🎯 Interview Preparation Guide
Common mathematical interview questions for ML/AI positions with detailed solutions.
Table of Contents
- Linear Algebra Questions
- Calculus Questions
- Probability & Statistics Questions
- Optimization Questions
- Information Theory Questions
- Applied ML Math Questions
- Deep Learning Math Questions
- Generative Models Math Questions
- Quick Review Checklist
- Study Plan
Linear Algebra Questions
Q1: What is the difference between eigenvalue decomposition and SVD?
Answer:
| Aspect | Eigendecomposition | SVD |
|---|---|---|
| Applies to | Square matrices only | Any matrix (m×n) |
| Formula | ||
| Components | Eigenvalues, eigenvectors | Singular values, left/right singular vectors |
| Requires | Diagonalizable matrix | Always exists |
Key Insight: For symmetric positive semi-definite matrices, singular values equal eigenvalues, and .
ML Application:
- Eigendecomposition: PCA on covariance matrix
- SVD: Recommender systems, image compression, pseudoinverse
Q2: Explain the geometric interpretation of eigenvalues and eigenvectors.
Answer:
Eigenvectors are directions that remain unchanged (except for scaling) when a linear transformation is applied. Eigenvalues are the scaling factors.
Before transformation: After transformation (A):
↑ v ↑ λv
│ │ (stretched by λ)
│ │
─────┼─────→ ─────┼─────→
│ │
ML Application:
- In PCA, eigenvectors point in directions of maximum variance
- Large eigenvalue = high variance in that direction
- We keep eigenvectors with largest eigenvalues
Q3: What makes a matrix positive definite? Why does it matter?
Answer:
A symmetric matrix is positive definite if:
- All eigenvalues are positive, OR
- for all non-zero
Why it matters:
- Covariance matrices are positive semi-definite
- Hessian being PD guarantees local minimum
- Optimization: Convex quadratic functions have PD Hessians
- Numerical stability: PD matrices are invertible
Q4: Explain the null space and column space. How do they relate to linear regression?
Answer:
- Column space (range): All possible outputs
- Null space (kernel): All inputs where
In Linear Regression:
Finding ŷ = Xw that's closest to y
y
↗
/
/
/ ŷ (projection onto column space of X)
●─────────────────→ Column space of X
The residual (y - ŷ) is perpendicular to column space
Insight: If null space is non-trivial, there are infinitely many solutions (need regularization).
Q5: What is the rank of a matrix and why is it important?
Answer:
Rank = number of linearly independent rows (or columns)
| Rank Property | Implication |
|---|---|
| rank(A) = min(m,n) | Full rank, unique solution possible |
| rank(A) < min(m,n) | Rank deficient, singular |
| rank(X^TX) < n | Multicollinearity in regression |
ML Applications:
- Low-rank matrix approximation (compression)
- Detecting redundant features
- Matrix completion (recommender systems)
Calculus Questions
Q6: Derive the gradient of the sigmoid function.
Answer:
Given:
Using quotient rule:
Simplify:
Key Result:
Why it matters: This makes backpropagation through sigmoid efficient!
Q7: Explain the chain rule and its role in backpropagation.
Answer:
Chain Rule: For :
In Neural Networks:
x → [Layer 1] → z₁ → [Activation] → a₁ → [Layer 2] → z₂ → [Loss] → L
∂L/∂W₁ = ∂L/∂z₂ · ∂z₂/∂a₁ · ∂a₁/∂z₁ · ∂z₁/∂W₁
Key Insight: Backprop is just repeated application of chain rule, computed efficiently by caching intermediate values.
Q8: What is the Jacobian matrix? When do you need it?
Answer:
The Jacobian is the matrix of all first-order partial derivatives for a vector-valued function:
When needed:
- Backprop through layers with multiple outputs
- Change of variables in probability (normalizing flows)
- Sensitivity analysis: How outputs change with inputs
Q9: What is the Hessian and how is it used in optimization?
Answer:
The Hessian is the matrix of second-order partial derivatives:
Uses:
| Condition | Meaning |
|---|---|
| positive definite | Local minimum |
| negative definite | Local maximum |
| indefinite | Saddle point |
In Optimization:
- Newton's method:
- Approximate Hessian: Adam, BFGS
Q10: Explain Taylor series and its application in optimization.
Answer:
Taylor expansion around point :
In Optimization:
First-order approximation (gradient descent):
Second-order approximation (Newton's method):
Probability & Statistics Questions
Q11: Derive Bayes' theorem and explain its components.
Answer:
| Component | Name | Meaning |
|---|---|---|
| Posterior | Updated belief after seeing B | |
| Likelihood | Probability of B given A | |
| Prior | Initial belief about A | |
| Evidence | Normalizing constant |
ML Example - Naive Bayes:
Q12: What is the difference between MLE and MAP?
Answer:
Maximum Likelihood Estimation (MLE):
- Only considers data likelihood
- Can overfit
Maximum A Posteriori (MAP):
- Includes prior belief
- Acts as regularization
Key Connection:
- Gaussian prior → L2 regularization
- Laplace prior → L1 regularization
Q13: Explain the bias-variance tradeoff mathematically.
Answer:
For any estimator, the expected error can be decomposed:
| Component | Meaning | High when... |
|---|---|---|
| Bias² | Systematic error | Model too simple (underfitting) |
| Variance | Sensitivity to training data | Model too complex (overfitting) |
| σ² | Irreducible noise | Inherent in data |
Error
│
│╲ ╱
│ ╲ Total Error ╱
│ ╲ ┌────┐ ╱
│ ╲ │ │ ╱
│ ╲──│────│──╱
│ Bias² │
│ │ Variance
└────────────┴──────────→ Model Complexity
Simple Complex
Q14: What is the Central Limit Theorem and why does it matter?
Answer:
CLT: The sum (or average) of many independent random variables tends toward a normal distribution, regardless of the original distribution.
Why it matters:
- Confidence intervals: We can use normal distribution
- Hypothesis testing: t-tests, z-tests
- SGD convergence: Gradient estimates are approximately normal
- Batch normalization: Large batch → normal activations
Q15: Explain covariance and correlation. What's the difference?
Answer:
Covariance: Measures how two variables change together
- Unbounded
- Units depend on X and Y
Correlation: Normalized covariance
- Always between -1 and 1
- Unitless
| Value | Meaning |
|---|---|
| ρ = 1 | Perfect positive linear relationship |
| ρ = 0 | No linear relationship |
| ρ = -1 | Perfect negative linear relationship |
Optimization Questions
Q16: Explain gradient descent and its variants.
Answer:
Basic Gradient Descent:
| Variant | Update Rule | Pros/Cons |
|---|---|---|
| Batch GD | Full dataset gradient | Stable but slow |
| SGD | Single sample gradient | Fast but noisy |
| Mini-batch | Batch of samples | Best of both |
| Momentum | Accumulate velocity | Faster convergence |
| Adam | Adaptive learning rates | Works well in practice |
Q17: What makes a function convex? Why does convexity matter?
Answer:
Convexity Definition:
for all , and .
Visual:
Convex: Non-convex:
╱╲ ╱╲
╱ ╲ ╱ ╲
╱ ╲ ╱ ╲
╱ ★ ╲ ╱ ★ ╲ ★
╱ ╲ ╱ ╲ ╱╲
local minima!
Why it matters:
- Convex functions have global minimum = local minimum
- Gradient descent guaranteed to converge
- Linear regression loss is convex
- Neural network loss is non-convex (hence harder to optimize)
Q18: Explain the Adam optimizer mathematically.
Answer:
Adam combines momentum and RMSprop:
# Initialize
m = 0 # First moment (momentum)
v = 0 # Second moment (RMSprop)
for t in range(1, num_iterations):
g = compute_gradient(theta)
# Update moments
m = beta1 * m + (1 - beta1) * g # Momentum
v = beta2 * v + (1 - beta2) * g**2 # RMSprop
# Bias correction
m_hat = m / (1 - beta1**t)
v_hat = v / (1 - beta2**t)
# Update parameters
theta = theta - lr * m_hat / (sqrt(v_hat) + epsilon)
Defaults: , ,
Q19: What are the KKT conditions?
Answer:
For constrained optimization:
KKT Conditions:
- Stationarity:
- Primal feasibility: ,
- Dual feasibility:
- Complementary slackness:
ML Application: SVM optimization satisfies KKT conditions.
Information Theory Questions
Q20: What is entropy and how is it used in ML?
Answer:
Entropy: Measure of uncertainty/randomness
| Distribution | Entropy |
|---|---|
| Certain (one outcome) | 0 |
| Uniform | Maximum |
| More spread out | Higher |
ML Uses:
- Decision tree splitting (maximize information gain)
- Regularization in neural networks
- Measuring model confidence
Q21: Explain cross-entropy loss. Why is it used for classification?
Answer:
Cross-Entropy:
For classification:
- = true distribution (one-hot)
- = predicted probabilities
Why use it:
- Gradients: Better gradients than MSE for classification
- Probability interpretation: Natural for probability outputs
- Convexity: Convex for logistic regression
Q22: What is KL divergence and how is it related to cross-entropy?
Answer:
KL Divergence: Measures how different distribution Q is from P
Relation to Cross-Entropy:
Key Properties:
- (Gibbs' inequality)
- iff
- Not symmetric:
ML Uses:
- VAE loss function
- Knowledge distillation
- Policy gradient (KL constraint)
Applied ML Math Questions
Q23: Derive the gradient for logistic regression.
Answer:
Model:
Loss (single sample):
Gradient derivation:
Where :
Result:
Q24: Explain the math behind batch normalization.
Answer:
Forward Pass:
-
Compute batch statistics:
-
Normalize:
-
Scale and shift:
Why it works:
- Reduces internal covariate shift
- Allows higher learning rates
- Acts as regularization
- Smoother loss landscape
Q25: Derive the attention mechanism math in Transformers.
Answer:
Scaled Dot-Product Attention:
Step by step:
- Compute similarities: gives attention scores
- Scale: Divide by to prevent softmax saturation
- Softmax: Convert to probability distribution
- Weighted sum: Multiply by V
Multi-Head Attention:
where
Deep Learning Math Questions
Q21: Explain the math behind the Transformer's attention mechanism.
Answer:
Scaled dot-product attention:
Why ? Without scaling, dot products grow with dimension , pushing softmax into saturation (near 0 or 1 gradients). Scaling keeps variance ~1.
Multi-head attention allows attending to information from different representation subspaces:
Complexity: where is sequence length — the quadratic bottleneck motivating efficient attention.
Q22: Derive the backpropagation equations for a simple 2-layer network.
Answer:
Network: , , ,
Loss: (cross-entropy)
Backward pass (using chain rule):
- (softmax + cross-entropy simplification)
- (element-wise)
Key insight: Each layer's gradient depends on the downstream gradient modulated by the local derivative .
Q23: What causes vanishing/exploding gradients and how are they addressed?
Answer:
For deep network with layers, gradient magnitude scales as:
- Vanishing: repeatedly → gradient → 0
- Exploding: repeatedly → gradient → ∞
Solutions:
| Technique | How it helps |
|---|---|
| ReLU | for (no shrinkage) |
| Residual connections | Gradient flows through skip: |
| Batch normalization | Keeps activations well-conditioned |
| Xavier/He initialization | Sets or |
| Gradient clipping | Caps gradient norm to threshold |
Q24: Explain batch normalization mathematically. Why does it help?
Answer:
Forward pass (for mini-batch ):
Why it helps:
- Reduces internal covariate shift (distribution of inputs to each layer stabilizes)
- Allows higher learning rates (smoother loss landscape)
- Acts as regularizer (batch statistics add noise)
- Makes the loss landscape smoother: varies less
Generative Models Math Questions
Q25: Derive the ELBO for VAEs.
Answer:
Start with log-likelihood:
Introduce variational distribution :
Since KL ≥ 0:
ELBO = Reconstruction - KL:
- Reconstruction: How well can we decode back to ?
- KL: How close is the encoder to the prior?
Q26: What is the GAN objective and what does the optimal discriminator look like?
Answer:
Optimal discriminator (for fixed ):
With , the generator minimizes:
where is the Jensen-Shannon divergence.
Problem: When is too good, saturates → vanishing gradients for .
Quick Review Checklist
Before Your Interview, Make Sure You Can:
Linear Algebra:
- Multiply matrices and explain the dimensions
- Explain eigenvalue decomposition geometrically
- Describe SVD and its applications
- Define and identify positive definite matrices
- Explain rank and its implications
Calculus:
- Derive common function derivatives (sigmoid, softmax)
- Apply the chain rule for backpropagation
- Explain gradient, Jacobian, and Hessian
- Use Taylor series for approximations
Probability:
- Derive and apply Bayes' theorem
- Explain MLE vs MAP
- Describe common distributions
- Explain bias-variance tradeoff
- Define and compute expectation, variance, covariance
Optimization:
- Explain gradient descent variants
- Describe convexity and its importance
- Walk through Adam optimizer
- Explain regularization mathematically
Information Theory:
- Define and compute entropy
- Explain cross-entropy loss
- Describe KL divergence and its properties
Applied:
- Derive gradients for logistic regression
- Explain batch normalization math
- Describe attention mechanism math
Tips for Math Interviews
- Start with intuition before diving into formulas
- Draw pictures - geometric intuition is powerful
- Connect to ML applications - show you understand why it matters
- Be honest if you don't know something
- Practice derivations by hand
- Understand, don't memorize - interviewers test understanding
📅 Study Plan
Week 1: Foundations
- Day 1-2: Linear algebra (vectors, matrices, eigenvalues)
- Day 3-4: Calculus (derivatives, gradients, chain rule)
- Day 5-6: Probability (Bayes, distributions, expectation)
- Day 7: Review + practice problems
Week 2: Core ML Math
- Day 1-2: Optimization (gradient descent, convexity, Adam)
- Day 3-4: Information theory (entropy, KL, cross-entropy)
- Day 5-6: Statistics (MLE, MAP, hypothesis testing)
- Day 7: Review + mock interview
Week 3: Advanced Topics
- Day 1-2: Deep learning math (backprop, batch norm, attention)
- Day 3-4: Generative models (VAE ELBO, GAN theory, diffusion)
- Day 5-6: Graph theory + kernel methods
- Day 7: Full review + timed practice
Daily Practice Routine
- ⏰ 10 min: Review 5 formulas from cheatsheet
- 📝 20 min: Derive one key result by hand
- 💻 20 min: Implement one concept in code
- 🎯 10 min: Answer one interview question aloud
Good luck with your interviews! 🍀