"An RL algorithm is a way to turn consequences into better future decisions."
Overview
Reinforcement learning studies agents that learn by acting. The data are not a fixed table of labeled examples; the policy changes which states are visited, which rewards are observed, and which mistakes are possible. That single fact is why RL needs Markov decision processes, Bellman equations, temporal-difference learning, exploration, and policy-gradient estimators.
For AI systems, RL matters in two directions. Classical RL explains games, robotics, recommender feedback loops, and online control. Modern LLM work uses the same math when a model is fine-tuned from human preferences, constrained by KL divergence, or evaluated through interactive feedback rather than static labels.
This section is written as LaTeX Markdown. Inline math uses $...$, display math uses `
`, and the notebooks use small synthetic MDPs so every update can be inspected without external data.
Prerequisites
- Markov Chains
- Expectation and Moments
- Gradient Descent
- KL Divergence
- Transformer Architecture
- Preference Optimization, RLHF, and DPO
Companion Notebooks
| Notebook | Description |
|---|---|
| theory.ipynb | Executable demonstrations of MDPs, Bellman backups, TD learning, Q-learning, policy gradients, PPO, and preference optimization. |
| exercises.ipynb | Ten graded practice problems with runnable scaffolds and full checked solutions. |
Learning Objectives
After completing this section, you will be able to:
- Define a finite Markov decision process and identify its state, action, reward, transition, and discount components.
- Compute discounted returns and explain temporal credit assignment.
- Derive Bellman expectation and optimality equations.
- Run policy evaluation, policy iteration, and value iteration in a tabular MDP.
- Explain Monte Carlo, TD(0), n-step, and eligibility-trace prediction.
- Implement SARSA and Q-learning updates and distinguish on-policy from off-policy control.
- Explain why replay buffers and target networks stabilize DQN-style learning.
- Derive the policy-gradient estimator and explain baseline variance reduction.
- Interpret actor-critic, GAE, PPO clipping, and KL regularization.
- Connect RL math to RLHF, reward modeling, DPO, and reward hacking risk in LLM systems.
Study Flow
- Read the pages in order and pause after each page to restate the main definition or theorem.
- Run
theory.ipynbwhen you want to check the formulas numerically. - Use
exercises.ipynbafter the reading path, not before it. - Return to this overview page when you need the chapter-level navigation.