Private notes
0/8000

Notes stay private to your browser until account sync is configured.

Notes
2 min read6 headings5 reading parts

"An RL algorithm is a way to turn consequences into better future decisions."

Overview

Reinforcement learning studies agents that learn by acting. The data are not a fixed table of labeled examples; the policy changes which states are visited, which rewards are observed, and which mistakes are possible. That single fact is why RL needs Markov decision processes, Bellman equations, temporal-difference learning, exploration, and policy-gradient estimators.

For AI systems, RL matters in two directions. Classical RL explains games, robotics, recommender feedback loops, and online control. Modern LLM work uses the same math when a model is fine-tuned from human preferences, constrained by KL divergence, or evaluated through interactive feedback rather than static labels.

This section is written as LaTeX Markdown. Inline math uses $...$, display math uses `

......

`, and the notebooks use small synthetic MDPs so every update can be inspected without external data.

Prerequisites

Companion Notebooks

NotebookDescription
theory.ipynbExecutable demonstrations of MDPs, Bellman backups, TD learning, Q-learning, policy gradients, PPO, and preference optimization.
exercises.ipynbTen graded practice problems with runnable scaffolds and full checked solutions.

Learning Objectives

After completing this section, you will be able to:

  • Define a finite Markov decision process and identify its state, action, reward, transition, and discount components.
  • Compute discounted returns and explain temporal credit assignment.
  • Derive Bellman expectation and optimality equations.
  • Run policy evaluation, policy iteration, and value iteration in a tabular MDP.
  • Explain Monte Carlo, TD(0), n-step, and eligibility-trace prediction.
  • Implement SARSA and Q-learning updates and distinguish on-policy from off-policy control.
  • Explain why replay buffers and target networks stabilize DQN-style learning.
  • Derive the policy-gradient estimator and explain baseline variance reduction.
  • Interpret actor-critic, GAE, PPO clipping, and KL regularization.
  • Connect RL math to RLHF, reward modeling, DPO, and reward hacking risk in LLM systems.

Study Flow

  1. Read the pages in order and pause after each page to restate the main definition or theorem.
  2. Run theory.ipynb when you want to check the formulas numerically.
  3. Use exercises.ipynb after the reading path, not before it.
  4. Return to this overview page when you need the chapter-level navigation.

Runnable Companions

Skill Check

Test this lesson

Answer 4 quick questions to lock in the lesson and feed your adaptive practice queue.

--
Score
0/4
Answered
Not attempted
Status
1

Which module does this lesson belong to?

2

Which section is covered in this lesson content?

3

Which term is most central to this lesson?

4

What is the best way to use this lesson for real learning?

Your answers save locally first, then sync when account storage is available.
Practice queue