Lesson overview | Previous part | Next part
Reinforcement Learning: Part 7: Control Algorithms to 8. Function Approximation and Deep RL
7. Control Algorithms
Control Algorithms is part of the core mathematical path from Markov chains to modern AI agents. The emphasis is on the object definitions and update equations a learner must be able to inspect in code.
7.1 SARSA
Purpose. SARSA focuses on on-policy TD control. This is not optional vocabulary: it determines which variables are random, which distribution creates the data, and which recursive target is valid.
Operational definition.
Control algorithms learn how to choose actions, not just how to evaluate a fixed policy.
Worked reading.
SARSA uses the action actually sampled by the behavior policy; Q-learning uses a greedy target and is therefore off-policy.
| Object | Mathematical role | ML interpretation |
|---|---|---|
| state space | task context, simulator state, dialogue state | |
| action space | moves, controls, generated tokens, tool choices | |
| transition kernel | environment dynamics or next-context distribution | |
| reward function | scalar training signal, preference score, task score | |
| policy | behavior rule or neural action distribution | |
| value functions | estimates of future performance |
Examples:
- tabular gridworld control.
- DQN-style value learning.
- epsilon-greedy exploration.
Non-examples:
- estimating only.
- planning with a perfect model and no samples.
Derivation habit.
- State whether the policy is fixed, greedy-improved, or being optimized.
- Write the return or Bellman target before writing code.
- Identify whether the target is sampled, model-based, bootstrapped, or exact.
- Check whether the data are on-policy or off-policy.
- Mask terminal states so that value is not propagated beyond the episode.
Implementation lens.
In a small tabular MDP, every object can be printed: transition matrices, reward vectors, value arrays, and greedy actions. In a deep RL system, the same objects exist but are represented by neural networks, replay buffers, sampled rollouts, and approximate losses. The names should not change just because the implementation becomes larger.
For LLM work, the state is usually a prompt plus generated prefix, the action is the next token or structured action, and the policy is an autoregressive distribution. The reward may arrive only after a full response, which means the return is sequence-level even though actions are token-level.
Useful checks:
- Does the transition or sample contain the next state?
- Is the update using , , an advantage, or a reward-only signal?
- Is the current policy also the data-collection policy?
- Does discounting express time preference or just numerical convenience?
- Is the reward model being optimized beyond its trustworthy region?
A common failure mode is to copy an update rule while silently changing the sampling assumption. For example, SARSA and Q-learning can look nearly identical in code, but one updates from the action actually taken and the other updates from a greedy next action. That distinction changes both the mathematics and the behavior.
The notebook version of this idea keeps the environment small enough to inspect by hand. If a concept is unclear, compute it first in the chain MDP and only then move to neural approximators.
7.2 Q-learning
Purpose. Q-learning focuses on off-policy TD control with greedy targets. This is not optional vocabulary: it determines which variables are random, which distribution creates the data, and which recursive target is valid.
Operational definition.
Control algorithms learn how to choose actions, not just how to evaluate a fixed policy.
Worked reading.
SARSA uses the action actually sampled by the behavior policy; Q-learning uses a greedy target and is therefore off-policy.
| Object | Mathematical role | ML interpretation |
|---|---|---|
| state space | task context, simulator state, dialogue state | |
| action space | moves, controls, generated tokens, tool choices | |
| transition kernel | environment dynamics or next-context distribution | |
| reward function | scalar training signal, preference score, task score | |
| policy | behavior rule or neural action distribution | |
| value functions | estimates of future performance |
Examples:
- tabular gridworld control.
- DQN-style value learning.
- epsilon-greedy exploration.
Non-examples:
- estimating only.
- planning with a perfect model and no samples.
Derivation habit.
- State whether the policy is fixed, greedy-improved, or being optimized.
- Write the return or Bellman target before writing code.
- Identify whether the target is sampled, model-based, bootstrapped, or exact.
- Check whether the data are on-policy or off-policy.
- Mask terminal states so that value is not propagated beyond the episode.
Implementation lens.
In a small tabular MDP, every object can be printed: transition matrices, reward vectors, value arrays, and greedy actions. In a deep RL system, the same objects exist but are represented by neural networks, replay buffers, sampled rollouts, and approximate losses. The names should not change just because the implementation becomes larger.
For LLM work, the state is usually a prompt plus generated prefix, the action is the next token or structured action, and the policy is an autoregressive distribution. The reward may arrive only after a full response, which means the return is sequence-level even though actions are token-level.
Useful checks:
- Does the transition or sample contain the next state?
- Is the update using , , an advantage, or a reward-only signal?
- Is the current policy also the data-collection policy?
- Does discounting express time preference or just numerical convenience?
- Is the reward model being optimized beyond its trustworthy region?
A common failure mode is to copy an update rule while silently changing the sampling assumption. For example, SARSA and Q-learning can look nearly identical in code, but one updates from the action actually taken and the other updates from a greedy next action. That distinction changes both the mathematics and the behavior.
The notebook version of this idea keeps the environment small enough to inspect by hand. If a concept is unclear, compute it first in the chain MDP and only then move to neural approximators.
7.3 Exploration schedules
Purpose. Exploration schedules focuses on epsilon greedy softmax and optimism. This is not optional vocabulary: it determines which variables are random, which distribution creates the data, and which recursive target is valid.
Operational definition.
This concept is part of the bridge from sequential probability models to practical RL algorithms.
Worked reading.
The key habit is to name the state, action, reward, transition, policy, and value object before writing an update.
| Object | Mathematical role | ML interpretation |
|---|---|---|
| state space | task context, simulator state, dialogue state | |
| action space | moves, controls, generated tokens, tool choices | |
| transition kernel | environment dynamics or next-context distribution | |
| reward function | scalar training signal, preference score, task score | |
| policy | behavior rule or neural action distribution | |
| value functions | estimates of future performance |
Examples:
- small tabular MDPs.
- neural value functions.
- preference-optimized language policies.
Non-examples:
- static regression.
- uncontrolled simulation traces.
Derivation habit.
- State whether the policy is fixed, greedy-improved, or being optimized.
- Write the return or Bellman target before writing code.
- Identify whether the target is sampled, model-based, bootstrapped, or exact.
- Check whether the data are on-policy or off-policy.
- Mask terminal states so that value is not propagated beyond the episode.
Implementation lens.
In a small tabular MDP, every object can be printed: transition matrices, reward vectors, value arrays, and greedy actions. In a deep RL system, the same objects exist but are represented by neural networks, replay buffers, sampled rollouts, and approximate losses. The names should not change just because the implementation becomes larger.
For LLM work, the state is usually a prompt plus generated prefix, the action is the next token or structured action, and the policy is an autoregressive distribution. The reward may arrive only after a full response, which means the return is sequence-level even though actions are token-level.
Useful checks:
- Does the transition or sample contain the next state?
- Is the update using , , an advantage, or a reward-only signal?
- Is the current policy also the data-collection policy?
- Does discounting express time preference or just numerical convenience?
- Is the reward model being optimized beyond its trustworthy region?
A common failure mode is to copy an update rule while silently changing the sampling assumption. For example, SARSA and Q-learning can look nearly identical in code, but one updates from the action actually taken and the other updates from a greedy next action. That distinction changes both the mathematics and the behavior.
The notebook version of this idea keeps the environment small enough to inspect by hand. If a concept is unclear, compute it first in the chain MDP and only then move to neural approximators.
7.4 Double Q-learning
Purpose. Double Q-learning focuses on reducing maximization bias. This is not optional vocabulary: it determines which variables are random, which distribution creates the data, and which recursive target is valid.
Operational definition.
Control algorithms learn how to choose actions, not just how to evaluate a fixed policy.
Worked reading.
SARSA uses the action actually sampled by the behavior policy; Q-learning uses a greedy target and is therefore off-policy.
| Object | Mathematical role | ML interpretation |
|---|---|---|
| state space | task context, simulator state, dialogue state | |
| action space | moves, controls, generated tokens, tool choices | |
| transition kernel | environment dynamics or next-context distribution | |
| reward function | scalar training signal, preference score, task score | |
| policy | behavior rule or neural action distribution | |
| value functions | estimates of future performance |
Examples:
- tabular gridworld control.
- DQN-style value learning.
- epsilon-greedy exploration.
Non-examples:
- estimating only.
- planning with a perfect model and no samples.
Derivation habit.
- State whether the policy is fixed, greedy-improved, or being optimized.
- Write the return or Bellman target before writing code.
- Identify whether the target is sampled, model-based, bootstrapped, or exact.
- Check whether the data are on-policy or off-policy.
- Mask terminal states so that value is not propagated beyond the episode.
Implementation lens.
In a small tabular MDP, every object can be printed: transition matrices, reward vectors, value arrays, and greedy actions. In a deep RL system, the same objects exist but are represented by neural networks, replay buffers, sampled rollouts, and approximate losses. The names should not change just because the implementation becomes larger.
For LLM work, the state is usually a prompt plus generated prefix, the action is the next token or structured action, and the policy is an autoregressive distribution. The reward may arrive only after a full response, which means the return is sequence-level even though actions are token-level.
Useful checks:
- Does the transition or sample contain the next state?
- Is the update using , , an advantage, or a reward-only signal?
- Is the current policy also the data-collection policy?
- Does discounting express time preference or just numerical convenience?
- Is the reward model being optimized beyond its trustworthy region?
A common failure mode is to copy an update rule while silently changing the sampling assumption. For example, SARSA and Q-learning can look nearly identical in code, but one updates from the action actually taken and the other updates from a greedy next action. That distinction changes both the mathematics and the behavior.
The notebook version of this idea keeps the environment small enough to inspect by hand. If a concept is unclear, compute it first in the chain MDP and only then move to neural approximators.
7.5 Convergence conditions
Purpose. Convergence conditions focuses on what tabular guarantees assume. This is not optional vocabulary: it determines which variables are random, which distribution creates the data, and which recursive target is valid.
Operational definition.
This concept is part of the bridge from sequential probability models to practical RL algorithms.
Worked reading.
The key habit is to name the state, action, reward, transition, policy, and value object before writing an update.
| Object | Mathematical role | ML interpretation |
|---|---|---|
| state space | task context, simulator state, dialogue state | |
| action space | moves, controls, generated tokens, tool choices | |
| transition kernel | environment dynamics or next-context distribution | |
| reward function | scalar training signal, preference score, task score | |
| policy | behavior rule or neural action distribution | |
| value functions | estimates of future performance |
Examples:
- small tabular MDPs.
- neural value functions.
- preference-optimized language policies.
Non-examples:
- static regression.
- uncontrolled simulation traces.
Derivation habit.
- State whether the policy is fixed, greedy-improved, or being optimized.
- Write the return or Bellman target before writing code.
- Identify whether the target is sampled, model-based, bootstrapped, or exact.
- Check whether the data are on-policy or off-policy.
- Mask terminal states so that value is not propagated beyond the episode.
Implementation lens.
In a small tabular MDP, every object can be printed: transition matrices, reward vectors, value arrays, and greedy actions. In a deep RL system, the same objects exist but are represented by neural networks, replay buffers, sampled rollouts, and approximate losses. The names should not change just because the implementation becomes larger.
For LLM work, the state is usually a prompt plus generated prefix, the action is the next token or structured action, and the policy is an autoregressive distribution. The reward may arrive only after a full response, which means the return is sequence-level even though actions are token-level.
Useful checks:
- Does the transition or sample contain the next state?
- Is the update using , , an advantage, or a reward-only signal?
- Is the current policy also the data-collection policy?
- Does discounting express time preference or just numerical convenience?
- Is the reward model being optimized beyond its trustworthy region?
A common failure mode is to copy an update rule while silently changing the sampling assumption. For example, SARSA and Q-learning can look nearly identical in code, but one updates from the action actually taken and the other updates from a greedy next action. That distinction changes both the mathematics and the behavior.
The notebook version of this idea keeps the environment small enough to inspect by hand. If a concept is unclear, compute it first in the chain MDP and only then move to neural approximators.
8. Function Approximation and Deep RL
Function Approximation and Deep RL is part of the core mathematical path from Markov chains to modern AI agents. The emphasis is on the object definitions and update equations a learner must be able to inspect in code.
8.1 Why tables fail
Purpose. Why tables fail focuses on large state spaces and generalization. This is not optional vocabulary: it determines which variables are random, which distribution creates the data, and which recursive target is valid.
Operational definition.
This concept is part of the bridge from sequential probability models to practical RL algorithms.
Worked reading.
The key habit is to name the state, action, reward, transition, policy, and value object before writing an update.
| Object | Mathematical role | ML interpretation |
|---|---|---|
| state space | task context, simulator state, dialogue state | |
| action space | moves, controls, generated tokens, tool choices | |
| transition kernel | environment dynamics or next-context distribution | |
| reward function | scalar training signal, preference score, task score | |
| policy | behavior rule or neural action distribution | |
| value functions | estimates of future performance |
Examples:
- small tabular MDPs.
- neural value functions.
- preference-optimized language policies.
Non-examples:
- static regression.
- uncontrolled simulation traces.
Derivation habit.
- State whether the policy is fixed, greedy-improved, or being optimized.
- Write the return or Bellman target before writing code.
- Identify whether the target is sampled, model-based, bootstrapped, or exact.
- Check whether the data are on-policy or off-policy.
- Mask terminal states so that value is not propagated beyond the episode.
Implementation lens.
In a small tabular MDP, every object can be printed: transition matrices, reward vectors, value arrays, and greedy actions. In a deep RL system, the same objects exist but are represented by neural networks, replay buffers, sampled rollouts, and approximate losses. The names should not change just because the implementation becomes larger.
For LLM work, the state is usually a prompt plus generated prefix, the action is the next token or structured action, and the policy is an autoregressive distribution. The reward may arrive only after a full response, which means the return is sequence-level even though actions are token-level.
Useful checks:
- Does the transition or sample contain the next state?
- Is the update using , , an advantage, or a reward-only signal?
- Is the current policy also the data-collection policy?
- Does discounting express time preference or just numerical convenience?
- Is the reward model being optimized beyond its trustworthy region?
A common failure mode is to copy an update rule while silently changing the sampling assumption. For example, SARSA and Q-learning can look nearly identical in code, but one updates from the action actually taken and the other updates from a greedy next action. That distinction changes both the mathematics and the behavior.
The notebook version of this idea keeps the environment small enough to inspect by hand. If a concept is unclear, compute it first in the chain MDP and only then move to neural approximators.
8.2 Linear value approximation
Purpose. Linear value approximation focuses on projected Bellman equations. This is not optional vocabulary: it determines which variables are random, which distribution creates the data, and which recursive target is valid.
Operational definition.
Value functions summarize future consequences. Advantage functions compare an action with the policy's average behavior at the same state.
Worked reading.
Occupancy measures explain why RL gradients weight states by how often the current policy visits them.
| Object | Mathematical role | ML interpretation |
|---|---|---|
| state space | task context, simulator state, dialogue state | |
| action space | moves, controls, generated tokens, tool choices | |
| transition kernel | environment dynamics or next-context distribution | |
| reward function | scalar training signal, preference score, task score | |
| policy | behavior rule or neural action distribution | |
| value functions | estimates of future performance |
Examples:
- .
- .
- .
Non-examples:
- instant reward only.
- a metric computed on states never reached by the policy.
Derivation habit.
- State whether the policy is fixed, greedy-improved, or being optimized.
- Write the return or Bellman target before writing code.
- Identify whether the target is sampled, model-based, bootstrapped, or exact.
- Check whether the data are on-policy or off-policy.
- Mask terminal states so that value is not propagated beyond the episode.
Implementation lens.
In a small tabular MDP, every object can be printed: transition matrices, reward vectors, value arrays, and greedy actions. In a deep RL system, the same objects exist but are represented by neural networks, replay buffers, sampled rollouts, and approximate losses. The names should not change just because the implementation becomes larger.
For LLM work, the state is usually a prompt plus generated prefix, the action is the next token or structured action, and the policy is an autoregressive distribution. The reward may arrive only after a full response, which means the return is sequence-level even though actions are token-level.
Useful checks:
- Does the transition or sample contain the next state?
- Is the update using , , an advantage, or a reward-only signal?
- Is the current policy also the data-collection policy?
- Does discounting express time preference or just numerical convenience?
- Is the reward model being optimized beyond its trustworthy region?
A common failure mode is to copy an update rule while silently changing the sampling assumption. For example, SARSA and Q-learning can look nearly identical in code, but one updates from the action actually taken and the other updates from a greedy next action. That distinction changes both the mathematics and the behavior.
The notebook version of this idea keeps the environment small enough to inspect by hand. If a concept is unclear, compute it first in the chain MDP and only then move to neural approximators.
8.3 The deadly triad
Purpose. The deadly triad focuses on bootstrapping off-policy learning and approximation. This is not optional vocabulary: it determines which variables are random, which distribution creates the data, and which recursive target is valid.
Operational definition.
This concept is part of the bridge from sequential probability models to practical RL algorithms.
Worked reading.
The key habit is to name the state, action, reward, transition, policy, and value object before writing an update.
| Object | Mathematical role | ML interpretation |
|---|---|---|
| state space | task context, simulator state, dialogue state | |
| action space | moves, controls, generated tokens, tool choices | |
| transition kernel | environment dynamics or next-context distribution | |
| reward function | scalar training signal, preference score, task score | |
| policy | behavior rule or neural action distribution | |
| value functions | estimates of future performance |
Examples:
- small tabular MDPs.
- neural value functions.
- preference-optimized language policies.
Non-examples:
- static regression.
- uncontrolled simulation traces.
Derivation habit.
- State whether the policy is fixed, greedy-improved, or being optimized.
- Write the return or Bellman target before writing code.
- Identify whether the target is sampled, model-based, bootstrapped, or exact.
- Check whether the data are on-policy or off-policy.
- Mask terminal states so that value is not propagated beyond the episode.
Implementation lens.
In a small tabular MDP, every object can be printed: transition matrices, reward vectors, value arrays, and greedy actions. In a deep RL system, the same objects exist but are represented by neural networks, replay buffers, sampled rollouts, and approximate losses. The names should not change just because the implementation becomes larger.
For LLM work, the state is usually a prompt plus generated prefix, the action is the next token or structured action, and the policy is an autoregressive distribution. The reward may arrive only after a full response, which means the return is sequence-level even though actions are token-level.
Useful checks:
- Does the transition or sample contain the next state?
- Is the update using , , an advantage, or a reward-only signal?
- Is the current policy also the data-collection policy?
- Does discounting express time preference or just numerical convenience?
- Is the reward model being optimized beyond its trustworthy region?
A common failure mode is to copy an update rule while silently changing the sampling assumption. For example, SARSA and Q-learning can look nearly identical in code, but one updates from the action actually taken and the other updates from a greedy next action. That distinction changes both the mathematics and the behavior.
The notebook version of this idea keeps the environment small enough to inspect by hand. If a concept is unclear, compute it first in the chain MDP and only then move to neural approximators.
8.4 DQN stabilization
Purpose. DQN stabilization focuses on replay buffers and target networks. This is not optional vocabulary: it determines which variables are random, which distribution creates the data, and which recursive target is valid.
Operational definition.
Function approximation replaces tables with parameterized models so the agent can generalize across large state spaces.
Worked reading.
Deep RL is powerful because neural networks share statistical strength, but unstable because approximate bootstrapping can amplify errors.
| Object | Mathematical role | ML interpretation |
|---|---|---|
| state space | task context, simulator state, dialogue state | |
| action space | moves, controls, generated tokens, tool choices | |
| transition kernel | environment dynamics or next-context distribution | |
| reward function | scalar training signal, preference score, task score | |
| policy | behavior rule or neural action distribution | |
| value functions | estimates of future performance |
Examples:
- DQN target networks.
- experience replay.
- critic networks.
Non-examples:
- exact dynamic programming in a tiny known MDP.
- memorizing every state-action value in a table.
Derivation habit.
- State whether the policy is fixed, greedy-improved, or being optimized.
- Write the return or Bellman target before writing code.
- Identify whether the target is sampled, model-based, bootstrapped, or exact.
- Check whether the data are on-policy or off-policy.
- Mask terminal states so that value is not propagated beyond the episode.
Implementation lens.
In a small tabular MDP, every object can be printed: transition matrices, reward vectors, value arrays, and greedy actions. In a deep RL system, the same objects exist but are represented by neural networks, replay buffers, sampled rollouts, and approximate losses. The names should not change just because the implementation becomes larger.
For LLM work, the state is usually a prompt plus generated prefix, the action is the next token or structured action, and the policy is an autoregressive distribution. The reward may arrive only after a full response, which means the return is sequence-level even though actions are token-level.
Useful checks:
- Does the transition or sample contain the next state?
- Is the update using , , an advantage, or a reward-only signal?
- Is the current policy also the data-collection policy?
- Does discounting express time preference or just numerical convenience?
- Is the reward model being optimized beyond its trustworthy region?
A common failure mode is to copy an update rule while silently changing the sampling assumption. For example, SARSA and Q-learning can look nearly identical in code, but one updates from the action actually taken and the other updates from a greedy next action. That distinction changes both the mathematics and the behavior.
The notebook version of this idea keeps the environment small enough to inspect by hand. If a concept is unclear, compute it first in the chain MDP and only then move to neural approximators.
8.5 Representation learning for policies and critics
Purpose. Representation learning for policies and critics focuses on how neural networks change the math. This is not optional vocabulary: it determines which variables are random, which distribution creates the data, and which recursive target is valid.
Operational definition.
This concept is part of the bridge from sequential probability models to practical RL algorithms.
Worked reading.
The key habit is to name the state, action, reward, transition, policy, and value object before writing an update.
| Object | Mathematical role | ML interpretation |
|---|---|---|
| state space | task context, simulator state, dialogue state | |
| action space | moves, controls, generated tokens, tool choices | |
| transition kernel | environment dynamics or next-context distribution | |
| reward function | scalar training signal, preference score, task score | |
| policy | behavior rule or neural action distribution | |
| value functions | estimates of future performance |
Examples:
- small tabular MDPs.
- neural value functions.
- preference-optimized language policies.
Non-examples:
- static regression.
- uncontrolled simulation traces.
Derivation habit.
- State whether the policy is fixed, greedy-improved, or being optimized.
- Write the return or Bellman target before writing code.
- Identify whether the target is sampled, model-based, bootstrapped, or exact.
- Check whether the data are on-policy or off-policy.
- Mask terminal states so that value is not propagated beyond the episode.
Implementation lens.
In a small tabular MDP, every object can be printed: transition matrices, reward vectors, value arrays, and greedy actions. In a deep RL system, the same objects exist but are represented by neural networks, replay buffers, sampled rollouts, and approximate losses. The names should not change just because the implementation becomes larger.
For LLM work, the state is usually a prompt plus generated prefix, the action is the next token or structured action, and the policy is an autoregressive distribution. The reward may arrive only after a full response, which means the return is sequence-level even though actions are token-level.
Useful checks:
- Does the transition or sample contain the next state?
- Is the update using , , an advantage, or a reward-only signal?
- Is the current policy also the data-collection policy?
- Does discounting express time preference or just numerical convenience?
- Is the reward model being optimized beyond its trustworthy region?
A common failure mode is to copy an update rule while silently changing the sampling assumption. For example, SARSA and Q-learning can look nearly identical in code, but one updates from the action actually taken and the other updates from a greedy next action. That distinction changes both the mathematics and the behavior.
The notebook version of this idea keeps the environment small enough to inspect by hand. If a concept is unclear, compute it first in the chain MDP and only then move to neural approximators.