Private notes
0/8000

Notes stay private to your browser until account sync is configured.

Part 5
28 min read17 headingsSplit lesson page

Lesson overview | Previous part | Lesson overview

Reinforcement Learning: Part 9: Policy Gradients to References

9. Policy Gradients

Policy Gradients is part of the core mathematical path from Markov chains to modern AI agents. The emphasis is on the object definitions and update equations a learner must be able to inspect in code.

9.1 Policy objective

Purpose. Policy objective focuses on maximizing J(θ)=Eτπθ[G0]J(\theta)=\mathbb{E}_{\tau\sim\pi_\theta}[G_0]. This is not optional vocabulary: it determines which variables are random, which distribution creates the data, and which recursive target is valid.

Gt=k=0γkRt+k+1,0γ<1.G_t=\sum_{k=0}^{\infty}\gamma^kR_{t+k+1},\qquad 0\le \gamma<1.

Operational definition.

Policy methods optimize the action distribution directly. This is essential when actions are continuous, structured, or generated token by token.

Worked reading.

The score-function identity converts a derivative of an expectation into an expectation of θlogπθ(as)\nabla_\theta\log\pi_\theta(a\mid s) times a return-like signal.

ObjectMathematical roleML interpretation
S\mathcal{S}state spacetask context, simulator state, dialogue state
A\mathcal{A}action spacemoves, controls, generated tokens, tool choices
P(ss,a)P(s'\mid s,a)transition kernelenvironment dynamics or next-context distribution
r(s,a,s)r(s,a,s')reward functionscalar training signal, preference score, task score
π(as)\pi(a\mid s)policybehavior rule or neural action distribution
Vπ,QπV^\pi,Q^\pivalue functionsestimates of future performance

Examples:

  1. REINFORCE.
  2. actor-critic.
  3. PPO for RLHF.

Non-examples:

  1. choosing argmaxaQ(s,a)\arg\max_a Q(s,a) from a tiny action table.
  2. behavior cloning without reward feedback.

Derivation habit.

  1. State whether the policy is fixed, greedy-improved, or being optimized.
  2. Write the return or Bellman target before writing code.
  3. Identify whether the target is sampled, model-based, bootstrapped, or exact.
  4. Check whether the data are on-policy or off-policy.
  5. Mask terminal states so that value is not propagated beyond the episode.

Implementation lens.

In a small tabular MDP, every object can be printed: transition matrices, reward vectors, value arrays, and greedy actions. In a deep RL system, the same objects exist but are represented by neural networks, replay buffers, sampled rollouts, and approximate losses. The names should not change just because the implementation becomes larger.

For LLM work, the state is usually a prompt plus generated prefix, the action is the next token or structured action, and the policy is an autoregressive distribution. The reward may arrive only after a full response, which means the return is sequence-level even though actions are token-level.

Useful checks:

  • Does the transition or sample contain the next state?
  • Is the update using VV, QQ, an advantage, or a reward-only signal?
  • Is the current policy also the data-collection policy?
  • Does discounting express time preference or just numerical convenience?
  • Is the reward model being optimized beyond its trustworthy region?

A common failure mode is to copy an update rule while silently changing the sampling assumption. For example, SARSA and Q-learning can look nearly identical in code, but one updates from the action actually taken and the other updates from a greedy next action. That distinction changes both the mathematics and the behavior.

The notebook version of this idea keeps the environment small enough to inspect by hand. If a concept is unclear, compute it first in the chain MDP and only then move to neural approximators.

9.2 Log-derivative trick

Purpose. Log-derivative trick focuses on turning trajectory probabilities into score functions. This is not optional vocabulary: it determines which variables are random, which distribution creates the data, and which recursive target is valid.

Vπ(s)=Eπ[GtSt=s],Qπ(s,a)=Eπ[GtSt=s,At=a].V^\pi(s)=\mathbb{E}_\pi[G_t\mid S_t=s],\qquad Q^\pi(s,a)=\mathbb{E}_\pi[G_t\mid S_t=s,A_t=a].

Operational definition.

This concept is part of the bridge from sequential probability models to practical RL algorithms.

Worked reading.

The key habit is to name the state, action, reward, transition, policy, and value object before writing an update.

ObjectMathematical roleML interpretation
S\mathcal{S}state spacetask context, simulator state, dialogue state
A\mathcal{A}action spacemoves, controls, generated tokens, tool choices
P(ss,a)P(s'\mid s,a)transition kernelenvironment dynamics or next-context distribution
r(s,a,s)r(s,a,s')reward functionscalar training signal, preference score, task score
π(as)\pi(a\mid s)policybehavior rule or neural action distribution
Vπ,QπV^\pi,Q^\pivalue functionsestimates of future performance

Examples:

  1. small tabular MDPs.
  2. neural value functions.
  3. preference-optimized language policies.

Non-examples:

  1. static regression.
  2. uncontrolled simulation traces.

Derivation habit.

  1. State whether the policy is fixed, greedy-improved, or being optimized.
  2. Write the return or Bellman target before writing code.
  3. Identify whether the target is sampled, model-based, bootstrapped, or exact.
  4. Check whether the data are on-policy or off-policy.
  5. Mask terminal states so that value is not propagated beyond the episode.

Implementation lens.

In a small tabular MDP, every object can be printed: transition matrices, reward vectors, value arrays, and greedy actions. In a deep RL system, the same objects exist but are represented by neural networks, replay buffers, sampled rollouts, and approximate losses. The names should not change just because the implementation becomes larger.

For LLM work, the state is usually a prompt plus generated prefix, the action is the next token or structured action, and the policy is an autoregressive distribution. The reward may arrive only after a full response, which means the return is sequence-level even though actions are token-level.

Useful checks:

  • Does the transition or sample contain the next state?
  • Is the update using VV, QQ, an advantage, or a reward-only signal?
  • Is the current policy also the data-collection policy?
  • Does discounting express time preference or just numerical convenience?
  • Is the reward model being optimized beyond its trustworthy region?

A common failure mode is to copy an update rule while silently changing the sampling assumption. For example, SARSA and Q-learning can look nearly identical in code, but one updates from the action actually taken and the other updates from a greedy next action. That distinction changes both the mathematics and the behavior.

The notebook version of this idea keeps the environment small enough to inspect by hand. If a concept is unclear, compute it first in the chain MDP and only then move to neural approximators.

9.3 Policy gradient theorem

Purpose. Policy gradient theorem focuses on why gradients can use Qπ(s,a)Q^\pi(s,a). This is not optional vocabulary: it determines which variables are random, which distribution creates the data, and which recursive target is valid.

Vπ(s)=aπ(as)sP(ss,a)(r(s,a,s)+γVπ(s)).V^\pi(s)=\sum_a\pi(a\mid s)\sum_{s'}P(s'\mid s,a)\left(r(s,a,s')+\gamma V^\pi(s')\right).

Operational definition.

Policy methods optimize the action distribution directly. This is essential when actions are continuous, structured, or generated token by token.

Worked reading.

The score-function identity converts a derivative of an expectation into an expectation of θlogπθ(as)\nabla_\theta\log\pi_\theta(a\mid s) times a return-like signal.

ObjectMathematical roleML interpretation
S\mathcal{S}state spacetask context, simulator state, dialogue state
A\mathcal{A}action spacemoves, controls, generated tokens, tool choices
P(ss,a)P(s'\mid s,a)transition kernelenvironment dynamics or next-context distribution
r(s,a,s)r(s,a,s')reward functionscalar training signal, preference score, task score
π(as)\pi(a\mid s)policybehavior rule or neural action distribution
Vπ,QπV^\pi,Q^\pivalue functionsestimates of future performance

Examples:

  1. REINFORCE.
  2. actor-critic.
  3. PPO for RLHF.

Non-examples:

  1. choosing argmaxaQ(s,a)\arg\max_a Q(s,a) from a tiny action table.
  2. behavior cloning without reward feedback.

Derivation habit.

  1. State whether the policy is fixed, greedy-improved, or being optimized.
  2. Write the return or Bellman target before writing code.
  3. Identify whether the target is sampled, model-based, bootstrapped, or exact.
  4. Check whether the data are on-policy or off-policy.
  5. Mask terminal states so that value is not propagated beyond the episode.

Implementation lens.

In a small tabular MDP, every object can be printed: transition matrices, reward vectors, value arrays, and greedy actions. In a deep RL system, the same objects exist but are represented by neural networks, replay buffers, sampled rollouts, and approximate losses. The names should not change just because the implementation becomes larger.

For LLM work, the state is usually a prompt plus generated prefix, the action is the next token or structured action, and the policy is an autoregressive distribution. The reward may arrive only after a full response, which means the return is sequence-level even though actions are token-level.

Useful checks:

  • Does the transition or sample contain the next state?
  • Is the update using VV, QQ, an advantage, or a reward-only signal?
  • Is the current policy also the data-collection policy?
  • Does discounting express time preference or just numerical convenience?
  • Is the reward model being optimized beyond its trustworthy region?

A common failure mode is to copy an update rule while silently changing the sampling assumption. For example, SARSA and Q-learning can look nearly identical in code, but one updates from the action actually taken and the other updates from a greedy next action. That distinction changes both the mathematics and the behavior.

The notebook version of this idea keeps the environment small enough to inspect by hand. If a concept is unclear, compute it first in the chain MDP and only then move to neural approximators.

9.4 Baselines and variance reduction

Purpose. Baselines and variance reduction focuses on why subtracting b(s)b(s) is unbiased. This is not optional vocabulary: it determines which variables are random, which distribution creates the data, and which recursive target is valid.

Q(s,a)=sP(ss,a)(r(s,a,s)+γmaxaQ(s,a)).Q^*(s,a)=\sum_{s'}P(s'\mid s,a)\left(r(s,a,s')+\gamma\max_{a'}Q^*(s',a')\right).

Operational definition.

This concept is part of the bridge from sequential probability models to practical RL algorithms.

Worked reading.

The key habit is to name the state, action, reward, transition, policy, and value object before writing an update.

ObjectMathematical roleML interpretation
S\mathcal{S}state spacetask context, simulator state, dialogue state
A\mathcal{A}action spacemoves, controls, generated tokens, tool choices
P(ss,a)P(s'\mid s,a)transition kernelenvironment dynamics or next-context distribution
r(s,a,s)r(s,a,s')reward functionscalar training signal, preference score, task score
π(as)\pi(a\mid s)policybehavior rule or neural action distribution
Vπ,QπV^\pi,Q^\pivalue functionsestimates of future performance

Examples:

  1. small tabular MDPs.
  2. neural value functions.
  3. preference-optimized language policies.

Non-examples:

  1. static regression.
  2. uncontrolled simulation traces.

Derivation habit.

  1. State whether the policy is fixed, greedy-improved, or being optimized.
  2. Write the return or Bellman target before writing code.
  3. Identify whether the target is sampled, model-based, bootstrapped, or exact.
  4. Check whether the data are on-policy or off-policy.
  5. Mask terminal states so that value is not propagated beyond the episode.

Implementation lens.

In a small tabular MDP, every object can be printed: transition matrices, reward vectors, value arrays, and greedy actions. In a deep RL system, the same objects exist but are represented by neural networks, replay buffers, sampled rollouts, and approximate losses. The names should not change just because the implementation becomes larger.

For LLM work, the state is usually a prompt plus generated prefix, the action is the next token or structured action, and the policy is an autoregressive distribution. The reward may arrive only after a full response, which means the return is sequence-level even though actions are token-level.

Useful checks:

  • Does the transition or sample contain the next state?
  • Is the update using VV, QQ, an advantage, or a reward-only signal?
  • Is the current policy also the data-collection policy?
  • Does discounting express time preference or just numerical convenience?
  • Is the reward model being optimized beyond its trustworthy region?

A common failure mode is to copy an update rule while silently changing the sampling assumption. For example, SARSA and Q-learning can look nearly identical in code, but one updates from the action actually taken and the other updates from a greedy next action. That distinction changes both the mathematics and the behavior.

The notebook version of this idea keeps the environment small enough to inspect by hand. If a concept is unclear, compute it first in the chain MDP and only then move to neural approximators.

9.5 Entropy regularization

Purpose. Entropy regularization focuses on why stochastic policies are encouraged. This is not optional vocabulary: it determines which variables are random, which distribution creates the data, and which recursive target is valid.

θJ(θ)=Eπθ[θlogπθ(AtSt)Aπθ(St,At)].\nabla_\theta J(\theta)=\mathbb{E}_{\pi_\theta}\left[\nabla_\theta\log\pi_\theta(A_t\mid S_t)A^{\pi_\theta}(S_t,A_t)\right].

Operational definition.

This concept is part of the bridge from sequential probability models to practical RL algorithms.

Worked reading.

The key habit is to name the state, action, reward, transition, policy, and value object before writing an update.

ObjectMathematical roleML interpretation
S\mathcal{S}state spacetask context, simulator state, dialogue state
A\mathcal{A}action spacemoves, controls, generated tokens, tool choices
P(ss,a)P(s'\mid s,a)transition kernelenvironment dynamics or next-context distribution
r(s,a,s)r(s,a,s')reward functionscalar training signal, preference score, task score
π(as)\pi(a\mid s)policybehavior rule or neural action distribution
Vπ,QπV^\pi,Q^\pivalue functionsestimates of future performance

Examples:

  1. small tabular MDPs.
  2. neural value functions.
  3. preference-optimized language policies.

Non-examples:

  1. static regression.
  2. uncontrolled simulation traces.

Derivation habit.

  1. State whether the policy is fixed, greedy-improved, or being optimized.
  2. Write the return or Bellman target before writing code.
  3. Identify whether the target is sampled, model-based, bootstrapped, or exact.
  4. Check whether the data are on-policy or off-policy.
  5. Mask terminal states so that value is not propagated beyond the episode.

Implementation lens.

In a small tabular MDP, every object can be printed: transition matrices, reward vectors, value arrays, and greedy actions. In a deep RL system, the same objects exist but are represented by neural networks, replay buffers, sampled rollouts, and approximate losses. The names should not change just because the implementation becomes larger.

For LLM work, the state is usually a prompt plus generated prefix, the action is the next token or structured action, and the policy is an autoregressive distribution. The reward may arrive only after a full response, which means the return is sequence-level even though actions are token-level.

Useful checks:

  • Does the transition or sample contain the next state?
  • Is the update using VV, QQ, an advantage, or a reward-only signal?
  • Is the current policy also the data-collection policy?
  • Does discounting express time preference or just numerical convenience?
  • Is the reward model being optimized beyond its trustworthy region?

A common failure mode is to copy an update rule while silently changing the sampling assumption. For example, SARSA and Q-learning can look nearly identical in code, but one updates from the action actually taken and the other updates from a greedy next action. That distinction changes both the mathematics and the behavior.

The notebook version of this idea keeps the environment small enough to inspect by hand. If a concept is unclear, compute it first in the chain MDP and only then move to neural approximators.

10. Actor-Critic PPO and RLHF

Actor-Critic PPO and RLHF is part of the core mathematical path from Markov chains to modern AI agents. The emphasis is on the object definitions and update equations a learner must be able to inspect in code.

10.1 Actor-critic decomposition

Purpose. Actor-critic decomposition focuses on policy and value learning together. This is not optional vocabulary: it determines which variables are random, which distribution creates the data, and which recursive target is valid.

Vπ(s)=Eπ[GtSt=s],Qπ(s,a)=Eπ[GtSt=s,At=a].V^\pi(s)=\mathbb{E}_\pi[G_t\mid S_t=s],\qquad Q^\pi(s,a)=\mathbb{E}_\pi[G_t\mid S_t=s,A_t=a].

Operational definition.

Policy methods optimize the action distribution directly. This is essential when actions are continuous, structured, or generated token by token.

Worked reading.

The score-function identity converts a derivative of an expectation into an expectation of θlogπθ(as)\nabla_\theta\log\pi_\theta(a\mid s) times a return-like signal.

ObjectMathematical roleML interpretation
S\mathcal{S}state spacetask context, simulator state, dialogue state
A\mathcal{A}action spacemoves, controls, generated tokens, tool choices
P(ss,a)P(s'\mid s,a)transition kernelenvironment dynamics or next-context distribution
r(s,a,s)r(s,a,s')reward functionscalar training signal, preference score, task score
π(as)\pi(a\mid s)policybehavior rule or neural action distribution
Vπ,QπV^\pi,Q^\pivalue functionsestimates of future performance

Examples:

  1. REINFORCE.
  2. actor-critic.
  3. PPO for RLHF.

Non-examples:

  1. choosing argmaxaQ(s,a)\arg\max_a Q(s,a) from a tiny action table.
  2. behavior cloning without reward feedback.

Derivation habit.

  1. State whether the policy is fixed, greedy-improved, or being optimized.
  2. Write the return or Bellman target before writing code.
  3. Identify whether the target is sampled, model-based, bootstrapped, or exact.
  4. Check whether the data are on-policy or off-policy.
  5. Mask terminal states so that value is not propagated beyond the episode.

Implementation lens.

In a small tabular MDP, every object can be printed: transition matrices, reward vectors, value arrays, and greedy actions. In a deep RL system, the same objects exist but are represented by neural networks, replay buffers, sampled rollouts, and approximate losses. The names should not change just because the implementation becomes larger.

For LLM work, the state is usually a prompt plus generated prefix, the action is the next token or structured action, and the policy is an autoregressive distribution. The reward may arrive only after a full response, which means the return is sequence-level even though actions are token-level.

Useful checks:

  • Does the transition or sample contain the next state?
  • Is the update using VV, QQ, an advantage, or a reward-only signal?
  • Is the current policy also the data-collection policy?
  • Does discounting express time preference or just numerical convenience?
  • Is the reward model being optimized beyond its trustworthy region?

A common failure mode is to copy an update rule while silently changing the sampling assumption. For example, SARSA and Q-learning can look nearly identical in code, but one updates from the action actually taken and the other updates from a greedy next action. That distinction changes both the mathematics and the behavior.

The notebook version of this idea keeps the environment small enough to inspect by hand. If a concept is unclear, compute it first in the chain MDP and only then move to neural approximators.

10.2 Generalized advantage estimation

Purpose. Generalized advantage estimation focuses on the λ\lambda tradeoff for advantages. This is not optional vocabulary: it determines which variables are random, which distribution creates the data, and which recursive target is valid.

Vπ(s)=aπ(as)sP(ss,a)(r(s,a,s)+γVπ(s)).V^\pi(s)=\sum_a\pi(a\mid s)\sum_{s'}P(s'\mid s,a)\left(r(s,a,s')+\gamma V^\pi(s')\right).

Operational definition.

Value functions summarize future consequences. Advantage functions compare an action with the policy's average behavior at the same state.

Worked reading.

Occupancy measures explain why RL gradients weight states by how often the current policy visits them.

ObjectMathematical roleML interpretation
S\mathcal{S}state spacetask context, simulator state, dialogue state
A\mathcal{A}action spacemoves, controls, generated tokens, tool choices
P(ss,a)P(s'\mid s,a)transition kernelenvironment dynamics or next-context distribution
r(s,a,s)r(s,a,s')reward functionscalar training signal, preference score, task score
π(as)\pi(a\mid s)policybehavior rule or neural action distribution
Vπ,QπV^\pi,Q^\pivalue functionsestimates of future performance

Examples:

  1. VπV^\pi.
  2. QπQ^\pi.
  3. AπA^\pi.

Non-examples:

  1. instant reward only.
  2. a metric computed on states never reached by the policy.

Derivation habit.

  1. State whether the policy is fixed, greedy-improved, or being optimized.
  2. Write the return or Bellman target before writing code.
  3. Identify whether the target is sampled, model-based, bootstrapped, or exact.
  4. Check whether the data are on-policy or off-policy.
  5. Mask terminal states so that value is not propagated beyond the episode.

Implementation lens.

In a small tabular MDP, every object can be printed: transition matrices, reward vectors, value arrays, and greedy actions. In a deep RL system, the same objects exist but are represented by neural networks, replay buffers, sampled rollouts, and approximate losses. The names should not change just because the implementation becomes larger.

For LLM work, the state is usually a prompt plus generated prefix, the action is the next token or structured action, and the policy is an autoregressive distribution. The reward may arrive only after a full response, which means the return is sequence-level even though actions are token-level.

Useful checks:

  • Does the transition or sample contain the next state?
  • Is the update using VV, QQ, an advantage, or a reward-only signal?
  • Is the current policy also the data-collection policy?
  • Does discounting express time preference or just numerical convenience?
  • Is the reward model being optimized beyond its trustworthy region?

A common failure mode is to copy an update rule while silently changing the sampling assumption. For example, SARSA and Q-learning can look nearly identical in code, but one updates from the action actually taken and the other updates from a greedy next action. That distinction changes both the mathematics and the behavior.

The notebook version of this idea keeps the environment small enough to inspect by hand. If a concept is unclear, compute it first in the chain MDP and only then move to neural approximators.

10.3 Trust regions and KL control

Purpose. Trust regions and KL control focuses on limiting policy movement. This is not optional vocabulary: it determines which variables are random, which distribution creates the data, and which recursive target is valid.

Q(s,a)=sP(ss,a)(r(s,a,s)+γmaxaQ(s,a)).Q^*(s,a)=\sum_{s'}P(s'\mid s,a)\left(r(s,a,s')+\gamma\max_{a'}Q^*(s',a')\right).

Operational definition.

Control algorithms learn how to choose actions, not just how to evaluate a fixed policy.

Worked reading.

SARSA uses the action actually sampled by the behavior policy; Q-learning uses a greedy target and is therefore off-policy.

ObjectMathematical roleML interpretation
S\mathcal{S}state spacetask context, simulator state, dialogue state
A\mathcal{A}action spacemoves, controls, generated tokens, tool choices
P(ss,a)P(s'\mid s,a)transition kernelenvironment dynamics or next-context distribution
r(s,a,s)r(s,a,s')reward functionscalar training signal, preference score, task score
π(as)\pi(a\mid s)policybehavior rule or neural action distribution
Vπ,QπV^\pi,Q^\pivalue functionsestimates of future performance

Examples:

  1. tabular gridworld control.
  2. DQN-style value learning.
  3. epsilon-greedy exploration.

Non-examples:

  1. estimating VπV^\pi only.
  2. planning with a perfect model and no samples.

Derivation habit.

  1. State whether the policy is fixed, greedy-improved, or being optimized.
  2. Write the return or Bellman target before writing code.
  3. Identify whether the target is sampled, model-based, bootstrapped, or exact.
  4. Check whether the data are on-policy or off-policy.
  5. Mask terminal states so that value is not propagated beyond the episode.

Implementation lens.

In a small tabular MDP, every object can be printed: transition matrices, reward vectors, value arrays, and greedy actions. In a deep RL system, the same objects exist but are represented by neural networks, replay buffers, sampled rollouts, and approximate losses. The names should not change just because the implementation becomes larger.

For LLM work, the state is usually a prompt plus generated prefix, the action is the next token or structured action, and the policy is an autoregressive distribution. The reward may arrive only after a full response, which means the return is sequence-level even though actions are token-level.

Useful checks:

  • Does the transition or sample contain the next state?
  • Is the update using VV, QQ, an advantage, or a reward-only signal?
  • Is the current policy also the data-collection policy?
  • Does discounting express time preference or just numerical convenience?
  • Is the reward model being optimized beyond its trustworthy region?

A common failure mode is to copy an update rule while silently changing the sampling assumption. For example, SARSA and Q-learning can look nearly identical in code, but one updates from the action actually taken and the other updates from a greedy next action. That distinction changes both the mathematics and the behavior.

The notebook version of this idea keeps the environment small enough to inspect by hand. If a concept is unclear, compute it first in the chain MDP and only then move to neural approximators.

10.4 PPO clipped surrogate objective

Purpose. PPO clipped surrogate objective focuses on a practical trust-region approximation. This is not optional vocabulary: it determines which variables are random, which distribution creates the data, and which recursive target is valid.

θJ(θ)=Eπθ[θlogπθ(AtSt)Aπθ(St,At)].\nabla_\theta J(\theta)=\mathbb{E}_{\pi_\theta}\left[\nabla_\theta\log\pi_\theta(A_t\mid S_t)A^{\pi_\theta}(S_t,A_t)\right].

Operational definition.

Policy methods optimize the action distribution directly. This is essential when actions are continuous, structured, or generated token by token.

Worked reading.

The score-function identity converts a derivative of an expectation into an expectation of θlogπθ(as)\nabla_\theta\log\pi_\theta(a\mid s) times a return-like signal.

ObjectMathematical roleML interpretation
S\mathcal{S}state spacetask context, simulator state, dialogue state
A\mathcal{A}action spacemoves, controls, generated tokens, tool choices
P(ss,a)P(s'\mid s,a)transition kernelenvironment dynamics or next-context distribution
r(s,a,s)r(s,a,s')reward functionscalar training signal, preference score, task score
π(as)\pi(a\mid s)policybehavior rule or neural action distribution
Vπ,QπV^\pi,Q^\pivalue functionsestimates of future performance

Examples:

  1. REINFORCE.
  2. actor-critic.
  3. PPO for RLHF.

Non-examples:

  1. choosing argmaxaQ(s,a)\arg\max_a Q(s,a) from a tiny action table.
  2. behavior cloning without reward feedback.

Derivation habit.

  1. State whether the policy is fixed, greedy-improved, or being optimized.
  2. Write the return or Bellman target before writing code.
  3. Identify whether the target is sampled, model-based, bootstrapped, or exact.
  4. Check whether the data are on-policy or off-policy.
  5. Mask terminal states so that value is not propagated beyond the episode.

Implementation lens.

In a small tabular MDP, every object can be printed: transition matrices, reward vectors, value arrays, and greedy actions. In a deep RL system, the same objects exist but are represented by neural networks, replay buffers, sampled rollouts, and approximate losses. The names should not change just because the implementation becomes larger.

For LLM work, the state is usually a prompt plus generated prefix, the action is the next token or structured action, and the policy is an autoregressive distribution. The reward may arrive only after a full response, which means the return is sequence-level even though actions are token-level.

Useful checks:

  • Does the transition or sample contain the next state?
  • Is the update using VV, QQ, an advantage, or a reward-only signal?
  • Is the current policy also the data-collection policy?
  • Does discounting express time preference or just numerical convenience?
  • Is the reward model being optimized beyond its trustworthy region?

A common failure mode is to copy an update rule while silently changing the sampling assumption. For example, SARSA and Q-learning can look nearly identical in code, but one updates from the action actually taken and the other updates from a greedy next action. That distinction changes both the mathematics and the behavior.

The notebook version of this idea keeps the environment small enough to inspect by hand. If a concept is unclear, compute it first in the chain MDP and only then move to neural approximators.

10.5 Reward modeling and preference optimization

Purpose. Reward modeling and preference optimization focuses on the RLHF bridge to language models. This is not optional vocabulary: it determines which variables are random, which distribution creates the data, and which recursive target is valid.

LCLIP(θ)=E[min(rt(θ)A^t,clip(rt(θ),1ϵ,1+ϵ)A^t)].L^{\mathrm{CLIP}}(\theta)=\mathbb{E}\left[\min(r_t(\theta)\hat{A}_t,\operatorname{clip}(r_t(\theta),1-\epsilon,1+\epsilon)\hat{A}_t)\right].

Operational definition.

This concept is part of the bridge from sequential probability models to practical RL algorithms.

Worked reading.

The key habit is to name the state, action, reward, transition, policy, and value object before writing an update.

ObjectMathematical roleML interpretation
S\mathcal{S}state spacetask context, simulator state, dialogue state
A\mathcal{A}action spacemoves, controls, generated tokens, tool choices
P(ss,a)P(s'\mid s,a)transition kernelenvironment dynamics or next-context distribution
r(s,a,s)r(s,a,s')reward functionscalar training signal, preference score, task score
π(as)\pi(a\mid s)policybehavior rule or neural action distribution
Vπ,QπV^\pi,Q^\pivalue functionsestimates of future performance

Examples:

  1. small tabular MDPs.
  2. neural value functions.
  3. preference-optimized language policies.

Non-examples:

  1. static regression.
  2. uncontrolled simulation traces.

Derivation habit.

  1. State whether the policy is fixed, greedy-improved, or being optimized.
  2. Write the return or Bellman target before writing code.
  3. Identify whether the target is sampled, model-based, bootstrapped, or exact.
  4. Check whether the data are on-policy or off-policy.
  5. Mask terminal states so that value is not propagated beyond the episode.

Implementation lens.

In a small tabular MDP, every object can be printed: transition matrices, reward vectors, value arrays, and greedy actions. In a deep RL system, the same objects exist but are represented by neural networks, replay buffers, sampled rollouts, and approximate losses. The names should not change just because the implementation becomes larger.

For LLM work, the state is usually a prompt plus generated prefix, the action is the next token or structured action, and the policy is an autoregressive distribution. The reward may arrive only after a full response, which means the return is sequence-level even though actions are token-level.

Useful checks:

  • Does the transition or sample contain the next state?
  • Is the update using VV, QQ, an advantage, or a reward-only signal?
  • Is the current policy also the data-collection policy?
  • Does discounting express time preference or just numerical convenience?
  • Is the reward model being optimized beyond its trustworthy region?

A common failure mode is to copy an update rule while silently changing the sampling assumption. For example, SARSA and Q-learning can look nearly identical in code, but one updates from the action actually taken and the other updates from a greedy next action. That distinction changes both the mathematics and the behavior.

The notebook version of this idea keeps the environment small enough to inspect by hand. If a concept is unclear, compute it first in the chain MDP and only then move to neural approximators.

11. Common Mistakes

#MistakeWhy it is wrongFix
1Confusing rewards with returnsA reward is local; a return is accumulated over time.Always write GtG_t before deriving an update target.
2Ignoring the data distribution shiftChanging the policy changes which states are visited.Name whether the data are on-policy or off-policy.
3Treating Bellman equations as supervised labelsBellman targets contain estimates and bootstrapping.Track target networks, stop-gradient choices, or tabular guarantees.
4Using Q-learning for every problemContinuous action spaces and large state spaces often need different methods.Choose value-based, policy-gradient, or actor-critic methods from the action and data structure.
5Forgetting explorationA greedy policy may never see better actions.Use explicit exploration or uncertainty-aware data collection.
6Trusting average reward without varianceRL estimates are noisy and seed-sensitive.Report confidence intervals, seeds, and learning curves.
7Mixing offline and online assumptionsLogged data may not cover actions needed by the learned policy.Check coverage and use conservative offline RL when needed.
8Over-optimizing a reward modelThe policy may exploit reward model errors.Use KL control, held-out preference evaluation, and adversarial tests.
9Calling PPO a magic stabilizerPPO still depends on advantage quality, clipping, normalization, and KL monitoring.Audit ratios, advantages, entropy, KL, and value loss.
10Forgetting terminal statesBootstrapping through terminal states creates false future value.Mask terminal transitions in TD targets.

12. Exercises

  1. (*) Solve a Bellman policy-evaluation system for a three-state chain.

    • (a) Name the random variables.
    • (b) Write the target equation.
    • (c) Compute the numeric result.
    • (d) Explain what the result means for an agent or LLM policy.
  2. (*) Run three value-iteration backups and track the sup-norm change.

    • (a) Name the random variables.
    • (b) Write the target equation.
    • (c) Compute the numeric result.
    • (d) Explain what the result means for an agent or LLM policy.
  3. (*) Compute a Monte Carlo return and compare it with a TD target.

    • (a) Name the random variables.
    • (b) Write the target equation.
    • (c) Compute the numeric result.
    • (d) Explain what the result means for an agent or LLM policy.
  4. (**) Apply one SARSA update and one Q-learning update to the same transition.

    • (a) Name the random variables.
    • (b) Write the target equation.
    • (c) Compute the numeric result.
    • (d) Explain what the result means for an agent or LLM policy.
  5. (**) Compute an epsilon-greedy action distribution.

    • (a) Name the random variables.
    • (b) Write the target equation.
    • (c) Compute the numeric result.
    • (d) Explain what the result means for an agent or LLM policy.
  6. (**) Estimate a policy-gradient direction for a two-action softmax policy.

    • (a) Name the random variables.
    • (b) Write the target equation.
    • (c) Compute the numeric result.
    • (d) Explain what the result means for an agent or LLM policy.
  7. (**) Compute generalized advantages from rewards and value estimates.

    • (a) Name the random variables.
    • (b) Write the target equation.
    • (c) Compute the numeric result.
    • (d) Explain what the result means for an agent or LLM policy.
  8. (***) Evaluate the PPO clipped surrogate for positive and negative advantages.

    • (a) Name the random variables.
    • (b) Write the target equation.
    • (c) Compute the numeric result.
    • (d) Explain what the result means for an agent or LLM policy.
  9. (***) Fit a Bradley-Terry reward-model probability for preference data.

    • (a) Name the random variables.
    • (b) Write the target equation.
    • (c) Compute the numeric result.
    • (d) Explain what the result means for an agent or LLM policy.
  10. (***) Compute a DPO-style preference loss and explain the KL-control intuition.

  • (a) Name the random variables.
  • (b) Write the target equation.
  • (c) Compute the numeric result.
  • (d) Explain what the result means for an agent or LLM policy.

13. Why This Matters for AI

ConceptAI impact
MDPsProvide the mathematical contract for agents, simulators, robotics tasks, games, and dialogue policies.
Bellman equationsTurn long-horizon objectives into one-step recursive learning targets.
TD learningExplains bootstrapping, credit assignment, and value-learning targets used in deep RL.
Q-learningPowers value-based control and explains why replay and target networks stabilize DQN-style systems.
Policy gradientsGive the gradient estimator behind REINFORCE, actor-critic, PPO, and many RLHF implementations.
Advantage estimationReduces variance and makes policy updates more sample-efficient.
KL regularizationKeeps a learned policy close to a reference model in RLHF and safe fine-tuning.
Reward modelingConnects human preferences to scalar optimization, while exposing reward hacking risks.

14. Conceptual Bridge

The backward bridge is probability and Markov chains. A Markov chain has transitions but no actions. An MDP adds actions, rewards, and optimization. Once actions enter the process, the learner must reason about both inference and control.

The forward bridge is alignment and interactive systems. RLHF, DPO, preference models, online experiments, and agentic tool-use loops all reuse RL ideas: reward signals, policies, KL constraints, distribution shift, and evaluation under feedback.

+-------------------+      +------------------------+      +----------------------+
| Markov chains     | ---> | Markov decision process | ---> | policy optimization  |
| transition only   |      | actions and rewards     |      | RLHF, agents, games  |
+-------------------+      +------------------------+      +----------------------+

The most important practical lesson is that RL is not just an optimizer. It is a complete data-generating loop. When a policy changes, the future dataset changes. That is why careful RL work always audits rewards, policies, value estimates, exploration, and evaluation together.

References

Skill Check

Test this lesson

Answer 4 quick questions to lock in the lesson and feed your adaptive practice queue.

--
Score
0/4
Answered
Not attempted
Status
1

Which module does this lesson belong to?

2

Which section is covered in this lesson content?

3

Which term is most central to this lesson?

4

What is the best way to use this lesson for real learning?

Your answers save locally first, then sync when account storage is available.
Practice queue