Private notes
0/8000

Notes stay private to your browser until account sync is configured.

Part 2
24 min read12 headingsSplit lesson page

Lesson overview | Previous part | Next part

Reinforcement Learning: Part 3: Returns Policies and Value Functions to 4. Bellman Equations

3. Returns Policies and Value Functions

Returns Policies and Value Functions is part of the core mathematical path from Markov chains to modern AI agents. The emphasis is on the object definitions and update equations a learner must be able to inspect in code.

3.1 Discounted return

Purpose. Discounted return focuses on why Gt=k0γkRt+k+1G_t=\sum_{k\ge 0}\gamma^k R_{t+k+1} is central. This is not optional vocabulary: it determines which variables are random, which distribution creates the data, and which recursive target is valid.

Vπ(s)=Eπ[GtSt=s],Qπ(s,a)=Eπ[GtSt=s,At=a].V^\pi(s)=\mathbb{E}_\pi[G_t\mid S_t=s],\qquad Q^\pi(s,a)=\mathbb{E}_\pi[G_t\mid S_t=s,A_t=a].

Operational definition.

Value functions summarize future consequences. Advantage functions compare an action with the policy's average behavior at the same state.

Worked reading.

Occupancy measures explain why RL gradients weight states by how often the current policy visits them.

ObjectMathematical roleML interpretation
S\mathcal{S}state spacetask context, simulator state, dialogue state
A\mathcal{A}action spacemoves, controls, generated tokens, tool choices
P(ss,a)P(s'\mid s,a)transition kernelenvironment dynamics or next-context distribution
r(s,a,s)r(s,a,s')reward functionscalar training signal, preference score, task score
π(as)\pi(a\mid s)policybehavior rule or neural action distribution
Vπ,QπV^\pi,Q^\pivalue functionsestimates of future performance

Examples:

  1. VπV^\pi.
  2. QπQ^\pi.
  3. AπA^\pi.

Non-examples:

  1. instant reward only.
  2. a metric computed on states never reached by the policy.

Derivation habit.

  1. State whether the policy is fixed, greedy-improved, or being optimized.
  2. Write the return or Bellman target before writing code.
  3. Identify whether the target is sampled, model-based, bootstrapped, or exact.
  4. Check whether the data are on-policy or off-policy.
  5. Mask terminal states so that value is not propagated beyond the episode.

Implementation lens.

In a small tabular MDP, every object can be printed: transition matrices, reward vectors, value arrays, and greedy actions. In a deep RL system, the same objects exist but are represented by neural networks, replay buffers, sampled rollouts, and approximate losses. The names should not change just because the implementation becomes larger.

For LLM work, the state is usually a prompt plus generated prefix, the action is the next token or structured action, and the policy is an autoregressive distribution. The reward may arrive only after a full response, which means the return is sequence-level even though actions are token-level.

Useful checks:

  • Does the transition or sample contain the next state?
  • Is the update using VV, QQ, an advantage, or a reward-only signal?
  • Is the current policy also the data-collection policy?
  • Does discounting express time preference or just numerical convenience?
  • Is the reward model being optimized beyond its trustworthy region?

A common failure mode is to copy an update rule while silently changing the sampling assumption. For example, SARSA and Q-learning can look nearly identical in code, but one updates from the action actually taken and the other updates from a greedy next action. That distinction changes both the mathematics and the behavior.

The notebook version of this idea keeps the environment small enough to inspect by hand. If a concept is unclear, compute it first in the chain MDP and only then move to neural approximators.

3.2 Deterministic and stochastic policies

Purpose. Deterministic and stochastic policies focuses on how π(as)\pi(a\mid s) controls the data distribution. This is not optional vocabulary: it determines which variables are random, which distribution creates the data, and which recursive target is valid.

Vπ(s)=aπ(as)sP(ss,a)(r(s,a,s)+γVπ(s)).V^\pi(s)=\sum_a\pi(a\mid s)\sum_{s'}P(s'\mid s,a)\left(r(s,a,s')+\gamma V^\pi(s')\right).

Operational definition.

This concept is part of the bridge from sequential probability models to practical RL algorithms.

Worked reading.

The key habit is to name the state, action, reward, transition, policy, and value object before writing an update.

ObjectMathematical roleML interpretation
S\mathcal{S}state spacetask context, simulator state, dialogue state
A\mathcal{A}action spacemoves, controls, generated tokens, tool choices
P(ss,a)P(s'\mid s,a)transition kernelenvironment dynamics or next-context distribution
r(s,a,s)r(s,a,s')reward functionscalar training signal, preference score, task score
π(as)\pi(a\mid s)policybehavior rule or neural action distribution
Vπ,QπV^\pi,Q^\pivalue functionsestimates of future performance

Examples:

  1. small tabular MDPs.
  2. neural value functions.
  3. preference-optimized language policies.

Non-examples:

  1. static regression.
  2. uncontrolled simulation traces.

Derivation habit.

  1. State whether the policy is fixed, greedy-improved, or being optimized.
  2. Write the return or Bellman target before writing code.
  3. Identify whether the target is sampled, model-based, bootstrapped, or exact.
  4. Check whether the data are on-policy or off-policy.
  5. Mask terminal states so that value is not propagated beyond the episode.

Implementation lens.

In a small tabular MDP, every object can be printed: transition matrices, reward vectors, value arrays, and greedy actions. In a deep RL system, the same objects exist but are represented by neural networks, replay buffers, sampled rollouts, and approximate losses. The names should not change just because the implementation becomes larger.

For LLM work, the state is usually a prompt plus generated prefix, the action is the next token or structured action, and the policy is an autoregressive distribution. The reward may arrive only after a full response, which means the return is sequence-level even though actions are token-level.

Useful checks:

  • Does the transition or sample contain the next state?
  • Is the update using VV, QQ, an advantage, or a reward-only signal?
  • Is the current policy also the data-collection policy?
  • Does discounting express time preference or just numerical convenience?
  • Is the reward model being optimized beyond its trustworthy region?

A common failure mode is to copy an update rule while silently changing the sampling assumption. For example, SARSA and Q-learning can look nearly identical in code, but one updates from the action actually taken and the other updates from a greedy next action. That distinction changes both the mathematics and the behavior.

The notebook version of this idea keeps the environment small enough to inspect by hand. If a concept is unclear, compute it first in the chain MDP and only then move to neural approximators.

3.3 State-value and action-value functions

Purpose. State-value and action-value functions focuses on what VπV^\pi and QπQ^\pi estimate. This is not optional vocabulary: it determines which variables are random, which distribution creates the data, and which recursive target is valid.

Q(s,a)=sP(ss,a)(r(s,a,s)+γmaxaQ(s,a)).Q^*(s,a)=\sum_{s'}P(s'\mid s,a)\left(r(s,a,s')+\gamma\max_{a'}Q^*(s',a')\right).

Operational definition.

Function approximation replaces tables with parameterized models so the agent can generalize across large state spaces.

Worked reading.

Deep RL is powerful because neural networks share statistical strength, but unstable because approximate bootstrapping can amplify errors.

ObjectMathematical roleML interpretation
S\mathcal{S}state spacetask context, simulator state, dialogue state
A\mathcal{A}action spacemoves, controls, generated tokens, tool choices
P(ss,a)P(s'\mid s,a)transition kernelenvironment dynamics or next-context distribution
r(s,a,s)r(s,a,s')reward functionscalar training signal, preference score, task score
π(as)\pi(a\mid s)policybehavior rule or neural action distribution
Vπ,QπV^\pi,Q^\pivalue functionsestimates of future performance

Examples:

  1. DQN target networks.
  2. experience replay.
  3. critic networks.

Non-examples:

  1. exact dynamic programming in a tiny known MDP.
  2. memorizing every state-action value in a table.

Derivation habit.

  1. State whether the policy is fixed, greedy-improved, or being optimized.
  2. Write the return or Bellman target before writing code.
  3. Identify whether the target is sampled, model-based, bootstrapped, or exact.
  4. Check whether the data are on-policy or off-policy.
  5. Mask terminal states so that value is not propagated beyond the episode.

Implementation lens.

In a small tabular MDP, every object can be printed: transition matrices, reward vectors, value arrays, and greedy actions. In a deep RL system, the same objects exist but are represented by neural networks, replay buffers, sampled rollouts, and approximate losses. The names should not change just because the implementation becomes larger.

For LLM work, the state is usually a prompt plus generated prefix, the action is the next token or structured action, and the policy is an autoregressive distribution. The reward may arrive only after a full response, which means the return is sequence-level even though actions are token-level.

Useful checks:

  • Does the transition or sample contain the next state?
  • Is the update using VV, QQ, an advantage, or a reward-only signal?
  • Is the current policy also the data-collection policy?
  • Does discounting express time preference or just numerical convenience?
  • Is the reward model being optimized beyond its trustworthy region?

A common failure mode is to copy an update rule while silently changing the sampling assumption. For example, SARSA and Q-learning can look nearly identical in code, but one updates from the action actually taken and the other updates from a greedy next action. That distinction changes both the mathematics and the behavior.

The notebook version of this idea keeps the environment small enough to inspect by hand. If a concept is unclear, compute it first in the chain MDP and only then move to neural approximators.

3.4 Advantage functions

Purpose. Advantage functions focuses on why Aπ(s,a)=Qπ(s,a)Vπ(s)A^\pi(s,a)=Q^\pi(s,a)-V^\pi(s) reduces variance. This is not optional vocabulary: it determines which variables are random, which distribution creates the data, and which recursive target is valid.

θJ(θ)=Eπθ[θlogπθ(AtSt)Aπθ(St,At)].\nabla_\theta J(\theta)=\mathbb{E}_{\pi_\theta}\left[\nabla_\theta\log\pi_\theta(A_t\mid S_t)A^{\pi_\theta}(S_t,A_t)\right].

Operational definition.

Function approximation replaces tables with parameterized models so the agent can generalize across large state spaces.

Worked reading.

Deep RL is powerful because neural networks share statistical strength, but unstable because approximate bootstrapping can amplify errors.

ObjectMathematical roleML interpretation
S\mathcal{S}state spacetask context, simulator state, dialogue state
A\mathcal{A}action spacemoves, controls, generated tokens, tool choices
P(ss,a)P(s'\mid s,a)transition kernelenvironment dynamics or next-context distribution
r(s,a,s)r(s,a,s')reward functionscalar training signal, preference score, task score
π(as)\pi(a\mid s)policybehavior rule or neural action distribution
Vπ,QπV^\pi,Q^\pivalue functionsestimates of future performance

Examples:

  1. DQN target networks.
  2. experience replay.
  3. critic networks.

Non-examples:

  1. exact dynamic programming in a tiny known MDP.
  2. memorizing every state-action value in a table.

Derivation habit.

  1. State whether the policy is fixed, greedy-improved, or being optimized.
  2. Write the return or Bellman target before writing code.
  3. Identify whether the target is sampled, model-based, bootstrapped, or exact.
  4. Check whether the data are on-policy or off-policy.
  5. Mask terminal states so that value is not propagated beyond the episode.

Implementation lens.

In a small tabular MDP, every object can be printed: transition matrices, reward vectors, value arrays, and greedy actions. In a deep RL system, the same objects exist but are represented by neural networks, replay buffers, sampled rollouts, and approximate losses. The names should not change just because the implementation becomes larger.

For LLM work, the state is usually a prompt plus generated prefix, the action is the next token or structured action, and the policy is an autoregressive distribution. The reward may arrive only after a full response, which means the return is sequence-level even though actions are token-level.

Useful checks:

  • Does the transition or sample contain the next state?
  • Is the update using VV, QQ, an advantage, or a reward-only signal?
  • Is the current policy also the data-collection policy?
  • Does discounting express time preference or just numerical convenience?
  • Is the reward model being optimized beyond its trustworthy region?

A common failure mode is to copy an update rule while silently changing the sampling assumption. For example, SARSA and Q-learning can look nearly identical in code, but one updates from the action actually taken and the other updates from a greedy next action. That distinction changes both the mathematics and the behavior.

The notebook version of this idea keeps the environment small enough to inspect by hand. If a concept is unclear, compute it first in the chain MDP and only then move to neural approximators.

3.5 Occupancy measures

Purpose. Occupancy measures focuses on how policies induce weighted state-action distributions. This is not optional vocabulary: it determines which variables are random, which distribution creates the data, and which recursive target is valid.

LCLIP(θ)=E[min(rt(θ)A^t,clip(rt(θ),1ϵ,1+ϵ)A^t)].L^{\mathrm{CLIP}}(\theta)=\mathbb{E}\left[\min(r_t(\theta)\hat{A}_t,\operatorname{clip}(r_t(\theta),1-\epsilon,1+\epsilon)\hat{A}_t)\right].

Operational definition.

Value functions summarize future consequences. Advantage functions compare an action with the policy's average behavior at the same state.

Worked reading.

Occupancy measures explain why RL gradients weight states by how often the current policy visits them.

ObjectMathematical roleML interpretation
S\mathcal{S}state spacetask context, simulator state, dialogue state
A\mathcal{A}action spacemoves, controls, generated tokens, tool choices
P(ss,a)P(s'\mid s,a)transition kernelenvironment dynamics or next-context distribution
r(s,a,s)r(s,a,s')reward functionscalar training signal, preference score, task score
π(as)\pi(a\mid s)policybehavior rule or neural action distribution
Vπ,QπV^\pi,Q^\pivalue functionsestimates of future performance

Examples:

  1. VπV^\pi.
  2. QπQ^\pi.
  3. AπA^\pi.

Non-examples:

  1. instant reward only.
  2. a metric computed on states never reached by the policy.

Derivation habit.

  1. State whether the policy is fixed, greedy-improved, or being optimized.
  2. Write the return or Bellman target before writing code.
  3. Identify whether the target is sampled, model-based, bootstrapped, or exact.
  4. Check whether the data are on-policy or off-policy.
  5. Mask terminal states so that value is not propagated beyond the episode.

Implementation lens.

In a small tabular MDP, every object can be printed: transition matrices, reward vectors, value arrays, and greedy actions. In a deep RL system, the same objects exist but are represented by neural networks, replay buffers, sampled rollouts, and approximate losses. The names should not change just because the implementation becomes larger.

For LLM work, the state is usually a prompt plus generated prefix, the action is the next token or structured action, and the policy is an autoregressive distribution. The reward may arrive only after a full response, which means the return is sequence-level even though actions are token-level.

Useful checks:

  • Does the transition or sample contain the next state?
  • Is the update using VV, QQ, an advantage, or a reward-only signal?
  • Is the current policy also the data-collection policy?
  • Does discounting express time preference or just numerical convenience?
  • Is the reward model being optimized beyond its trustworthy region?

A common failure mode is to copy an update rule while silently changing the sampling assumption. For example, SARSA and Q-learning can look nearly identical in code, but one updates from the action actually taken and the other updates from a greedy next action. That distinction changes both the mathematics and the behavior.

The notebook version of this idea keeps the environment small enough to inspect by hand. If a concept is unclear, compute it first in the chain MDP and only then move to neural approximators.

4. Bellman Equations

Bellman Equations is part of the core mathematical path from Markov chains to modern AI agents. The emphasis is on the object definitions and update equations a learner must be able to inspect in code.

4.1 Bellman expectation equation

Purpose. Bellman expectation equation focuses on recursive consistency for a fixed policy. This is not optional vocabulary: it determines which variables are random, which distribution creates the data, and which recursive target is valid.

Vπ(s)=aπ(as)sP(ss,a)(r(s,a,s)+γVπ(s)).V^\pi(s)=\sum_a\pi(a\mid s)\sum_{s'}P(s'\mid s,a)\left(r(s,a,s')+\gamma V^\pi(s')\right).

Operational definition.

Bellman equations express a global return as immediate reward plus the value of the next state. They are recursive consistency equations.

Worked reading.

The Bellman backup replaces a value estimate by a reward-plus-next-value target under either a fixed policy or an optimal action.

ObjectMathematical roleML interpretation
S\mathcal{S}state spacetask context, simulator state, dialogue state
A\mathcal{A}action spacemoves, controls, generated tokens, tool choices
P(ss,a)P(s'\mid s,a)transition kernelenvironment dynamics or next-context distribution
r(s,a,s)r(s,a,s')reward functionscalar training signal, preference score, task score
π(as)\pi(a\mid s)policybehavior rule or neural action distribution
Vπ,QπV^\pi,Q^\pivalue functionsestimates of future performance

Examples:

  1. policy evaluation.
  2. value iteration.
  3. TD target construction.

Non-examples:

  1. a one-step supervised label.
  2. a loss that ignores future value.

Derivation habit.

  1. State whether the policy is fixed, greedy-improved, or being optimized.
  2. Write the return or Bellman target before writing code.
  3. Identify whether the target is sampled, model-based, bootstrapped, or exact.
  4. Check whether the data are on-policy or off-policy.
  5. Mask terminal states so that value is not propagated beyond the episode.

Implementation lens.

In a small tabular MDP, every object can be printed: transition matrices, reward vectors, value arrays, and greedy actions. In a deep RL system, the same objects exist but are represented by neural networks, replay buffers, sampled rollouts, and approximate losses. The names should not change just because the implementation becomes larger.

For LLM work, the state is usually a prompt plus generated prefix, the action is the next token or structured action, and the policy is an autoregressive distribution. The reward may arrive only after a full response, which means the return is sequence-level even though actions are token-level.

Useful checks:

  • Does the transition or sample contain the next state?
  • Is the update using VV, QQ, an advantage, or a reward-only signal?
  • Is the current policy also the data-collection policy?
  • Does discounting express time preference or just numerical convenience?
  • Is the reward model being optimized beyond its trustworthy region?

A common failure mode is to copy an update rule while silently changing the sampling assumption. For example, SARSA and Q-learning can look nearly identical in code, but one updates from the action actually taken and the other updates from a greedy next action. That distinction changes both the mathematics and the behavior.

The notebook version of this idea keeps the environment small enough to inspect by hand. If a concept is unclear, compute it first in the chain MDP and only then move to neural approximators.

4.2 Bellman optimality equation

Purpose. Bellman optimality equation focuses on recursive consistency for the best policy. This is not optional vocabulary: it determines which variables are random, which distribution creates the data, and which recursive target is valid.

Q(s,a)=sP(ss,a)(r(s,a,s)+γmaxaQ(s,a)).Q^*(s,a)=\sum_{s'}P(s'\mid s,a)\left(r(s,a,s')+\gamma\max_{a'}Q^*(s',a')\right).

Operational definition.

Bellman equations express a global return as immediate reward plus the value of the next state. They are recursive consistency equations.

Worked reading.

The Bellman backup replaces a value estimate by a reward-plus-next-value target under either a fixed policy or an optimal action.

ObjectMathematical roleML interpretation
S\mathcal{S}state spacetask context, simulator state, dialogue state
A\mathcal{A}action spacemoves, controls, generated tokens, tool choices
P(ss,a)P(s'\mid s,a)transition kernelenvironment dynamics or next-context distribution
r(s,a,s)r(s,a,s')reward functionscalar training signal, preference score, task score
π(as)\pi(a\mid s)policybehavior rule or neural action distribution
Vπ,QπV^\pi,Q^\pivalue functionsestimates of future performance

Examples:

  1. policy evaluation.
  2. value iteration.
  3. TD target construction.

Non-examples:

  1. a one-step supervised label.
  2. a loss that ignores future value.

Derivation habit.

  1. State whether the policy is fixed, greedy-improved, or being optimized.
  2. Write the return or Bellman target before writing code.
  3. Identify whether the target is sampled, model-based, bootstrapped, or exact.
  4. Check whether the data are on-policy or off-policy.
  5. Mask terminal states so that value is not propagated beyond the episode.

Implementation lens.

In a small tabular MDP, every object can be printed: transition matrices, reward vectors, value arrays, and greedy actions. In a deep RL system, the same objects exist but are represented by neural networks, replay buffers, sampled rollouts, and approximate losses. The names should not change just because the implementation becomes larger.

For LLM work, the state is usually a prompt plus generated prefix, the action is the next token or structured action, and the policy is an autoregressive distribution. The reward may arrive only after a full response, which means the return is sequence-level even though actions are token-level.

Useful checks:

  • Does the transition or sample contain the next state?
  • Is the update using VV, QQ, an advantage, or a reward-only signal?
  • Is the current policy also the data-collection policy?
  • Does discounting express time preference or just numerical convenience?
  • Is the reward model being optimized beyond its trustworthy region?

A common failure mode is to copy an update rule while silently changing the sampling assumption. For example, SARSA and Q-learning can look nearly identical in code, but one updates from the action actually taken and the other updates from a greedy next action. That distinction changes both the mathematics and the behavior.

The notebook version of this idea keeps the environment small enough to inspect by hand. If a concept is unclear, compute it first in the chain MDP and only then move to neural approximators.

4.3 Contraction mapping intuition

Purpose. Contraction mapping intuition focuses on why dynamic programming converges when γ<1\gamma<1. This is not optional vocabulary: it determines which variables are random, which distribution creates the data, and which recursive target is valid.

θJ(θ)=Eπθ[θlogπθ(AtSt)Aπθ(St,At)].\nabla_\theta J(\theta)=\mathbb{E}_{\pi_\theta}\left[\nabla_\theta\log\pi_\theta(A_t\mid S_t)A^{\pi_\theta}(S_t,A_t)\right].

Operational definition.

This concept is part of the bridge from sequential probability models to practical RL algorithms.

Worked reading.

The key habit is to name the state, action, reward, transition, policy, and value object before writing an update.

ObjectMathematical roleML interpretation
S\mathcal{S}state spacetask context, simulator state, dialogue state
A\mathcal{A}action spacemoves, controls, generated tokens, tool choices
P(ss,a)P(s'\mid s,a)transition kernelenvironment dynamics or next-context distribution
r(s,a,s)r(s,a,s')reward functionscalar training signal, preference score, task score
π(as)\pi(a\mid s)policybehavior rule or neural action distribution
Vπ,QπV^\pi,Q^\pivalue functionsestimates of future performance

Examples:

  1. small tabular MDPs.
  2. neural value functions.
  3. preference-optimized language policies.

Non-examples:

  1. static regression.
  2. uncontrolled simulation traces.

Derivation habit.

  1. State whether the policy is fixed, greedy-improved, or being optimized.
  2. Write the return or Bellman target before writing code.
  3. Identify whether the target is sampled, model-based, bootstrapped, or exact.
  4. Check whether the data are on-policy or off-policy.
  5. Mask terminal states so that value is not propagated beyond the episode.

Implementation lens.

In a small tabular MDP, every object can be printed: transition matrices, reward vectors, value arrays, and greedy actions. In a deep RL system, the same objects exist but are represented by neural networks, replay buffers, sampled rollouts, and approximate losses. The names should not change just because the implementation becomes larger.

For LLM work, the state is usually a prompt plus generated prefix, the action is the next token or structured action, and the policy is an autoregressive distribution. The reward may arrive only after a full response, which means the return is sequence-level even though actions are token-level.

Useful checks:

  • Does the transition or sample contain the next state?
  • Is the update using VV, QQ, an advantage, or a reward-only signal?
  • Is the current policy also the data-collection policy?
  • Does discounting express time preference or just numerical convenience?
  • Is the reward model being optimized beyond its trustworthy region?

A common failure mode is to copy an update rule while silently changing the sampling assumption. For example, SARSA and Q-learning can look nearly identical in code, but one updates from the action actually taken and the other updates from a greedy next action. That distinction changes both the mathematics and the behavior.

The notebook version of this idea keeps the environment small enough to inspect by hand. If a concept is unclear, compute it first in the chain MDP and only then move to neural approximators.

4.4 Matrix form for finite MDPs

Purpose. Matrix form for finite MDPs focuses on how policy evaluation becomes a linear system. This is not optional vocabulary: it determines which variables are random, which distribution creates the data, and which recursive target is valid.

LCLIP(θ)=E[min(rt(θ)A^t,clip(rt(θ),1ϵ,1+ϵ)A^t)].L^{\mathrm{CLIP}}(\theta)=\mathbb{E}\left[\min(r_t(\theta)\hat{A}_t,\operatorname{clip}(r_t(\theta),1-\epsilon,1+\epsilon)\hat{A}_t)\right].

Operational definition.

An MDP is the formal object that makes sequential decision-making mathematically precise.

Worked reading.

The Markov property says the current state contains the predictive information needed for the next transition and reward.

ObjectMathematical roleML interpretation
S\mathcal{S}state spacetask context, simulator state, dialogue state
A\mathcal{A}action spacemoves, controls, generated tokens, tool choices
P(ss,a)P(s'\mid s,a)transition kernelenvironment dynamics or next-context distribution
r(s,a,s)r(s,a,s')reward functionscalar training signal, preference score, task score
π(as)\pi(a\mid s)policybehavior rule or neural action distribution
Vπ,QπV^\pi,Q^\pivalue functionsestimates of future performance

Examples:

  1. finite gridworld.
  2. inventory control.
  3. dialogue state tracking.

Non-examples:

  1. raw observations that omit hidden state.
  2. a static labeled dataset with no actions.

Derivation habit.

  1. State whether the policy is fixed, greedy-improved, or being optimized.
  2. Write the return or Bellman target before writing code.
  3. Identify whether the target is sampled, model-based, bootstrapped, or exact.
  4. Check whether the data are on-policy or off-policy.
  5. Mask terminal states so that value is not propagated beyond the episode.

Implementation lens.

In a small tabular MDP, every object can be printed: transition matrices, reward vectors, value arrays, and greedy actions. In a deep RL system, the same objects exist but are represented by neural networks, replay buffers, sampled rollouts, and approximate losses. The names should not change just because the implementation becomes larger.

For LLM work, the state is usually a prompt plus generated prefix, the action is the next token or structured action, and the policy is an autoregressive distribution. The reward may arrive only after a full response, which means the return is sequence-level even though actions are token-level.

Useful checks:

  • Does the transition or sample contain the next state?
  • Is the update using VV, QQ, an advantage, or a reward-only signal?
  • Is the current policy also the data-collection policy?
  • Does discounting express time preference or just numerical convenience?
  • Is the reward model being optimized beyond its trustworthy region?

A common failure mode is to copy an update rule while silently changing the sampling assumption. For example, SARSA and Q-learning can look nearly identical in code, but one updates from the action actually taken and the other updates from a greedy next action. That distinction changes both the mathematics and the behavior.

The notebook version of this idea keeps the environment small enough to inspect by hand. If a concept is unclear, compute it first in the chain MDP and only then move to neural approximators.

4.5 Bellman residuals

Purpose. Bellman residuals focuses on how to diagnose approximate value functions. This is not optional vocabulary: it determines which variables are random, which distribution creates the data, and which recursive target is valid.

M=(S,A,P,r,γ),P(ss,a)=Pr(St+1=sSt=s,At=a).\mathcal{M}=(\mathcal{S},\mathcal{A},P,r,\gamma),\qquad P(s'\mid s,a)=\Pr(S_{t+1}=s'\mid S_t=s,A_t=a).

Operational definition.

Bellman equations express a global return as immediate reward plus the value of the next state. They are recursive consistency equations.

Worked reading.

The Bellman backup replaces a value estimate by a reward-plus-next-value target under either a fixed policy or an optimal action.

ObjectMathematical roleML interpretation
S\mathcal{S}state spacetask context, simulator state, dialogue state
A\mathcal{A}action spacemoves, controls, generated tokens, tool choices
P(ss,a)P(s'\mid s,a)transition kernelenvironment dynamics or next-context distribution
r(s,a,s)r(s,a,s')reward functionscalar training signal, preference score, task score
π(as)\pi(a\mid s)policybehavior rule or neural action distribution
Vπ,QπV^\pi,Q^\pivalue functionsestimates of future performance

Examples:

  1. policy evaluation.
  2. value iteration.
  3. TD target construction.

Non-examples:

  1. a one-step supervised label.
  2. a loss that ignores future value.

Derivation habit.

  1. State whether the policy is fixed, greedy-improved, or being optimized.
  2. Write the return or Bellman target before writing code.
  3. Identify whether the target is sampled, model-based, bootstrapped, or exact.
  4. Check whether the data are on-policy or off-policy.
  5. Mask terminal states so that value is not propagated beyond the episode.

Implementation lens.

In a small tabular MDP, every object can be printed: transition matrices, reward vectors, value arrays, and greedy actions. In a deep RL system, the same objects exist but are represented by neural networks, replay buffers, sampled rollouts, and approximate losses. The names should not change just because the implementation becomes larger.

For LLM work, the state is usually a prompt plus generated prefix, the action is the next token or structured action, and the policy is an autoregressive distribution. The reward may arrive only after a full response, which means the return is sequence-level even though actions are token-level.

Useful checks:

  • Does the transition or sample contain the next state?
  • Is the update using VV, QQ, an advantage, or a reward-only signal?
  • Is the current policy also the data-collection policy?
  • Does discounting express time preference or just numerical convenience?
  • Is the reward model being optimized beyond its trustworthy region?

A common failure mode is to copy an update rule while silently changing the sampling assumption. For example, SARSA and Q-learning can look nearly identical in code, but one updates from the action actually taken and the other updates from a greedy next action. That distinction changes both the mathematics and the behavior.

The notebook version of this idea keeps the environment small enough to inspect by hand. If a concept is unclear, compute it first in the chain MDP and only then move to neural approximators.

Skill Check

Test this lesson

Answer 4 quick questions to lock in the lesson and feed your adaptive practice queue.

--
Score
0/4
Answered
Not attempted
Status
1

Which module does this lesson belong to?

2

Which section is covered in this lesson content?

3

Which term is most central to this lesson?

4

What is the best way to use this lesson for real learning?

Your answers save locally first, then sync when account storage is available.
Practice queue