Reinforcement LearningMath for LLMs

Reinforcement Learning

Math for Specific Models

Private notes
0/8000

Notes stay private to your browser until account sync is configured.

Reinforcement Learning
24 min read12 headingsSplit lesson page

Lesson overview | Lesson overview | Next part

Reinforcement Learning: Part 1: Intuition and Motivation to 2. Formal MDP Setup

1. Intuition and Motivation

Intuition and Motivation is part of the core mathematical path from Markov chains to modern AI agents. The emphasis is on the object definitions and update equations a learner must be able to inspect in code.

1.1 Sequential decision making

Purpose. Sequential decision making focuses on why actions change future data. This is not optional vocabulary: it determines which variables are random, which distribution creates the data, and which recursive target is valid.

M=(S,A,P,r,γ),P(ss,a)=Pr(St+1=sSt=s,At=a).\mathcal{M}=(\mathcal{S},\mathcal{A},P,r,\gamma),\qquad P(s'\mid s,a)=\Pr(S_{t+1}=s'\mid S_t=s,A_t=a).

Operational definition.

RL is the mathematics of decisions whose consequences alter later observations. A training example is no longer fixed before learning; the policy helps create the future dataset.

Worked reading.

At time tt, the agent sees StS_t, chooses AtA_t, receives Rt+1R_{t+1}, and reaches St+1S_{t+1}. The learning signal is attached to a trajectory, not a single independent example.

ObjectMathematical roleML interpretation
S\mathcal{S}state spacetask context, simulator state, dialogue state
A\mathcal{A}action spacemoves, controls, generated tokens, tool choices
P(ss,a)P(s'\mid s,a)transition kernelenvironment dynamics or next-context distribution
r(s,a,s)r(s,a,s')reward functionscalar training signal, preference score, task score
π(as)\pi(a\mid s)policybehavior rule or neural action distribution
Vπ,QπV^\pi,Q^\pivalue functionsestimates of future performance

Examples:

  1. robot navigation.
  2. game play.
  3. dialogue policy tuning.

Non-examples:

  1. ordinary regression with fixed labels.
  2. a bandit problem with no state evolution.

Derivation habit.

  1. State whether the policy is fixed, greedy-improved, or being optimized.
  2. Write the return or Bellman target before writing code.
  3. Identify whether the target is sampled, model-based, bootstrapped, or exact.
  4. Check whether the data are on-policy or off-policy.
  5. Mask terminal states so that value is not propagated beyond the episode.

Implementation lens.

In a small tabular MDP, every object can be printed: transition matrices, reward vectors, value arrays, and greedy actions. In a deep RL system, the same objects exist but are represented by neural networks, replay buffers, sampled rollouts, and approximate losses. The names should not change just because the implementation becomes larger.

For LLM work, the state is usually a prompt plus generated prefix, the action is the next token or structured action, and the policy is an autoregressive distribution. The reward may arrive only after a full response, which means the return is sequence-level even though actions are token-level.

Useful checks:

  • Does the transition or sample contain the next state?
  • Is the update using VV, QQ, an advantage, or a reward-only signal?
  • Is the current policy also the data-collection policy?
  • Does discounting express time preference or just numerical convenience?
  • Is the reward model being optimized beyond its trustworthy region?

A common failure mode is to copy an update rule while silently changing the sampling assumption. For example, SARSA and Q-learning can look nearly identical in code, but one updates from the action actually taken and the other updates from a greedy next action. That distinction changes both the mathematics and the behavior.

The notebook version of this idea keeps the environment small enough to inspect by hand. If a concept is unclear, compute it first in the chain MDP and only then move to neural approximators.

1.2 Rewards returns and delayed credit

Purpose. Rewards returns and delayed credit focuses on why scalar feedback is hard to assign. This is not optional vocabulary: it determines which variables are random, which distribution creates the data, and which recursive target is valid.

Gt=k=0γkRt+k+1,0γ<1.G_t=\sum_{k=0}^{\infty}\gamma^kR_{t+k+1},\qquad 0\le \gamma<1.

Operational definition.

The reward signal is local, but the objective is cumulative. The central difficulty is assigning a future return back to earlier actions.

Worked reading.

The discounted return is Gt=Rt+1+γRt+2+γ2Rt+3+G_t=R_{t+1}+\gamma R_{t+2}+\gamma^2R_{t+3}+\cdots.

ObjectMathematical roleML interpretation
S\mathcal{S}state spacetask context, simulator state, dialogue state
A\mathcal{A}action spacemoves, controls, generated tokens, tool choices
P(ss,a)P(s'\mid s,a)transition kernelenvironment dynamics or next-context distribution
r(s,a,s)r(s,a,s')reward functionscalar training signal, preference score, task score
π(as)\pi(a\mid s)policybehavior rule or neural action distribution
Vπ,QπV^\pi,Q^\pivalue functionsestimates of future performance

Examples:

  1. sparse game win/loss rewards.
  2. human preference scores for full responses.
  3. robot task completion bonuses.

Non-examples:

  1. per-token supervised labels.
  2. a deterministic lookup table with no delayed consequence.

Derivation habit.

  1. State whether the policy is fixed, greedy-improved, or being optimized.
  2. Write the return or Bellman target before writing code.
  3. Identify whether the target is sampled, model-based, bootstrapped, or exact.
  4. Check whether the data are on-policy or off-policy.
  5. Mask terminal states so that value is not propagated beyond the episode.

Implementation lens.

In a small tabular MDP, every object can be printed: transition matrices, reward vectors, value arrays, and greedy actions. In a deep RL system, the same objects exist but are represented by neural networks, replay buffers, sampled rollouts, and approximate losses. The names should not change just because the implementation becomes larger.

For LLM work, the state is usually a prompt plus generated prefix, the action is the next token or structured action, and the policy is an autoregressive distribution. The reward may arrive only after a full response, which means the return is sequence-level even though actions are token-level.

Useful checks:

  • Does the transition or sample contain the next state?
  • Is the update using VV, QQ, an advantage, or a reward-only signal?
  • Is the current policy also the data-collection policy?
  • Does discounting express time preference or just numerical convenience?
  • Is the reward model being optimized beyond its trustworthy region?

A common failure mode is to copy an update rule while silently changing the sampling assumption. For example, SARSA and Q-learning can look nearly identical in code, but one updates from the action actually taken and the other updates from a greedy next action. That distinction changes both the mathematics and the behavior.

The notebook version of this idea keeps the environment small enough to inspect by hand. If a concept is unclear, compute it first in the chain MDP and only then move to neural approximators.

1.3 Exploration versus exploitation

Purpose. Exploration versus exploitation focuses on why the learner must choose data. This is not optional vocabulary: it determines which variables are random, which distribution creates the data, and which recursive target is valid.

Vπ(s)=Eπ[GtSt=s],Qπ(s,a)=Eπ[GtSt=s,At=a].V^\pi(s)=\mathbb{E}_\pi[G_t\mid S_t=s],\qquad Q^\pi(s,a)=\mathbb{E}_\pi[G_t\mid S_t=s,A_t=a].

Operational definition.

The agent must balance actions that look good now with actions that reveal useful information. This makes data collection part of the optimization problem.

Worked reading.

An ϵ\epsilon-greedy policy chooses a greedy action with probability 1ϵ1-\epsilon and explores otherwise.

ObjectMathematical roleML interpretation
S\mathcal{S}state spacetask context, simulator state, dialogue state
A\mathcal{A}action spacemoves, controls, generated tokens, tool choices
P(ss,a)P(s'\mid s,a)transition kernelenvironment dynamics or next-context distribution
r(s,a,s)r(s,a,s')reward functionscalar training signal, preference score, task score
π(as)\pi(a\mid s)policybehavior rule or neural action distribution
Vπ,QπV^\pi,Q^\pivalue functionsestimates of future performance

Examples:

  1. optimistic initialization.
  2. Boltzmann exploration.
  3. UCB-style uncertainty bonuses.

Non-examples:

  1. evaluating one fixed logged policy only.
  2. training on a static supervised corpus.

Derivation habit.

  1. State whether the policy is fixed, greedy-improved, or being optimized.
  2. Write the return or Bellman target before writing code.
  3. Identify whether the target is sampled, model-based, bootstrapped, or exact.
  4. Check whether the data are on-policy or off-policy.
  5. Mask terminal states so that value is not propagated beyond the episode.

Implementation lens.

In a small tabular MDP, every object can be printed: transition matrices, reward vectors, value arrays, and greedy actions. In a deep RL system, the same objects exist but are represented by neural networks, replay buffers, sampled rollouts, and approximate losses. The names should not change just because the implementation becomes larger.

For LLM work, the state is usually a prompt plus generated prefix, the action is the next token or structured action, and the policy is an autoregressive distribution. The reward may arrive only after a full response, which means the return is sequence-level even though actions are token-level.

Useful checks:

  • Does the transition or sample contain the next state?
  • Is the update using VV, QQ, an advantage, or a reward-only signal?
  • Is the current policy also the data-collection policy?
  • Does discounting express time preference or just numerical convenience?
  • Is the reward model being optimized beyond its trustworthy region?

A common failure mode is to copy an update rule while silently changing the sampling assumption. For example, SARSA and Q-learning can look nearly identical in code, but one updates from the action actually taken and the other updates from a greedy next action. That distinction changes both the mathematics and the behavior.

The notebook version of this idea keeps the environment small enough to inspect by hand. If a concept is unclear, compute it first in the chain MDP and only then move to neural approximators.

1.4 RL versus supervised learning

Purpose. RL versus supervised learning focuses on why labels are not enough. This is not optional vocabulary: it determines which variables are random, which distribution creates the data, and which recursive target is valid.

Vπ(s)=aπ(as)sP(ss,a)(r(s,a,s)+γVπ(s)).V^\pi(s)=\sum_a\pi(a\mid s)\sum_{s'}P(s'\mid s,a)\left(r(s,a,s')+\gamma V^\pi(s')\right).

Operational definition.

Supervised learning assumes labeled targets. RL instead observes consequences from an interactive process and must handle shifting state-action distributions.

Worked reading.

A policy update changes dπ(s)d^\pi(s), the visitation distribution, so future data are policy-dependent.

ObjectMathematical roleML interpretation
S\mathcal{S}state spacetask context, simulator state, dialogue state
A\mathcal{A}action spacemoves, controls, generated tokens, tool choices
P(ss,a)P(s'\mid s,a)transition kernelenvironment dynamics or next-context distribution
r(s,a,s)r(s,a,s')reward functionscalar training signal, preference score, task score
π(as)\pi(a\mid s)policybehavior rule or neural action distribution
Vπ,QπV^\pi,Q^\pivalue functionsestimates of future performance

Examples:

  1. offline RL from logged data.
  2. online policy improvement.
  3. preference-tuned language models.

Non-examples:

  1. i.i.d. image classification.
  2. least-squares regression with fixed design.

Derivation habit.

  1. State whether the policy is fixed, greedy-improved, or being optimized.
  2. Write the return or Bellman target before writing code.
  3. Identify whether the target is sampled, model-based, bootstrapped, or exact.
  4. Check whether the data are on-policy or off-policy.
  5. Mask terminal states so that value is not propagated beyond the episode.

Implementation lens.

In a small tabular MDP, every object can be printed: transition matrices, reward vectors, value arrays, and greedy actions. In a deep RL system, the same objects exist but are represented by neural networks, replay buffers, sampled rollouts, and approximate losses. The names should not change just because the implementation becomes larger.

For LLM work, the state is usually a prompt plus generated prefix, the action is the next token or structured action, and the policy is an autoregressive distribution. The reward may arrive only after a full response, which means the return is sequence-level even though actions are token-level.

Useful checks:

  • Does the transition or sample contain the next state?
  • Is the update using VV, QQ, an advantage, or a reward-only signal?
  • Is the current policy also the data-collection policy?
  • Does discounting express time preference or just numerical convenience?
  • Is the reward model being optimized beyond its trustworthy region?

A common failure mode is to copy an update rule while silently changing the sampling assumption. For example, SARSA and Q-learning can look nearly identical in code, but one updates from the action actually taken and the other updates from a greedy next action. That distinction changes both the mathematics and the behavior.

The notebook version of this idea keeps the environment small enough to inspect by hand. If a concept is unclear, compute it first in the chain MDP and only then move to neural approximators.

1.5 Where RL appears in LLM systems

Purpose. Where RL appears in LLM systems focuses on why preference optimization uses this math. This is not optional vocabulary: it determines which variables are random, which distribution creates the data, and which recursive target is valid.

Q(s,a)=sP(ss,a)(r(s,a,s)+γmaxaQ(s,a)).Q^*(s,a)=\sum_{s'}P(s'\mid s,a)\left(r(s,a,s')+\gamma\max_{a'}Q^*(s',a')\right).

Operational definition.

RL enters LLM systems through preference learning, reward modeling, KL-regularized policy updates, and evaluation policies that adapt from feedback.

Worked reading.

RLHF typically optimizes a reward model while penalizing divergence from a reference model with DKL(πθπref)D_{\mathrm{KL}}(\pi_\theta\Vert\pi_{\mathrm{ref}}).

ObjectMathematical roleML interpretation
S\mathcal{S}state spacetask context, simulator state, dialogue state
A\mathcal{A}action spacemoves, controls, generated tokens, tool choices
P(ss,a)P(s'\mid s,a)transition kernelenvironment dynamics or next-context distribution
r(s,a,s)r(s,a,s')reward functionscalar training signal, preference score, task score
π(as)\pi(a\mid s)policybehavior rule or neural action distribution
Vπ,QπV^\pi,Q^\pivalue functionsestimates of future performance

Examples:

  1. PPO-style RLHF.
  2. DPO-style preference optimization.
  3. bandit feedback for ranking.

Non-examples:

  1. plain next-token pretraining.
  2. static instruction tuning without preference feedback.

Derivation habit.

  1. State whether the policy is fixed, greedy-improved, or being optimized.
  2. Write the return or Bellman target before writing code.
  3. Identify whether the target is sampled, model-based, bootstrapped, or exact.
  4. Check whether the data are on-policy or off-policy.
  5. Mask terminal states so that value is not propagated beyond the episode.

Implementation lens.

In a small tabular MDP, every object can be printed: transition matrices, reward vectors, value arrays, and greedy actions. In a deep RL system, the same objects exist but are represented by neural networks, replay buffers, sampled rollouts, and approximate losses. The names should not change just because the implementation becomes larger.

For LLM work, the state is usually a prompt plus generated prefix, the action is the next token or structured action, and the policy is an autoregressive distribution. The reward may arrive only after a full response, which means the return is sequence-level even though actions are token-level.

Useful checks:

  • Does the transition or sample contain the next state?
  • Is the update using VV, QQ, an advantage, or a reward-only signal?
  • Is the current policy also the data-collection policy?
  • Does discounting express time preference or just numerical convenience?
  • Is the reward model being optimized beyond its trustworthy region?

A common failure mode is to copy an update rule while silently changing the sampling assumption. For example, SARSA and Q-learning can look nearly identical in code, but one updates from the action actually taken and the other updates from a greedy next action. That distinction changes both the mathematics and the behavior.

The notebook version of this idea keeps the environment small enough to inspect by hand. If a concept is unclear, compute it first in the chain MDP and only then move to neural approximators.

2. Formal MDP Setup

Formal MDP Setup is part of the core mathematical path from Markov chains to modern AI agents. The emphasis is on the object definitions and update equations a learner must be able to inspect in code.

2.1 States actions rewards and transitions

Purpose. States actions rewards and transitions focuses on the tuple (S,A,P,r,γ)(\mathcal{S},\mathcal{A},P,r,\gamma). This is not optional vocabulary: it determines which variables are random, which distribution creates the data, and which recursive target is valid.

Gt=k=0γkRt+k+1,0γ<1.G_t=\sum_{k=0}^{\infty}\gamma^kR_{t+k+1},\qquad 0\le \gamma<1.

Operational definition.

An MDP is the formal object that makes sequential decision-making mathematically precise.

Worked reading.

The Markov property says the current state contains the predictive information needed for the next transition and reward.

ObjectMathematical roleML interpretation
S\mathcal{S}state spacetask context, simulator state, dialogue state
A\mathcal{A}action spacemoves, controls, generated tokens, tool choices
P(ss,a)P(s'\mid s,a)transition kernelenvironment dynamics or next-context distribution
r(s,a,s)r(s,a,s')reward functionscalar training signal, preference score, task score
π(as)\pi(a\mid s)policybehavior rule or neural action distribution
Vπ,QπV^\pi,Q^\pivalue functionsestimates of future performance

Examples:

  1. finite gridworld.
  2. inventory control.
  3. dialogue state tracking.

Non-examples:

  1. raw observations that omit hidden state.
  2. a static labeled dataset with no actions.

Derivation habit.

  1. State whether the policy is fixed, greedy-improved, or being optimized.
  2. Write the return or Bellman target before writing code.
  3. Identify whether the target is sampled, model-based, bootstrapped, or exact.
  4. Check whether the data are on-policy or off-policy.
  5. Mask terminal states so that value is not propagated beyond the episode.

Implementation lens.

In a small tabular MDP, every object can be printed: transition matrices, reward vectors, value arrays, and greedy actions. In a deep RL system, the same objects exist but are represented by neural networks, replay buffers, sampled rollouts, and approximate losses. The names should not change just because the implementation becomes larger.

For LLM work, the state is usually a prompt plus generated prefix, the action is the next token or structured action, and the policy is an autoregressive distribution. The reward may arrive only after a full response, which means the return is sequence-level even though actions are token-level.

Useful checks:

  • Does the transition or sample contain the next state?
  • Is the update using VV, QQ, an advantage, or a reward-only signal?
  • Is the current policy also the data-collection policy?
  • Does discounting express time preference or just numerical convenience?
  • Is the reward model being optimized beyond its trustworthy region?

A common failure mode is to copy an update rule while silently changing the sampling assumption. For example, SARSA and Q-learning can look nearly identical in code, but one updates from the action actually taken and the other updates from a greedy next action. That distinction changes both the mathematics and the behavior.

The notebook version of this idea keeps the environment small enough to inspect by hand. If a concept is unclear, compute it first in the chain MDP and only then move to neural approximators.

2.2 The Markov property

Purpose. The Markov property focuses on why the present state summarizes the useful past. This is not optional vocabulary: it determines which variables are random, which distribution creates the data, and which recursive target is valid.

Vπ(s)=Eπ[GtSt=s],Qπ(s,a)=Eπ[GtSt=s,At=a].V^\pi(s)=\mathbb{E}_\pi[G_t\mid S_t=s],\qquad Q^\pi(s,a)=\mathbb{E}_\pi[G_t\mid S_t=s,A_t=a].

Operational definition.

An MDP is the formal object that makes sequential decision-making mathematically precise.

Worked reading.

The Markov property says the current state contains the predictive information needed for the next transition and reward.

ObjectMathematical roleML interpretation
S\mathcal{S}state spacetask context, simulator state, dialogue state
A\mathcal{A}action spacemoves, controls, generated tokens, tool choices
P(ss,a)P(s'\mid s,a)transition kernelenvironment dynamics or next-context distribution
r(s,a,s)r(s,a,s')reward functionscalar training signal, preference score, task score
π(as)\pi(a\mid s)policybehavior rule or neural action distribution
Vπ,QπV^\pi,Q^\pivalue functionsestimates of future performance

Examples:

  1. finite gridworld.
  2. inventory control.
  3. dialogue state tracking.

Non-examples:

  1. raw observations that omit hidden state.
  2. a static labeled dataset with no actions.

Derivation habit.

  1. State whether the policy is fixed, greedy-improved, or being optimized.
  2. Write the return or Bellman target before writing code.
  3. Identify whether the target is sampled, model-based, bootstrapped, or exact.
  4. Check whether the data are on-policy or off-policy.
  5. Mask terminal states so that value is not propagated beyond the episode.

Implementation lens.

In a small tabular MDP, every object can be printed: transition matrices, reward vectors, value arrays, and greedy actions. In a deep RL system, the same objects exist but are represented by neural networks, replay buffers, sampled rollouts, and approximate losses. The names should not change just because the implementation becomes larger.

For LLM work, the state is usually a prompt plus generated prefix, the action is the next token or structured action, and the policy is an autoregressive distribution. The reward may arrive only after a full response, which means the return is sequence-level even though actions are token-level.

Useful checks:

  • Does the transition or sample contain the next state?
  • Is the update using VV, QQ, an advantage, or a reward-only signal?
  • Is the current policy also the data-collection policy?
  • Does discounting express time preference or just numerical convenience?
  • Is the reward model being optimized beyond its trustworthy region?

A common failure mode is to copy an update rule while silently changing the sampling assumption. For example, SARSA and Q-learning can look nearly identical in code, but one updates from the action actually taken and the other updates from a greedy next action. That distinction changes both the mathematics and the behavior.

The notebook version of this idea keeps the environment small enough to inspect by hand. If a concept is unclear, compute it first in the chain MDP and only then move to neural approximators.

2.3 Episodic continuing finite and infinite horizon tasks

Purpose. Episodic continuing finite and infinite horizon tasks focuses on how time changes the objective. This is not optional vocabulary: it determines which variables are random, which distribution creates the data, and which recursive target is valid.

Vπ(s)=aπ(as)sP(ss,a)(r(s,a,s)+γVπ(s)).V^\pi(s)=\sum_a\pi(a\mid s)\sum_{s'}P(s'\mid s,a)\left(r(s,a,s')+\gamma V^\pi(s')\right).

Operational definition.

This concept is part of the bridge from sequential probability models to practical RL algorithms.

Worked reading.

The key habit is to name the state, action, reward, transition, policy, and value object before writing an update.

ObjectMathematical roleML interpretation
S\mathcal{S}state spacetask context, simulator state, dialogue state
A\mathcal{A}action spacemoves, controls, generated tokens, tool choices
P(ss,a)P(s'\mid s,a)transition kernelenvironment dynamics or next-context distribution
r(s,a,s)r(s,a,s')reward functionscalar training signal, preference score, task score
π(as)\pi(a\mid s)policybehavior rule or neural action distribution
Vπ,QπV^\pi,Q^\pivalue functionsestimates of future performance

Examples:

  1. small tabular MDPs.
  2. neural value functions.
  3. preference-optimized language policies.

Non-examples:

  1. static regression.
  2. uncontrolled simulation traces.

Derivation habit.

  1. State whether the policy is fixed, greedy-improved, or being optimized.
  2. Write the return or Bellman target before writing code.
  3. Identify whether the target is sampled, model-based, bootstrapped, or exact.
  4. Check whether the data are on-policy or off-policy.
  5. Mask terminal states so that value is not propagated beyond the episode.

Implementation lens.

In a small tabular MDP, every object can be printed: transition matrices, reward vectors, value arrays, and greedy actions. In a deep RL system, the same objects exist but are represented by neural networks, replay buffers, sampled rollouts, and approximate losses. The names should not change just because the implementation becomes larger.

For LLM work, the state is usually a prompt plus generated prefix, the action is the next token or structured action, and the policy is an autoregressive distribution. The reward may arrive only after a full response, which means the return is sequence-level even though actions are token-level.

Useful checks:

  • Does the transition or sample contain the next state?
  • Is the update using VV, QQ, an advantage, or a reward-only signal?
  • Is the current policy also the data-collection policy?
  • Does discounting express time preference or just numerical convenience?
  • Is the reward model being optimized beyond its trustworthy region?

A common failure mode is to copy an update rule while silently changing the sampling assumption. For example, SARSA and Q-learning can look nearly identical in code, but one updates from the action actually taken and the other updates from a greedy next action. That distinction changes both the mathematics and the behavior.

The notebook version of this idea keeps the environment small enough to inspect by hand. If a concept is unclear, compute it first in the chain MDP and only then move to neural approximators.

2.4 Transition kernels and reward functions

Purpose. Transition kernels and reward functions focuses on how stochastic environments are represented. This is not optional vocabulary: it determines which variables are random, which distribution creates the data, and which recursive target is valid.

Q(s,a)=sP(ss,a)(r(s,a,s)+γmaxaQ(s,a)).Q^*(s,a)=\sum_{s'}P(s'\mid s,a)\left(r(s,a,s')+\gamma\max_{a'}Q^*(s',a')\right).

Operational definition.

Function approximation replaces tables with parameterized models so the agent can generalize across large state spaces.

Worked reading.

Deep RL is powerful because neural networks share statistical strength, but unstable because approximate bootstrapping can amplify errors.

ObjectMathematical roleML interpretation
S\mathcal{S}state spacetask context, simulator state, dialogue state
A\mathcal{A}action spacemoves, controls, generated tokens, tool choices
P(ss,a)P(s'\mid s,a)transition kernelenvironment dynamics or next-context distribution
r(s,a,s)r(s,a,s')reward functionscalar training signal, preference score, task score
π(as)\pi(a\mid s)policybehavior rule or neural action distribution
Vπ,QπV^\pi,Q^\pivalue functionsestimates of future performance

Examples:

  1. DQN target networks.
  2. experience replay.
  3. critic networks.

Non-examples:

  1. exact dynamic programming in a tiny known MDP.
  2. memorizing every state-action value in a table.

Derivation habit.

  1. State whether the policy is fixed, greedy-improved, or being optimized.
  2. Write the return or Bellman target before writing code.
  3. Identify whether the target is sampled, model-based, bootstrapped, or exact.
  4. Check whether the data are on-policy or off-policy.
  5. Mask terminal states so that value is not propagated beyond the episode.

Implementation lens.

In a small tabular MDP, every object can be printed: transition matrices, reward vectors, value arrays, and greedy actions. In a deep RL system, the same objects exist but are represented by neural networks, replay buffers, sampled rollouts, and approximate losses. The names should not change just because the implementation becomes larger.

For LLM work, the state is usually a prompt plus generated prefix, the action is the next token or structured action, and the policy is an autoregressive distribution. The reward may arrive only after a full response, which means the return is sequence-level even though actions are token-level.

Useful checks:

  • Does the transition or sample contain the next state?
  • Is the update using VV, QQ, an advantage, or a reward-only signal?
  • Is the current policy also the data-collection policy?
  • Does discounting express time preference or just numerical convenience?
  • Is the reward model being optimized beyond its trustworthy region?

A common failure mode is to copy an update rule while silently changing the sampling assumption. For example, SARSA and Q-learning can look nearly identical in code, but one updates from the action actually taken and the other updates from a greedy next action. That distinction changes both the mathematics and the behavior.

The notebook version of this idea keeps the environment small enough to inspect by hand. If a concept is unclear, compute it first in the chain MDP and only then move to neural approximators.

2.5 Partial observability and belief states

Purpose. Partial observability and belief states focuses on what breaks when observations are not states. This is not optional vocabulary: it determines which variables are random, which distribution creates the data, and which recursive target is valid.

θJ(θ)=Eπθ[θlogπθ(AtSt)Aπθ(St,At)].\nabla_\theta J(\theta)=\mathbb{E}_{\pi_\theta}\left[\nabla_\theta\log\pi_\theta(A_t\mid S_t)A^{\pi_\theta}(S_t,A_t)\right].

Operational definition.

An MDP is the formal object that makes sequential decision-making mathematically precise.

Worked reading.

The Markov property says the current state contains the predictive information needed for the next transition and reward.

ObjectMathematical roleML interpretation
S\mathcal{S}state spacetask context, simulator state, dialogue state
A\mathcal{A}action spacemoves, controls, generated tokens, tool choices
P(ss,a)P(s'\mid s,a)transition kernelenvironment dynamics or next-context distribution
r(s,a,s)r(s,a,s')reward functionscalar training signal, preference score, task score
π(as)\pi(a\mid s)policybehavior rule or neural action distribution
Vπ,QπV^\pi,Q^\pivalue functionsestimates of future performance

Examples:

  1. finite gridworld.
  2. inventory control.
  3. dialogue state tracking.

Non-examples:

  1. raw observations that omit hidden state.
  2. a static labeled dataset with no actions.

Derivation habit.

  1. State whether the policy is fixed, greedy-improved, or being optimized.
  2. Write the return or Bellman target before writing code.
  3. Identify whether the target is sampled, model-based, bootstrapped, or exact.
  4. Check whether the data are on-policy or off-policy.
  5. Mask terminal states so that value is not propagated beyond the episode.

Implementation lens.

In a small tabular MDP, every object can be printed: transition matrices, reward vectors, value arrays, and greedy actions. In a deep RL system, the same objects exist but are represented by neural networks, replay buffers, sampled rollouts, and approximate losses. The names should not change just because the implementation becomes larger.

For LLM work, the state is usually a prompt plus generated prefix, the action is the next token or structured action, and the policy is an autoregressive distribution. The reward may arrive only after a full response, which means the return is sequence-level even though actions are token-level.

Useful checks:

  • Does the transition or sample contain the next state?
  • Is the update using VV, QQ, an advantage, or a reward-only signal?
  • Is the current policy also the data-collection policy?
  • Does discounting express time preference or just numerical convenience?
  • Is the reward model being optimized beyond its trustworthy region?

A common failure mode is to copy an update rule while silently changing the sampling assumption. For example, SARSA and Q-learning can look nearly identical in code, but one updates from the action actually taken and the other updates from a greedy next action. That distinction changes both the mathematics and the behavior.

The notebook version of this idea keeps the environment small enough to inspect by hand. If a concept is unclear, compute it first in the chain MDP and only then move to neural approximators.

PreviousNext