"Preferences teach the model which good-looking answers people actually choose."
Overview
Preference optimization learns from comparisons, either by fitting a reward model and optimizing a KL-regularized policy or by directly optimizing policy log-ratios.
Alignment and safety sit between model training and model deployment. The chapter studies how desired behavior is specified, learned, stress-tested, constrained, and improved through feedback.
This section is written in LaTeX Markdown. Inline mathematics uses $...$, and display
equations use `
`. The notes use math to expose the operational contract: which behavior is optimized, which behavior is forbidden, and which evidence proves the system changed.
Prerequisites
Companion Notebooks
| Notebook | Description |
|---|---|
| theory.ipynb | Executable demonstrations for preference optimization rlhf and dpo |
| exercises.ipynb | Graded practice for preference optimization rlhf and dpo |
Learning Objectives
After completing this section, you will be able to:
- Define the core mathematical objects in preference optimization rlhf and dpo
- Write alignment losses and decision rules in repo notation
- Separate data, objective, policy, and evaluation responsibilities
- Explain how safety failures become training or guardrail updates
- Identify reward hacking, over-refusal, proxy misspecification, and drift
- Design synthetic experiments that demonstrate alignment tradeoffs
- Connect human feedback to SFT, reward modeling, DPO, or runtime policies
- State the boundary with Chapter 17 evaluation and Chapter 19 production MLOps
- Read modern alignment papers with attention to assumptions and proxies
- Build exercises that test both formulas and system-level judgment
Study Flow
- Read the pages in order and pause after each page to restate the main definition or theorem.
- Run
theory.ipynbwhen you want to check the formulas numerically. - Use
exercises.ipynbafter the reading path, not before it. - Return to this overview page when you need the chapter-level navigation.