"A policy is a behavioral specification; a guardrail is its runtime test."
Overview
Policies define desired and disallowed behavior, while guardrails translate those rules into classifiers, validators, gates, and intervention actions.
Alignment and safety sit between model training and model deployment. The chapter studies how desired behavior is specified, learned, stress-tested, constrained, and improved through feedback.
This section is written in LaTeX Markdown. Inline mathematics uses $...$, and display
equations use `
`. The notes use math to expose the operational contract: which behavior is optimized, which behavior is forbidden, and which evidence proves the system changed.
Prerequisites
- Red Teaming and Safety Evaluations
- Calibration and Uncertainty
- Hypothesis Testing
- Language Model Probability
Companion Notebooks
| Notebook | Description |
|---|---|
| theory.ipynb | Executable demonstrations for policy and guardrails |
| exercises.ipynb | Graded practice for policy and guardrails |
Learning Objectives
After completing this section, you will be able to:
- Define the core mathematical objects in policy and guardrails
- Write alignment losses and decision rules in repo notation
- Separate data, objective, policy, and evaluation responsibilities
- Explain how safety failures become training or guardrail updates
- Identify reward hacking, over-refusal, proxy misspecification, and drift
- Design synthetic experiments that demonstrate alignment tradeoffs
- Connect human feedback to SFT, reward modeling, DPO, or runtime policies
- State the boundary with Chapter 17 evaluation and Chapter 19 production MLOps
- Read modern alignment papers with attention to assumptions and proxies
- Build exercises that test both formulas and system-level judgment
Study Flow
- Read the pages in order and pause after each page to restate the main definition or theorem.
- Run
theory.ipynbwhen you want to check the formulas numerically. - Use
exercises.ipynbafter the reading path, not before it. - Return to this overview page when you need the chapter-level navigation.