"A safety case improves when failures are searched for deliberately."
Overview
Red teaming treats harmful behavior discovery as an adaptive search problem over prompts, contexts, policies, and scoring rules.
Alignment and safety sit between model training and model deployment. The chapter studies how desired behavior is specified, learned, stress-tested, constrained, and improved through feedback.
This section is written in LaTeX Markdown. Inline mathematics uses $...$, and display
equations use `
`. The notes use math to expose the operational contract: which behavior is optimized, which behavior is forbidden, and which evidence proves the system changed.
Prerequisites
- Robustness and Distribution Shift
- Capability Benchmarks
- Policy and Guardrails
- Contamination and Dedup Audits
Companion Notebooks
| Notebook | Description |
|---|---|
| theory.ipynb | Executable demonstrations for red teaming and safety evaluations |
| exercises.ipynb | Graded practice for red teaming and safety evaluations |
Learning Objectives
After completing this section, you will be able to:
- Define the core mathematical objects in red teaming and safety evaluations
- Write alignment losses and decision rules in repo notation
- Separate data, objective, policy, and evaluation responsibilities
- Explain how safety failures become training or guardrail updates
- Identify reward hacking, over-refusal, proxy misspecification, and drift
- Design synthetic experiments that demonstrate alignment tradeoffs
- Connect human feedback to SFT, reward modeling, DPO, or runtime policies
- State the boundary with Chapter 17 evaluation and Chapter 19 production MLOps
- Read modern alignment papers with attention to assumptions and proxies
- Build exercises that test both formulas and system-level judgment
Study Flow
- Read the pages in order and pause after each page to restate the main definition or theorem.
- Run
theory.ipynbwhen you want to check the formulas numerically. - Use
exercises.ipynbafter the reading path, not before it. - Return to this overview page when you need the chapter-level navigation.