Notes - Math for LLMs Tutorial

Notes

2 min read6 headings7 reading parts

"Preferences teach the model which good-looking answers people actually choose."

Overview

Preference optimization learns from comparisons, either by fitting a reward model and optimizing a KL-regularized policy or by directly optimizing policy log-ratios.

Alignment and safety sit between model training and model deployment. The chapter studies how desired behavior is specified, learned, stress-tested, constrained, and improved through feedback.

This section is written in LaTeX Markdown. Inline mathematics uses $...$ , and display equations use `

...

`. The notes use math to expose the operational contract: which behavior is optimized, which behavior is forbidden, and which evidence proves the system changed.

Prerequisites

Companion Notebooks

Notebook	Description
theory.ipynb	Executable demonstrations for preference optimization rlhf and dpo
exercises.ipynb	Graded practice for preference optimization rlhf and dpo

Learning Objectives

After completing this section, you will be able to:

Define the core mathematical objects in preference optimization rlhf and dpo
Write alignment losses and decision rules in repo notation
Separate data, objective, policy, and evaluation responsibilities
Explain how safety failures become training or guardrail updates
Identify reward hacking, over-refusal, proxy misspecification, and drift
Design synthetic experiments that demonstrate alignment tradeoffs
Connect human feedback to SFT, reward modeling, DPO, or runtime policies
State the boundary with Chapter 17 evaluation and Chapter 19 production MLOps
Read modern alignment papers with attention to assumptions and proxies
Build exercises that test both formulas and system-level judgment

Study Flow

Read the pages in order and pause after each page to restate the main definition or theorem.
Run theory.ipynb when you want to check the formulas numerically.
Use exercises.ipynb after the reading path, not before it.
Return to this overview page when you need the chapter-level navigation.

Preference Optimization RLHF and DPO

Overview

Prerequisites

Companion Notebooks

Learning Objectives

Study Flow

Runnable Companions

Test this lesson

Which module does this lesson belong to?

Which section is covered in this lesson content?

Which term is most central to this lesson?

What is the best way to use this lesson for real learning?