"An online experiment is a randomized proof attempt about the real world."
Overview
Online experiments connect offline model evidence to causal user and system impact through randomized comparison, statistical inference, and trust checks.
The chapter treats evaluation as a mathematical object: a controlled protocol that maps model behavior into evidence with uncertainty. A result is not just a number; it is an estimate produced by a task distribution, a scoring rule, and an aggregation rule.
This section is written in LaTeX Markdown. Inline mathematics uses $...$, and display
equations use `
`. The emphasis is practical but rigorous: every metric should be linked to the statistical assumptions that make it meaningful.
Prerequisites
- Common Distributions
- Estimation Theory
- Hypothesis Testing
- Capability Benchmarks
- Error Analysis and Ablations
Companion Notebooks
| Notebook | Description |
|---|---|
| theory.ipynb | Executable demonstrations for online experimentation and ab testing |
| exercises.ipynb | Graded practice for online experimentation and ab testing |
Learning Objectives
After completing this section, you will be able to:
- Define the core objects used in online experimentation and ab testing
- Write empirical estimators for model scores and risks
- Attach uncertainty statements to finite evaluation results
- Separate protocol design from metric computation
- Identify common leakage, sampling, and aggregation failures
- Use synthetic experiments to test statistical intuitions
- Connect offline metrics to downstream AI decisions
- Explain where this section ends and where adjacent chapters begin
- Design exercises and notebooks that make the math executable
- Read modern LLM evaluation papers with sharper statistical judgment
Study Flow
- Read the pages in order and pause after each page to restate the main definition or theorem.
- Run
theory.ipynbwhen you want to check the formulas numerically. - Use
exercises.ipynbafter the reading path, not before it. - Return to this overview page when you need the chapter-level navigation.