Scaling laws are empirical formulas for forecasting how language-model loss changes as parameters, data, and compute change. They are not a replacement for experiments; they are a way to make experiments smaller, cheaper, and more informative.
Overview
The core form is a power law with a floor:
Here is usually held-out cross-entropy, is a resource such as parameters, tokens, or compute, is the irreducible floor for the setup, and controls how fast excess loss falls. In LLM planning, the most important resources are parameter count , training tokens , and training compute . For dense decoder-only transformers, a useful first estimate is:
The practical question is: for a fixed budget, what combination of and gives the lowest expected loss, and how uncertain is that forecast?
Prerequisites
- Cross-entropy, perplexity, and held-out loss
- Logarithms and log-log plots
- Basic curve fitting and residual analysis
- Training FLOPs and token accounting
- The previous sections on training at scale and fine-tuning
Companion Notebooks
| Notebook | Purpose |
|---|---|
| theory.ipynb | Fits toy power laws, builds IsoFLOP curves, solves compute-optimal allocation by grid search, studies data quality, serving cost, and residual diagnostics. |
| exercises.ipynb | Ten practice problems for power-law fitting, compute budgets, Chinchilla-style sizing, undertraining checks, and forecast uncertainty. |
Learning Objectives
After this section, you should be able to:
- Explain the difference between parameter, data, and compute scaling.
- Fit a power law when the irreducible loss floor is known or hypothesized.
- Estimate training compute with .
- Draw and interpret IsoFLOP tradeoffs.
- Solve a compute-optimal allocation problem for a two-term loss proxy.
- Explain the Kaplan and Chinchilla lessons without cargo-culting ratios.
- Reason about data quality, repeated data, and effective token counts.
- Include inference cost and latency in model-size decisions.
- Diagnose residuals and uncertainty in scaling-law forecasts.
Table of Contents
- What Scaling Laws Measure
- 1.1 Loss as the target
- 1.2 Resources
- 1.3 Power-law form
- 1.4 Forecasting discipline
- 1.5 Scope
- Parameter Scaling
- 2.1 Model-size law
- 2.2 Capacity limit
- 2.3 Over-large model
- 2.4 Architecture dependence
- 2.5 Fine-tuning implication
- Data Scaling
- 3.1 Token-count law
- 3.2 Data quality
- 3.3 Repeated data
- 3.4 Domain match
- 3.5 Tokenizer dependence
- Compute Scaling
- 4.1 Dense training FLOPs
- 4.2 Compute law
- 4.3 IsoFLOP curves
- 4.4 Compute is not wall-clock
- 4.5 Budget translation
- Compute-Optimal Allocation
- 5.1 Two-term loss proxy
- 5.2 Constraint
- 5.3 Balanced marginal returns
- 5.4 Chinchilla intuition
- 5.5 Practical rule
- Kaplan and Chinchilla Lessons
- Fitting Scaling Laws
- 7.1 Log transformation
- 7.2 Unknown floor
- 7.3 Holdout validation
- 7.4 Residual analysis
- 7.5 Extrapolation risk
- Data Quality and Mixtures
- 8.1 Mixture weights
- 8.2 Effective tokens
- 8.3 Contamination
- 8.4 Curriculum and filtering
- 8.5 Evaluation distribution
- Inference-Aware Scaling
- 9.1 Serving cost
- 9.2 Latency constraint
- 9.3 Small overtrained models
- 9.4 Distillation
- 9.5 Total cost objective
- Limits and Debugging
- 10.1 Not a theorem of all AI
- 10.2 Metric artifacts
- 10.3 Architecture changes
- 10.4 Bad small runs
- 10.5 Decision checklist
Forecasting Workflow
small runs -> held-out loss table -> fit simple law -> holdout forecast -> residual check -> budget decision -> larger validation run
At each step, write down the exact setup: architecture, tokenizer, data mixture, optimizer, schedule, batch size, precision, evaluation set, and filtering. A scaling law fitted under one setup is not guaranteed to transfer to another setup.
1. What Scaling Laws Measure
This part studies what scaling laws measure as a forecasting tool. The formulas are useful only when you keep the measurement setup fixed and validate predictions against real runs.
| Subtopic | Question | Formula |
|---|---|---|
| Loss as the target | scaling laws usually model held-out cross-entropy | |
| Resources | parameters, tokens, and compute are the main axes | |
| Power-law form | loss often improves approximately linearly on log-log axes after subtracting an irreducible floor | |
| Forecasting discipline | fit small runs, predict larger runs, then validate residuals | |
| Scope | a scaling law is empirical over a data, architecture, optimizer, and tokenization regime |
1.1 Loss as the target
Main idea. Scaling laws usually model held-out cross-entropy.
Core relation:
Scaling laws are not magic. They are compact empirical models of how loss changes when resources change. Their value comes from discipline: keep the setup stable, measure held-out loss, fit simple forms, reserve validation runs, and attach uncertainty to every forecast.
Worked micro-example. Suppose loss follows . Increasing compute by a factor of 10 multiplies the excess loss by . That is real improvement, but it is slow improvement. Power laws explain both why scale helps and why scale is expensive.
Implementation check. Plot on log-log axes, inspect residuals, and verify that the fitted law predicts runs that were not used for fitting. If the residuals curve systematically, the simple power law is missing a factor.
AI connection. This is a practical quantity for planning or checking a training run.
Common mistake. Do not treat a published coefficient as a universal constant. Exponents and offsets depend on model family, data, optimizer, tokenizer, and evaluation distribution.
1.2 Resources
Main idea. Parameters, tokens, and compute are the main axes.
Core relation:
Scaling laws are not magic. They are compact empirical models of how loss changes when resources change. Their value comes from discipline: keep the setup stable, measure held-out loss, fit simple forms, reserve validation runs, and attach uncertainty to every forecast.
Worked micro-example. Suppose loss follows . Increasing compute by a factor of 10 multiplies the excess loss by . That is real improvement, but it is slow improvement. Power laws explain both why scale helps and why scale is expensive.
Implementation check. Plot on log-log axes, inspect residuals, and verify that the fitted law predicts runs that were not used for fitting. If the residuals curve systematically, the simple power law is missing a factor.
AI connection. This is a practical quantity for planning or checking a training run.
Common mistake. Do not treat a published coefficient as a universal constant. Exponents and offsets depend on model family, data, optimizer, tokenizer, and evaluation distribution.
1.3 Power-law form
Main idea. Loss often improves approximately linearly on log-log axes after subtracting an irreducible floor.
Core relation:
Scaling laws are not magic. They are compact empirical models of how loss changes when resources change. Their value comes from discipline: keep the setup stable, measure held-out loss, fit simple forms, reserve validation runs, and attach uncertainty to every forecast.
Worked micro-example. Suppose loss follows . Increasing compute by a factor of 10 multiplies the excess loss by . That is real improvement, but it is slow improvement. Power laws explain both why scale helps and why scale is expensive.
Implementation check. Plot on log-log axes, inspect residuals, and verify that the fitted law predicts runs that were not used for fitting. If the residuals curve systematically, the simple power law is missing a factor.
AI connection. This is a practical quantity for planning or checking a training run.
Common mistake. Do not treat a published coefficient as a universal constant. Exponents and offsets depend on model family, data, optimizer, tokenizer, and evaluation distribution.
1.4 Forecasting discipline
Main idea. Fit small runs, predict larger runs, then validate residuals.
Core relation:
Scaling laws are not magic. They are compact empirical models of how loss changes when resources change. Their value comes from discipline: keep the setup stable, measure held-out loss, fit simple forms, reserve validation runs, and attach uncertainty to every forecast.
Worked micro-example. Suppose loss follows . Increasing compute by a factor of 10 multiplies the excess loss by . That is real improvement, but it is slow improvement. Power laws explain both why scale helps and why scale is expensive.
Implementation check. Plot on log-log axes, inspect residuals, and verify that the fitted law predicts runs that were not used for fitting. If the residuals curve systematically, the simple power law is missing a factor.
AI connection. This is a practical quantity for planning or checking a training run.
Common mistake. Do not treat a published coefficient as a universal constant. Exponents and offsets depend on model family, data, optimizer, tokenizer, and evaluation distribution.
1.5 Scope
Main idea. A scaling law is empirical over a data, architecture, optimizer, and tokenization regime.
Core relation:
Scaling laws are not magic. They are compact empirical models of how loss changes when resources change. Their value comes from discipline: keep the setup stable, measure held-out loss, fit simple forms, reserve validation runs, and attach uncertainty to every forecast.
Worked micro-example. Suppose loss follows . Increasing compute by a factor of 10 multiplies the excess loss by . That is real improvement, but it is slow improvement. Power laws explain both why scale helps and why scale is expensive.
Implementation check. Plot on log-log axes, inspect residuals, and verify that the fitted law predicts runs that were not used for fitting. If the residuals curve systematically, the simple power law is missing a factor.
AI connection. This is a practical quantity for planning or checking a training run.
Common mistake. Do not treat a published coefficient as a universal constant. Exponents and offsets depend on model family, data, optimizer, tokenizer, and evaluation distribution.
2. Parameter Scaling
This part studies parameter scaling as a forecasting tool. The formulas are useful only when you keep the measurement setup fixed and validate predictions against real runs.
| Subtopic | Question | Formula |
|---|---|---|
| Model-size law | larger models reduce loss when data and compute are not limiting | |
| Capacity limit | too small a model cannot represent the distribution well | |
| Over-large model | a model can be too large for the available token budget | too small |
| Architecture dependence | the coefficients depend on architecture and training recipe | are not universal constants |
| Fine-tuning implication | adapter rank can be viewed as an adaptation-capacity axis | controls update capacity |
2.1 Model-size law
Main idea. Larger models reduce loss when data and compute are not limiting.
Core relation:
Scaling laws are not magic. They are compact empirical models of how loss changes when resources change. Their value comes from discipline: keep the setup stable, measure held-out loss, fit simple forms, reserve validation runs, and attach uncertainty to every forecast.
Worked micro-example. Suppose loss follows . Increasing compute by a factor of 10 multiplies the excess loss by . That is real improvement, but it is slow improvement. Power laws explain both why scale helps and why scale is expensive.
Implementation check. Plot on log-log axes, inspect residuals, and verify that the fitted law predicts runs that were not used for fitting. If the residuals curve systematically, the simple power law is missing a factor.
AI connection. This is a practical quantity for planning or checking a training run.
Common mistake. Do not treat a published coefficient as a universal constant. Exponents and offsets depend on model family, data, optimizer, tokenizer, and evaluation distribution.
2.2 Capacity limit
Main idea. Too small a model cannot represent the distribution well.
Core relation:
Scaling laws are not magic. They are compact empirical models of how loss changes when resources change. Their value comes from discipline: keep the setup stable, measure held-out loss, fit simple forms, reserve validation runs, and attach uncertainty to every forecast.
Worked micro-example. Suppose loss follows . Increasing compute by a factor of 10 multiplies the excess loss by . That is real improvement, but it is slow improvement. Power laws explain both why scale helps and why scale is expensive.
Implementation check. Plot on log-log axes, inspect residuals, and verify that the fitted law predicts runs that were not used for fitting. If the residuals curve systematically, the simple power law is missing a factor.
AI connection. This is a practical quantity for planning or checking a training run.
Common mistake. Do not treat a published coefficient as a universal constant. Exponents and offsets depend on model family, data, optimizer, tokenizer, and evaluation distribution.
2.3 Over-large model
Main idea. A model can be too large for the available token budget.
Core relation:
D/N$ too smallScaling laws are not magic. They are compact empirical models of how loss changes when resources change. Their value comes from discipline: keep the setup stable, measure held-out loss, fit simple forms, reserve validation runs, and attach uncertainty to every forecast.
Worked micro-example. Suppose loss follows . Increasing compute by a factor of 10 multiplies the excess loss by . That is real improvement, but it is slow improvement. Power laws explain both why scale helps and why scale is expensive.
Implementation check. Plot on log-log axes, inspect residuals, and verify that the fitted law predicts runs that were not used for fitting. If the residuals curve systematically, the simple power law is missing a factor.
AI connection. This is a practical quantity for planning or checking a training run.
Common mistake. Do not treat a published coefficient as a universal constant. Exponents and offsets depend on model family, data, optimizer, tokenizer, and evaluation distribution.
2.4 Architecture dependence
Main idea. The coefficients depend on architecture and training recipe.
Core relation:
A_N,\alpha_N$ are not universal constantsScaling laws are not magic. They are compact empirical models of how loss changes when resources change. Their value comes from discipline: keep the setup stable, measure held-out loss, fit simple forms, reserve validation runs, and attach uncertainty to every forecast.
Worked micro-example. Suppose loss follows . Increasing compute by a factor of 10 multiplies the excess loss by . That is real improvement, but it is slow improvement. Power laws explain both why scale helps and why scale is expensive.
Implementation check. Plot on log-log axes, inspect residuals, and verify that the fitted law predicts runs that were not used for fitting. If the residuals curve systematically, the simple power law is missing a factor.
AI connection. This is a practical quantity for planning or checking a training run.
Common mistake. Do not treat a published coefficient as a universal constant. Exponents and offsets depend on model family, data, optimizer, tokenizer, and evaluation distribution.
2.5 Fine-tuning implication
Main idea. Adapter rank can be viewed as an adaptation-capacity axis.
Core relation:
r$ controls update capacityScaling laws are not magic. They are compact empirical models of how loss changes when resources change. Their value comes from discipline: keep the setup stable, measure held-out loss, fit simple forms, reserve validation runs, and attach uncertainty to every forecast.
Worked micro-example. Suppose loss follows . Increasing compute by a factor of 10 multiplies the excess loss by . That is real improvement, but it is slow improvement. Power laws explain both why scale helps and why scale is expensive.
Implementation check. Plot on log-log axes, inspect residuals, and verify that the fitted law predicts runs that were not used for fitting. If the residuals curve systematically, the simple power law is missing a factor.
AI connection. This is a practical quantity for planning or checking a training run.
Common mistake. Do not treat a published coefficient as a universal constant. Exponents and offsets depend on model family, data, optimizer, tokenizer, and evaluation distribution.
3. Data Scaling
This part studies data scaling as a forecasting tool. The formulas are useful only when you keep the measurement setup fixed and validate predictions against real runs.
| Subtopic | Question | Formula |
|---|---|---|
| Token-count law | more unique useful tokens reduce loss | |
| Data quality | one high-quality token can be worth more than one noisy token | |
| Repeated data | repetition gives diminishing returns and can increase memorization | for repeats |
| Domain match | the validation distribution defines what improvement means | |
| Tokenizer dependence | token counts depend on the tokenizer | is measured in model tokens |
3.1 Token-count law
Main idea. More unique useful tokens reduce loss.
Core relation:
Scaling laws are not magic. They are compact empirical models of how loss changes when resources change. Their value comes from discipline: keep the setup stable, measure held-out loss, fit simple forms, reserve validation runs, and attach uncertainty to every forecast.
Worked micro-example. Suppose loss follows . Increasing compute by a factor of 10 multiplies the excess loss by . That is real improvement, but it is slow improvement. Power laws explain both why scale helps and why scale is expensive.
Implementation check. Plot on log-log axes, inspect residuals, and verify that the fitted law predicts runs that were not used for fitting. If the residuals curve systematically, the simple power law is missing a factor.
AI connection. This is a practical quantity for planning or checking a training run.
Common mistake. Do not treat a published coefficient as a universal constant. Exponents and offsets depend on model family, data, optimizer, tokenizer, and evaluation distribution.
3.2 Data quality
Main idea. One high-quality token can be worth more than one noisy token.
Core relation:
Scaling laws are not magic. They are compact empirical models of how loss changes when resources change. Their value comes from discipline: keep the setup stable, measure held-out loss, fit simple forms, reserve validation runs, and attach uncertainty to every forecast.
Worked micro-example. Suppose loss follows . Increasing compute by a factor of 10 multiplies the excess loss by . That is real improvement, but it is slow improvement. Power laws explain both why scale helps and why scale is expensive.
Implementation check. Plot on log-log axes, inspect residuals, and verify that the fitted law predicts runs that were not used for fitting. If the residuals curve systematically, the simple power law is missing a factor.
AI connection. This is a practical quantity for planning or checking a training run.
Common mistake. Do not treat a published coefficient as a universal constant. Exponents and offsets depend on model family, data, optimizer, tokenizer, and evaluation distribution.
3.3 Repeated data
Main idea. Repetition gives diminishing returns and can increase memorization.
Core relation:
D_\mathrm{eff}<kD$ for repeatsScaling laws are not magic. They are compact empirical models of how loss changes when resources change. Their value comes from discipline: keep the setup stable, measure held-out loss, fit simple forms, reserve validation runs, and attach uncertainty to every forecast.
Worked micro-example. Suppose loss follows . Increasing compute by a factor of 10 multiplies the excess loss by . That is real improvement, but it is slow improvement. Power laws explain both why scale helps and why scale is expensive.
Implementation check. Plot on log-log axes, inspect residuals, and verify that the fitted law predicts runs that were not used for fitting. If the residuals curve systematically, the simple power law is missing a factor.
AI connection. This is a practical quantity for planning or checking a training run.
Common mistake. Do not treat a published coefficient as a universal constant. Exponents and offsets depend on model family, data, optimizer, tokenizer, and evaluation distribution.
3.4 Domain match
Main idea. The validation distribution defines what improvement means.
Core relation:
Scaling laws are not magic. They are compact empirical models of how loss changes when resources change. Their value comes from discipline: keep the setup stable, measure held-out loss, fit simple forms, reserve validation runs, and attach uncertainty to every forecast.
Worked micro-example. Suppose loss follows . Increasing compute by a factor of 10 multiplies the excess loss by . That is real improvement, but it is slow improvement. Power laws explain both why scale helps and why scale is expensive.
Implementation check. Plot on log-log axes, inspect residuals, and verify that the fitted law predicts runs that were not used for fitting. If the residuals curve systematically, the simple power law is missing a factor.
AI connection. This is a practical quantity for planning or checking a training run.
Common mistake. Do not treat a published coefficient as a universal constant. Exponents and offsets depend on model family, data, optimizer, tokenizer, and evaluation distribution.
3.5 Tokenizer dependence
Main idea. Token counts depend on the tokenizer.
Core relation:
D$ is measured in model tokensScaling laws are not magic. They are compact empirical models of how loss changes when resources change. Their value comes from discipline: keep the setup stable, measure held-out loss, fit simple forms, reserve validation runs, and attach uncertainty to every forecast.
Worked micro-example. Suppose loss follows . Increasing compute by a factor of 10 multiplies the excess loss by . That is real improvement, but it is slow improvement. Power laws explain both why scale helps and why scale is expensive.
Implementation check. Plot on log-log axes, inspect residuals, and verify that the fitted law predicts runs that were not used for fitting. If the residuals curve systematically, the simple power law is missing a factor.
AI connection. This is a practical quantity for planning or checking a training run.
Common mistake. Do not treat a published coefficient as a universal constant. Exponents and offsets depend on model family, data, optimizer, tokenizer, and evaluation distribution.
4. Compute Scaling
This part studies compute scaling as a forecasting tool. The formulas are useful only when you keep the measurement setup fixed and validate predictions against real runs.
| Subtopic | Question | Formula |
|---|---|---|
| Dense training FLOPs | a common first estimate for dense transformers is six times parameters times tokens | |
| Compute law | loss can be modeled directly as a function of compute | |
| IsoFLOP curves | fixed compute creates a tradeoff between model size and token count | |
| Compute is not wall-clock | hardware utilization and communication decide time | |
| Budget translation | money and time become constraints after mapping to achievable FLOPs |
4.1 Dense training FLOPs
Main idea. A common first estimate for dense transformers is six times parameters times tokens.
Core relation:
Scaling laws are not magic. They are compact empirical models of how loss changes when resources change. Their value comes from discipline: keep the setup stable, measure held-out loss, fit simple forms, reserve validation runs, and attach uncertainty to every forecast.
Worked micro-example. Suppose loss follows . Increasing compute by a factor of 10 multiplies the excess loss by . That is real improvement, but it is slow improvement. Power laws explain both why scale helps and why scale is expensive.
Implementation check. Plot on log-log axes, inspect residuals, and verify that the fitted law predicts runs that were not used for fitting. If the residuals curve systematically, the simple power law is missing a factor.
AI connection. This is a practical quantity for planning or checking a training run.
Common mistake. Do not treat a published coefficient as a universal constant. Exponents and offsets depend on model family, data, optimizer, tokenizer, and evaluation distribution.
4.2 Compute law
Main idea. Loss can be modeled directly as a function of compute.
Core relation:
Scaling laws are not magic. They are compact empirical models of how loss changes when resources change. Their value comes from discipline: keep the setup stable, measure held-out loss, fit simple forms, reserve validation runs, and attach uncertainty to every forecast.
Worked micro-example. Suppose loss follows . Increasing compute by a factor of 10 multiplies the excess loss by . That is real improvement, but it is slow improvement. Power laws explain both why scale helps and why scale is expensive.
Implementation check. Plot on log-log axes, inspect residuals, and verify that the fitted law predicts runs that were not used for fitting. If the residuals curve systematically, the simple power law is missing a factor.
AI connection. This is a practical quantity for planning or checking a training run.
Common mistake. Do not treat a published coefficient as a universal constant. Exponents and offsets depend on model family, data, optimizer, tokenizer, and evaluation distribution.
4.3 IsoFLOP curves
Main idea. Fixed compute creates a tradeoff between model size and token count.
Core relation:
Scaling laws are not magic. They are compact empirical models of how loss changes when resources change. Their value comes from discipline: keep the setup stable, measure held-out loss, fit simple forms, reserve validation runs, and attach uncertainty to every forecast.
Worked micro-example. Suppose loss follows . Increasing compute by a factor of 10 multiplies the excess loss by . That is real improvement, but it is slow improvement. Power laws explain both why scale helps and why scale is expensive.
Implementation check. Plot on log-log axes, inspect residuals, and verify that the fitted law predicts runs that were not used for fitting. If the residuals curve systematically, the simple power law is missing a factor.
AI connection. These curves are how you ask whether the next dollar should buy a larger model or more training tokens.
Common mistake. Do not treat a published coefficient as a universal constant. Exponents and offsets depend on model family, data, optimizer, tokenizer, and evaluation distribution.
4.4 Compute is not wall-clock
Main idea. Hardware utilization and communication decide time.
Core relation:
Scaling laws are not magic. They are compact empirical models of how loss changes when resources change. Their value comes from discipline: keep the setup stable, measure held-out loss, fit simple forms, reserve validation runs, and attach uncertainty to every forecast.
Worked micro-example. Suppose loss follows . Increasing compute by a factor of 10 multiplies the excess loss by . That is real improvement, but it is slow improvement. Power laws explain both why scale helps and why scale is expensive.
Implementation check. Plot on log-log axes, inspect residuals, and verify that the fitted law predicts runs that were not used for fitting. If the residuals curve systematically, the simple power law is missing a factor.
AI connection. This is a practical quantity for planning or checking a training run.
Common mistake. Do not treat a published coefficient as a universal constant. Exponents and offsets depend on model family, data, optimizer, tokenizer, and evaluation distribution.
4.5 Budget translation
Main idea. Money and time become constraints after mapping to achievable flops.
Core relation:
Scaling laws are not magic. They are compact empirical models of how loss changes when resources change. Their value comes from discipline: keep the setup stable, measure held-out loss, fit simple forms, reserve validation runs, and attach uncertainty to every forecast.
Worked micro-example. Suppose loss follows . Increasing compute by a factor of 10 multiplies the excess loss by . That is real improvement, but it is slow improvement. Power laws explain both why scale helps and why scale is expensive.
Implementation check. Plot on log-log axes, inspect residuals, and verify that the fitted law predicts runs that were not used for fitting. If the residuals curve systematically, the simple power law is missing a factor.
AI connection. This is a practical quantity for planning or checking a training run.
Common mistake. Do not treat a published coefficient as a universal constant. Exponents and offsets depend on model family, data, optimizer, tokenizer, and evaluation distribution.
5. Compute-Optimal Allocation
This part studies compute-optimal allocation as a forecasting tool. The formulas are useful only when you keep the measurement setup fixed and validate predictions against real runs.
| Subtopic | Question | Formula |
|---|---|---|
| Two-term loss proxy | separate parameter-limited and data-limited terms | |
| Constraint | training compute couples N and D | |
| Balanced marginal returns | at optimum, extra compute should help similarly through N and D | balances under the constraint |
| Chinchilla intuition | many earlier large models were undertrained relative to their size | was too small |
| Practical rule | choose N and D together, not one after the other |
5.1 Two-term loss proxy
Main idea. Separate parameter-limited and data-limited terms.
Core relation:
Scaling laws are not magic. They are compact empirical models of how loss changes when resources change. Their value comes from discipline: keep the setup stable, measure held-out loss, fit simple forms, reserve validation runs, and attach uncertainty to every forecast.
Worked micro-example. Suppose loss follows . Increasing compute by a factor of 10 multiplies the excess loss by . That is real improvement, but it is slow improvement. Power laws explain both why scale helps and why scale is expensive.
Implementation check. Plot on log-log axes, inspect residuals, and verify that the fitted law predicts runs that were not used for fitting. If the residuals curve systematically, the simple power law is missing a factor.
AI connection. This is a practical quantity for planning or checking a training run.
Common mistake. Do not treat a published coefficient as a universal constant. Exponents and offsets depend on model family, data, optimizer, tokenizer, and evaluation distribution.
5.2 Constraint
Main idea. Training compute couples n and d.
Core relation:
Scaling laws are not magic. They are compact empirical models of how loss changes when resources change. Their value comes from discipline: keep the setup stable, measure held-out loss, fit simple forms, reserve validation runs, and attach uncertainty to every forecast.
Worked micro-example. Suppose loss follows . Increasing compute by a factor of 10 multiplies the excess loss by . That is real improvement, but it is slow improvement. Power laws explain both why scale helps and why scale is expensive.
Implementation check. Plot on log-log axes, inspect residuals, and verify that the fitted law predicts runs that were not used for fitting. If the residuals curve systematically, the simple power law is missing a factor.
AI connection. This is a practical quantity for planning or checking a training run.
Common mistake. Do not treat a published coefficient as a universal constant. Exponents and offsets depend on model family, data, optimizer, tokenizer, and evaluation distribution.
5.3 Balanced marginal returns
Main idea. At optimum, extra compute should help similarly through n and d.
Core relation:
\partial L/\partial N$ balances $\partial L/\partial D$ under the constraintScaling laws are not magic. They are compact empirical models of how loss changes when resources change. Their value comes from discipline: keep the setup stable, measure held-out loss, fit simple forms, reserve validation runs, and attach uncertainty to every forecast.
Worked micro-example. Suppose loss follows . Increasing compute by a factor of 10 multiplies the excess loss by . That is real improvement, but it is slow improvement. Power laws explain both why scale helps and why scale is expensive.
Implementation check. Plot on log-log axes, inspect residuals, and verify that the fitted law predicts runs that were not used for fitting. If the residuals curve systematically, the simple power law is missing a factor.
AI connection. This is a practical quantity for planning or checking a training run.
Common mistake. Do not treat a published coefficient as a universal constant. Exponents and offsets depend on model family, data, optimizer, tokenizer, and evaluation distribution.
5.4 Chinchilla intuition
Main idea. Many earlier large models were undertrained relative to their size.
Core relation:
D/N$ was too smallScaling laws are not magic. They are compact empirical models of how loss changes when resources change. Their value comes from discipline: keep the setup stable, measure held-out loss, fit simple forms, reserve validation runs, and attach uncertainty to every forecast.
Worked micro-example. Suppose loss follows . Increasing compute by a factor of 10 multiplies the excess loss by . That is real improvement, but it is slow improvement. Power laws explain both why scale helps and why scale is expensive.
Implementation check. Plot on log-log axes, inspect residuals, and verify that the fitted law predicts runs that were not used for fitting. If the residuals curve systematically, the simple power law is missing a factor.
AI connection. This is a practical quantity for planning or checking a training run.
Common mistake. Do not treat a published coefficient as a universal constant. Exponents and offsets depend on model family, data, optimizer, tokenizer, and evaluation distribution.
5.5 Practical rule
Main idea. Choose n and d together, not one after the other.
Core relation:
Scaling laws are not magic. They are compact empirical models of how loss changes when resources change. Their value comes from discipline: keep the setup stable, measure held-out loss, fit simple forms, reserve validation runs, and attach uncertainty to every forecast.
Worked micro-example. Suppose loss follows . Increasing compute by a factor of 10 multiplies the excess loss by . That is real improvement, but it is slow improvement. Power laws explain both why scale helps and why scale is expensive.
Implementation check. Plot on log-log axes, inspect residuals, and verify that the fitted law predicts runs that were not used for fitting. If the residuals curve systematically, the simple power law is missing a factor.
AI connection. This is a practical quantity for planning or checking a training run.
Common mistake. Do not treat a published coefficient as a universal constant. Exponents and offsets depend on model family, data, optimizer, tokenizer, and evaluation distribution.
6. Kaplan and Chinchilla Lessons
This part studies kaplan and chinchilla lessons as a forecasting tool. The formulas are useful only when you keep the measurement setup fixed and validate predictions against real runs.
| Subtopic | Question | Formula |
|---|---|---|
| Kaplan-style result | smooth power laws made loss forecasting practical | follows power laws across scale ranges |
| Chinchilla revision | more tokens per parameter can be compute-optimal than earlier practice suggested | increases |
| Undertraining diagnostic | a model with low tokens per parameter may benefit more from data than size | |
| Overtraining for inference | training smaller models longer can reduce serving cost | |
| Do not cargo-cult ratios | ratios depend on data, architecture, and objective | is a planning prior, not a theorem |
6.1 Kaplan-style result
Main idea. Smooth power laws made loss forecasting practical.
Core relation:
L$ follows power laws across scale rangesScaling laws are not magic. They are compact empirical models of how loss changes when resources change. Their value comes from discipline: keep the setup stable, measure held-out loss, fit simple forms, reserve validation runs, and attach uncertainty to every forecast.
Worked micro-example. Suppose loss follows . Increasing compute by a factor of 10 multiplies the excess loss by . That is real improvement, but it is slow improvement. Power laws explain both why scale helps and why scale is expensive.
Implementation check. Plot on log-log axes, inspect residuals, and verify that the fitted law predicts runs that were not used for fitting. If the residuals curve systematically, the simple power law is missing a factor.
AI connection. This is a practical quantity for planning or checking a training run.
Common mistake. Do not treat a published coefficient as a universal constant. Exponents and offsets depend on model family, data, optimizer, tokenizer, and evaluation distribution.
6.2 Chinchilla revision
Main idea. More tokens per parameter can be compute-optimal than earlier practice suggested.
Core relation:
D^\star/N^\star$ increasesScaling laws are not magic. They are compact empirical models of how loss changes when resources change. Their value comes from discipline: keep the setup stable, measure held-out loss, fit simple forms, reserve validation runs, and attach uncertainty to every forecast.
Worked micro-example. Suppose loss follows . Increasing compute by a factor of 10 multiplies the excess loss by . That is real improvement, but it is slow improvement. Power laws explain both why scale helps and why scale is expensive.
Implementation check. Plot on log-log axes, inspect residuals, and verify that the fitted law predicts runs that were not used for fitting. If the residuals curve systematically, the simple power law is missing a factor.
AI connection. The practical lesson is that undertrained large models can waste compute even when they look impressive.
Common mistake. Do not treat a published coefficient as a universal constant. Exponents and offsets depend on model family, data, optimizer, tokenizer, and evaluation distribution.
6.3 Undertraining diagnostic
Main idea. A model with low tokens per parameter may benefit more from data than size.
Core relation:
Scaling laws are not magic. They are compact empirical models of how loss changes when resources change. Their value comes from discipline: keep the setup stable, measure held-out loss, fit simple forms, reserve validation runs, and attach uncertainty to every forecast.
Worked micro-example. Suppose loss follows . Increasing compute by a factor of 10 multiplies the excess loss by . That is real improvement, but it is slow improvement. Power laws explain both why scale helps and why scale is expensive.
Implementation check. Plot on log-log axes, inspect residuals, and verify that the fitted law predicts runs that were not used for fitting. If the residuals curve systematically, the simple power law is missing a factor.
AI connection. This is a practical quantity for planning or checking a training run.
Common mistake. Do not treat a published coefficient as a universal constant. Exponents and offsets depend on model family, data, optimizer, tokenizer, and evaluation distribution.
6.4 Overtraining for inference
Main idea. Training smaller models longer can reduce serving cost.
Core relation:
Scaling laws are not magic. They are compact empirical models of how loss changes when resources change. Their value comes from discipline: keep the setup stable, measure held-out loss, fit simple forms, reserve validation runs, and attach uncertainty to every forecast.
Worked micro-example. Suppose loss follows . Increasing compute by a factor of 10 multiplies the excess loss by . That is real improvement, but it is slow improvement. Power laws explain both why scale helps and why scale is expensive.
Implementation check. Plot on log-log axes, inspect residuals, and verify that the fitted law predicts runs that were not used for fitting. If the residuals curve systematically, the simple power law is missing a factor.
AI connection. This is a practical quantity for planning or checking a training run.
Common mistake. Do not treat a published coefficient as a universal constant. Exponents and offsets depend on model family, data, optimizer, tokenizer, and evaluation distribution.
6.5 Do not cargo-cult ratios
Main idea. Ratios depend on data, architecture, and objective.
Core relation:
D/N$ is a planning prior, not a theoremScaling laws are not magic. They are compact empirical models of how loss changes when resources change. Their value comes from discipline: keep the setup stable, measure held-out loss, fit simple forms, reserve validation runs, and attach uncertainty to every forecast.
Worked micro-example. Suppose loss follows . Increasing compute by a factor of 10 multiplies the excess loss by . That is real improvement, but it is slow improvement. Power laws explain both why scale helps and why scale is expensive.
Implementation check. Plot on log-log axes, inspect residuals, and verify that the fitted law predicts runs that were not used for fitting. If the residuals curve systematically, the simple power law is missing a factor.
AI connection. This is a practical quantity for planning or checking a training run.
Common mistake. Do not treat a published coefficient as a universal constant. Exponents and offsets depend on model family, data, optimizer, tokenizer, and evaluation distribution.
7. Fitting Scaling Laws
This part studies fitting scaling laws as a forecasting tool. The formulas are useful only when you keep the measurement setup fixed and validate predictions against real runs.
| Subtopic | Question | Formula |
|---|---|---|
| Log transformation | if the floor is known, power-law fitting becomes linear | |
| Unknown floor | the irreducible loss floor must be fit or bounded carefully | |
| Holdout validation | reserve some runs to test forecasts | |
| Residual analysis | curved residuals mean the law is missing a factor | |
| Extrapolation risk | small-run fits can fail when regime changes |
7.1 Log transformation
Main idea. If the floor is known, power-law fitting becomes linear.
Core relation:
Scaling laws are not magic. They are compact empirical models of how loss changes when resources change. Their value comes from discipline: keep the setup stable, measure held-out loss, fit simple forms, reserve validation runs, and attach uncertainty to every forecast.
Worked micro-example. Suppose loss follows . Increasing compute by a factor of 10 multiplies the excess loss by . That is real improvement, but it is slow improvement. Power laws explain both why scale helps and why scale is expensive.
Implementation check. Plot on log-log axes, inspect residuals, and verify that the fitted law predicts runs that were not used for fitting. If the residuals curve systematically, the simple power law is missing a factor.
AI connection. This is a practical quantity for planning or checking a training run.
Common mistake. Do not treat a published coefficient as a universal constant. Exponents and offsets depend on model family, data, optimizer, tokenizer, and evaluation distribution.
7.2 Unknown floor
Main idea. The irreducible loss floor must be fit or bounded carefully.
Core relation:
Scaling laws are not magic. They are compact empirical models of how loss changes when resources change. Their value comes from discipline: keep the setup stable, measure held-out loss, fit simple forms, reserve validation runs, and attach uncertainty to every forecast.
Worked micro-example. Suppose loss follows . Increasing compute by a factor of 10 multiplies the excess loss by . That is real improvement, but it is slow improvement. Power laws explain both why scale helps and why scale is expensive.
Implementation check. Plot on log-log axes, inspect residuals, and verify that the fitted law predicts runs that were not used for fitting. If the residuals curve systematically, the simple power law is missing a factor.
AI connection. A small error in the loss floor can bend exponent estimates and make forecasts overconfident.
Common mistake. Do not treat a published coefficient as a universal constant. Exponents and offsets depend on model family, data, optimizer, tokenizer, and evaluation distribution.
7.3 Holdout validation
Main idea. Reserve some runs to test forecasts.
Core relation:
Scaling laws are not magic. They are compact empirical models of how loss changes when resources change. Their value comes from discipline: keep the setup stable, measure held-out loss, fit simple forms, reserve validation runs, and attach uncertainty to every forecast.
Worked micro-example. Suppose loss follows . Increasing compute by a factor of 10 multiplies the excess loss by . That is real improvement, but it is slow improvement. Power laws explain both why scale helps and why scale is expensive.
Implementation check. Plot on log-log axes, inspect residuals, and verify that the fitted law predicts runs that were not used for fitting. If the residuals curve systematically, the simple power law is missing a factor.
AI connection. This is a practical quantity for planning or checking a training run.
Common mistake. Do not treat a published coefficient as a universal constant. Exponents and offsets depend on model family, data, optimizer, tokenizer, and evaluation distribution.
7.4 Residual analysis
Main idea. Curved residuals mean the law is missing a factor.
Core relation:
Scaling laws are not magic. They are compact empirical models of how loss changes when resources change. Their value comes from discipline: keep the setup stable, measure held-out loss, fit simple forms, reserve validation runs, and attach uncertainty to every forecast.
Worked micro-example. Suppose loss follows . Increasing compute by a factor of 10 multiplies the excess loss by . That is real improvement, but it is slow improvement. Power laws explain both why scale helps and why scale is expensive.
Implementation check. Plot on log-log axes, inspect residuals, and verify that the fitted law predicts runs that were not used for fitting. If the residuals curve systematically, the simple power law is missing a factor.
AI connection. This is a practical quantity for planning or checking a training run.
Common mistake. Do not treat a published coefficient as a universal constant. Exponents and offsets depend on model family, data, optimizer, tokenizer, and evaluation distribution.
7.5 Extrapolation risk
Main idea. Small-run fits can fail when regime changes.
Core relation:
Scaling laws are not magic. They are compact empirical models of how loss changes when resources change. Their value comes from discipline: keep the setup stable, measure held-out loss, fit simple forms, reserve validation runs, and attach uncertainty to every forecast.
Worked micro-example. Suppose loss follows . Increasing compute by a factor of 10 multiplies the excess loss by . That is real improvement, but it is slow improvement. Power laws explain both why scale helps and why scale is expensive.
Implementation check. Plot on log-log axes, inspect residuals, and verify that the fitted law predicts runs that were not used for fitting. If the residuals curve systematically, the simple power law is missing a factor.
AI connection. This is a practical quantity for planning or checking a training run.
Common mistake. Do not treat a published coefficient as a universal constant. Exponents and offsets depend on model family, data, optimizer, tokenizer, and evaluation distribution.
8. Data Quality and Mixtures
This part studies data quality and mixtures as a forecasting tool. The formulas are useful only when you keep the measurement setup fixed and validate predictions against real runs.
| Subtopic | Question | Formula |
|---|---|---|
| Mixture weights | training data is a weighted mixture of sources | |
| Effective tokens | quality, deduplication, and domain match alter token value | |
| Contamination | validation leakage makes scaling look better than it is | |
| Curriculum and filtering | data order and filtering can shift constants | changes even if is similar |
| Evaluation distribution | optimize for the distribution that matters | , not only |
8.1 Mixture weights
Main idea. Training data is a weighted mixture of sources.
Core relation:
Scaling laws are not magic. They are compact empirical models of how loss changes when resources change. Their value comes from discipline: keep the setup stable, measure held-out loss, fit simple forms, reserve validation runs, and attach uncertainty to every forecast.
Worked micro-example. Suppose loss follows . Increasing compute by a factor of 10 multiplies the excess loss by . That is real improvement, but it is slow improvement. Power laws explain both why scale helps and why scale is expensive.
Implementation check. Plot on log-log axes, inspect residuals, and verify that the fitted law predicts runs that were not used for fitting. If the residuals curve systematically, the simple power law is missing a factor.
AI connection. This is a practical quantity for planning or checking a training run.
Common mistake. Do not treat a published coefficient as a universal constant. Exponents and offsets depend on model family, data, optimizer, tokenizer, and evaluation distribution.
8.2 Effective tokens
Main idea. Quality, deduplication, and domain match alter token value.
Core relation:
Scaling laws are not magic. They are compact empirical models of how loss changes when resources change. Their value comes from discipline: keep the setup stable, measure held-out loss, fit simple forms, reserve validation runs, and attach uncertainty to every forecast.
Worked micro-example. Suppose loss follows . Increasing compute by a factor of 10 multiplies the excess loss by . That is real improvement, but it is slow improvement. Power laws explain both why scale helps and why scale is expensive.
Implementation check. Plot on log-log axes, inspect residuals, and verify that the fitted law predicts runs that were not used for fitting. If the residuals curve systematically, the simple power law is missing a factor.
AI connection. Data quantity without data value is a weak planning variable.
Common mistake. Do not treat a published coefficient as a universal constant. Exponents and offsets depend on model family, data, optimizer, tokenizer, and evaluation distribution.
8.3 Contamination
Main idea. Validation leakage makes scaling look better than it is.
Core relation:
Scaling laws are not magic. They are compact empirical models of how loss changes when resources change. Their value comes from discipline: keep the setup stable, measure held-out loss, fit simple forms, reserve validation runs, and attach uncertainty to every forecast.
Worked micro-example. Suppose loss follows . Increasing compute by a factor of 10 multiplies the excess loss by . That is real improvement, but it is slow improvement. Power laws explain both why scale helps and why scale is expensive.
Implementation check. Plot on log-log axes, inspect residuals, and verify that the fitted law predicts runs that were not used for fitting. If the residuals curve systematically, the simple power law is missing a factor.
AI connection. This is a practical quantity for planning or checking a training run.
Common mistake. Do not treat a published coefficient as a universal constant. Exponents and offsets depend on model family, data, optimizer, tokenizer, and evaluation distribution.
8.4 Curriculum and filtering
Main idea. Data order and filtering can shift constants.
Core relation:
A$ changes even if $\alpha$ is similarScaling laws are not magic. They are compact empirical models of how loss changes when resources change. Their value comes from discipline: keep the setup stable, measure held-out loss, fit simple forms, reserve validation runs, and attach uncertainty to every forecast.
Worked micro-example. Suppose loss follows . Increasing compute by a factor of 10 multiplies the excess loss by . That is real improvement, but it is slow improvement. Power laws explain both why scale helps and why scale is expensive.
Implementation check. Plot on log-log axes, inspect residuals, and verify that the fitted law predicts runs that were not used for fitting. If the residuals curve systematically, the simple power law is missing a factor.
AI connection. This is a practical quantity for planning or checking a training run.
Common mistake. Do not treat a published coefficient as a universal constant. Exponents and offsets depend on model family, data, optimizer, tokenizer, and evaluation distribution.
8.5 Evaluation distribution
Main idea. Optimize for the distribution that matters.
Core relation:
L_\mathrm{eval}$, not only $L_\mathrm{web}Scaling laws are not magic. They are compact empirical models of how loss changes when resources change. Their value comes from discipline: keep the setup stable, measure held-out loss, fit simple forms, reserve validation runs, and attach uncertainty to every forecast.
Worked micro-example. Suppose loss follows . Increasing compute by a factor of 10 multiplies the excess loss by . That is real improvement, but it is slow improvement. Power laws explain both why scale helps and why scale is expensive.
Implementation check. Plot on log-log axes, inspect residuals, and verify that the fitted law predicts runs that were not used for fitting. If the residuals curve systematically, the simple power law is missing a factor.
AI connection. This is a practical quantity for planning or checking a training run.
Common mistake. Do not treat a published coefficient as a universal constant. Exponents and offsets depend on model family, data, optimizer, tokenizer, and evaluation distribution.
9. Inference-Aware Scaling
This part studies inference-aware scaling as a forecasting tool. The formulas are useful only when you keep the measurement setup fixed and validate predictions against real runs.
| Subtopic | Question | Formula |
|---|---|---|
| Serving cost | a model trained once may be served many times | |
| Latency constraint | larger models can be better but too slow | |
| Small overtrained models | more training tokens can improve a smaller cheaper model | can exceed pretraining-optimal planning ratios |
| Distillation | teacher quality can transfer into a smaller student | |
| Total cost objective | choose models by train plus serve plus quality |
9.1 Serving cost
Main idea. A model trained once may be served many times.
Core relation:
Scaling laws are not magic. They are compact empirical models of how loss changes when resources change. Their value comes from discipline: keep the setup stable, measure held-out loss, fit simple forms, reserve validation runs, and attach uncertainty to every forecast.
Worked micro-example. Suppose loss follows . Increasing compute by a factor of 10 multiplies the excess loss by . That is real improvement, but it is slow improvement. Power laws explain both why scale helps and why scale is expensive.
Implementation check. Plot on log-log axes, inspect residuals, and verify that the fitted law predicts runs that were not used for fitting. If the residuals curve systematically, the simple power law is missing a factor.
AI connection. The best training run is not always the cheapest product model.
Common mistake. Do not treat a published coefficient as a universal constant. Exponents and offsets depend on model family, data, optimizer, tokenizer, and evaluation distribution.
9.2 Latency constraint
Main idea. Larger models can be better but too slow.
Core relation:
Scaling laws are not magic. They are compact empirical models of how loss changes when resources change. Their value comes from discipline: keep the setup stable, measure held-out loss, fit simple forms, reserve validation runs, and attach uncertainty to every forecast.
Worked micro-example. Suppose loss follows . Increasing compute by a factor of 10 multiplies the excess loss by . That is real improvement, but it is slow improvement. Power laws explain both why scale helps and why scale is expensive.
Implementation check. Plot on log-log axes, inspect residuals, and verify that the fitted law predicts runs that were not used for fitting. If the residuals curve systematically, the simple power law is missing a factor.
AI connection. This is a practical quantity for planning or checking a training run.
Common mistake. Do not treat a published coefficient as a universal constant. Exponents and offsets depend on model family, data, optimizer, tokenizer, and evaluation distribution.
9.3 Small overtrained models
Main idea. More training tokens can improve a smaller cheaper model.
Core relation:
D/N$ can exceed pretraining-optimal planning ratiosScaling laws are not magic. They are compact empirical models of how loss changes when resources change. Their value comes from discipline: keep the setup stable, measure held-out loss, fit simple forms, reserve validation runs, and attach uncertainty to every forecast.
Worked micro-example. Suppose loss follows . Increasing compute by a factor of 10 multiplies the excess loss by . That is real improvement, but it is slow improvement. Power laws explain both why scale helps and why scale is expensive.
Implementation check. Plot on log-log axes, inspect residuals, and verify that the fitted law predicts runs that were not used for fitting. If the residuals curve systematically, the simple power law is missing a factor.
AI connection. This is a practical quantity for planning or checking a training run.
Common mistake. Do not treat a published coefficient as a universal constant. Exponents and offsets depend on model family, data, optimizer, tokenizer, and evaluation distribution.
9.4 Distillation
Main idea. Teacher quality can transfer into a smaller student.
Core relation:
Scaling laws are not magic. They are compact empirical models of how loss changes when resources change. Their value comes from discipline: keep the setup stable, measure held-out loss, fit simple forms, reserve validation runs, and attach uncertainty to every forecast.
Worked micro-example. Suppose loss follows . Increasing compute by a factor of 10 multiplies the excess loss by . That is real improvement, but it is slow improvement. Power laws explain both why scale helps and why scale is expensive.
Implementation check. Plot on log-log axes, inspect residuals, and verify that the fitted law predicts runs that were not used for fitting. If the residuals curve systematically, the simple power law is missing a factor.
AI connection. This is a practical quantity for planning or checking a training run.
Common mistake. Do not treat a published coefficient as a universal constant. Exponents and offsets depend on model family, data, optimizer, tokenizer, and evaluation distribution.
9.5 Total cost objective
Main idea. Choose models by train plus serve plus quality.
Core relation:
Scaling laws are not magic. They are compact empirical models of how loss changes when resources change. Their value comes from discipline: keep the setup stable, measure held-out loss, fit simple forms, reserve validation runs, and attach uncertainty to every forecast.
Worked micro-example. Suppose loss follows . Increasing compute by a factor of 10 multiplies the excess loss by . That is real improvement, but it is slow improvement. Power laws explain both why scale helps and why scale is expensive.
Implementation check. Plot on log-log axes, inspect residuals, and verify that the fitted law predicts runs that were not used for fitting. If the residuals curve systematically, the simple power law is missing a factor.
AI connection. This is a practical quantity for planning or checking a training run.
Common mistake. Do not treat a published coefficient as a universal constant. Exponents and offsets depend on model family, data, optimizer, tokenizer, and evaluation distribution.
10. Limits and Debugging
This part studies limits and debugging as a forecasting tool. The formulas are useful only when you keep the measurement setup fixed and validate predictions against real runs.
| Subtopic | Question | Formula |
|---|---|---|
| Not a theorem of all AI | scaling laws are fitted empirical regularities | |
| Metric artifacts | thresholded task scores can look sudden even when loss changes smoothly | |
| Architecture changes | new attention, MoE, or tokenizer choices can change constants | shift |
| Bad small runs | bugs in small runs poison the forecast | inherits measurement error |
| Decision checklist | use scaling laws as planning instruments with uncertainty |
10.1 Not a theorem of all AI
Main idea. Scaling laws are fitted empirical regularities.
Core relation:
Scaling laws are not magic. They are compact empirical models of how loss changes when resources change. Their value comes from discipline: keep the setup stable, measure held-out loss, fit simple forms, reserve validation runs, and attach uncertainty to every forecast.
Worked micro-example. Suppose loss follows . Increasing compute by a factor of 10 multiplies the excess loss by . That is real improvement, but it is slow improvement. Power laws explain both why scale helps and why scale is expensive.
Implementation check. Plot on log-log axes, inspect residuals, and verify that the fitted law predicts runs that were not used for fitting. If the residuals curve systematically, the simple power law is missing a factor.
AI connection. This is a practical quantity for planning or checking a training run.
Common mistake. Do not treat a published coefficient as a universal constant. Exponents and offsets depend on model family, data, optimizer, tokenizer, and evaluation distribution.
10.2 Metric artifacts
Main idea. Thresholded task scores can look sudden even when loss changes smoothly.
Core relation:
Scaling laws are not magic. They are compact empirical models of how loss changes when resources change. Their value comes from discipline: keep the setup stable, measure held-out loss, fit simple forms, reserve validation runs, and attach uncertainty to every forecast.
Worked micro-example. Suppose loss follows . Increasing compute by a factor of 10 multiplies the excess loss by . That is real improvement, but it is slow improvement. Power laws explain both why scale helps and why scale is expensive.
Implementation check. Plot on log-log axes, inspect residuals, and verify that the fitted law predicts runs that were not used for fitting. If the residuals curve systematically, the simple power law is missing a factor.
AI connection. This is a practical quantity for planning or checking a training run.
Common mistake. Do not treat a published coefficient as a universal constant. Exponents and offsets depend on model family, data, optimizer, tokenizer, and evaluation distribution.
10.3 Architecture changes
Main idea. New attention, moe, or tokenizer choices can change constants.
Core relation:
A,\alpha,L_\infty$ shiftScaling laws are not magic. They are compact empirical models of how loss changes when resources change. Their value comes from discipline: keep the setup stable, measure held-out loss, fit simple forms, reserve validation runs, and attach uncertainty to every forecast.
Worked micro-example. Suppose loss follows . Increasing compute by a factor of 10 multiplies the excess loss by . That is real improvement, but it is slow improvement. Power laws explain both why scale helps and why scale is expensive.
Implementation check. Plot on log-log axes, inspect residuals, and verify that the fitted law predicts runs that were not used for fitting. If the residuals curve systematically, the simple power law is missing a factor.
AI connection. This is a practical quantity for planning or checking a training run.
Common mistake. Do not treat a published coefficient as a universal constant. Exponents and offsets depend on model family, data, optimizer, tokenizer, and evaluation distribution.
10.4 Bad small runs
Main idea. Bugs in small runs poison the forecast.
Core relation:
\hat L$ inherits measurement errorScaling laws are not magic. They are compact empirical models of how loss changes when resources change. Their value comes from discipline: keep the setup stable, measure held-out loss, fit simple forms, reserve validation runs, and attach uncertainty to every forecast.
Worked micro-example. Suppose loss follows . Increasing compute by a factor of 10 multiplies the excess loss by . That is real improvement, but it is slow improvement. Power laws explain both why scale helps and why scale is expensive.
Implementation check. Plot on log-log axes, inspect residuals, and verify that the fitted law predicts runs that were not used for fitting. If the residuals curve systematically, the simple power law is missing a factor.
AI connection. This is a practical quantity for planning or checking a training run.
Common mistake. Do not treat a published coefficient as a universal constant. Exponents and offsets depend on model family, data, optimizer, tokenizer, and evaluation distribution.
10.5 Decision checklist
Main idea. Use scaling laws as planning instruments with uncertainty.
Core relation:
Scaling laws are not magic. They are compact empirical models of how loss changes when resources change. Their value comes from discipline: keep the setup stable, measure held-out loss, fit simple forms, reserve validation runs, and attach uncertainty to every forecast.
Worked micro-example. Suppose loss follows . Increasing compute by a factor of 10 multiplies the excess loss by . That is real improvement, but it is slow improvement. Power laws explain both why scale helps and why scale is expensive.
Implementation check. Plot on log-log axes, inspect residuals, and verify that the fitted law predicts runs that were not used for fitting. If the residuals curve systematically, the simple power law is missing a factor.
AI connection. This is a practical quantity for planning or checking a training run.
Common mistake. Do not treat a published coefficient as a universal constant. Exponents and offsets depend on model family, data, optimizer, tokenizer, and evaluation distribution.
Practice Exercises
- Fit a one-resource power law after subtracting a known loss floor.
- Convert parameters and tokens into approximate dense training FLOPs.
- Compute token count from a fixed compute budget and model size.
- Search an IsoFLOP curve for the lowest toy loss.
- Diagnose whether a model is undertrained by checking tokens per parameter.
- Compute effective tokens for a data mixture.
- Estimate serving cost for two model choices.
- Compute residuals for scaling-law forecasts.
- Show how a thresholded metric can look abrupt even when loss is smooth.
- Write a forecast checklist with uncertainty.
Why This Matters for AI
Scaling laws turn expensive guessing into disciplined planning. They help decide whether to collect more data, train longer, use a smaller model, improve data quality, or avoid a run whose expected gain is too small. They also teach humility: a clean line on a log-log plot is useful, but it is still a fit to a regime, not a theorem about intelligence.
Bridge to Efficient Attention and Inference
The next section studies how attention and inference systems change the cost side of the scaling equation. Scaling laws say what loss might be worth buying; efficient attention and serving math decide whether that loss can be bought and served under memory, latency, and throughput constraints.
References
- Jared Kaplan et al., "Scaling Laws for Neural Language Models", 2020: https://arxiv.org/abs/2001.08361
- Jordan Hoffmann et al., "Training Compute-Optimal Large Language Models", 2022: https://arxiv.org/abs/2203.15556
- Tom Henighan et al., "Scaling Laws for Autoregressive Generative Modeling", 2020: https://arxiv.org/abs/2010.14701
- Ethan Caballero et al., "Broken Neural Scaling Laws", 2022: https://arxiv.org/abs/2210.14891
- Sam McCandlish et al., "An Empirical Model of Large-Batch Training", 2018: https://arxiv.org/abs/1812.06162
- Yonatan Bisk et al., "Experience Grounds Language", 2020, for caution about benchmarks and grounding limits: https://arxiv.org/abs/2004.10151