NotesMath for LLMs

Scaling Laws

Math for LLMs / Scaling Laws

Private notes
0/8000

Notes stay private to your browser until account sync is configured.

Notes
10 min read18 headings

Scaling laws are empirical formulas for forecasting how language-model loss changes as parameters, data, and compute change. They are not a replacement for experiments; they are a way to make experiments smaller, cheaper, and more informative.

Overview

The core form is a power law with a floor:

L(X)=L+AXα.L(X)=L_\infty + AX^{-\alpha}.

Here LL is usually held-out cross-entropy, XX is a resource such as parameters, tokens, or compute, LL_\infty is the irreducible floor for the setup, and α\alpha controls how fast excess loss falls. In LLM planning, the most important resources are parameter count NN, training tokens DD, and training compute CC. For dense decoder-only transformers, a useful first estimate is:

C6ND.C\approx 6ND.

The practical question is: for a fixed budget, what combination of NN and DD gives the lowest expected loss, and how uncertain is that forecast?

Prerequisites

  • Cross-entropy, perplexity, and held-out loss
  • Logarithms and log-log plots
  • Basic curve fitting and residual analysis
  • Training FLOPs and token accounting
  • The previous sections on training at scale and fine-tuning

Companion Notebooks

NotebookPurpose
theory.ipynbFits toy power laws, builds IsoFLOP curves, solves compute-optimal allocation by grid search, studies data quality, serving cost, and residual diagnostics.
exercises.ipynbTen practice problems for power-law fitting, compute budgets, Chinchilla-style sizing, undertraining checks, and forecast uncertainty.

Learning Objectives

After this section, you should be able to:

  • Explain the difference between parameter, data, and compute scaling.
  • Fit a power law when the irreducible loss floor is known or hypothesized.
  • Estimate training compute with C6NDC\approx 6ND.
  • Draw and interpret IsoFLOP tradeoffs.
  • Solve a compute-optimal allocation problem for a two-term loss proxy.
  • Explain the Kaplan and Chinchilla lessons without cargo-culting ratios.
  • Reason about data quality, repeated data, and effective token counts.
  • Include inference cost and latency in model-size decisions.
  • Diagnose residuals and uncertainty in scaling-law forecasts.

Table of Contents

  1. What Scaling Laws Measure
  2. Parameter Scaling
  3. Data Scaling
  4. Compute Scaling
  5. Compute-Optimal Allocation
  6. Kaplan and Chinchilla Lessons
  7. Fitting Scaling Laws
  8. Data Quality and Mixtures
  9. Inference-Aware Scaling
  10. Limits and Debugging

Forecasting Workflow

small runs -> held-out loss table -> fit simple law -> holdout forecast -> residual check -> budget decision -> larger validation run

At each step, write down the exact setup: architecture, tokenizer, data mixture, optimizer, schedule, batch size, precision, evaluation set, and filtering. A scaling law fitted under one setup is not guaranteed to transfer to another setup.

1. What Scaling Laws Measure

This part studies what scaling laws measure as a forecasting tool. The formulas are useful only when you keep the measurement setup fixed and validate predictions against real runs.

SubtopicQuestionFormula
Loss as the targetscaling laws usually model held-out cross-entropyL=E[logpθ(tit<i)]L=-E[\log p_\theta(t_i\mid t_{<i})]
Resourcesparameters, tokens, and compute are the main axesN, D, CN,\ D,\ C
Power-law formloss often improves approximately linearly on log-log axes after subtracting an irreducible floorL(X)=L+AXαL(X)=L_\infty + AX^{-\alpha}
Forecasting disciplinefit small runs, predict larger runs, then validate residualsL^(Clarge)\hat L(C_\mathrm{large})
Scopea scaling law is empirical over a data, architecture, optimizer, and tokenization regimelaw=fitsetup\mathrm{law}=\mathrm{fit}\mid\mathrm{setup}

1.1 Loss as the target

Main idea. Scaling laws usually model held-out cross-entropy.

Core relation:

L=E[logpθ(tit<i)]L=-E[\log p_\theta(t_i\mid t_{<i})]

Scaling laws are not magic. They are compact empirical models of how loss changes when resources change. Their value comes from discipline: keep the setup stable, measure held-out loss, fit simple forms, reserve validation runs, and attach uncertainty to every forecast.

Worked micro-example. Suppose loss follows L(C)=1.5+0.8C0.05L(C)=1.5+0.8C^{-0.05}. Increasing compute by a factor of 10 multiplies the excess loss by 100.050.89110^{-0.05}\approx 0.891. That is real improvement, but it is slow improvement. Power laws explain both why scale helps and why scale is expensive.

Implementation check. Plot on log-log axes, inspect residuals, and verify that the fitted law predicts runs that were not used for fitting. If the residuals curve systematically, the simple power law is missing a factor.

AI connection. This is a practical quantity for planning or checking a training run.

Common mistake. Do not treat a published coefficient as a universal constant. Exponents and offsets depend on model family, data, optimizer, tokenizer, and evaluation distribution.

1.2 Resources

Main idea. Parameters, tokens, and compute are the main axes.

Core relation:

N, D, CN,\ D,\ C

Scaling laws are not magic. They are compact empirical models of how loss changes when resources change. Their value comes from discipline: keep the setup stable, measure held-out loss, fit simple forms, reserve validation runs, and attach uncertainty to every forecast.

Worked micro-example. Suppose loss follows L(C)=1.5+0.8C0.05L(C)=1.5+0.8C^{-0.05}. Increasing compute by a factor of 10 multiplies the excess loss by 100.050.89110^{-0.05}\approx 0.891. That is real improvement, but it is slow improvement. Power laws explain both why scale helps and why scale is expensive.

Implementation check. Plot on log-log axes, inspect residuals, and verify that the fitted law predicts runs that were not used for fitting. If the residuals curve systematically, the simple power law is missing a factor.

AI connection. This is a practical quantity for planning or checking a training run.

Common mistake. Do not treat a published coefficient as a universal constant. Exponents and offsets depend on model family, data, optimizer, tokenizer, and evaluation distribution.

1.3 Power-law form

Main idea. Loss often improves approximately linearly on log-log axes after subtracting an irreducible floor.

Core relation:

L(X)=L+AXαL(X)=L_\infty + AX^{-\alpha}

Scaling laws are not magic. They are compact empirical models of how loss changes when resources change. Their value comes from discipline: keep the setup stable, measure held-out loss, fit simple forms, reserve validation runs, and attach uncertainty to every forecast.

Worked micro-example. Suppose loss follows L(C)=1.5+0.8C0.05L(C)=1.5+0.8C^{-0.05}. Increasing compute by a factor of 10 multiplies the excess loss by 100.050.89110^{-0.05}\approx 0.891. That is real improvement, but it is slow improvement. Power laws explain both why scale helps and why scale is expensive.

Implementation check. Plot on log-log axes, inspect residuals, and verify that the fitted law predicts runs that were not used for fitting. If the residuals curve systematically, the simple power law is missing a factor.

AI connection. This is a practical quantity for planning or checking a training run.

Common mistake. Do not treat a published coefficient as a universal constant. Exponents and offsets depend on model family, data, optimizer, tokenizer, and evaluation distribution.

1.4 Forecasting discipline

Main idea. Fit small runs, predict larger runs, then validate residuals.

Core relation:

L^(Clarge)\hat L(C_\mathrm{large})

Scaling laws are not magic. They are compact empirical models of how loss changes when resources change. Their value comes from discipline: keep the setup stable, measure held-out loss, fit simple forms, reserve validation runs, and attach uncertainty to every forecast.

Worked micro-example. Suppose loss follows L(C)=1.5+0.8C0.05L(C)=1.5+0.8C^{-0.05}. Increasing compute by a factor of 10 multiplies the excess loss by 100.050.89110^{-0.05}\approx 0.891. That is real improvement, but it is slow improvement. Power laws explain both why scale helps and why scale is expensive.

Implementation check. Plot on log-log axes, inspect residuals, and verify that the fitted law predicts runs that were not used for fitting. If the residuals curve systematically, the simple power law is missing a factor.

AI connection. This is a practical quantity for planning or checking a training run.

Common mistake. Do not treat a published coefficient as a universal constant. Exponents and offsets depend on model family, data, optimizer, tokenizer, and evaluation distribution.

1.5 Scope

Main idea. A scaling law is empirical over a data, architecture, optimizer, and tokenization regime.

Core relation:

law=fitsetup\mathrm{law}=\mathrm{fit}\mid\mathrm{setup}

Scaling laws are not magic. They are compact empirical models of how loss changes when resources change. Their value comes from discipline: keep the setup stable, measure held-out loss, fit simple forms, reserve validation runs, and attach uncertainty to every forecast.

Worked micro-example. Suppose loss follows L(C)=1.5+0.8C0.05L(C)=1.5+0.8C^{-0.05}. Increasing compute by a factor of 10 multiplies the excess loss by 100.050.89110^{-0.05}\approx 0.891. That is real improvement, but it is slow improvement. Power laws explain both why scale helps and why scale is expensive.

Implementation check. Plot on log-log axes, inspect residuals, and verify that the fitted law predicts runs that were not used for fitting. If the residuals curve systematically, the simple power law is missing a factor.

AI connection. This is a practical quantity for planning or checking a training run.

Common mistake. Do not treat a published coefficient as a universal constant. Exponents and offsets depend on model family, data, optimizer, tokenizer, and evaluation distribution.

2. Parameter Scaling

This part studies parameter scaling as a forecasting tool. The formulas are useful only when you keep the measurement setup fixed and validate predictions against real runs.

SubtopicQuestionFormula
Model-size lawlarger models reduce loss when data and compute are not limitingL(N)=L+ANNαNL(N)=L_\infty + A_NN^{-\alpha_N}
Capacity limittoo small a model cannot represent the distribution wellNLN\downarrow\Rightarrow L\uparrow
Over-large modela model can be too large for the available token budgetD/ND/N too small
Architecture dependencethe coefficients depend on architecture and training recipeAN,αNA_N,\alpha_N are not universal constants
Fine-tuning implicationadapter rank can be viewed as an adaptation-capacity axisrr controls update capacity

2.1 Model-size law

Main idea. Larger models reduce loss when data and compute are not limiting.

Core relation:

L(N)=L+ANNαNL(N)=L_\infty + A_NN^{-\alpha_N}

Scaling laws are not magic. They are compact empirical models of how loss changes when resources change. Their value comes from discipline: keep the setup stable, measure held-out loss, fit simple forms, reserve validation runs, and attach uncertainty to every forecast.

Worked micro-example. Suppose loss follows L(C)=1.5+0.8C0.05L(C)=1.5+0.8C^{-0.05}. Increasing compute by a factor of 10 multiplies the excess loss by 100.050.89110^{-0.05}\approx 0.891. That is real improvement, but it is slow improvement. Power laws explain both why scale helps and why scale is expensive.

Implementation check. Plot on log-log axes, inspect residuals, and verify that the fitted law predicts runs that were not used for fitting. If the residuals curve systematically, the simple power law is missing a factor.

AI connection. This is a practical quantity for planning or checking a training run.

Common mistake. Do not treat a published coefficient as a universal constant. Exponents and offsets depend on model family, data, optimizer, tokenizer, and evaluation distribution.

2.2 Capacity limit

Main idea. Too small a model cannot represent the distribution well.

Core relation:

NLN\downarrow\Rightarrow L\uparrow

Scaling laws are not magic. They are compact empirical models of how loss changes when resources change. Their value comes from discipline: keep the setup stable, measure held-out loss, fit simple forms, reserve validation runs, and attach uncertainty to every forecast.

Worked micro-example. Suppose loss follows L(C)=1.5+0.8C0.05L(C)=1.5+0.8C^{-0.05}. Increasing compute by a factor of 10 multiplies the excess loss by 100.050.89110^{-0.05}\approx 0.891. That is real improvement, but it is slow improvement. Power laws explain both why scale helps and why scale is expensive.

Implementation check. Plot on log-log axes, inspect residuals, and verify that the fitted law predicts runs that were not used for fitting. If the residuals curve systematically, the simple power law is missing a factor.

AI connection. This is a practical quantity for planning or checking a training run.

Common mistake. Do not treat a published coefficient as a universal constant. Exponents and offsets depend on model family, data, optimizer, tokenizer, and evaluation distribution.

2.3 Over-large model

Main idea. A model can be too large for the available token budget.

Core relation:

D/N$ too small

Scaling laws are not magic. They are compact empirical models of how loss changes when resources change. Their value comes from discipline: keep the setup stable, measure held-out loss, fit simple forms, reserve validation runs, and attach uncertainty to every forecast.

Worked micro-example. Suppose loss follows L(C)=1.5+0.8C0.05L(C)=1.5+0.8C^{-0.05}. Increasing compute by a factor of 10 multiplies the excess loss by 100.050.89110^{-0.05}\approx 0.891. That is real improvement, but it is slow improvement. Power laws explain both why scale helps and why scale is expensive.

Implementation check. Plot on log-log axes, inspect residuals, and verify that the fitted law predicts runs that were not used for fitting. If the residuals curve systematically, the simple power law is missing a factor.

AI connection. This is a practical quantity for planning or checking a training run.

Common mistake. Do not treat a published coefficient as a universal constant. Exponents and offsets depend on model family, data, optimizer, tokenizer, and evaluation distribution.

2.4 Architecture dependence

Main idea. The coefficients depend on architecture and training recipe.

Core relation:

A_N,\alpha_N$ are not universal constants

Scaling laws are not magic. They are compact empirical models of how loss changes when resources change. Their value comes from discipline: keep the setup stable, measure held-out loss, fit simple forms, reserve validation runs, and attach uncertainty to every forecast.

Worked micro-example. Suppose loss follows L(C)=1.5+0.8C0.05L(C)=1.5+0.8C^{-0.05}. Increasing compute by a factor of 10 multiplies the excess loss by 100.050.89110^{-0.05}\approx 0.891. That is real improvement, but it is slow improvement. Power laws explain both why scale helps and why scale is expensive.

Implementation check. Plot on log-log axes, inspect residuals, and verify that the fitted law predicts runs that were not used for fitting. If the residuals curve systematically, the simple power law is missing a factor.

AI connection. This is a practical quantity for planning or checking a training run.

Common mistake. Do not treat a published coefficient as a universal constant. Exponents and offsets depend on model family, data, optimizer, tokenizer, and evaluation distribution.

2.5 Fine-tuning implication

Main idea. Adapter rank can be viewed as an adaptation-capacity axis.

Core relation:

r$ controls update capacity

Scaling laws are not magic. They are compact empirical models of how loss changes when resources change. Their value comes from discipline: keep the setup stable, measure held-out loss, fit simple forms, reserve validation runs, and attach uncertainty to every forecast.

Worked micro-example. Suppose loss follows L(C)=1.5+0.8C0.05L(C)=1.5+0.8C^{-0.05}. Increasing compute by a factor of 10 multiplies the excess loss by 100.050.89110^{-0.05}\approx 0.891. That is real improvement, but it is slow improvement. Power laws explain both why scale helps and why scale is expensive.

Implementation check. Plot on log-log axes, inspect residuals, and verify that the fitted law predicts runs that were not used for fitting. If the residuals curve systematically, the simple power law is missing a factor.

AI connection. This is a practical quantity for planning or checking a training run.

Common mistake. Do not treat a published coefficient as a universal constant. Exponents and offsets depend on model family, data, optimizer, tokenizer, and evaluation distribution.

3. Data Scaling

This part studies data scaling as a forecasting tool. The formulas are useful only when you keep the measurement setup fixed and validate predictions against real runs.

SubtopicQuestionFormula
Token-count lawmore unique useful tokens reduce lossL(D)=L+ADDαDL(D)=L_\infty + A_DD^{-\alpha_D}
Data qualityone high-quality token can be worth more than one noisy tokenDeffDrawD_\mathrm{eff}\ne D_\mathrm{raw}
Repeated datarepetition gives diminishing returns and can increase memorizationDeff<kDD_\mathrm{eff}<kD for repeats
Domain matchthe validation distribution defines what improvement meansptrainpevalp_\mathrm{train}\approx p_\mathrm{eval}
Tokenizer dependencetoken counts depend on the tokenizerDD is measured in model tokens

3.1 Token-count law

Main idea. More unique useful tokens reduce loss.

Core relation:

L(D)=L+ADDαDL(D)=L_\infty + A_DD^{-\alpha_D}

Scaling laws are not magic. They are compact empirical models of how loss changes when resources change. Their value comes from discipline: keep the setup stable, measure held-out loss, fit simple forms, reserve validation runs, and attach uncertainty to every forecast.

Worked micro-example. Suppose loss follows L(C)=1.5+0.8C0.05L(C)=1.5+0.8C^{-0.05}. Increasing compute by a factor of 10 multiplies the excess loss by 100.050.89110^{-0.05}\approx 0.891. That is real improvement, but it is slow improvement. Power laws explain both why scale helps and why scale is expensive.

Implementation check. Plot on log-log axes, inspect residuals, and verify that the fitted law predicts runs that were not used for fitting. If the residuals curve systematically, the simple power law is missing a factor.

AI connection. This is a practical quantity for planning or checking a training run.

Common mistake. Do not treat a published coefficient as a universal constant. Exponents and offsets depend on model family, data, optimizer, tokenizer, and evaluation distribution.

3.2 Data quality

Main idea. One high-quality token can be worth more than one noisy token.

Core relation:

DeffDrawD_\mathrm{eff}\ne D_\mathrm{raw}

Scaling laws are not magic. They are compact empirical models of how loss changes when resources change. Their value comes from discipline: keep the setup stable, measure held-out loss, fit simple forms, reserve validation runs, and attach uncertainty to every forecast.

Worked micro-example. Suppose loss follows L(C)=1.5+0.8C0.05L(C)=1.5+0.8C^{-0.05}. Increasing compute by a factor of 10 multiplies the excess loss by 100.050.89110^{-0.05}\approx 0.891. That is real improvement, but it is slow improvement. Power laws explain both why scale helps and why scale is expensive.

Implementation check. Plot on log-log axes, inspect residuals, and verify that the fitted law predicts runs that were not used for fitting. If the residuals curve systematically, the simple power law is missing a factor.

AI connection. This is a practical quantity for planning or checking a training run.

Common mistake. Do not treat a published coefficient as a universal constant. Exponents and offsets depend on model family, data, optimizer, tokenizer, and evaluation distribution.

3.3 Repeated data

Main idea. Repetition gives diminishing returns and can increase memorization.

Core relation:

D_\mathrm{eff}<kD$ for repeats

Scaling laws are not magic. They are compact empirical models of how loss changes when resources change. Their value comes from discipline: keep the setup stable, measure held-out loss, fit simple forms, reserve validation runs, and attach uncertainty to every forecast.

Worked micro-example. Suppose loss follows L(C)=1.5+0.8C0.05L(C)=1.5+0.8C^{-0.05}. Increasing compute by a factor of 10 multiplies the excess loss by 100.050.89110^{-0.05}\approx 0.891. That is real improvement, but it is slow improvement. Power laws explain both why scale helps and why scale is expensive.

Implementation check. Plot on log-log axes, inspect residuals, and verify that the fitted law predicts runs that were not used for fitting. If the residuals curve systematically, the simple power law is missing a factor.

AI connection. This is a practical quantity for planning or checking a training run.

Common mistake. Do not treat a published coefficient as a universal constant. Exponents and offsets depend on model family, data, optimizer, tokenizer, and evaluation distribution.

3.4 Domain match

Main idea. The validation distribution defines what improvement means.

Core relation:

ptrainpevalp_\mathrm{train}\approx p_\mathrm{eval}

Scaling laws are not magic. They are compact empirical models of how loss changes when resources change. Their value comes from discipline: keep the setup stable, measure held-out loss, fit simple forms, reserve validation runs, and attach uncertainty to every forecast.

Worked micro-example. Suppose loss follows L(C)=1.5+0.8C0.05L(C)=1.5+0.8C^{-0.05}. Increasing compute by a factor of 10 multiplies the excess loss by 100.050.89110^{-0.05}\approx 0.891. That is real improvement, but it is slow improvement. Power laws explain both why scale helps and why scale is expensive.

Implementation check. Plot on log-log axes, inspect residuals, and verify that the fitted law predicts runs that were not used for fitting. If the residuals curve systematically, the simple power law is missing a factor.

AI connection. This is a practical quantity for planning or checking a training run.

Common mistake. Do not treat a published coefficient as a universal constant. Exponents and offsets depend on model family, data, optimizer, tokenizer, and evaluation distribution.

3.5 Tokenizer dependence

Main idea. Token counts depend on the tokenizer.

Core relation:

D$ is measured in model tokens

Scaling laws are not magic. They are compact empirical models of how loss changes when resources change. Their value comes from discipline: keep the setup stable, measure held-out loss, fit simple forms, reserve validation runs, and attach uncertainty to every forecast.

Worked micro-example. Suppose loss follows L(C)=1.5+0.8C0.05L(C)=1.5+0.8C^{-0.05}. Increasing compute by a factor of 10 multiplies the excess loss by 100.050.89110^{-0.05}\approx 0.891. That is real improvement, but it is slow improvement. Power laws explain both why scale helps and why scale is expensive.

Implementation check. Plot on log-log axes, inspect residuals, and verify that the fitted law predicts runs that were not used for fitting. If the residuals curve systematically, the simple power law is missing a factor.

AI connection. This is a practical quantity for planning or checking a training run.

Common mistake. Do not treat a published coefficient as a universal constant. Exponents and offsets depend on model family, data, optimizer, tokenizer, and evaluation distribution.

4. Compute Scaling

This part studies compute scaling as a forecasting tool. The formulas are useful only when you keep the measurement setup fixed and validate predictions against real runs.

SubtopicQuestionFormula
Dense training FLOPsa common first estimate for dense transformers is six times parameters times tokensC6NDC\approx 6ND
Compute lawloss can be modeled directly as a function of computeL(C)=L+ACCαCL(C)=L_\infty + A_CC^{-\alpha_C}
IsoFLOP curvesfixed compute creates a tradeoff between model size and token countD=C/(6N)D=C/(6N)
Compute is not wall-clockhardware utilization and communication decide timeT=C/achieved FLOPs per secT=C/\mathrm{achieved\ FLOPs\ per\ sec}
Budget translationmoney and time become constraints after mapping to achievable FLOPscost=Tprice\mathrm{cost}=T\cdot\mathrm{price}

4.1 Dense training FLOPs

Main idea. A common first estimate for dense transformers is six times parameters times tokens.

Core relation:

C6NDC\approx 6ND

Scaling laws are not magic. They are compact empirical models of how loss changes when resources change. Their value comes from discipline: keep the setup stable, measure held-out loss, fit simple forms, reserve validation runs, and attach uncertainty to every forecast.

Worked micro-example. Suppose loss follows L(C)=1.5+0.8C0.05L(C)=1.5+0.8C^{-0.05}. Increasing compute by a factor of 10 multiplies the excess loss by 100.050.89110^{-0.05}\approx 0.891. That is real improvement, but it is slow improvement. Power laws explain both why scale helps and why scale is expensive.

Implementation check. Plot on log-log axes, inspect residuals, and verify that the fitted law predicts runs that were not used for fitting. If the residuals curve systematically, the simple power law is missing a factor.

AI connection. This is a practical quantity for planning or checking a training run.

Common mistake. Do not treat a published coefficient as a universal constant. Exponents and offsets depend on model family, data, optimizer, tokenizer, and evaluation distribution.

4.2 Compute law

Main idea. Loss can be modeled directly as a function of compute.

Core relation:

L(C)=L+ACCαCL(C)=L_\infty + A_CC^{-\alpha_C}

Scaling laws are not magic. They are compact empirical models of how loss changes when resources change. Their value comes from discipline: keep the setup stable, measure held-out loss, fit simple forms, reserve validation runs, and attach uncertainty to every forecast.

Worked micro-example. Suppose loss follows L(C)=1.5+0.8C0.05L(C)=1.5+0.8C^{-0.05}. Increasing compute by a factor of 10 multiplies the excess loss by 100.050.89110^{-0.05}\approx 0.891. That is real improvement, but it is slow improvement. Power laws explain both why scale helps and why scale is expensive.

Implementation check. Plot on log-log axes, inspect residuals, and verify that the fitted law predicts runs that were not used for fitting. If the residuals curve systematically, the simple power law is missing a factor.

AI connection. This is a practical quantity for planning or checking a training run.

Common mistake. Do not treat a published coefficient as a universal constant. Exponents and offsets depend on model family, data, optimizer, tokenizer, and evaluation distribution.

4.3 IsoFLOP curves

Main idea. Fixed compute creates a tradeoff between model size and token count.

Core relation:

D=C/(6N)D=C/(6N)

Scaling laws are not magic. They are compact empirical models of how loss changes when resources change. Their value comes from discipline: keep the setup stable, measure held-out loss, fit simple forms, reserve validation runs, and attach uncertainty to every forecast.

Worked micro-example. Suppose loss follows L(C)=1.5+0.8C0.05L(C)=1.5+0.8C^{-0.05}. Increasing compute by a factor of 10 multiplies the excess loss by 100.050.89110^{-0.05}\approx 0.891. That is real improvement, but it is slow improvement. Power laws explain both why scale helps and why scale is expensive.

Implementation check. Plot on log-log axes, inspect residuals, and verify that the fitted law predicts runs that were not used for fitting. If the residuals curve systematically, the simple power law is missing a factor.

AI connection. These curves are how you ask whether the next dollar should buy a larger model or more training tokens.

Common mistake. Do not treat a published coefficient as a universal constant. Exponents and offsets depend on model family, data, optimizer, tokenizer, and evaluation distribution.

4.4 Compute is not wall-clock

Main idea. Hardware utilization and communication decide time.

Core relation:

T=C/achieved FLOPs per secT=C/\mathrm{achieved\ FLOPs\ per\ sec}

Scaling laws are not magic. They are compact empirical models of how loss changes when resources change. Their value comes from discipline: keep the setup stable, measure held-out loss, fit simple forms, reserve validation runs, and attach uncertainty to every forecast.

Worked micro-example. Suppose loss follows L(C)=1.5+0.8C0.05L(C)=1.5+0.8C^{-0.05}. Increasing compute by a factor of 10 multiplies the excess loss by 100.050.89110^{-0.05}\approx 0.891. That is real improvement, but it is slow improvement. Power laws explain both why scale helps and why scale is expensive.

Implementation check. Plot on log-log axes, inspect residuals, and verify that the fitted law predicts runs that were not used for fitting. If the residuals curve systematically, the simple power law is missing a factor.

AI connection. This is a practical quantity for planning or checking a training run.

Common mistake. Do not treat a published coefficient as a universal constant. Exponents and offsets depend on model family, data, optimizer, tokenizer, and evaluation distribution.

4.5 Budget translation

Main idea. Money and time become constraints after mapping to achievable flops.

Core relation:

cost=Tprice\mathrm{cost}=T\cdot\mathrm{price}

Scaling laws are not magic. They are compact empirical models of how loss changes when resources change. Their value comes from discipline: keep the setup stable, measure held-out loss, fit simple forms, reserve validation runs, and attach uncertainty to every forecast.

Worked micro-example. Suppose loss follows L(C)=1.5+0.8C0.05L(C)=1.5+0.8C^{-0.05}. Increasing compute by a factor of 10 multiplies the excess loss by 100.050.89110^{-0.05}\approx 0.891. That is real improvement, but it is slow improvement. Power laws explain both why scale helps and why scale is expensive.

Implementation check. Plot on log-log axes, inspect residuals, and verify that the fitted law predicts runs that were not used for fitting. If the residuals curve systematically, the simple power law is missing a factor.

AI connection. This is a practical quantity for planning or checking a training run.

Common mistake. Do not treat a published coefficient as a universal constant. Exponents and offsets depend on model family, data, optimizer, tokenizer, and evaluation distribution.

5. Compute-Optimal Allocation

This part studies compute-optimal allocation as a forecasting tool. The formulas are useful only when you keep the measurement setup fixed and validate predictions against real runs.

SubtopicQuestionFormula
Two-term loss proxyseparate parameter-limited and data-limited termsL(N,D)=L+A/Nα+B/DβL(N,D)=L_\infty + A/N^\alpha + B/D^\beta
Constrainttraining compute couples N and DC=6NDC=6ND
Balanced marginal returnsat optimum, extra compute should help similarly through N and DL/N\partial L/\partial N balances L/D\partial L/\partial D under the constraint
Chinchilla intuitionmany earlier large models were undertrained relative to their sizeD/ND/N was too small
Practical rulechoose N and D together, not one after the other(N,D)=argmin6ND=CL(N,D)(N^\star,D^\star)=\arg\min_{6ND=C}L(N,D)

5.1 Two-term loss proxy

Main idea. Separate parameter-limited and data-limited terms.

Core relation:

L(N,D)=L+A/Nα+B/DβL(N,D)=L_\infty + A/N^\alpha + B/D^\beta

Scaling laws are not magic. They are compact empirical models of how loss changes when resources change. Their value comes from discipline: keep the setup stable, measure held-out loss, fit simple forms, reserve validation runs, and attach uncertainty to every forecast.

Worked micro-example. Suppose loss follows L(C)=1.5+0.8C0.05L(C)=1.5+0.8C^{-0.05}. Increasing compute by a factor of 10 multiplies the excess loss by 100.050.89110^{-0.05}\approx 0.891. That is real improvement, but it is slow improvement. Power laws explain both why scale helps and why scale is expensive.

Implementation check. Plot on log-log axes, inspect residuals, and verify that the fitted law predicts runs that were not used for fitting. If the residuals curve systematically, the simple power law is missing a factor.

AI connection. This is a practical quantity for planning or checking a training run.

Common mistake. Do not treat a published coefficient as a universal constant. Exponents and offsets depend on model family, data, optimizer, tokenizer, and evaluation distribution.

5.2 Constraint

Main idea. Training compute couples n and d.

Core relation:

C=6NDC=6ND

Scaling laws are not magic. They are compact empirical models of how loss changes when resources change. Their value comes from discipline: keep the setup stable, measure held-out loss, fit simple forms, reserve validation runs, and attach uncertainty to every forecast.

Worked micro-example. Suppose loss follows L(C)=1.5+0.8C0.05L(C)=1.5+0.8C^{-0.05}. Increasing compute by a factor of 10 multiplies the excess loss by 100.050.89110^{-0.05}\approx 0.891. That is real improvement, but it is slow improvement. Power laws explain both why scale helps and why scale is expensive.

Implementation check. Plot on log-log axes, inspect residuals, and verify that the fitted law predicts runs that were not used for fitting. If the residuals curve systematically, the simple power law is missing a factor.

AI connection. This is a practical quantity for planning or checking a training run.

Common mistake. Do not treat a published coefficient as a universal constant. Exponents and offsets depend on model family, data, optimizer, tokenizer, and evaluation distribution.

5.3 Balanced marginal returns

Main idea. At optimum, extra compute should help similarly through n and d.

Core relation:

\partial L/\partial N$ balances $\partial L/\partial D$ under the constraint

Scaling laws are not magic. They are compact empirical models of how loss changes when resources change. Their value comes from discipline: keep the setup stable, measure held-out loss, fit simple forms, reserve validation runs, and attach uncertainty to every forecast.

Worked micro-example. Suppose loss follows L(C)=1.5+0.8C0.05L(C)=1.5+0.8C^{-0.05}. Increasing compute by a factor of 10 multiplies the excess loss by 100.050.89110^{-0.05}\approx 0.891. That is real improvement, but it is slow improvement. Power laws explain both why scale helps and why scale is expensive.

Implementation check. Plot on log-log axes, inspect residuals, and verify that the fitted law predicts runs that were not used for fitting. If the residuals curve systematically, the simple power law is missing a factor.

AI connection. This is a practical quantity for planning or checking a training run.

Common mistake. Do not treat a published coefficient as a universal constant. Exponents and offsets depend on model family, data, optimizer, tokenizer, and evaluation distribution.

5.4 Chinchilla intuition

Main idea. Many earlier large models were undertrained relative to their size.

Core relation:

D/N$ was too small

Scaling laws are not magic. They are compact empirical models of how loss changes when resources change. Their value comes from discipline: keep the setup stable, measure held-out loss, fit simple forms, reserve validation runs, and attach uncertainty to every forecast.

Worked micro-example. Suppose loss follows L(C)=1.5+0.8C0.05L(C)=1.5+0.8C^{-0.05}. Increasing compute by a factor of 10 multiplies the excess loss by 100.050.89110^{-0.05}\approx 0.891. That is real improvement, but it is slow improvement. Power laws explain both why scale helps and why scale is expensive.

Implementation check. Plot on log-log axes, inspect residuals, and verify that the fitted law predicts runs that were not used for fitting. If the residuals curve systematically, the simple power law is missing a factor.

AI connection. This is a practical quantity for planning or checking a training run.

Common mistake. Do not treat a published coefficient as a universal constant. Exponents and offsets depend on model family, data, optimizer, tokenizer, and evaluation distribution.

5.5 Practical rule

Main idea. Choose n and d together, not one after the other.

Core relation:

(N,D)=argmin6ND=CL(N,D)(N^\star,D^\star)=\arg\min_{6ND=C}L(N,D)

Scaling laws are not magic. They are compact empirical models of how loss changes when resources change. Their value comes from discipline: keep the setup stable, measure held-out loss, fit simple forms, reserve validation runs, and attach uncertainty to every forecast.

Worked micro-example. Suppose loss follows L(C)=1.5+0.8C0.05L(C)=1.5+0.8C^{-0.05}. Increasing compute by a factor of 10 multiplies the excess loss by 100.050.89110^{-0.05}\approx 0.891. That is real improvement, but it is slow improvement. Power laws explain both why scale helps and why scale is expensive.

Implementation check. Plot on log-log axes, inspect residuals, and verify that the fitted law predicts runs that were not used for fitting. If the residuals curve systematically, the simple power law is missing a factor.

AI connection. This is a practical quantity for planning or checking a training run.

Common mistake. Do not treat a published coefficient as a universal constant. Exponents and offsets depend on model family, data, optimizer, tokenizer, and evaluation distribution.

6. Kaplan and Chinchilla Lessons

This part studies kaplan and chinchilla lessons as a forecasting tool. The formulas are useful only when you keep the measurement setup fixed and validate predictions against real runs.

SubtopicQuestionFormula
Kaplan-style resultsmooth power laws made loss forecasting practicalLL follows power laws across scale ranges
Chinchilla revisionmore tokens per parameter can be compute-optimal than earlier practice suggestedD/ND^\star/N^\star increases
Undertraining diagnostica model with low tokens per parameter may benefit more from data than sizeD/NtargetD/N\ll\mathrm{target}
Overtraining for inferencetraining smaller models longer can reduce serving costtrain cost+inference cost\mathrm{train\ cost}+\mathrm{inference\ cost}
Do not cargo-cult ratiosratios depend on data, architecture, and objectiveD/ND/N is a planning prior, not a theorem

6.1 Kaplan-style result

Main idea. Smooth power laws made loss forecasting practical.

Core relation:

L$ follows power laws across scale ranges

Scaling laws are not magic. They are compact empirical models of how loss changes when resources change. Their value comes from discipline: keep the setup stable, measure held-out loss, fit simple forms, reserve validation runs, and attach uncertainty to every forecast.

Worked micro-example. Suppose loss follows L(C)=1.5+0.8C0.05L(C)=1.5+0.8C^{-0.05}. Increasing compute by a factor of 10 multiplies the excess loss by 100.050.89110^{-0.05}\approx 0.891. That is real improvement, but it is slow improvement. Power laws explain both why scale helps and why scale is expensive.

Implementation check. Plot on log-log axes, inspect residuals, and verify that the fitted law predicts runs that were not used for fitting. If the residuals curve systematically, the simple power law is missing a factor.

AI connection. This is a practical quantity for planning or checking a training run.

Common mistake. Do not treat a published coefficient as a universal constant. Exponents and offsets depend on model family, data, optimizer, tokenizer, and evaluation distribution.

6.2 Chinchilla revision

Main idea. More tokens per parameter can be compute-optimal than earlier practice suggested.

Core relation:

D^\star/N^\star$ increases

Scaling laws are not magic. They are compact empirical models of how loss changes when resources change. Their value comes from discipline: keep the setup stable, measure held-out loss, fit simple forms, reserve validation runs, and attach uncertainty to every forecast.

Worked micro-example. Suppose loss follows L(C)=1.5+0.8C0.05L(C)=1.5+0.8C^{-0.05}. Increasing compute by a factor of 10 multiplies the excess loss by 100.050.89110^{-0.05}\approx 0.891. That is real improvement, but it is slow improvement. Power laws explain both why scale helps and why scale is expensive.

Implementation check. Plot on log-log axes, inspect residuals, and verify that the fitted law predicts runs that were not used for fitting. If the residuals curve systematically, the simple power law is missing a factor.

AI connection. The practical lesson is that undertrained large models can waste compute even when they look impressive.

Common mistake. Do not treat a published coefficient as a universal constant. Exponents and offsets depend on model family, data, optimizer, tokenizer, and evaluation distribution.

6.3 Undertraining diagnostic

Main idea. A model with low tokens per parameter may benefit more from data than size.

Core relation:

D/NtargetD/N\ll\mathrm{target}

Scaling laws are not magic. They are compact empirical models of how loss changes when resources change. Their value comes from discipline: keep the setup stable, measure held-out loss, fit simple forms, reserve validation runs, and attach uncertainty to every forecast.

Worked micro-example. Suppose loss follows L(C)=1.5+0.8C0.05L(C)=1.5+0.8C^{-0.05}. Increasing compute by a factor of 10 multiplies the excess loss by 100.050.89110^{-0.05}\approx 0.891. That is real improvement, but it is slow improvement. Power laws explain both why scale helps and why scale is expensive.

Implementation check. Plot on log-log axes, inspect residuals, and verify that the fitted law predicts runs that were not used for fitting. If the residuals curve systematically, the simple power law is missing a factor.

AI connection. This is a practical quantity for planning or checking a training run.

Common mistake. Do not treat a published coefficient as a universal constant. Exponents and offsets depend on model family, data, optimizer, tokenizer, and evaluation distribution.

6.4 Overtraining for inference

Main idea. Training smaller models longer can reduce serving cost.

Core relation:

train cost+inference cost\mathrm{train\ cost}+\mathrm{inference\ cost}

Scaling laws are not magic. They are compact empirical models of how loss changes when resources change. Their value comes from discipline: keep the setup stable, measure held-out loss, fit simple forms, reserve validation runs, and attach uncertainty to every forecast.

Worked micro-example. Suppose loss follows L(C)=1.5+0.8C0.05L(C)=1.5+0.8C^{-0.05}. Increasing compute by a factor of 10 multiplies the excess loss by 100.050.89110^{-0.05}\approx 0.891. That is real improvement, but it is slow improvement. Power laws explain both why scale helps and why scale is expensive.

Implementation check. Plot on log-log axes, inspect residuals, and verify that the fitted law predicts runs that were not used for fitting. If the residuals curve systematically, the simple power law is missing a factor.

AI connection. This is a practical quantity for planning or checking a training run.

Common mistake. Do not treat a published coefficient as a universal constant. Exponents and offsets depend on model family, data, optimizer, tokenizer, and evaluation distribution.

6.5 Do not cargo-cult ratios

Main idea. Ratios depend on data, architecture, and objective.

Core relation:

D/N$ is a planning prior, not a theorem

Scaling laws are not magic. They are compact empirical models of how loss changes when resources change. Their value comes from discipline: keep the setup stable, measure held-out loss, fit simple forms, reserve validation runs, and attach uncertainty to every forecast.

Worked micro-example. Suppose loss follows L(C)=1.5+0.8C0.05L(C)=1.5+0.8C^{-0.05}. Increasing compute by a factor of 10 multiplies the excess loss by 100.050.89110^{-0.05}\approx 0.891. That is real improvement, but it is slow improvement. Power laws explain both why scale helps and why scale is expensive.

Implementation check. Plot on log-log axes, inspect residuals, and verify that the fitted law predicts runs that were not used for fitting. If the residuals curve systematically, the simple power law is missing a factor.

AI connection. This is a practical quantity for planning or checking a training run.

Common mistake. Do not treat a published coefficient as a universal constant. Exponents and offsets depend on model family, data, optimizer, tokenizer, and evaluation distribution.

7. Fitting Scaling Laws

This part studies fitting scaling laws as a forecasting tool. The formulas are useful only when you keep the measurement setup fixed and validate predictions against real runs.

SubtopicQuestionFormula
Log transformationif the floor is known, power-law fitting becomes linearlog(LL)=logAαlogX\log(L-L_\infty)=\log A-\alpha\log X
Unknown floorthe irreducible loss floor must be fit or bounded carefullyLL_\infty
Holdout validationreserve some runs to test forecastsLactualLpredL_\mathrm{actual}-L_\mathrm{pred}
Residual analysiscurved residuals mean the law is missing a factorϵi=LiL^i\epsilon_i=L_i-\hat L_i
Extrapolation risksmall-run fits can fail when regime changesXtestXfitX_\mathrm{test}\gg X_\mathrm{fit}

7.1 Log transformation

Main idea. If the floor is known, power-law fitting becomes linear.

Core relation:

log(LL)=logAαlogX\log(L-L_\infty)=\log A-\alpha\log X

Scaling laws are not magic. They are compact empirical models of how loss changes when resources change. Their value comes from discipline: keep the setup stable, measure held-out loss, fit simple forms, reserve validation runs, and attach uncertainty to every forecast.

Worked micro-example. Suppose loss follows L(C)=1.5+0.8C0.05L(C)=1.5+0.8C^{-0.05}. Increasing compute by a factor of 10 multiplies the excess loss by 100.050.89110^{-0.05}\approx 0.891. That is real improvement, but it is slow improvement. Power laws explain both why scale helps and why scale is expensive.

Implementation check. Plot on log-log axes, inspect residuals, and verify that the fitted law predicts runs that were not used for fitting. If the residuals curve systematically, the simple power law is missing a factor.

AI connection. This is a practical quantity for planning or checking a training run.

Common mistake. Do not treat a published coefficient as a universal constant. Exponents and offsets depend on model family, data, optimizer, tokenizer, and evaluation distribution.

7.2 Unknown floor

Main idea. The irreducible loss floor must be fit or bounded carefully.

Core relation:

LL_\infty

Scaling laws are not magic. They are compact empirical models of how loss changes when resources change. Their value comes from discipline: keep the setup stable, measure held-out loss, fit simple forms, reserve validation runs, and attach uncertainty to every forecast.

Worked micro-example. Suppose loss follows L(C)=1.5+0.8C0.05L(C)=1.5+0.8C^{-0.05}. Increasing compute by a factor of 10 multiplies the excess loss by 100.050.89110^{-0.05}\approx 0.891. That is real improvement, but it is slow improvement. Power laws explain both why scale helps and why scale is expensive.

Implementation check. Plot on log-log axes, inspect residuals, and verify that the fitted law predicts runs that were not used for fitting. If the residuals curve systematically, the simple power law is missing a factor.

AI connection. A small error in the loss floor can bend exponent estimates and make forecasts overconfident.

Common mistake. Do not treat a published coefficient as a universal constant. Exponents and offsets depend on model family, data, optimizer, tokenizer, and evaluation distribution.

7.3 Holdout validation

Main idea. Reserve some runs to test forecasts.

Core relation:

LactualLpredL_\mathrm{actual}-L_\mathrm{pred}

Scaling laws are not magic. They are compact empirical models of how loss changes when resources change. Their value comes from discipline: keep the setup stable, measure held-out loss, fit simple forms, reserve validation runs, and attach uncertainty to every forecast.

Worked micro-example. Suppose loss follows L(C)=1.5+0.8C0.05L(C)=1.5+0.8C^{-0.05}. Increasing compute by a factor of 10 multiplies the excess loss by 100.050.89110^{-0.05}\approx 0.891. That is real improvement, but it is slow improvement. Power laws explain both why scale helps and why scale is expensive.

Implementation check. Plot on log-log axes, inspect residuals, and verify that the fitted law predicts runs that were not used for fitting. If the residuals curve systematically, the simple power law is missing a factor.

AI connection. This is a practical quantity for planning or checking a training run.

Common mistake. Do not treat a published coefficient as a universal constant. Exponents and offsets depend on model family, data, optimizer, tokenizer, and evaluation distribution.

7.4 Residual analysis

Main idea. Curved residuals mean the law is missing a factor.

Core relation:

ϵi=LiL^i\epsilon_i=L_i-\hat L_i

Scaling laws are not magic. They are compact empirical models of how loss changes when resources change. Their value comes from discipline: keep the setup stable, measure held-out loss, fit simple forms, reserve validation runs, and attach uncertainty to every forecast.

Worked micro-example. Suppose loss follows L(C)=1.5+0.8C0.05L(C)=1.5+0.8C^{-0.05}. Increasing compute by a factor of 10 multiplies the excess loss by 100.050.89110^{-0.05}\approx 0.891. That is real improvement, but it is slow improvement. Power laws explain both why scale helps and why scale is expensive.

Implementation check. Plot on log-log axes, inspect residuals, and verify that the fitted law predicts runs that were not used for fitting. If the residuals curve systematically, the simple power law is missing a factor.

AI connection. This is a practical quantity for planning or checking a training run.

Common mistake. Do not treat a published coefficient as a universal constant. Exponents and offsets depend on model family, data, optimizer, tokenizer, and evaluation distribution.

7.5 Extrapolation risk

Main idea. Small-run fits can fail when regime changes.

Core relation:

XtestXfitX_\mathrm{test}\gg X_\mathrm{fit}

Scaling laws are not magic. They are compact empirical models of how loss changes when resources change. Their value comes from discipline: keep the setup stable, measure held-out loss, fit simple forms, reserve validation runs, and attach uncertainty to every forecast.

Worked micro-example. Suppose loss follows L(C)=1.5+0.8C0.05L(C)=1.5+0.8C^{-0.05}. Increasing compute by a factor of 10 multiplies the excess loss by 100.050.89110^{-0.05}\approx 0.891. That is real improvement, but it is slow improvement. Power laws explain both why scale helps and why scale is expensive.

Implementation check. Plot on log-log axes, inspect residuals, and verify that the fitted law predicts runs that were not used for fitting. If the residuals curve systematically, the simple power law is missing a factor.

AI connection. This is a practical quantity for planning or checking a training run.

Common mistake. Do not treat a published coefficient as a universal constant. Exponents and offsets depend on model family, data, optimizer, tokenizer, and evaluation distribution.

8. Data Quality and Mixtures

This part studies data quality and mixtures as a forecasting tool. The formulas are useful only when you keep the measurement setup fixed and validate predictions against real runs.

SubtopicQuestionFormula
Mixture weightstraining data is a weighted mixture of sourcesptrain=kwkpkp_\mathrm{train}=\sum_k w_kp_k
Effective tokensquality, deduplication, and domain match alter token valueDeff=kqkwkDD_\mathrm{eff}=\sum_k q_kw_kD
Contaminationvalidation leakage makes scaling look better than it isDevalDtrainD_\mathrm{eval}\cap D_\mathrm{train}\ne\emptyset
Curriculum and filteringdata order and filtering can shift constantsAA changes even if α\alpha is similar
Evaluation distributionoptimize for the distribution that mattersLevalL_\mathrm{eval}, not only LwebL_\mathrm{web}

8.1 Mixture weights

Main idea. Training data is a weighted mixture of sources.

Core relation:

ptrain=kwkpkp_\mathrm{train}=\sum_k w_kp_k

Scaling laws are not magic. They are compact empirical models of how loss changes when resources change. Their value comes from discipline: keep the setup stable, measure held-out loss, fit simple forms, reserve validation runs, and attach uncertainty to every forecast.

Worked micro-example. Suppose loss follows L(C)=1.5+0.8C0.05L(C)=1.5+0.8C^{-0.05}. Increasing compute by a factor of 10 multiplies the excess loss by 100.050.89110^{-0.05}\approx 0.891. That is real improvement, but it is slow improvement. Power laws explain both why scale helps and why scale is expensive.

Implementation check. Plot on log-log axes, inspect residuals, and verify that the fitted law predicts runs that were not used for fitting. If the residuals curve systematically, the simple power law is missing a factor.

AI connection. This is a practical quantity for planning or checking a training run.

Common mistake. Do not treat a published coefficient as a universal constant. Exponents and offsets depend on model family, data, optimizer, tokenizer, and evaluation distribution.

8.2 Effective tokens

Main idea. Quality, deduplication, and domain match alter token value.

Core relation:

Deff=kqkwkDD_\mathrm{eff}=\sum_k q_kw_kD

Scaling laws are not magic. They are compact empirical models of how loss changes when resources change. Their value comes from discipline: keep the setup stable, measure held-out loss, fit simple forms, reserve validation runs, and attach uncertainty to every forecast.

Worked micro-example. Suppose loss follows L(C)=1.5+0.8C0.05L(C)=1.5+0.8C^{-0.05}. Increasing compute by a factor of 10 multiplies the excess loss by 100.050.89110^{-0.05}\approx 0.891. That is real improvement, but it is slow improvement. Power laws explain both why scale helps and why scale is expensive.

Implementation check. Plot on log-log axes, inspect residuals, and verify that the fitted law predicts runs that were not used for fitting. If the residuals curve systematically, the simple power law is missing a factor.

AI connection. Data quantity without data value is a weak planning variable.

Common mistake. Do not treat a published coefficient as a universal constant. Exponents and offsets depend on model family, data, optimizer, tokenizer, and evaluation distribution.

8.3 Contamination

Main idea. Validation leakage makes scaling look better than it is.

Core relation:

DevalDtrainD_\mathrm{eval}\cap D_\mathrm{train}\ne\emptyset

Scaling laws are not magic. They are compact empirical models of how loss changes when resources change. Their value comes from discipline: keep the setup stable, measure held-out loss, fit simple forms, reserve validation runs, and attach uncertainty to every forecast.

Worked micro-example. Suppose loss follows L(C)=1.5+0.8C0.05L(C)=1.5+0.8C^{-0.05}. Increasing compute by a factor of 10 multiplies the excess loss by 100.050.89110^{-0.05}\approx 0.891. That is real improvement, but it is slow improvement. Power laws explain both why scale helps and why scale is expensive.

Implementation check. Plot on log-log axes, inspect residuals, and verify that the fitted law predicts runs that were not used for fitting. If the residuals curve systematically, the simple power law is missing a factor.

AI connection. This is a practical quantity for planning or checking a training run.

Common mistake. Do not treat a published coefficient as a universal constant. Exponents and offsets depend on model family, data, optimizer, tokenizer, and evaluation distribution.

8.4 Curriculum and filtering

Main idea. Data order and filtering can shift constants.

Core relation:

A$ changes even if $\alpha$ is similar

Scaling laws are not magic. They are compact empirical models of how loss changes when resources change. Their value comes from discipline: keep the setup stable, measure held-out loss, fit simple forms, reserve validation runs, and attach uncertainty to every forecast.

Worked micro-example. Suppose loss follows L(C)=1.5+0.8C0.05L(C)=1.5+0.8C^{-0.05}. Increasing compute by a factor of 10 multiplies the excess loss by 100.050.89110^{-0.05}\approx 0.891. That is real improvement, but it is slow improvement. Power laws explain both why scale helps and why scale is expensive.

Implementation check. Plot on log-log axes, inspect residuals, and verify that the fitted law predicts runs that were not used for fitting. If the residuals curve systematically, the simple power law is missing a factor.

AI connection. This is a practical quantity for planning or checking a training run.

Common mistake. Do not treat a published coefficient as a universal constant. Exponents and offsets depend on model family, data, optimizer, tokenizer, and evaluation distribution.

8.5 Evaluation distribution

Main idea. Optimize for the distribution that matters.

Core relation:

L_\mathrm{eval}$, not only $L_\mathrm{web}

Scaling laws are not magic. They are compact empirical models of how loss changes when resources change. Their value comes from discipline: keep the setup stable, measure held-out loss, fit simple forms, reserve validation runs, and attach uncertainty to every forecast.

Worked micro-example. Suppose loss follows L(C)=1.5+0.8C0.05L(C)=1.5+0.8C^{-0.05}. Increasing compute by a factor of 10 multiplies the excess loss by 100.050.89110^{-0.05}\approx 0.891. That is real improvement, but it is slow improvement. Power laws explain both why scale helps and why scale is expensive.

Implementation check. Plot on log-log axes, inspect residuals, and verify that the fitted law predicts runs that were not used for fitting. If the residuals curve systematically, the simple power law is missing a factor.

AI connection. This is a practical quantity for planning or checking a training run.

Common mistake. Do not treat a published coefficient as a universal constant. Exponents and offsets depend on model family, data, optimizer, tokenizer, and evaluation distribution.

9. Inference-Aware Scaling

This part studies inference-aware scaling as a forecasting tool. The formulas are useful only when you keep the measurement setup fixed and validate predictions against real runs.

SubtopicQuestionFormula
Serving costa model trained once may be served many timesCserveNTgeneratedQC_\mathrm{serve}\propto N\cdot T_\mathrm{generated}\cdot Q
Latency constraintlarger models can be better but too slowTlatencyTSLAT_\mathrm{latency}\le T_\mathrm{SLA}
Small overtrained modelsmore training tokens can improve a smaller cheaper modelD/ND/N can exceed pretraining-optimal planning ratios
Distillationteacher quality can transfer into a smaller studentpsptp_s \approx p_t
Total cost objectivechoose models by train plus serve plus qualityJ=loss+λcostJ=\mathrm{loss}+\lambda\mathrm{cost}

9.1 Serving cost

Main idea. A model trained once may be served many times.

Core relation:

CserveNTgeneratedQC_\mathrm{serve}\propto N\cdot T_\mathrm{generated}\cdot Q

Scaling laws are not magic. They are compact empirical models of how loss changes when resources change. Their value comes from discipline: keep the setup stable, measure held-out loss, fit simple forms, reserve validation runs, and attach uncertainty to every forecast.

Worked micro-example. Suppose loss follows L(C)=1.5+0.8C0.05L(C)=1.5+0.8C^{-0.05}. Increasing compute by a factor of 10 multiplies the excess loss by 100.050.89110^{-0.05}\approx 0.891. That is real improvement, but it is slow improvement. Power laws explain both why scale helps and why scale is expensive.

Implementation check. Plot on log-log axes, inspect residuals, and verify that the fitted law predicts runs that were not used for fitting. If the residuals curve systematically, the simple power law is missing a factor.

AI connection. The best training run is not always the cheapest product model.

Common mistake. Do not treat a published coefficient as a universal constant. Exponents and offsets depend on model family, data, optimizer, tokenizer, and evaluation distribution.

9.2 Latency constraint

Main idea. Larger models can be better but too slow.

Core relation:

TlatencyTSLAT_\mathrm{latency}\le T_\mathrm{SLA}

Scaling laws are not magic. They are compact empirical models of how loss changes when resources change. Their value comes from discipline: keep the setup stable, measure held-out loss, fit simple forms, reserve validation runs, and attach uncertainty to every forecast.

Worked micro-example. Suppose loss follows L(C)=1.5+0.8C0.05L(C)=1.5+0.8C^{-0.05}. Increasing compute by a factor of 10 multiplies the excess loss by 100.050.89110^{-0.05}\approx 0.891. That is real improvement, but it is slow improvement. Power laws explain both why scale helps and why scale is expensive.

Implementation check. Plot on log-log axes, inspect residuals, and verify that the fitted law predicts runs that were not used for fitting. If the residuals curve systematically, the simple power law is missing a factor.

AI connection. This is a practical quantity for planning or checking a training run.

Common mistake. Do not treat a published coefficient as a universal constant. Exponents and offsets depend on model family, data, optimizer, tokenizer, and evaluation distribution.

9.3 Small overtrained models

Main idea. More training tokens can improve a smaller cheaper model.

Core relation:

D/N$ can exceed pretraining-optimal planning ratios

Scaling laws are not magic. They are compact empirical models of how loss changes when resources change. Their value comes from discipline: keep the setup stable, measure held-out loss, fit simple forms, reserve validation runs, and attach uncertainty to every forecast.

Worked micro-example. Suppose loss follows L(C)=1.5+0.8C0.05L(C)=1.5+0.8C^{-0.05}. Increasing compute by a factor of 10 multiplies the excess loss by 100.050.89110^{-0.05}\approx 0.891. That is real improvement, but it is slow improvement. Power laws explain both why scale helps and why scale is expensive.

Implementation check. Plot on log-log axes, inspect residuals, and verify that the fitted law predicts runs that were not used for fitting. If the residuals curve systematically, the simple power law is missing a factor.

AI connection. This is a practical quantity for planning or checking a training run.

Common mistake. Do not treat a published coefficient as a universal constant. Exponents and offsets depend on model family, data, optimizer, tokenizer, and evaluation distribution.

9.4 Distillation

Main idea. Teacher quality can transfer into a smaller student.

Core relation:

psptp_s \approx p_t

Scaling laws are not magic. They are compact empirical models of how loss changes when resources change. Their value comes from discipline: keep the setup stable, measure held-out loss, fit simple forms, reserve validation runs, and attach uncertainty to every forecast.

Worked micro-example. Suppose loss follows L(C)=1.5+0.8C0.05L(C)=1.5+0.8C^{-0.05}. Increasing compute by a factor of 10 multiplies the excess loss by 100.050.89110^{-0.05}\approx 0.891. That is real improvement, but it is slow improvement. Power laws explain both why scale helps and why scale is expensive.

Implementation check. Plot on log-log axes, inspect residuals, and verify that the fitted law predicts runs that were not used for fitting. If the residuals curve systematically, the simple power law is missing a factor.

AI connection. This is a practical quantity for planning or checking a training run.

Common mistake. Do not treat a published coefficient as a universal constant. Exponents and offsets depend on model family, data, optimizer, tokenizer, and evaluation distribution.

9.5 Total cost objective

Main idea. Choose models by train plus serve plus quality.

Core relation:

J=loss+λcostJ=\mathrm{loss}+\lambda\mathrm{cost}

Scaling laws are not magic. They are compact empirical models of how loss changes when resources change. Their value comes from discipline: keep the setup stable, measure held-out loss, fit simple forms, reserve validation runs, and attach uncertainty to every forecast.

Worked micro-example. Suppose loss follows L(C)=1.5+0.8C0.05L(C)=1.5+0.8C^{-0.05}. Increasing compute by a factor of 10 multiplies the excess loss by 100.050.89110^{-0.05}\approx 0.891. That is real improvement, but it is slow improvement. Power laws explain both why scale helps and why scale is expensive.

Implementation check. Plot on log-log axes, inspect residuals, and verify that the fitted law predicts runs that were not used for fitting. If the residuals curve systematically, the simple power law is missing a factor.

AI connection. This is a practical quantity for planning or checking a training run.

Common mistake. Do not treat a published coefficient as a universal constant. Exponents and offsets depend on model family, data, optimizer, tokenizer, and evaluation distribution.

10. Limits and Debugging

This part studies limits and debugging as a forecasting tool. The formulas are useful only when you keep the measurement setup fixed and validate predictions against real runs.

SubtopicQuestionFormula
Not a theorem of all AIscaling laws are fitted empirical regularitiesvalid only in regime\mathrm{valid\ only\ in\ regime}
Metric artifactsthresholded task scores can look sudden even when loss changes smoothly1{s>τ}\mathbf{1}\{s>\tau\}
Architecture changesnew attention, MoE, or tokenizer choices can change constantsA,α,LA,\alpha,L_\infty shift
Bad small runsbugs in small runs poison the forecastL^\hat L inherits measurement error
Decision checklistuse scaling laws as planning instruments with uncertaintyL^±interval\hat L\pm\mathrm{interval}

10.1 Not a theorem of all AI

Main idea. Scaling laws are fitted empirical regularities.

Core relation:

valid only in regime\mathrm{valid\ only\ in\ regime}

Scaling laws are not magic. They are compact empirical models of how loss changes when resources change. Their value comes from discipline: keep the setup stable, measure held-out loss, fit simple forms, reserve validation runs, and attach uncertainty to every forecast.

Worked micro-example. Suppose loss follows L(C)=1.5+0.8C0.05L(C)=1.5+0.8C^{-0.05}. Increasing compute by a factor of 10 multiplies the excess loss by 100.050.89110^{-0.05}\approx 0.891. That is real improvement, but it is slow improvement. Power laws explain both why scale helps and why scale is expensive.

Implementation check. Plot on log-log axes, inspect residuals, and verify that the fitted law predicts runs that were not used for fitting. If the residuals curve systematically, the simple power law is missing a factor.

AI connection. This is a practical quantity for planning or checking a training run.

Common mistake. Do not treat a published coefficient as a universal constant. Exponents and offsets depend on model family, data, optimizer, tokenizer, and evaluation distribution.

10.2 Metric artifacts

Main idea. Thresholded task scores can look sudden even when loss changes smoothly.

Core relation:

1{s>τ}\mathbf{1}\{s>\tau\}

Scaling laws are not magic. They are compact empirical models of how loss changes when resources change. Their value comes from discipline: keep the setup stable, measure held-out loss, fit simple forms, reserve validation runs, and attach uncertainty to every forecast.

Worked micro-example. Suppose loss follows L(C)=1.5+0.8C0.05L(C)=1.5+0.8C^{-0.05}. Increasing compute by a factor of 10 multiplies the excess loss by 100.050.89110^{-0.05}\approx 0.891. That is real improvement, but it is slow improvement. Power laws explain both why scale helps and why scale is expensive.

Implementation check. Plot on log-log axes, inspect residuals, and verify that the fitted law predicts runs that were not used for fitting. If the residuals curve systematically, the simple power law is missing a factor.

AI connection. This is a practical quantity for planning or checking a training run.

Common mistake. Do not treat a published coefficient as a universal constant. Exponents and offsets depend on model family, data, optimizer, tokenizer, and evaluation distribution.

10.3 Architecture changes

Main idea. New attention, moe, or tokenizer choices can change constants.

Core relation:

A,\alpha,L_\infty$ shift

Scaling laws are not magic. They are compact empirical models of how loss changes when resources change. Their value comes from discipline: keep the setup stable, measure held-out loss, fit simple forms, reserve validation runs, and attach uncertainty to every forecast.

Worked micro-example. Suppose loss follows L(C)=1.5+0.8C0.05L(C)=1.5+0.8C^{-0.05}. Increasing compute by a factor of 10 multiplies the excess loss by 100.050.89110^{-0.05}\approx 0.891. That is real improvement, but it is slow improvement. Power laws explain both why scale helps and why scale is expensive.

Implementation check. Plot on log-log axes, inspect residuals, and verify that the fitted law predicts runs that were not used for fitting. If the residuals curve systematically, the simple power law is missing a factor.

AI connection. This is a practical quantity for planning or checking a training run.

Common mistake. Do not treat a published coefficient as a universal constant. Exponents and offsets depend on model family, data, optimizer, tokenizer, and evaluation distribution.

10.4 Bad small runs

Main idea. Bugs in small runs poison the forecast.

Core relation:

\hat L$ inherits measurement error

Scaling laws are not magic. They are compact empirical models of how loss changes when resources change. Their value comes from discipline: keep the setup stable, measure held-out loss, fit simple forms, reserve validation runs, and attach uncertainty to every forecast.

Worked micro-example. Suppose loss follows L(C)=1.5+0.8C0.05L(C)=1.5+0.8C^{-0.05}. Increasing compute by a factor of 10 multiplies the excess loss by 100.050.89110^{-0.05}\approx 0.891. That is real improvement, but it is slow improvement. Power laws explain both why scale helps and why scale is expensive.

Implementation check. Plot on log-log axes, inspect residuals, and verify that the fitted law predicts runs that were not used for fitting. If the residuals curve systematically, the simple power law is missing a factor.

AI connection. This is a practical quantity for planning or checking a training run.

Common mistake. Do not treat a published coefficient as a universal constant. Exponents and offsets depend on model family, data, optimizer, tokenizer, and evaluation distribution.

10.5 Decision checklist

Main idea. Use scaling laws as planning instruments with uncertainty.

Core relation:

L^±interval\hat L\pm\mathrm{interval}

Scaling laws are not magic. They are compact empirical models of how loss changes when resources change. Their value comes from discipline: keep the setup stable, measure held-out loss, fit simple forms, reserve validation runs, and attach uncertainty to every forecast.

Worked micro-example. Suppose loss follows L(C)=1.5+0.8C0.05L(C)=1.5+0.8C^{-0.05}. Increasing compute by a factor of 10 multiplies the excess loss by 100.050.89110^{-0.05}\approx 0.891. That is real improvement, but it is slow improvement. Power laws explain both why scale helps and why scale is expensive.

Implementation check. Plot on log-log axes, inspect residuals, and verify that the fitted law predicts runs that were not used for fitting. If the residuals curve systematically, the simple power law is missing a factor.

AI connection. This is a practical quantity for planning or checking a training run.

Common mistake. Do not treat a published coefficient as a universal constant. Exponents and offsets depend on model family, data, optimizer, tokenizer, and evaluation distribution.


Practice Exercises

  1. Fit a one-resource power law after subtracting a known loss floor.
  2. Convert parameters and tokens into approximate dense training FLOPs.
  3. Compute token count from a fixed compute budget and model size.
  4. Search an IsoFLOP curve for the lowest toy loss.
  5. Diagnose whether a model is undertrained by checking tokens per parameter.
  6. Compute effective tokens for a data mixture.
  7. Estimate serving cost for two model choices.
  8. Compute residuals for scaling-law forecasts.
  9. Show how a thresholded metric can look abrupt even when loss is smooth.
  10. Write a forecast checklist with uncertainty.

Why This Matters for AI

Scaling laws turn expensive guessing into disciplined planning. They help decide whether to collect more data, train longer, use a smaller model, improve data quality, or avoid a run whose expected gain is too small. They also teach humility: a clean line on a log-log plot is useful, but it is still a fit to a regime, not a theorem about intelligence.

Bridge to Efficient Attention and Inference

The next section studies how attention and inference systems change the cost side of the scaling equation. Scaling laws say what loss might be worth buying; efficient attention and serving math decide whether that loss can be bought and served under memory, latency, and throughput constraints.

References

Skill Check

Test this lesson

Answer 4 quick questions to lock in the lesson and feed your adaptive practice queue.

--
Score
0/4
Answered
Not attempted
Status
1

Which module does this lesson belong to?

2

Which section is covered in this lesson content?

3

Which term is most central to this lesson?

4

What is the best way to use this lesson for real learning?

Your answers save locally first, then sync when account storage is available.
Practice queue