Data Mixture Optimization

26 min read18 headingsSplit lesson page

Lesson overview | Lesson overview | Next part

Data Mixture Optimization: Part 1: Intuition to 3. Baseline Mixtures

1. Intuition

Intuition gives the conceptual and mathematical layer for data mixture optimization. The local variables in this section should be read as pipeline objects: documents, records, tokens, filters, weights, shards, and manifests.

1.1 Mixture weights determine model skill profile

Mixture weights determine model skill profile is part of the canonical scope of data mixture optimization. We model the relevant object as a finite collection $\mathcal{D} = \{r_i\}_{i=1}^n$ with record-level metadata $m_i$ and text or token content $x_i$ . The practical question is whether the transformation preserves the intended empirical distribution.

A useful local invariant is:

\text{valid}(r_i, \mathcal{S}) = 1 \quad \Longrightarrow \quad r_i \text{ can be consumed by the next pipeline stage.}

For mixture, the invariant should be explicit enough that a checker can fail fast. If the invariant is only written in a notebook comment or an engineer's memory, it will not protect a long-running data build.

Examples:

A small local experiment can store this object in memory; a frontier-scale run must store it as sharded, versioned, validated records.
The mathematical object is simple, but the operational contract must survive restarts, parallel workers, schema changes, and audits.
The notebook for this section uses synthetic data so the same ideas can be executed without external files.

Non-examples:

A path on disk without a manifest is not a reproducible dataset.
A metric dashboard without record-level lineage is not a provenance system.
A filter threshold without an audit sample is not evidence of quality.

Implementation consequence: every transformation should report both a count and a rate. If $n_{\mathrm{in}}$ records enter the stage and $n_{\mathrm{out}}$ records leave, the acceptance rate is

a = \frac{n_{\mathrm{out}}}{n_{\mathrm{in}}}.

A sudden change in $a$ is a data-drift signal even when the code still runs. This is why pipeline math is inseparable from logging, manifests, and audit slices.

For LLM work, the token-weighted view is often more important than the document-weighted view. A filter that removes 5 percent of documents may remove 30 percent of tokens if it targets long documents. The corresponding token acceptance rate is

a_{\mathrm{tok}} = \frac{\sum_i f(r_i)\,T_i}{\sum_i T_i},

where $T_i$ is the token count or a deterministic token-count estimate. The distinction matters for compute budgets, mixture proportions, and scaling-law interpretation.

1.2 Not all tokens have equal value

Not all tokens have equal value is part of the canonical scope of data mixture optimization. We model the relevant object as a finite collection $\mathcal{D} = \{r_i\}_{i=1}^n$ with record-level metadata $m_i$ and text or token content $x_i$ . The practical question is whether the transformation preserves the intended empirical distribution.

A useful local invariant is:

\text{valid}(r_i, \mathcal{S}) = 1 \quad \Longrightarrow \quad r_i \text{ can be consumed by the next pipeline stage.}

For simplex, the invariant should be explicit enough that a checker can fail fast. If the invariant is only written in a notebook comment or an engineer's memory, it will not protect a long-running data build.

Examples:

A small local experiment can store this object in memory; a frontier-scale run must store it as sharded, versioned, validated records.
The mathematical object is simple, but the operational contract must survive restarts, parallel workers, schema changes, and audits.
The notebook for this section uses synthetic data so the same ideas can be executed without external files.

Non-examples:

A path on disk without a manifest is not a reproducible dataset.
A metric dashboard without record-level lineage is not a provenance system.
A filter threshold without an audit sample is not evidence of quality.

Implementation consequence: every transformation should report both a count and a rate. If $n_{\mathrm{in}}$ records enter the stage and $n_{\mathrm{out}}$ records leave, the acceptance rate is

a = \frac{n_{\mathrm{out}}}{n_{\mathrm{in}}}.

A sudden change in $a$ is a data-drift signal even when the code still runs. This is why pipeline math is inseparable from logging, manifests, and audit slices.

a_{\mathrm{tok}} = \frac{\sum_i f(r_i)\,T_i}{\sum_i T_i},

where $T_i$ is the token count or a deterministic token-count estimate. The distinction matters for compute budgets, mixture proportions, and scaling-law interpretation.

1.3 Mixture as constrained optimization

Mixture as constrained optimization is part of the canonical scope of data mixture optimization. We model the relevant object as a finite collection $\mathcal{D} = \{r_i\}_{i=1}^n$ with record-level metadata $m_i$ and text or token content $x_i$ . The practical question is whether the transformation preserves the intended empirical distribution.

A useful local invariant is:

\text{valid}(r_i, \mathcal{S}) = 1 \quad \Longrightarrow \quad r_i \text{ can be consumed by the next pipeline stage.}

For domain, the invariant should be explicit enough that a checker can fail fast. If the invariant is only written in a notebook comment or an engineer's memory, it will not protect a long-running data build.

Examples:

A small local experiment can store this object in memory; a frontier-scale run must store it as sharded, versioned, validated records.
The mathematical object is simple, but the operational contract must survive restarts, parallel workers, schema changes, and audits.
The notebook for this section uses synthetic data so the same ideas can be executed without external files.

Non-examples:

A path on disk without a manifest is not a reproducible dataset.
A metric dashboard without record-level lineage is not a provenance system.
A filter threshold without an audit sample is not evidence of quality.

Implementation consequence: every transformation should report both a count and a rate. If $n_{\mathrm{in}}$ records enter the stage and $n_{\mathrm{out}}$ records leave, the acceptance rate is

a = \frac{n_{\mathrm{out}}}{n_{\mathrm{in}}}.

A sudden change in $a$ is a data-drift signal even when the code still runs. This is why pipeline math is inseparable from logging, manifests, and audit slices.

a_{\mathrm{tok}} = \frac{\sum_i f(r_i)\,T_i}{\sum_i T_i},

where $T_i$ is the token count or a deterministic token-count estimate. The distinction matters for compute budgets, mixture proportions, and scaling-law interpretation.

1.4 Proxy models

Proxy models is part of the canonical scope of data mixture optimization. We model the relevant object as a finite collection $\mathcal{D} = \{r_i\}_{i=1}^n$ with record-level metadata $m_i$ and text or token content $x_i$ . The practical question is whether the transformation preserves the intended empirical distribution.

A useful local invariant is:

\text{valid}(r_i, \mathcal{S}) = 1 \quad \Longrightarrow \quad r_i \text{ can be consumed by the next pipeline stage.}

For proxy model, the invariant should be explicit enough that a checker can fail fast. If the invariant is only written in a notebook comment or an engineer's memory, it will not protect a long-running data build.

Examples:

A small local experiment can store this object in memory; a frontier-scale run must store it as sharded, versioned, validated records.
The mathematical object is simple, but the operational contract must survive restarts, parallel workers, schema changes, and audits.
The notebook for this section uses synthetic data so the same ideas can be executed without external files.

Non-examples:

A path on disk without a manifest is not a reproducible dataset.
A metric dashboard without record-level lineage is not a provenance system.
A filter threshold without an audit sample is not evidence of quality.

Implementation consequence: every transformation should report both a count and a rate. If $n_{\mathrm{in}}$ records enter the stage and $n_{\mathrm{out}}$ records leave, the acceptance rate is

a = \frac{n_{\mathrm{out}}}{n_{\mathrm{in}}}.

A sudden change in $a$ is a data-drift signal even when the code still runs. This is why pipeline math is inseparable from logging, manifests, and audit slices.

a_{\mathrm{tok}} = \frac{\sum_i f(r_i)\,T_i}{\sum_i T_i},

where $T_i$ is the token count or a deterministic token-count estimate. The distinction matters for compute budgets, mixture proportions, and scaling-law interpretation.

1.5 DataComp, DoReMi, and data mixing laws context

DataComp, DoReMi, and data mixing laws context is part of the canonical scope of data mixture optimization. We model the relevant object as a finite collection $\mathcal{D} = \{r_i\}_{i=1}^n$ with record-level metadata $m_i$ and text or token content $x_i$ . The practical question is whether the transformation preserves the intended empirical distribution.

A useful local invariant is:

\text{valid}(r_i, \mathcal{S}) = 1 \quad \Longrightarrow \quad r_i \text{ can be consumed by the next pipeline stage.}

For DRO, the invariant should be explicit enough that a checker can fail fast. If the invariant is only written in a notebook comment or an engineer's memory, it will not protect a long-running data build.

Examples:

A small local experiment can store this object in memory; a frontier-scale run must store it as sharded, versioned, validated records.
The mathematical object is simple, but the operational contract must survive restarts, parallel workers, schema changes, and audits.
The notebook for this section uses synthetic data so the same ideas can be executed without external files.

Non-examples:

A path on disk without a manifest is not a reproducible dataset.
A metric dashboard without record-level lineage is not a provenance system.
A filter threshold without an audit sample is not evidence of quality.

Implementation consequence: every transformation should report both a count and a rate. If $n_{\mathrm{in}}$ records enter the stage and $n_{\mathrm{out}}$ records leave, the acceptance rate is

a = \frac{n_{\mathrm{out}}}{n_{\mathrm{in}}}.

A sudden change in $a$ is a data-drift signal even when the code still runs. This is why pipeline math is inseparable from logging, manifests, and audit slices.

a_{\mathrm{tok}} = \frac{\sum_i f(r_i)\,T_i}{\sum_i T_i},

where $T_i$ is the token count or a deterministic token-count estimate. The distinction matters for compute budgets, mixture proportions, and scaling-law interpretation.

2. Formal Definitions

Formal Definitions gives the conceptual and mathematical layer for data mixture optimization. The local variables in this section should be read as pipeline objects: documents, records, tokens, filters, weights, shards, and manifests.

2.1 Domains $\mathcal{D}_1,\ldots,\mathcal{D}_K$

Domains $\mathcal{D}_1,\ldots,\mathcal{D}_K$ is part of the canonical scope of data mixture optimization. We model the relevant object as a finite collection $\mathcal{D} = \{r_i\}_{i=1}^n$ with record-level metadata $m_i$ and text or token content $x_i$ . The practical question is whether the transformation preserves the intended empirical distribution.

A useful local invariant is:

\text{valid}(r_i, \mathcal{S}) = 1 \quad \Longrightarrow \quad r_i \text{ can be consumed by the next pipeline stage.}

Examples:

A small local experiment can store this object in memory; a frontier-scale run must store it as sharded, versioned, validated records.
The mathematical object is simple, but the operational contract must survive restarts, parallel workers, schema changes, and audits.
The notebook for this section uses synthetic data so the same ideas can be executed without external files.

Non-examples:

A path on disk without a manifest is not a reproducible dataset.
A metric dashboard without record-level lineage is not a provenance system.
A filter threshold without an audit sample is not evidence of quality.

Implementation consequence: every transformation should report both a count and a rate. If $n_{\mathrm{in}}$ records enter the stage and $n_{\mathrm{out}}$ records leave, the acceptance rate is

a = \frac{n_{\mathrm{out}}}{n_{\mathrm{in}}}.

A sudden change in $a$ is a data-drift signal even when the code still runs. This is why pipeline math is inseparable from logging, manifests, and audit slices.

a_{\mathrm{tok}} = \frac{\sum_i f(r_i)\,T_i}{\sum_i T_i},

where $T_i$ is the token count or a deterministic token-count estimate. The distinction matters for compute budgets, mixture proportions, and scaling-law interpretation.

2.2 Mixture vector $\boldsymbol{\alpha}\in\Delta^{K-1}$

Mixture vector $\boldsymbol{\alpha}\in\Delta^{K-1}$ is part of the canonical scope of data mixture optimization. We model the relevant object as a finite collection $\mathcal{D} = \{r_i\}_{i=1}^n$ with record-level metadata $m_i$ and text or token content $x_i$ . The practical question is whether the transformation preserves the intended empirical distribution.

A useful local invariant is:

\text{valid}(r_i, \mathcal{S}) = 1 \quad \Longrightarrow \quad r_i \text{ can be consumed by the next pipeline stage.}

Examples:

A small local experiment can store this object in memory; a frontier-scale run must store it as sharded, versioned, validated records.
The mathematical object is simple, but the operational contract must survive restarts, parallel workers, schema changes, and audits.
The notebook for this section uses synthetic data so the same ideas can be executed without external files.

Non-examples:

A path on disk without a manifest is not a reproducible dataset.
A metric dashboard without record-level lineage is not a provenance system.
A filter threshold without an audit sample is not evidence of quality.

Implementation consequence: every transformation should report both a count and a rate. If $n_{\mathrm{in}}$ records enter the stage and $n_{\mathrm{out}}$ records leave, the acceptance rate is

a = \frac{n_{\mathrm{out}}}{n_{\mathrm{in}}}.

A sudden change in $a$ is a data-drift signal even when the code still runs. This is why pipeline math is inseparable from logging, manifests, and audit slices.

a_{\mathrm{tok}} = \frac{\sum_i f(r_i)\,T_i}{\sum_i T_i},

where $T_i$ is the token count or a deterministic token-count estimate. The distinction matters for compute budgets, mixture proportions, and scaling-law interpretation.

2.3 Sampling distribution

Sampling distribution is part of the canonical scope of data mixture optimization. We model the relevant object as a finite collection $\mathcal{D} = \{r_i\}_{i=1}^n$ with record-level metadata $m_i$ and text or token content $x_i$ . The practical question is whether the transformation preserves the intended empirical distribution.

A useful local invariant is:

\text{valid}(r_i, \mathcal{S}) = 1 \quad \Longrightarrow \quad r_i \text{ can be consumed by the next pipeline stage.}

Examples:

A small local experiment can store this object in memory; a frontier-scale run must store it as sharded, versioned, validated records.
The mathematical object is simple, but the operational contract must survive restarts, parallel workers, schema changes, and audits.
The notebook for this section uses synthetic data so the same ideas can be executed without external files.

Non-examples:

A path on disk without a manifest is not a reproducible dataset.
A metric dashboard without record-level lineage is not a provenance system.
A filter threshold without an audit sample is not evidence of quality.

Implementation consequence: every transformation should report both a count and a rate. If $n_{\mathrm{in}}$ records enter the stage and $n_{\mathrm{out}}$ records leave, the acceptance rate is

a = \frac{n_{\mathrm{out}}}{n_{\mathrm{in}}}.

A sudden change in $a$ is a data-drift signal even when the code still runs. This is why pipeline math is inseparable from logging, manifests, and audit slices.

a_{\mathrm{tok}} = \frac{\sum_i f(r_i)\,T_i}{\sum_i T_i},

where $T_i$ is the token count or a deterministic token-count estimate. The distinction matters for compute budgets, mixture proportions, and scaling-law interpretation.

2.4 Validation objective

Validation objective is part of the canonical scope of data mixture optimization. We model the relevant object as a finite collection $\mathcal{D} = \{r_i\}_{i=1}^n$ with record-level metadata $m_i$ and text or token content $x_i$ . The practical question is whether the transformation preserves the intended empirical distribution.

A useful local invariant is:

\text{valid}(r_i, \mathcal{S}) = 1 \quad \Longrightarrow \quad r_i \text{ can be consumed by the next pipeline stage.}

Examples:

A small local experiment can store this object in memory; a frontier-scale run must store it as sharded, versioned, validated records.
The mathematical object is simple, but the operational contract must survive restarts, parallel workers, schema changes, and audits.
The notebook for this section uses synthetic data so the same ideas can be executed without external files.

Non-examples:

A path on disk without a manifest is not a reproducible dataset.
A metric dashboard without record-level lineage is not a provenance system.
A filter threshold without an audit sample is not evidence of quality.

Implementation consequence: every transformation should report both a count and a rate. If $n_{\mathrm{in}}$ records enter the stage and $n_{\mathrm{out}}$ records leave, the acceptance rate is

a = \frac{n_{\mathrm{out}}}{n_{\mathrm{in}}}.

A sudden change in $a$ is a data-drift signal even when the code still runs. This is why pipeline math is inseparable from logging, manifests, and audit slices.

a_{\mathrm{tok}} = \frac{\sum_i f(r_i)\,T_i}{\sum_i T_i},

where $T_i$ is the token count or a deterministic token-count estimate. The distinction matters for compute budgets, mixture proportions, and scaling-law interpretation.

2.5 Token budget constraint

Token budget constraint is part of the canonical scope of data mixture optimization. We model the relevant object as a finite collection $\mathcal{D} = \{r_i\}_{i=1}^n$ with record-level metadata $m_i$ and text or token content $x_i$ . The practical question is whether the transformation preserves the intended empirical distribution.

A useful local invariant is:

\text{valid}(r_i, \mathcal{S}) = 1 \quad \Longrightarrow \quad r_i \text{ can be consumed by the next pipeline stage.}

Examples:

A small local experiment can store this object in memory; a frontier-scale run must store it as sharded, versioned, validated records.
The mathematical object is simple, but the operational contract must survive restarts, parallel workers, schema changes, and audits.
The notebook for this section uses synthetic data so the same ideas can be executed without external files.

Non-examples:

A path on disk without a manifest is not a reproducible dataset.
A metric dashboard without record-level lineage is not a provenance system.
A filter threshold without an audit sample is not evidence of quality.

Implementation consequence: every transformation should report both a count and a rate. If $n_{\mathrm{in}}$ records enter the stage and $n_{\mathrm{out}}$ records leave, the acceptance rate is

a = \frac{n_{\mathrm{out}}}{n_{\mathrm{in}}}.

A sudden change in $a$ is a data-drift signal even when the code still runs. This is why pipeline math is inseparable from logging, manifests, and audit slices.

a_{\mathrm{tok}} = \frac{\sum_i f(r_i)\,T_i}{\sum_i T_i},

where $T_i$ is the token count or a deterministic token-count estimate. The distinction matters for compute budgets, mixture proportions, and scaling-law interpretation.

3. Baseline Mixtures

Baseline Mixtures gives the conceptual and mathematical layer for data mixture optimization. The local variables in this section should be read as pipeline objects: documents, records, tokens, filters, weights, shards, and manifests.

3.1 Uniform by document

Uniform by document is part of the canonical scope of data mixture optimization. We model the relevant object as a finite collection $\mathcal{D} = \{r_i\}_{i=1}^n$ with record-level metadata $m_i$ and text or token content $x_i$ . The practical question is whether the transformation preserves the intended empirical distribution.

A useful local invariant is:

\text{valid}(r_i, \mathcal{S}) = 1 \quad \Longrightarrow \quad r_i \text{ can be consumed by the next pipeline stage.}

Examples:

A small local experiment can store this object in memory; a frontier-scale run must store it as sharded, versioned, validated records.
The mathematical object is simple, but the operational contract must survive restarts, parallel workers, schema changes, and audits.
The notebook for this section uses synthetic data so the same ideas can be executed without external files.

Non-examples:

A path on disk without a manifest is not a reproducible dataset.
A metric dashboard without record-level lineage is not a provenance system.
A filter threshold without an audit sample is not evidence of quality.

Implementation consequence: every transformation should report both a count and a rate. If $n_{\mathrm{in}}$ records enter the stage and $n_{\mathrm{out}}$ records leave, the acceptance rate is

a = \frac{n_{\mathrm{out}}}{n_{\mathrm{in}}}.

A sudden change in $a$ is a data-drift signal even when the code still runs. This is why pipeline math is inseparable from logging, manifests, and audit slices.

a_{\mathrm{tok}} = \frac{\sum_i f(r_i)\,T_i}{\sum_i T_i},

where $T_i$ is the token count or a deterministic token-count estimate. The distinction matters for compute budgets, mixture proportions, and scaling-law interpretation.

3.2 Uniform by token

Uniform by token is part of the canonical scope of data mixture optimization. We model the relevant object as a finite collection $\mathcal{D} = \{r_i\}_{i=1}^n$ with record- level metadata $m_i$ and text or token content $x_i$ . The practical question is whether the transformation preserves the intended empirical distribution.

A useful local invariant is:

\text{valid}(r_i, \mathcal{S}) = 1 \quad \Longrightarrow \quad r_i \text{ can be consumed by the next pipeline stage.}

Examples:

A small local experiment can store this object in memory; a frontier-scale run must store it as sharded, versioned, validated records.
The mathematical object is simple, but the operational contract must survive restarts, parallel workers, schema changes, and audits.
The notebook for this section uses synthetic data so the same ideas can be executed without external files.

Non-examples:

A path on disk without a manifest is not a reproducible dataset.
A metric dashboard without record-level lineage is not a provenance system.
A filter threshold without an audit sample is not evidence of quality.

Implementation consequence: every transformation should report both a count and a rate. If $n_{\mathrm{in}}$ records enter the stage and $n_{\mathrm{out}}$ records leave, the acceptance rate is

a = \frac{n_{\mathrm{out}}}{n_{\mathrm{in}}}.

A sudden change in $a$ is a data-drift signal even when the code still runs. This is why pipeline math is inseparable from logging, manifests, and audit slices.

a_{\mathrm{tok}} = \frac{\sum_i f(r_i)\,T_i}{\sum_i T_i},

where $T_i$ is the token count or a deterministic token-count estimate. The distinction matters for compute budgets, mixture proportions, and scaling-law interpretation.

3.3 Source-proportional

Source-proportional is part of the canonical scope of data mixture optimization. We model the relevant object as a finite collection $\mathcal{D} = \{r_i\}_{i=1}^n$ with record-level metadata $m_i$ and text or token content $x_i$ . The practical question is whether the transformation preserves the intended empirical distribution.

A useful local invariant is:

\text{valid}(r_i, \mathcal{S}) = 1 \quad \Longrightarrow \quad r_i \text{ can be consumed by the next pipeline stage.}

Examples:

A small local experiment can store this object in memory; a frontier-scale run must store it as sharded, versioned, validated records.
The mathematical object is simple, but the operational contract must survive restarts, parallel workers, schema changes, and audits.
The notebook for this section uses synthetic data so the same ideas can be executed without external files.

Non-examples:

A path on disk without a manifest is not a reproducible dataset.
A metric dashboard without record-level lineage is not a provenance system.
A filter threshold without an audit sample is not evidence of quality.

Implementation consequence: every transformation should report both a count and a rate. If $n_{\mathrm{in}}$ records enter the stage and $n_{\mathrm{out}}$ records leave, the acceptance rate is

a = \frac{n_{\mathrm{out}}}{n_{\mathrm{in}}}.

A sudden change in $a$ is a data-drift signal even when the code still runs. This is why pipeline math is inseparable from logging, manifests, and audit slices.

a_{\mathrm{tok}} = \frac{\sum_i f(r_i)\,T_i}{\sum_i T_i},

where $T_i$ is the token count or a deterministic token-count estimate. The distinction matters for compute budgets, mixture proportions, and scaling-law interpretation.

3.4 Hand-tuned domain weights

Hand-tuned domain weights is part of the canonical scope of data mixture optimization. We model the relevant object as a finite collection $\mathcal{D} = \{r_i\}_{i=1}^n$ with record-level metadata $m_i$ and text or token content $x_i$ . The practical question is whether the transformation preserves the intended empirical distribution.

A useful local invariant is:

\text{valid}(r_i, \mathcal{S}) = 1 \quad \Longrightarrow \quad r_i \text{ can be consumed by the next pipeline stage.}

Examples:

A small local experiment can store this object in memory; a frontier-scale run must store it as sharded, versioned, validated records.
The mathematical object is simple, but the operational contract must survive restarts, parallel workers, schema changes, and audits.
The notebook for this section uses synthetic data so the same ideas can be executed without external files.

Non-examples:

A path on disk without a manifest is not a reproducible dataset.
A metric dashboard without record-level lineage is not a provenance system.
A filter threshold without an audit sample is not evidence of quality.

Implementation consequence: every transformation should report both a count and a rate. If $n_{\mathrm{in}}$ records enter the stage and $n_{\mathrm{out}}$ records leave, the acceptance rate is

a = \frac{n_{\mathrm{out}}}{n_{\mathrm{in}}}.

A sudden change in $a$ is a data-drift signal even when the code still runs. This is why pipeline math is inseparable from logging, manifests, and audit slices.

a_{\mathrm{tok}} = \frac{\sum_i f(r_i)\,T_i}{\sum_i T_i},

where $T_i$ is the token count or a deterministic token-count estimate. The distinction matters for compute budgets, mixture proportions, and scaling-law interpretation.

3.5 Temperature-smoothed mixtures

Temperature-smoothed mixtures is part of the canonical scope of data mixture optimization. We model the relevant object as a finite collection $\mathcal{D} = \{r_i\}_{i=1}^n$ with record-level metadata $m_i$ and text or token content $x_i$ . The practical question is whether the transformation preserves the intended empirical distribution.

A useful local invariant is:

\text{valid}(r_i, \mathcal{S}) = 1 \quad \Longrightarrow \quad r_i \text{ can be consumed by the next pipeline stage.}

Examples:

A small local experiment can store this object in memory; a frontier-scale run must store it as sharded, versioned, validated records.
The mathematical object is simple, but the operational contract must survive restarts, parallel workers, schema changes, and audits.
The notebook for this section uses synthetic data so the same ideas can be executed without external files.

Non-examples:

A path on disk without a manifest is not a reproducible dataset.
A metric dashboard without record-level lineage is not a provenance system.
A filter threshold without an audit sample is not evidence of quality.

Implementation consequence: every transformation should report both a count and a rate. If $n_{\mathrm{in}}$ records enter the stage and $n_{\mathrm{out}}$ records leave, the acceptance rate is

a = \frac{n_{\mathrm{out}}}{n_{\mathrm{in}}}.

A sudden change in $a$ is a data-drift signal even when the code still runs. This is why pipeline math is inseparable from logging, manifests, and audit slices.

a_{\mathrm{tok}} = \frac{\sum_i f(r_i)\,T_i}{\sum_i T_i},

where $T_i$ is the token count or a deterministic token-count estimate. The distinction matters for compute budgets, mixture proportions, and scaling-law interpretation.

Data Mixture Optimization

Data Mixture Optimization: Part 1: Intuition to 3. Baseline Mixtures

1. Intuition

1.1 Mixture weights determine model skill profile

1.2 Not all tokens have equal value

1.3 Mixture as constrained optimization

1.4 Proxy models

1.5 DataComp, DoReMi, and data mixing laws context

2. Formal Definitions

2.1 Domains D1,…,DK\mathcal{D}_1,\ldots,\mathcal{D}_KD1​,…,DK​

2.2 Mixture vector α∈ΔK−1\boldsymbol{\alpha}\in\Delta^{K-1}α∈ΔK−1

2.3 Sampling distribution

2.4 Validation objective

2.5 Token budget constraint

3. Baseline Mixtures

3.1 Uniform by document

3.2 Uniform by token

3.3 Source-proportional

3.4 Hand-tuned domain weights

3.5 Temperature-smoothed mixtures

2.1 Domains $\mathcal{D}_1,\ldots,\mathcal{D}_K$

2.2 Mixture vector $\boldsymbol{\alpha}\in\Delta^{K-1}$