Contamination and Dedup AuditsMath for LLMs

Contamination and Dedup Audits

LLM Training Data Pipeline

Private notes
0/8000

Notes stay private to your browser until account sync is configured.

Contamination and Dedup Audits
25 min read18 headingsSplit lesson page

Lesson overview | Lesson overview | Next part

Contamination and Dedup Audits: Part 1: Intuition to 3. Exact Deduplication

1. Intuition

Intuition gives the conceptual and mathematical layer for contamination and dedup audits. The local variables in this section should be read as pipeline objects: documents, records, tokens, filters, weights, shards, and manifests.

1.1 Duplicates waste compute

Duplicates waste compute is part of the canonical scope of contamination and dedup audits. We model the relevant object as a finite collection D={ri}i=1n\mathcal{D} = \{r_i\}_{i=1}^n with record-level metadata mim_i and text or token content xix_i. The practical question is whether the transformation preserves the intended empirical distribution.

A useful local invariant is:

valid(ri,S)=1ri can be consumed by the next pipeline stage.\text{valid}(r_i, \mathcal{S}) = 1 \quad \Longrightarrow \quad r_i \text{ can be consumed by the next pipeline stage.}

For duplicate, the invariant should be explicit enough that a checker can fail fast. If the invariant is only written in a notebook comment or an engineer's memory, it will not protect a long-running data build.

Examples:

  • A small local experiment can store this object in memory; a frontier-scale run must store it as sharded, versioned, validated records.
  • The mathematical object is simple, but the operational contract must survive restarts, parallel workers, schema changes, and audits.
  • The notebook for this section uses synthetic data so the same ideas can be executed without external files.

Non-examples:

  • A path on disk without a manifest is not a reproducible dataset.
  • A metric dashboard without record-level lineage is not a provenance system.
  • A filter threshold without an audit sample is not evidence of quality.

Implementation consequence: every transformation should report both a count and a rate. If ninn_{\mathrm{in}} records enter the stage and noutn_{\mathrm{out}} records leave, the acceptance rate is

a=noutnin.a = \frac{n_{\mathrm{out}}}{n_{\mathrm{in}}}.

A sudden change in aa is a data-drift signal even when the code still runs. This is why pipeline math is inseparable from logging, manifests, and audit slices.

For LLM work, the token-weighted view is often more important than the document-weighted view. A filter that removes 5 percent of documents may remove 30 percent of tokens if it targets long documents. The corresponding token acceptance rate is

atok=if(ri)TiiTi,a_{\mathrm{tok}} = \frac{\sum_i f(r_i)\,T_i}{\sum_i T_i},

where TiT_i is the token count or a deterministic token-count estimate. The distinction matters for compute budgets, mixture proportions, and scaling-law interpretation.

1.2 Near-duplicates increase memorization

Near-duplicates increase memorization is part of the canonical scope of contamination and dedup audits. We model the relevant object as a finite collection D={ri}i=1n\mathcal{D} = \{r_i\}_{i=1}^n with record-level metadata mim_i and text or token content xix_i. The practical question is whether the transformation preserves the intended empirical distribution.

A useful local invariant is:

valid(ri,S)=1ri can be consumed by the next pipeline stage.\text{valid}(r_i, \mathcal{S}) = 1 \quad \Longrightarrow \quad r_i \text{ can be consumed by the next pipeline stage.}

For near duplicate, the invariant should be explicit enough that a checker can fail fast. If the invariant is only written in a notebook comment or an engineer's memory, it will not protect a long-running data build.

Examples:

  • A small local experiment can store this object in memory; a frontier-scale run must store it as sharded, versioned, validated records.
  • The mathematical object is simple, but the operational contract must survive restarts, parallel workers, schema changes, and audits.
  • The notebook for this section uses synthetic data so the same ideas can be executed without external files.

Non-examples:

  • A path on disk without a manifest is not a reproducible dataset.
  • A metric dashboard without record-level lineage is not a provenance system.
  • A filter threshold without an audit sample is not evidence of quality.

Implementation consequence: every transformation should report both a count and a rate. If ninn_{\mathrm{in}} records enter the stage and noutn_{\mathrm{out}} records leave, the acceptance rate is

a=noutnin.a = \frac{n_{\mathrm{out}}}{n_{\mathrm{in}}}.

A sudden change in aa is a data-drift signal even when the code still runs. This is why pipeline math is inseparable from logging, manifests, and audit slices.

For LLM work, the token-weighted view is often more important than the document-weighted view. A filter that removes 5 percent of documents may remove 30 percent of tokens if it targets long documents. The corresponding token acceptance rate is

atok=if(ri)TiiTi,a_{\mathrm{tok}} = \frac{\sum_i f(r_i)\,T_i}{\sum_i T_i},

where TiT_i is the token count or a deterministic token-count estimate. The distinction matters for compute budgets, mixture proportions, and scaling-law interpretation.

1.3 Benchmark leakage inflates evaluation

Benchmark leakage inflates evaluation is part of the canonical scope of contamination and dedup audits. We model the relevant object as a finite collection D={ri}i=1n\mathcal{D} = \{r_i\}_{i=1}^n with record-level metadata mim_i and text or token content xix_i. The practical question is whether the transformation preserves the intended empirical distribution.

A useful local invariant is:

valid(ri,S)=1ri can be consumed by the next pipeline stage.\text{valid}(r_i, \mathcal{S}) = 1 \quad \Longrightarrow \quad r_i \text{ can be consumed by the next pipeline stage.}

For shingle, the invariant should be explicit enough that a checker can fail fast. If the invariant is only written in a notebook comment or an engineer's memory, it will not protect a long-running data build.

Examples:

  • A small local experiment can store this object in memory; a frontier-scale run must store it as sharded, versioned, validated records.
  • The mathematical object is simple, but the operational contract must survive restarts, parallel workers, schema changes, and audits.
  • The notebook for this section uses synthetic data so the same ideas can be executed without external files.

Non-examples:

  • A path on disk without a manifest is not a reproducible dataset.
  • A metric dashboard without record-level lineage is not a provenance system.
  • A filter threshold without an audit sample is not evidence of quality.

Implementation consequence: every transformation should report both a count and a rate. If ninn_{\mathrm{in}} records enter the stage and noutn_{\mathrm{out}} records leave, the acceptance rate is

a=noutnin.a = \frac{n_{\mathrm{out}}}{n_{\mathrm{in}}}.

A sudden change in aa is a data-drift signal even when the code still runs. This is why pipeline math is inseparable from logging, manifests, and audit slices.

For LLM work, the token-weighted view is often more important than the document-weighted view. A filter that removes 5 percent of documents may remove 30 percent of tokens if it targets long documents. The corresponding token acceptance rate is

atok=if(ri)TiiTi,a_{\mathrm{tok}} = \frac{\sum_i f(r_i)\,T_i}{\sum_i T_i},

where TiT_i is the token count or a deterministic token-count estimate. The distinction matters for compute budgets, mixture proportions, and scaling-law interpretation.

1.4 Dedup vs diversity

Dedup vs diversity is part of the canonical scope of contamination and dedup audits. We model the relevant object as a finite collection D={ri}i=1n\mathcal{D} = \{r_i\}_{i=1}^n with record-level metadata mim_i and text or token content xix_i. The practical question is whether the transformation preserves the intended empirical distribution.

A useful local invariant is:

valid(ri,S)=1ri can be consumed by the next pipeline stage.\text{valid}(r_i, \mathcal{S}) = 1 \quad \Longrightarrow \quad r_i \text{ can be consumed by the next pipeline stage.}

For MinHash, the invariant should be explicit enough that a checker can fail fast. If the invariant is only written in a notebook comment or an engineer's memory, it will not protect a long-running data build.

Examples:

  • A small local experiment can store this object in memory; a frontier-scale run must store it as sharded, versioned, validated records.
  • The mathematical object is simple, but the operational contract must survive restarts, parallel workers, schema changes, and audits.
  • The notebook for this section uses synthetic data so the same ideas can be executed without external files.

Non-examples:

  • A path on disk without a manifest is not a reproducible dataset.
  • A metric dashboard without record-level lineage is not a provenance system.
  • A filter threshold without an audit sample is not evidence of quality.

Implementation consequence: every transformation should report both a count and a rate. If ninn_{\mathrm{in}} records enter the stage and noutn_{\mathrm{out}} records leave, the acceptance rate is

a=noutnin.a = \frac{n_{\mathrm{out}}}{n_{\mathrm{in}}}.

A sudden change in aa is a data-drift signal even when the code still runs. This is why pipeline math is inseparable from logging, manifests, and audit slices.

For LLM work, the token-weighted view is often more important than the document-weighted view. A filter that removes 5 percent of documents may remove 30 percent of tokens if it targets long documents. The corresponding token acceptance rate is

atok=if(ri)TiiTi,a_{\mathrm{tok}} = \frac{\sum_i f(r_i)\,T_i}{\sum_i T_i},

where TiT_i is the token count or a deterministic token-count estimate. The distinction matters for compute budgets, mixture proportions, and scaling-law interpretation.

1.5 Audit trail as scientific evidence

Audit trail as scientific evidence is part of the canonical scope of contamination and dedup audits. We model the relevant object as a finite collection D={ri}i=1n\mathcal{D} = \{r_i\}_{i=1}^n with record-level metadata mim_i and text or token content xix_i. The practical question is whether the transformation preserves the intended empirical distribution.

A useful local invariant is:

valid(ri,S)=1ri can be consumed by the next pipeline stage.\text{valid}(r_i, \mathcal{S}) = 1 \quad \Longrightarrow \quad r_i \text{ can be consumed by the next pipeline stage.}

For Jaccard, the invariant should be explicit enough that a checker can fail fast. If the invariant is only written in a notebook comment or an engineer's memory, it will not protect a long-running data build.

Examples:

  • A small local experiment can store this object in memory; a frontier-scale run must store it as sharded, versioned, validated records.
  • The mathematical object is simple, but the operational contract must survive restarts, parallel workers, schema changes, and audits.
  • The notebook for this section uses synthetic data so the same ideas can be executed without external files.

Non-examples:

  • A path on disk without a manifest is not a reproducible dataset.
  • A metric dashboard without record-level lineage is not a provenance system.
  • A filter threshold without an audit sample is not evidence of quality.

Implementation consequence: every transformation should report both a count and a rate. If ninn_{\mathrm{in}} records enter the stage and noutn_{\mathrm{out}} records leave, the acceptance rate is

a=noutnin.a = \frac{n_{\mathrm{out}}}{n_{\mathrm{in}}}.

A sudden change in aa is a data-drift signal even when the code still runs. This is why pipeline math is inseparable from logging, manifests, and audit slices.

For LLM work, the token-weighted view is often more important than the document-weighted view. A filter that removes 5 percent of documents may remove 30 percent of tokens if it targets long documents. The corresponding token acceptance rate is

atok=if(ri)TiiTi,a_{\mathrm{tok}} = \frac{\sum_i f(r_i)\,T_i}{\sum_i T_i},

where TiT_i is the token count or a deterministic token-count estimate. The distinction matters for compute budgets, mixture proportions, and scaling-law interpretation.

2. Formal Definitions

Formal Definitions gives the conceptual and mathematical layer for contamination and dedup audits. The local variables in this section should be read as pipeline objects: documents, records, tokens, filters, weights, shards, and manifests.

2.1 Exact duplicate

Exact duplicate is part of the canonical scope of contamination and dedup audits. We model the relevant object as a finite collection D={ri}i=1n\mathcal{D} = \{r_i\}_{i=1}^n with record-level metadata mim_i and text or token content xix_i. The practical question is whether the transformation preserves the intended empirical distribution.

A useful local invariant is:

valid(ri,S)=1ri can be consumed by the next pipeline stage.\text{valid}(r_i, \mathcal{S}) = 1 \quad \Longrightarrow \quad r_i \text{ can be consumed by the next pipeline stage.}

For duplicate, the invariant should be explicit enough that a checker can fail fast. If the invariant is only written in a notebook comment or an engineer's memory, it will not protect a long-running data build.

Examples:

  • A small local experiment can store this object in memory; a frontier-scale run must store it as sharded, versioned, validated records.
  • The mathematical object is simple, but the operational contract must survive restarts, parallel workers, schema changes, and audits.
  • The notebook for this section uses synthetic data so the same ideas can be executed without external files.

Non-examples:

  • A path on disk without a manifest is not a reproducible dataset.
  • A metric dashboard without record-level lineage is not a provenance system.
  • A filter threshold without an audit sample is not evidence of quality.

Implementation consequence: every transformation should report both a count and a rate. If ninn_{\mathrm{in}} records enter the stage and noutn_{\mathrm{out}} records leave, the acceptance rate is

a=noutnin.a = \frac{n_{\mathrm{out}}}{n_{\mathrm{in}}}.

A sudden change in aa is a data-drift signal even when the code still runs. This is why pipeline math is inseparable from logging, manifests, and audit slices.

For LLM work, the token-weighted view is often more important than the document-weighted view. A filter that removes 5 percent of documents may remove 30 percent of tokens if it targets long documents. The corresponding token acceptance rate is

atok=if(ri)TiiTi,a_{\mathrm{tok}} = \frac{\sum_i f(r_i)\,T_i}{\sum_i T_i},

where TiT_i is the token count or a deterministic token-count estimate. The distinction matters for compute budgets, mixture proportions, and scaling-law interpretation.

2.2 Near duplicate

Near duplicate is part of the canonical scope of contamination and dedup audits. We model the relevant object as a finite collection D={ri}i=1n\mathcal{D} = \{r_i\}_{i=1}^n with record-level metadata mim_i and text or token content xix_i. The practical question is whether the transformation preserves the intended empirical distribution.

A useful local invariant is:

valid(ri,S)=1ri can be consumed by the next pipeline stage.\text{valid}(r_i, \mathcal{S}) = 1 \quad \Longrightarrow \quad r_i \text{ can be consumed by the next pipeline stage.}

For near duplicate, the invariant should be explicit enough that a checker can fail fast. If the invariant is only written in a notebook comment or an engineer's memory, it will not protect a long-running data build.

Examples:

  • A small local experiment can store this object in memory; a frontier-scale run must store it as sharded, versioned, validated records.
  • The mathematical object is simple, but the operational contract must survive restarts, parallel workers, schema changes, and audits.
  • The notebook for this section uses synthetic data so the same ideas can be executed without external files.

Non-examples:

  • A path on disk without a manifest is not a reproducible dataset.
  • A metric dashboard without record-level lineage is not a provenance system.
  • A filter threshold without an audit sample is not evidence of quality.

Implementation consequence: every transformation should report both a count and a rate. If ninn_{\mathrm{in}} records enter the stage and noutn_{\mathrm{out}} records leave, the acceptance rate is

a=noutnin.a = \frac{n_{\mathrm{out}}}{n_{\mathrm{in}}}.

A sudden change in aa is a data-drift signal even when the code still runs. This is why pipeline math is inseparable from logging, manifests, and audit slices.

For LLM work, the token-weighted view is often more important than the document-weighted view. A filter that removes 5 percent of documents may remove 30 percent of tokens if it targets long documents. The corresponding token acceptance rate is

atok=if(ri)TiiTi,a_{\mathrm{tok}} = \frac{\sum_i f(r_i)\,T_i}{\sum_i T_i},

where TiT_i is the token count or a deterministic token-count estimate. The distinction matters for compute budgets, mixture proportions, and scaling-law interpretation.

2.3 nn-gram overlap

nn-gram overlap is part of the canonical scope of contamination and dedup audits. We model the relevant object as a finite collection D={ri}i=1n\mathcal{D} = \{r_i\}_{i=1}^n with record-level metadata mim_i and text or token content xix_i. The practical question is whether the transformation preserves the intended empirical distribution.

A useful local invariant is:

valid(ri,S)=1ri can be consumed by the next pipeline stage.\text{valid}(r_i, \mathcal{S}) = 1 \quad \Longrightarrow \quad r_i \text{ can be consumed by the next pipeline stage.}

For shingle, the invariant should be explicit enough that a checker can fail fast. If the invariant is only written in a notebook comment or an engineer's memory, it will not protect a long-running data build.

Examples:

  • A small local experiment can store this object in memory; a frontier-scale run must store it as sharded, versioned, validated records.
  • The mathematical object is simple, but the operational contract must survive restarts, parallel workers, schema changes, and audits.
  • The notebook for this section uses synthetic data so the same ideas can be executed without external files.

Non-examples:

  • A path on disk without a manifest is not a reproducible dataset.
  • A metric dashboard without record-level lineage is not a provenance system.
  • A filter threshold without an audit sample is not evidence of quality.

Implementation consequence: every transformation should report both a count and a rate. If ninn_{\mathrm{in}} records enter the stage and noutn_{\mathrm{out}} records leave, the acceptance rate is

a=noutnin.a = \frac{n_{\mathrm{out}}}{n_{\mathrm{in}}}.

A sudden change in aa is a data-drift signal even when the code still runs. This is why pipeline math is inseparable from logging, manifests, and audit slices.

For LLM work, the token-weighted view is often more important than the document-weighted view. A filter that removes 5 percent of documents may remove 30 percent of tokens if it targets long documents. The corresponding token acceptance rate is

atok=if(ri)TiiTi,a_{\mathrm{tok}} = \frac{\sum_i f(r_i)\,T_i}{\sum_i T_i},

where TiT_i is the token count or a deterministic token-count estimate. The distinction matters for compute budgets, mixture proportions, and scaling-law interpretation.

2.4 Jaccard similarity

Jaccard similarity is part of the canonical scope of contamination and dedup audits. We model the relevant object as a finite collection D={ri}i=1n\mathcal{D} = \{r_i\}_{i=1}^n with record-level metadata mim_i and text or token content xix_i. The practical question is whether the transformation preserves the intended empirical distribution.

A useful local invariant is:

valid(ri,S)=1ri can be consumed by the next pipeline stage.\text{valid}(r_i, \mathcal{S}) = 1 \quad \Longrightarrow \quad r_i \text{ can be consumed by the next pipeline stage.}

For MinHash, the invariant should be explicit enough that a checker can fail fast. If the invariant is only written in a notebook comment or an engineer's memory, it will not protect a long-running data build.

Examples:

  • A small local experiment can store this object in memory; a frontier-scale run must store it as sharded, versioned, validated records.
  • The mathematical object is simple, but the operational contract must survive restarts, parallel workers, schema changes, and audits.
  • The notebook for this section uses synthetic data so the same ideas can be executed without external files.

Non-examples:

  • A path on disk without a manifest is not a reproducible dataset.
  • A metric dashboard without record-level lineage is not a provenance system.
  • A filter threshold without an audit sample is not evidence of quality.

Implementation consequence: every transformation should report both a count and a rate. If ninn_{\mathrm{in}} records enter the stage and noutn_{\mathrm{out}} records leave, the acceptance rate is

a=noutnin.a = \frac{n_{\mathrm{out}}}{n_{\mathrm{in}}}.

A sudden change in aa is a data-drift signal even when the code still runs. This is why pipeline math is inseparable from logging, manifests, and audit slices.

For LLM work, the token-weighted view is often more important than the document-weighted view. A filter that removes 5 percent of documents may remove 30 percent of tokens if it targets long documents. The corresponding token acceptance rate is

atok=if(ri)TiiTi,a_{\mathrm{tok}} = \frac{\sum_i f(r_i)\,T_i}{\sum_i T_i},

where TiT_i is the token count or a deterministic token-count estimate. The distinction matters for compute budgets, mixture proportions, and scaling-law interpretation.

2.5 Contamination relation between train and eval sets

Contamination relation between train and eval sets is part of the canonical scope of contamination and dedup audits. We model the relevant object as a finite collection D={ri}i=1n\mathcal{D} = \{r_i\}_{i=1}^n with record-level metadata mim_i and text or token content xix_i. The practical question is whether the transformation preserves the intended empirical distribution.

A useful local invariant is:

valid(ri,S)=1ri can be consumed by the next pipeline stage.\text{valid}(r_i, \mathcal{S}) = 1 \quad \Longrightarrow \quad r_i \text{ can be consumed by the next pipeline stage.}

For Jaccard, the invariant should be explicit enough that a checker can fail fast. If the invariant is only written in a notebook comment or an engineer's memory, it will not protect a long-running data build.

Examples:

  • A small local experiment can store this object in memory; a frontier-scale run must store it as sharded, versioned, validated records.
  • The mathematical object is simple, but the operational contract must survive restarts, parallel workers, schema changes, and audits.
  • The notebook for this section uses synthetic data so the same ideas can be executed without external files.

Non-examples:

  • A path on disk without a manifest is not a reproducible dataset.
  • A metric dashboard without record-level lineage is not a provenance system.
  • A filter threshold without an audit sample is not evidence of quality.

Implementation consequence: every transformation should report both a count and a rate. If ninn_{\mathrm{in}} records enter the stage and noutn_{\mathrm{out}} records leave, the acceptance rate is

a=noutnin.a = \frac{n_{\mathrm{out}}}{n_{\mathrm{in}}}.

A sudden change in aa is a data-drift signal even when the code still runs. This is why pipeline math is inseparable from logging, manifests, and audit slices.

For LLM work, the token-weighted view is often more important than the document-weighted view. A filter that removes 5 percent of documents may remove 30 percent of tokens if it targets long documents. The corresponding token acceptance rate is

atok=if(ri)TiiTi,a_{\mathrm{tok}} = \frac{\sum_i f(r_i)\,T_i}{\sum_i T_i},

where TiT_i is the token count or a deterministic token-count estimate. The distinction matters for compute budgets, mixture proportions, and scaling-law interpretation.

3. Exact Deduplication

Exact Deduplication gives the conceptual and mathematical layer for contamination and dedup audits. The local variables in this section should be read as pipeline objects: documents, records, tokens, filters, weights, shards, and manifests.

3.1 Canonicalization

Canonicalization is part of the canonical scope of contamination and dedup audits. We model the relevant object as a finite collection D={ri}i=1n\mathcal{D} = \{r_i\}_{i=1}^n with record-level metadata mim_i and text or token content xix_i. The practical question is whether the transformation preserves the intended empirical distribution.

A useful local invariant is:

valid(ri,S)=1ri can be consumed by the next pipeline stage.\text{valid}(r_i, \mathcal{S}) = 1 \quad \Longrightarrow \quad r_i \text{ can be consumed by the next pipeline stage.}

For duplicate, the invariant should be explicit enough that a checker can fail fast. If the invariant is only written in a notebook comment or an engineer's memory, it will not protect a long-running data build.

Examples:

  • A small local experiment can store this object in memory; a frontier-scale run must store it as sharded, versioned, validated records.
  • The mathematical object is simple, but the operational contract must survive restarts, parallel workers, schema changes, and audits.
  • The notebook for this section uses synthetic data so the same ideas can be executed without external files.

Non-examples:

  • A path on disk without a manifest is not a reproducible dataset.
  • A metric dashboard without record-level lineage is not a provenance system.
  • A filter threshold without an audit sample is not evidence of quality.

Implementation consequence: every transformation should report both a count and a rate. If ninn_{\mathrm{in}} records enter the stage and noutn_{\mathrm{out}} records leave, the acceptance rate is

a=noutnin.a = \frac{n_{\mathrm{out}}}{n_{\mathrm{in}}}.

A sudden change in aa is a data-drift signal even when the code still runs. This is why pipeline math is inseparable from logging, manifests, and audit slices.

For LLM work, the token-weighted view is often more important than the document-weighted view. A filter that removes 5 percent of documents may remove 30 percent of tokens if it targets long documents. The corresponding token acceptance rate is

atok=if(ri)TiiTi,a_{\mathrm{tok}} = \frac{\sum_i f(r_i)\,T_i}{\sum_i T_i},

where TiT_i is the token count or a deterministic token-count estimate. The distinction matters for compute budgets, mixture proportions, and scaling-law interpretation.

3.2 Hashing

Hashing is part of the canonical scope of contamination and dedup audits. We model the relevant object as a finite collection D={ri}i=1n\mathcal{D} = \{r_i\}_{i=1}^n with record-level metadata mim_i and text or token content xix_i. The practical question is whether the transformation preserves the intended empirical distribution.

A useful local invariant is:

valid(ri,S)=1ri can be consumed by the next pipeline stage.\text{valid}(r_i, \mathcal{S}) = 1 \quad \Longrightarrow \quad r_i \text{ can be consumed by the next pipeline stage.}

For near duplicate, the invariant should be explicit enough that a checker can fail fast. If the invariant is only written in a notebook comment or an engineer's memory, it will not protect a long-running data build.

Examples:

  • A small local experiment can store this object in memory; a frontier-scale run must store it as sharded, versioned, validated records.
  • The mathematical object is simple, but the operational contract must survive restarts, parallel workers, schema changes, and audits.
  • The notebook for this section uses synthetic data so the same ideas can be executed without external files.

Non-examples:

  • A path on disk without a manifest is not a reproducible dataset.
  • A metric dashboard without record-level lineage is not a provenance system.
  • A filter threshold without an audit sample is not evidence of quality.

Implementation consequence: every transformation should report both a count and a rate. If ninn_{\mathrm{in}} records enter the stage and noutn_{\mathrm{out}} records leave, the acceptance rate is

a=noutnin.a = \frac{n_{\mathrm{out}}}{n_{\mathrm{in}}}.

A sudden change in aa is a data-drift signal even when the code still runs. This is why pipeline math is inseparable from logging, manifests, and audit slices.

For LLM work, the token-weighted view is often more important than the document-weighted view. A filter that removes 5 percent of documents may remove 30 percent of tokens if it targets long documents. The corresponding token acceptance rate is

atok=if(ri)TiiTi,a_{\mathrm{tok}} = \frac{\sum_i f(r_i)\,T_i}{\sum_i T_i},

where TiT_i is the token count or a deterministic token-count estimate. The distinction matters for compute budgets, mixture proportions, and scaling-law interpretation.

3.3 Document-level dedup

Document-level dedup is part of the canonical scope of contamination and dedup audits. We model the relevant object as a finite collection D={ri}i=1n\mathcal{D} = \{r_i\}_{i=1}^n with record-level metadata mim_i and text or token content xix_i. The practical question is whether the transformation preserves the intended empirical distribution.

A useful local invariant is:

valid(ri,S)=1ri can be consumed by the next pipeline stage.\text{valid}(r_i, \mathcal{S}) = 1 \quad \Longrightarrow \quad r_i \text{ can be consumed by the next pipeline stage.}

For shingle, the invariant should be explicit enough that a checker can fail fast. If the invariant is only written in a notebook comment or an engineer's memory, it will not protect a long-running data build.

Examples:

  • A small local experiment can store this object in memory; a frontier-scale run must store it as sharded, versioned, validated records.
  • The mathematical object is simple, but the operational contract must survive restarts, parallel workers, schema changes, and audits.
  • The notebook for this section uses synthetic data so the same ideas can be executed without external files.

Non-examples:

  • A path on disk without a manifest is not a reproducible dataset.
  • A metric dashboard without record-level lineage is not a provenance system.
  • A filter threshold without an audit sample is not evidence of quality.

Implementation consequence: every transformation should report both a count and a rate. If ninn_{\mathrm{in}} records enter the stage and noutn_{\mathrm{out}} records leave, the acceptance rate is

a=noutnin.a = \frac{n_{\mathrm{out}}}{n_{\mathrm{in}}}.

A sudden change in aa is a data-drift signal even when the code still runs. This is why pipeline math is inseparable from logging, manifests, and audit slices.

For LLM work, the token-weighted view is often more important than the document-weighted view. A filter that removes 5 percent of documents may remove 30 percent of tokens if it targets long documents. The corresponding token acceptance rate is

atok=if(ri)TiiTi,a_{\mathrm{tok}} = \frac{\sum_i f(r_i)\,T_i}{\sum_i T_i},

where TiT_i is the token count or a deterministic token-count estimate. The distinction matters for compute budgets, mixture proportions, and scaling-law interpretation.

3.4 Substring dedup

Substring dedup is part of the canonical scope of contamination and dedup audits. We model the relevant object as a finite collection D={ri}i=1n\mathcal{D} = \{r_i\}_{i=1}^n with record-level metadata mim_i and text or token content xix_i. The practical question is whether the transformation preserves the intended empirical distribution.

A useful local invariant is:

valid(ri,S)=1ri can be consumed by the next pipeline stage.\text{valid}(r_i, \mathcal{S}) = 1 \quad \Longrightarrow \quad r_i \text{ can be consumed by the next pipeline stage.}

For MinHash, the invariant should be explicit enough that a checker can fail fast. If the invariant is only written in a notebook comment or an engineer's memory, it will not protect a long-running data build.

Examples:

  • A small local experiment can store this object in memory; a frontier-scale run must store it as sharded, versioned, validated records.
  • The mathematical object is simple, but the operational contract must survive restarts, parallel workers, schema changes, and audits.
  • The notebook for this section uses synthetic data so the same ideas can be executed without external files.

Non-examples:

  • A path on disk without a manifest is not a reproducible dataset.
  • A metric dashboard without record-level lineage is not a provenance system.
  • A filter threshold without an audit sample is not evidence of quality.

Implementation consequence: every transformation should report both a count and a rate. If ninn_{\mathrm{in}} records enter the stage and noutn_{\mathrm{out}} records leave, the acceptance rate is

a=noutnin.a = \frac{n_{\mathrm{out}}}{n_{\mathrm{in}}}.

A sudden change in aa is a data-drift signal even when the code still runs. This is why pipeline math is inseparable from logging, manifests, and audit slices.

For LLM work, the token-weighted view is often more important than the document-weighted view. A filter that removes 5 percent of documents may remove 30 percent of tokens if it targets long documents. The corresponding token acceptance rate is

atok=if(ri)TiiTi,a_{\mathrm{tok}} = \frac{\sum_i f(r_i)\,T_i}{\sum_i T_i},

where TiT_i is the token count or a deterministic token-count estimate. The distinction matters for compute budgets, mixture proportions, and scaling-law interpretation.

3.5 Dedup reports

Dedup reports is part of the canonical scope of contamination and dedup audits. We model the relevant object as a finite collection D={ri}i=1n\mathcal{D} = \{r_i\}_{i=1}^n with record- level metadata mim_i and text or token content xix_i. The practical question is whether the transformation preserves the intended empirical distribution.

A useful local invariant is:

valid(ri,S)=1ri can be consumed by the next pipeline stage.\text{valid}(r_i, \mathcal{S}) = 1 \quad \Longrightarrow \quad r_i \text{ can be consumed by the next pipeline stage.}

For Jaccard, the invariant should be explicit enough that a checker can fail fast. If the invariant is only written in a notebook comment or an engineer's memory, it will not protect a long-running data build.

Examples:

  • A small local experiment can store this object in memory; a frontier-scale run must store it as sharded, versioned, validated records.
  • The mathematical object is simple, but the operational contract must survive restarts, parallel workers, schema changes, and audits.
  • The notebook for this section uses synthetic data so the same ideas can be executed without external files.

Non-examples:

  • A path on disk without a manifest is not a reproducible dataset.
  • A metric dashboard without record-level lineage is not a provenance system.
  • A filter threshold without an audit sample is not evidence of quality.

Implementation consequence: every transformation should report both a count and a rate. If ninn_{\mathrm{in}} records enter the stage and noutn_{\mathrm{out}} records leave, the acceptance rate is

a=noutnin.a = \frac{n_{\mathrm{out}}}{n_{\mathrm{in}}}.

A sudden change in aa is a data-drift signal even when the code still runs. This is why pipeline math is inseparable from logging, manifests, and audit slices.

For LLM work, the token-weighted view is often more important than the document-weighted view. A filter that removes 5 percent of documents may remove 30 percent of tokens if it targets long documents. The corresponding token acceptance rate is

atok=if(ri)TiiTi,a_{\mathrm{tok}} = \frac{\sum_i f(r_i)\,T_i}{\sum_i T_i},

where TiT_i is the token count or a deterministic token-count estimate. The distinction matters for compute budgets, mixture proportions, and scaling-law interpretation.

PreviousNext