NotesMath for LLMs

Contamination and Dedup Audits

LLM Training Data Pipeline / Contamination and Dedup Audits

Private notes
0/8000

Notes stay private to your browser until account sync is configured.

Notes

"A benchmark score is only evidence if the benchmark was not in the training data."

Overview

Deduplication and contamination audits reduce wasted compute, memorization, privacy risk, and evaluation leakage. In an LLM training run, data is not an inert pile of text; it is the empirical distribution that defines the examples, losses, risks, and capabilities the model will see.

This section is written as LaTeX Markdown. Inline mathematics uses $...$, and display equations use `

......

`. The goal is to connect data engineering decisions to mathematical objects such as records rir_i, token sequences x1:Tx_{1:T}, filters f(x)f(x), hashes h(x)h(x), mixture weights α\boldsymbol{\alpha}, and empirical expectations.

The scope is deliberately narrow: this chapter owns the training-data pipeline. Tokenizer design, GPU training systems, benchmark methodology, alignment objectives, and production MLOps each have their own canonical chapters. Here we study the data objects that those later systems consume.

Prerequisites

Companion Notebooks

NotebookDescription
theory.ipynbExecutable demonstrations for contamination and dedup audits
exercises.ipynbGraded practice for contamination and dedup audits

Learning Objectives

After completing this section, you will be able to:

  • Define exact duplicates, near duplicates, shingles, Jaccard similarity, and contamination
  • Implement canonicalization and hash-based exact deduplication
  • Implement shingling, MinHash, and LSH-style candidate retrieval
  • Audit train/eval overlap through exact and approximate matching
  • Explain prompt-only, answer-only, and full-example contamination
  • Connect duplicates to memorization and extractable training data
  • Design redaction logs for PII and benchmark-removal decisions
  • Interpret dedup reports without destroying useful diversity

Table of Contents


1. Intuition

Intuition gives the conceptual and mathematical layer for contamination and dedup audits. The local variables in this section should be read as pipeline objects: documents, records, tokens, filters, weights, shards, and manifests.

1.1 Duplicates waste compute

Duplicates waste compute is part of the canonical scope of contamination and dedup audits. We model the relevant object as a finite collection D={ri}i=1n\mathcal{D} = \{r_i\}_{i=1}^n with record-level metadata mim_i and text or token content xix_i. The practical question is whether the transformation preserves the intended empirical distribution.

A useful local invariant is:

valid(ri,S)=1ri can be consumed by the next pipeline stage.\text{valid}(r_i, \mathcal{S}) = 1 \quad \Longrightarrow \quad r_i \text{ can be consumed by the next pipeline stage.}

For duplicate, the invariant should be explicit enough that a checker can fail fast. If the invariant is only written in a notebook comment or an engineer's memory, it will not protect a long-running data build.

Examples:

  • A small local experiment can store this object in memory; a frontier-scale run must store it as sharded, versioned, validated records.
  • The mathematical object is simple, but the operational contract must survive restarts, parallel workers, schema changes, and audits.
  • The notebook for this section uses synthetic data so the same ideas can be executed without external files.

Non-examples:

  • A path on disk without a manifest is not a reproducible dataset.
  • A metric dashboard without record-level lineage is not a provenance system.
  • A filter threshold without an audit sample is not evidence of quality.

Implementation consequence: every transformation should report both a count and a rate. If ninn_{\mathrm{in}} records enter the stage and noutn_{\mathrm{out}} records leave, the acceptance rate is

a=noutnin.a = \frac{n_{\mathrm{out}}}{n_{\mathrm{in}}}.

A sudden change in aa is a data-drift signal even when the code still runs. This is why pipeline math is inseparable from logging, manifests, and audit slices.

For LLM work, the token-weighted view is often more important than the document-weighted view. A filter that removes 5 percent of documents may remove 30 percent of tokens if it targets long documents. The corresponding token acceptance rate is

atok=if(ri)TiiTi,a_{\mathrm{tok}} = \frac{\sum_i f(r_i)\,T_i}{\sum_i T_i},

where TiT_i is the token count or a deterministic token-count estimate. The distinction matters for compute budgets, mixture proportions, and scaling-law interpretation.

1.2 Near-duplicates increase memorization

Near-duplicates increase memorization is part of the canonical scope of contamination and dedup audits. We model the relevant object as a finite collection D={ri}i=1n\mathcal{D} = \{r_i\}_{i=1}^n with record-level metadata mim_i and text or token content xix_i. The practical question is whether the transformation preserves the intended empirical distribution.

A useful local invariant is:

valid(ri,S)=1ri can be consumed by the next pipeline stage.\text{valid}(r_i, \mathcal{S}) = 1 \quad \Longrightarrow \quad r_i \text{ can be consumed by the next pipeline stage.}

For near duplicate, the invariant should be explicit enough that a checker can fail fast. If the invariant is only written in a notebook comment or an engineer's memory, it will not protect a long-running data build.

Examples:

  • A small local experiment can store this object in memory; a frontier-scale run must store it as sharded, versioned, validated records.
  • The mathematical object is simple, but the operational contract must survive restarts, parallel workers, schema changes, and audits.
  • The notebook for this section uses synthetic data so the same ideas can be executed without external files.

Non-examples:

  • A path on disk without a manifest is not a reproducible dataset.
  • A metric dashboard without record-level lineage is not a provenance system.
  • A filter threshold without an audit sample is not evidence of quality.

Implementation consequence: every transformation should report both a count and a rate. If ninn_{\mathrm{in}} records enter the stage and noutn_{\mathrm{out}} records leave, the acceptance rate is

a=noutnin.a = \frac{n_{\mathrm{out}}}{n_{\mathrm{in}}}.

A sudden change in aa is a data-drift signal even when the code still runs. This is why pipeline math is inseparable from logging, manifests, and audit slices.

For LLM work, the token-weighted view is often more important than the document-weighted view. A filter that removes 5 percent of documents may remove 30 percent of tokens if it targets long documents. The corresponding token acceptance rate is

atok=if(ri)TiiTi,a_{\mathrm{tok}} = \frac{\sum_i f(r_i)\,T_i}{\sum_i T_i},

where TiT_i is the token count or a deterministic token-count estimate. The distinction matters for compute budgets, mixture proportions, and scaling-law interpretation.

1.3 Benchmark leakage inflates evaluation

Benchmark leakage inflates evaluation is part of the canonical scope of contamination and dedup audits. We model the relevant object as a finite collection D={ri}i=1n\mathcal{D} = \{r_i\}_{i=1}^n with record-level metadata mim_i and text or token content xix_i. The practical question is whether the transformation preserves the intended empirical distribution.

A useful local invariant is:

valid(ri,S)=1ri can be consumed by the next pipeline stage.\text{valid}(r_i, \mathcal{S}) = 1 \quad \Longrightarrow \quad r_i \text{ can be consumed by the next pipeline stage.}

For shingle, the invariant should be explicit enough that a checker can fail fast. If the invariant is only written in a notebook comment or an engineer's memory, it will not protect a long-running data build.

Examples:

  • A small local experiment can store this object in memory; a frontier-scale run must store it as sharded, versioned, validated records.
  • The mathematical object is simple, but the operational contract must survive restarts, parallel workers, schema changes, and audits.
  • The notebook for this section uses synthetic data so the same ideas can be executed without external files.

Non-examples:

  • A path on disk without a manifest is not a reproducible dataset.
  • A metric dashboard without record-level lineage is not a provenance system.
  • A filter threshold without an audit sample is not evidence of quality.

Implementation consequence: every transformation should report both a count and a rate. If ninn_{\mathrm{in}} records enter the stage and noutn_{\mathrm{out}} records leave, the acceptance rate is

a=noutnin.a = \frac{n_{\mathrm{out}}}{n_{\mathrm{in}}}.

A sudden change in aa is a data-drift signal even when the code still runs. This is why pipeline math is inseparable from logging, manifests, and audit slices.

For LLM work, the token-weighted view is often more important than the document-weighted view. A filter that removes 5 percent of documents may remove 30 percent of tokens if it targets long documents. The corresponding token acceptance rate is

atok=if(ri)TiiTi,a_{\mathrm{tok}} = \frac{\sum_i f(r_i)\,T_i}{\sum_i T_i},

where TiT_i is the token count or a deterministic token-count estimate. The distinction matters for compute budgets, mixture proportions, and scaling-law interpretation.

1.4 Dedup vs diversity

Dedup vs diversity is part of the canonical scope of contamination and dedup audits. We model the relevant object as a finite collection D={ri}i=1n\mathcal{D} = \{r_i\}_{i=1}^n with record-level metadata mim_i and text or token content xix_i. The practical question is whether the transformation preserves the intended empirical distribution.

A useful local invariant is:

valid(ri,S)=1ri can be consumed by the next pipeline stage.\text{valid}(r_i, \mathcal{S}) = 1 \quad \Longrightarrow \quad r_i \text{ can be consumed by the next pipeline stage.}

For MinHash, the invariant should be explicit enough that a checker can fail fast. If the invariant is only written in a notebook comment or an engineer's memory, it will not protect a long-running data build.

Examples:

  • A small local experiment can store this object in memory; a frontier-scale run must store it as sharded, versioned, validated records.
  • The mathematical object is simple, but the operational contract must survive restarts, parallel workers, schema changes, and audits.
  • The notebook for this section uses synthetic data so the same ideas can be executed without external files.

Non-examples:

  • A path on disk without a manifest is not a reproducible dataset.
  • A metric dashboard without record-level lineage is not a provenance system.
  • A filter threshold without an audit sample is not evidence of quality.

Implementation consequence: every transformation should report both a count and a rate. If ninn_{\mathrm{in}} records enter the stage and noutn_{\mathrm{out}} records leave, the acceptance rate is

a=noutnin.a = \frac{n_{\mathrm{out}}}{n_{\mathrm{in}}}.

A sudden change in aa is a data-drift signal even when the code still runs. This is why pipeline math is inseparable from logging, manifests, and audit slices.

For LLM work, the token-weighted view is often more important than the document-weighted view. A filter that removes 5 percent of documents may remove 30 percent of tokens if it targets long documents. The corresponding token acceptance rate is

atok=if(ri)TiiTi,a_{\mathrm{tok}} = \frac{\sum_i f(r_i)\,T_i}{\sum_i T_i},

where TiT_i is the token count or a deterministic token-count estimate. The distinction matters for compute budgets, mixture proportions, and scaling-law interpretation.

1.5 Audit trail as scientific evidence

Audit trail as scientific evidence is part of the canonical scope of contamination and dedup audits. We model the relevant object as a finite collection D={ri}i=1n\mathcal{D} = \{r_i\}_{i=1}^n with record-level metadata mim_i and text or token content xix_i. The practical question is whether the transformation preserves the intended empirical distribution.

A useful local invariant is:

valid(ri,S)=1ri can be consumed by the next pipeline stage.\text{valid}(r_i, \mathcal{S}) = 1 \quad \Longrightarrow \quad r_i \text{ can be consumed by the next pipeline stage.}

For Jaccard, the invariant should be explicit enough that a checker can fail fast. If the invariant is only written in a notebook comment or an engineer's memory, it will not protect a long-running data build.

Examples:

  • A small local experiment can store this object in memory; a frontier-scale run must store it as sharded, versioned, validated records.
  • The mathematical object is simple, but the operational contract must survive restarts, parallel workers, schema changes, and audits.
  • The notebook for this section uses synthetic data so the same ideas can be executed without external files.

Non-examples:

  • A path on disk without a manifest is not a reproducible dataset.
  • A metric dashboard without record-level lineage is not a provenance system.
  • A filter threshold without an audit sample is not evidence of quality.

Implementation consequence: every transformation should report both a count and a rate. If ninn_{\mathrm{in}} records enter the stage and noutn_{\mathrm{out}} records leave, the acceptance rate is

a=noutnin.a = \frac{n_{\mathrm{out}}}{n_{\mathrm{in}}}.

A sudden change in aa is a data-drift signal even when the code still runs. This is why pipeline math is inseparable from logging, manifests, and audit slices.

For LLM work, the token-weighted view is often more important than the document-weighted view. A filter that removes 5 percent of documents may remove 30 percent of tokens if it targets long documents. The corresponding token acceptance rate is

atok=if(ri)TiiTi,a_{\mathrm{tok}} = \frac{\sum_i f(r_i)\,T_i}{\sum_i T_i},

where TiT_i is the token count or a deterministic token-count estimate. The distinction matters for compute budgets, mixture proportions, and scaling-law interpretation.

2. Formal Definitions

Formal Definitions gives the conceptual and mathematical layer for contamination and dedup audits. The local variables in this section should be read as pipeline objects: documents, records, tokens, filters, weights, shards, and manifests.

2.1 Exact duplicate

Exact duplicate is part of the canonical scope of contamination and dedup audits. We model the relevant object as a finite collection D={ri}i=1n\mathcal{D} = \{r_i\}_{i=1}^n with record-level metadata mim_i and text or token content xix_i. The practical question is whether the transformation preserves the intended empirical distribution.

A useful local invariant is:

valid(ri,S)=1ri can be consumed by the next pipeline stage.\text{valid}(r_i, \mathcal{S}) = 1 \quad \Longrightarrow \quad r_i \text{ can be consumed by the next pipeline stage.}

For duplicate, the invariant should be explicit enough that a checker can fail fast. If the invariant is only written in a notebook comment or an engineer's memory, it will not protect a long-running data build.

Examples:

  • A small local experiment can store this object in memory; a frontier-scale run must store it as sharded, versioned, validated records.
  • The mathematical object is simple, but the operational contract must survive restarts, parallel workers, schema changes, and audits.
  • The notebook for this section uses synthetic data so the same ideas can be executed without external files.

Non-examples:

  • A path on disk without a manifest is not a reproducible dataset.
  • A metric dashboard without record-level lineage is not a provenance system.
  • A filter threshold without an audit sample is not evidence of quality.

Implementation consequence: every transformation should report both a count and a rate. If ninn_{\mathrm{in}} records enter the stage and noutn_{\mathrm{out}} records leave, the acceptance rate is

a=noutnin.a = \frac{n_{\mathrm{out}}}{n_{\mathrm{in}}}.

A sudden change in aa is a data-drift signal even when the code still runs. This is why pipeline math is inseparable from logging, manifests, and audit slices.

For LLM work, the token-weighted view is often more important than the document-weighted view. A filter that removes 5 percent of documents may remove 30 percent of tokens if it targets long documents. The corresponding token acceptance rate is

atok=if(ri)TiiTi,a_{\mathrm{tok}} = \frac{\sum_i f(r_i)\,T_i}{\sum_i T_i},

where TiT_i is the token count or a deterministic token-count estimate. The distinction matters for compute budgets, mixture proportions, and scaling-law interpretation.

2.2 Near duplicate

Near duplicate is part of the canonical scope of contamination and dedup audits. We model the relevant object as a finite collection D={ri}i=1n\mathcal{D} = \{r_i\}_{i=1}^n with record-level metadata mim_i and text or token content xix_i. The practical question is whether the transformation preserves the intended empirical distribution.

A useful local invariant is:

valid(ri,S)=1ri can be consumed by the next pipeline stage.\text{valid}(r_i, \mathcal{S}) = 1 \quad \Longrightarrow \quad r_i \text{ can be consumed by the next pipeline stage.}

For near duplicate, the invariant should be explicit enough that a checker can fail fast. If the invariant is only written in a notebook comment or an engineer's memory, it will not protect a long-running data build.

Examples:

  • A small local experiment can store this object in memory; a frontier-scale run must store it as sharded, versioned, validated records.
  • The mathematical object is simple, but the operational contract must survive restarts, parallel workers, schema changes, and audits.
  • The notebook for this section uses synthetic data so the same ideas can be executed without external files.

Non-examples:

  • A path on disk without a manifest is not a reproducible dataset.
  • A metric dashboard without record-level lineage is not a provenance system.
  • A filter threshold without an audit sample is not evidence of quality.

Implementation consequence: every transformation should report both a count and a rate. If ninn_{\mathrm{in}} records enter the stage and noutn_{\mathrm{out}} records leave, the acceptance rate is

a=noutnin.a = \frac{n_{\mathrm{out}}}{n_{\mathrm{in}}}.

A sudden change in aa is a data-drift signal even when the code still runs. This is why pipeline math is inseparable from logging, manifests, and audit slices.

For LLM work, the token-weighted view is often more important than the document-weighted view. A filter that removes 5 percent of documents may remove 30 percent of tokens if it targets long documents. The corresponding token acceptance rate is

atok=if(ri)TiiTi,a_{\mathrm{tok}} = \frac{\sum_i f(r_i)\,T_i}{\sum_i T_i},

where TiT_i is the token count or a deterministic token-count estimate. The distinction matters for compute budgets, mixture proportions, and scaling-law interpretation.

2.3 nn-gram overlap

nn-gram overlap is part of the canonical scope of contamination and dedup audits. We model the relevant object as a finite collection D={ri}i=1n\mathcal{D} = \{r_i\}_{i=1}^n with record-level metadata mim_i and text or token content xix_i. The practical question is whether the transformation preserves the intended empirical distribution.

A useful local invariant is:

valid(ri,S)=1ri can be consumed by the next pipeline stage.\text{valid}(r_i, \mathcal{S}) = 1 \quad \Longrightarrow \quad r_i \text{ can be consumed by the next pipeline stage.}

For shingle, the invariant should be explicit enough that a checker can fail fast. If the invariant is only written in a notebook comment or an engineer's memory, it will not protect a long-running data build.

Examples:

  • A small local experiment can store this object in memory; a frontier-scale run must store it as sharded, versioned, validated records.
  • The mathematical object is simple, but the operational contract must survive restarts, parallel workers, schema changes, and audits.
  • The notebook for this section uses synthetic data so the same ideas can be executed without external files.

Non-examples:

  • A path on disk without a manifest is not a reproducible dataset.
  • A metric dashboard without record-level lineage is not a provenance system.
  • A filter threshold without an audit sample is not evidence of quality.

Implementation consequence: every transformation should report both a count and a rate. If ninn_{\mathrm{in}} records enter the stage and noutn_{\mathrm{out}} records leave, the acceptance rate is

a=noutnin.a = \frac{n_{\mathrm{out}}}{n_{\mathrm{in}}}.

A sudden change in aa is a data-drift signal even when the code still runs. This is why pipeline math is inseparable from logging, manifests, and audit slices.

For LLM work, the token-weighted view is often more important than the document-weighted view. A filter that removes 5 percent of documents may remove 30 percent of tokens if it targets long documents. The corresponding token acceptance rate is

atok=if(ri)TiiTi,a_{\mathrm{tok}} = \frac{\sum_i f(r_i)\,T_i}{\sum_i T_i},

where TiT_i is the token count or a deterministic token-count estimate. The distinction matters for compute budgets, mixture proportions, and scaling-law interpretation.

2.4 Jaccard similarity

Jaccard similarity is part of the canonical scope of contamination and dedup audits. We model the relevant object as a finite collection D={ri}i=1n\mathcal{D} = \{r_i\}_{i=1}^n with record-level metadata mim_i and text or token content xix_i. The practical question is whether the transformation preserves the intended empirical distribution.

A useful local invariant is:

valid(ri,S)=1ri can be consumed by the next pipeline stage.\text{valid}(r_i, \mathcal{S}) = 1 \quad \Longrightarrow \quad r_i \text{ can be consumed by the next pipeline stage.}

For MinHash, the invariant should be explicit enough that a checker can fail fast. If the invariant is only written in a notebook comment or an engineer's memory, it will not protect a long-running data build.

Examples:

  • A small local experiment can store this object in memory; a frontier-scale run must store it as sharded, versioned, validated records.
  • The mathematical object is simple, but the operational contract must survive restarts, parallel workers, schema changes, and audits.
  • The notebook for this section uses synthetic data so the same ideas can be executed without external files.

Non-examples:

  • A path on disk without a manifest is not a reproducible dataset.
  • A metric dashboard without record-level lineage is not a provenance system.
  • A filter threshold without an audit sample is not evidence of quality.

Implementation consequence: every transformation should report both a count and a rate. If ninn_{\mathrm{in}} records enter the stage and noutn_{\mathrm{out}} records leave, the acceptance rate is

a=noutnin.a = \frac{n_{\mathrm{out}}}{n_{\mathrm{in}}}.

A sudden change in aa is a data-drift signal even when the code still runs. This is why pipeline math is inseparable from logging, manifests, and audit slices.

For LLM work, the token-weighted view is often more important than the document-weighted view. A filter that removes 5 percent of documents may remove 30 percent of tokens if it targets long documents. The corresponding token acceptance rate is

atok=if(ri)TiiTi,a_{\mathrm{tok}} = \frac{\sum_i f(r_i)\,T_i}{\sum_i T_i},

where TiT_i is the token count or a deterministic token-count estimate. The distinction matters for compute budgets, mixture proportions, and scaling-law interpretation.

2.5 Contamination relation between train and eval sets

Contamination relation between train and eval sets is part of the canonical scope of contamination and dedup audits. We model the relevant object as a finite collection D={ri}i=1n\mathcal{D} = \{r_i\}_{i=1}^n with record-level metadata mim_i and text or token content xix_i. The practical question is whether the transformation preserves the intended empirical distribution.

A useful local invariant is:

valid(ri,S)=1ri can be consumed by the next pipeline stage.\text{valid}(r_i, \mathcal{S}) = 1 \quad \Longrightarrow \quad r_i \text{ can be consumed by the next pipeline stage.}

For Jaccard, the invariant should be explicit enough that a checker can fail fast. If the invariant is only written in a notebook comment or an engineer's memory, it will not protect a long-running data build.

Examples:

  • A small local experiment can store this object in memory; a frontier-scale run must store it as sharded, versioned, validated records.
  • The mathematical object is simple, but the operational contract must survive restarts, parallel workers, schema changes, and audits.
  • The notebook for this section uses synthetic data so the same ideas can be executed without external files.

Non-examples:

  • A path on disk without a manifest is not a reproducible dataset.
  • A metric dashboard without record-level lineage is not a provenance system.
  • A filter threshold without an audit sample is not evidence of quality.

Implementation consequence: every transformation should report both a count and a rate. If ninn_{\mathrm{in}} records enter the stage and noutn_{\mathrm{out}} records leave, the acceptance rate is

a=noutnin.a = \frac{n_{\mathrm{out}}}{n_{\mathrm{in}}}.

A sudden change in aa is a data-drift signal even when the code still runs. This is why pipeline math is inseparable from logging, manifests, and audit slices.

For LLM work, the token-weighted view is often more important than the document-weighted view. A filter that removes 5 percent of documents may remove 30 percent of tokens if it targets long documents. The corresponding token acceptance rate is

atok=if(ri)TiiTi,a_{\mathrm{tok}} = \frac{\sum_i f(r_i)\,T_i}{\sum_i T_i},

where TiT_i is the token count or a deterministic token-count estimate. The distinction matters for compute budgets, mixture proportions, and scaling-law interpretation.

3. Exact Deduplication

Exact Deduplication gives the conceptual and mathematical layer for contamination and dedup audits. The local variables in this section should be read as pipeline objects: documents, records, tokens, filters, weights, shards, and manifests.

3.1 Canonicalization

Canonicalization is part of the canonical scope of contamination and dedup audits. We model the relevant object as a finite collection D={ri}i=1n\mathcal{D} = \{r_i\}_{i=1}^n with record-level metadata mim_i and text or token content xix_i. The practical question is whether the transformation preserves the intended empirical distribution.

A useful local invariant is:

valid(ri,S)=1ri can be consumed by the next pipeline stage.\text{valid}(r_i, \mathcal{S}) = 1 \quad \Longrightarrow \quad r_i \text{ can be consumed by the next pipeline stage.}

For duplicate, the invariant should be explicit enough that a checker can fail fast. If the invariant is only written in a notebook comment or an engineer's memory, it will not protect a long-running data build.

Examples:

  • A small local experiment can store this object in memory; a frontier-scale run must store it as sharded, versioned, validated records.
  • The mathematical object is simple, but the operational contract must survive restarts, parallel workers, schema changes, and audits.
  • The notebook for this section uses synthetic data so the same ideas can be executed without external files.

Non-examples:

  • A path on disk without a manifest is not a reproducible dataset.
  • A metric dashboard without record-level lineage is not a provenance system.
  • A filter threshold without an audit sample is not evidence of quality.

Implementation consequence: every transformation should report both a count and a rate. If ninn_{\mathrm{in}} records enter the stage and noutn_{\mathrm{out}} records leave, the acceptance rate is

a=noutnin.a = \frac{n_{\mathrm{out}}}{n_{\mathrm{in}}}.

A sudden change in aa is a data-drift signal even when the code still runs. This is why pipeline math is inseparable from logging, manifests, and audit slices.

For LLM work, the token-weighted view is often more important than the document-weighted view. A filter that removes 5 percent of documents may remove 30 percent of tokens if it targets long documents. The corresponding token acceptance rate is

atok=if(ri)TiiTi,a_{\mathrm{tok}} = \frac{\sum_i f(r_i)\,T_i}{\sum_i T_i},

where TiT_i is the token count or a deterministic token-count estimate. The distinction matters for compute budgets, mixture proportions, and scaling-law interpretation.

3.2 Hashing

Hashing is part of the canonical scope of contamination and dedup audits. We model the relevant object as a finite collection D={ri}i=1n\mathcal{D} = \{r_i\}_{i=1}^n with record-level metadata mim_i and text or token content xix_i. The practical question is whether the transformation preserves the intended empirical distribution.

A useful local invariant is:

valid(ri,S)=1ri can be consumed by the next pipeline stage.\text{valid}(r_i, \mathcal{S}) = 1 \quad \Longrightarrow \quad r_i \text{ can be consumed by the next pipeline stage.}

For near duplicate, the invariant should be explicit enough that a checker can fail fast. If the invariant is only written in a notebook comment or an engineer's memory, it will not protect a long-running data build.

Examples:

  • A small local experiment can store this object in memory; a frontier-scale run must store it as sharded, versioned, validated records.
  • The mathematical object is simple, but the operational contract must survive restarts, parallel workers, schema changes, and audits.
  • The notebook for this section uses synthetic data so the same ideas can be executed without external files.

Non-examples:

  • A path on disk without a manifest is not a reproducible dataset.
  • A metric dashboard without record-level lineage is not a provenance system.
  • A filter threshold without an audit sample is not evidence of quality.

Implementation consequence: every transformation should report both a count and a rate. If ninn_{\mathrm{in}} records enter the stage and noutn_{\mathrm{out}} records leave, the acceptance rate is

a=noutnin.a = \frac{n_{\mathrm{out}}}{n_{\mathrm{in}}}.

A sudden change in aa is a data-drift signal even when the code still runs. This is why pipeline math is inseparable from logging, manifests, and audit slices.

For LLM work, the token-weighted view is often more important than the document-weighted view. A filter that removes 5 percent of documents may remove 30 percent of tokens if it targets long documents. The corresponding token acceptance rate is

atok=if(ri)TiiTi,a_{\mathrm{tok}} = \frac{\sum_i f(r_i)\,T_i}{\sum_i T_i},

where TiT_i is the token count or a deterministic token-count estimate. The distinction matters for compute budgets, mixture proportions, and scaling-law interpretation.

3.3 Document-level dedup

Document-level dedup is part of the canonical scope of contamination and dedup audits. We model the relevant object as a finite collection D={ri}i=1n\mathcal{D} = \{r_i\}_{i=1}^n with record-level metadata mim_i and text or token content xix_i. The practical question is whether the transformation preserves the intended empirical distribution.

A useful local invariant is:

valid(ri,S)=1ri can be consumed by the next pipeline stage.\text{valid}(r_i, \mathcal{S}) = 1 \quad \Longrightarrow \quad r_i \text{ can be consumed by the next pipeline stage.}

For shingle, the invariant should be explicit enough that a checker can fail fast. If the invariant is only written in a notebook comment or an engineer's memory, it will not protect a long-running data build.

Examples:

  • A small local experiment can store this object in memory; a frontier-scale run must store it as sharded, versioned, validated records.
  • The mathematical object is simple, but the operational contract must survive restarts, parallel workers, schema changes, and audits.
  • The notebook for this section uses synthetic data so the same ideas can be executed without external files.

Non-examples:

  • A path on disk without a manifest is not a reproducible dataset.
  • A metric dashboard without record-level lineage is not a provenance system.
  • A filter threshold without an audit sample is not evidence of quality.

Implementation consequence: every transformation should report both a count and a rate. If ninn_{\mathrm{in}} records enter the stage and noutn_{\mathrm{out}} records leave, the acceptance rate is

a=noutnin.a = \frac{n_{\mathrm{out}}}{n_{\mathrm{in}}}.

A sudden change in aa is a data-drift signal even when the code still runs. This is why pipeline math is inseparable from logging, manifests, and audit slices.

For LLM work, the token-weighted view is often more important than the document-weighted view. A filter that removes 5 percent of documents may remove 30 percent of tokens if it targets long documents. The corresponding token acceptance rate is

atok=if(ri)TiiTi,a_{\mathrm{tok}} = \frac{\sum_i f(r_i)\,T_i}{\sum_i T_i},

where TiT_i is the token count or a deterministic token-count estimate. The distinction matters for compute budgets, mixture proportions, and scaling-law interpretation.

3.4 Substring dedup

Substring dedup is part of the canonical scope of contamination and dedup audits. We model the relevant object as a finite collection D={ri}i=1n\mathcal{D} = \{r_i\}_{i=1}^n with record-level metadata mim_i and text or token content xix_i. The practical question is whether the transformation preserves the intended empirical distribution.

A useful local invariant is:

valid(ri,S)=1ri can be consumed by the next pipeline stage.\text{valid}(r_i, \mathcal{S}) = 1 \quad \Longrightarrow \quad r_i \text{ can be consumed by the next pipeline stage.}

For MinHash, the invariant should be explicit enough that a checker can fail fast. If the invariant is only written in a notebook comment or an engineer's memory, it will not protect a long-running data build.

Examples:

  • A small local experiment can store this object in memory; a frontier-scale run must store it as sharded, versioned, validated records.
  • The mathematical object is simple, but the operational contract must survive restarts, parallel workers, schema changes, and audits.
  • The notebook for this section uses synthetic data so the same ideas can be executed without external files.

Non-examples:

  • A path on disk without a manifest is not a reproducible dataset.
  • A metric dashboard without record-level lineage is not a provenance system.
  • A filter threshold without an audit sample is not evidence of quality.

Implementation consequence: every transformation should report both a count and a rate. If ninn_{\mathrm{in}} records enter the stage and noutn_{\mathrm{out}} records leave, the acceptance rate is

a=noutnin.a = \frac{n_{\mathrm{out}}}{n_{\mathrm{in}}}.

A sudden change in aa is a data-drift signal even when the code still runs. This is why pipeline math is inseparable from logging, manifests, and audit slices.

For LLM work, the token-weighted view is often more important than the document-weighted view. A filter that removes 5 percent of documents may remove 30 percent of tokens if it targets long documents. The corresponding token acceptance rate is

atok=if(ri)TiiTi,a_{\mathrm{tok}} = \frac{\sum_i f(r_i)\,T_i}{\sum_i T_i},

where TiT_i is the token count or a deterministic token-count estimate. The distinction matters for compute budgets, mixture proportions, and scaling-law interpretation.

3.5 Dedup reports

Dedup reports is part of the canonical scope of contamination and dedup audits. We model the relevant object as a finite collection D={ri}i=1n\mathcal{D} = \{r_i\}_{i=1}^n with record- level metadata mim_i and text or token content xix_i. The practical question is whether the transformation preserves the intended empirical distribution.

A useful local invariant is:

valid(ri,S)=1ri can be consumed by the next pipeline stage.\text{valid}(r_i, \mathcal{S}) = 1 \quad \Longrightarrow \quad r_i \text{ can be consumed by the next pipeline stage.}

For Jaccard, the invariant should be explicit enough that a checker can fail fast. If the invariant is only written in a notebook comment or an engineer's memory, it will not protect a long-running data build.

Examples:

  • A small local experiment can store this object in memory; a frontier-scale run must store it as sharded, versioned, validated records.
  • The mathematical object is simple, but the operational contract must survive restarts, parallel workers, schema changes, and audits.
  • The notebook for this section uses synthetic data so the same ideas can be executed without external files.

Non-examples:

  • A path on disk without a manifest is not a reproducible dataset.
  • A metric dashboard without record-level lineage is not a provenance system.
  • A filter threshold without an audit sample is not evidence of quality.

Implementation consequence: every transformation should report both a count and a rate. If ninn_{\mathrm{in}} records enter the stage and noutn_{\mathrm{out}} records leave, the acceptance rate is

a=noutnin.a = \frac{n_{\mathrm{out}}}{n_{\mathrm{in}}}.

A sudden change in aa is a data-drift signal even when the code still runs. This is why pipeline math is inseparable from logging, manifests, and audit slices.

For LLM work, the token-weighted view is often more important than the document-weighted view. A filter that removes 5 percent of documents may remove 30 percent of tokens if it targets long documents. The corresponding token acceptance rate is

atok=if(ri)TiiTi,a_{\mathrm{tok}} = \frac{\sum_i f(r_i)\,T_i}{\sum_i T_i},

where TiT_i is the token count or a deterministic token-count estimate. The distinction matters for compute budgets, mixture proportions, and scaling-law interpretation.

4. Fuzzy Deduplication

Fuzzy Deduplication gives the conceptual and mathematical layer for contamination and dedup audits. The local variables in this section should be read as pipeline objects: documents, records, tokens, filters, weights, shards, and manifests.

4.1 Shingling

Shingling is part of the canonical scope of contamination and dedup audits. We model the relevant object as a finite collection D={ri}i=1n\mathcal{D} = \{r_i\}_{i=1}^n with record-level metadata mim_i and text or token content xix_i. The practical question is whether the transformation preserves the intended empirical distribution.

A useful local invariant is:

valid(ri,S)=1ri can be consumed by the next pipeline stage.\text{valid}(r_i, \mathcal{S}) = 1 \quad \Longrightarrow \quad r_i \text{ can be consumed by the next pipeline stage.}

For duplicate, the invariant should be explicit enough that a checker can fail fast. If the invariant is only written in a notebook comment or an engineer's memory, it will not protect a long-running data build.

Examples:

  • A small local experiment can store this object in memory; a frontier-scale run must store it as sharded, versioned, validated records.
  • The mathematical object is simple, but the operational contract must survive restarts, parallel workers, schema changes, and audits.
  • The notebook for this section uses synthetic data so the same ideas can be executed without external files.

Non-examples:

  • A path on disk without a manifest is not a reproducible dataset.
  • A metric dashboard without record-level lineage is not a provenance system.
  • A filter threshold without an audit sample is not evidence of quality.

Implementation consequence: every transformation should report both a count and a rate. If ninn_{\mathrm{in}} records enter the stage and noutn_{\mathrm{out}} records leave, the acceptance rate is

a=noutnin.a = \frac{n_{\mathrm{out}}}{n_{\mathrm{in}}}.

A sudden change in aa is a data-drift signal even when the code still runs. This is why pipeline math is inseparable from logging, manifests, and audit slices.

For LLM work, the token-weighted view is often more important than the document-weighted view. A filter that removes 5 percent of documents may remove 30 percent of tokens if it targets long documents. The corresponding token acceptance rate is

atok=if(ri)TiiTi,a_{\mathrm{tok}} = \frac{\sum_i f(r_i)\,T_i}{\sum_i T_i},

where TiT_i is the token count or a deterministic token-count estimate. The distinction matters for compute budgets, mixture proportions, and scaling-law interpretation.

4.2 MinHash

MinHash is part of the canonical scope of contamination and dedup audits. We model the relevant object as a finite collection D={ri}i=1n\mathcal{D} = \{r_i\}_{i=1}^n with record-level metadata mim_i and text or token content xix_i. The practical question is whether the transformation preserves the intended empirical distribution.

A useful local invariant is:

valid(ri,S)=1ri can be consumed by the next pipeline stage.\text{valid}(r_i, \mathcal{S}) = 1 \quad \Longrightarrow \quad r_i \text{ can be consumed by the next pipeline stage.}

For near duplicate, the invariant should be explicit enough that a checker can fail fast. If the invariant is only written in a notebook comment or an engineer's memory, it will not protect a long-running data build.

Examples:

  • A small local experiment can store this object in memory; a frontier-scale run must store it as sharded, versioned, validated records.
  • The mathematical object is simple, but the operational contract must survive restarts, parallel workers, schema changes, and audits.
  • The notebook for this section uses synthetic data so the same ideas can be executed without external files.

Non-examples:

  • A path on disk without a manifest is not a reproducible dataset.
  • A metric dashboard without record-level lineage is not a provenance system.
  • A filter threshold without an audit sample is not evidence of quality.

Implementation consequence: every transformation should report both a count and a rate. If ninn_{\mathrm{in}} records enter the stage and noutn_{\mathrm{out}} records leave, the acceptance rate is

a=noutnin.a = \frac{n_{\mathrm{out}}}{n_{\mathrm{in}}}.

A sudden change in aa is a data-drift signal even when the code still runs. This is why pipeline math is inseparable from logging, manifests, and audit slices.

For LLM work, the token-weighted view is often more important than the document-weighted view. A filter that removes 5 percent of documents may remove 30 percent of tokens if it targets long documents. The corresponding token acceptance rate is

atok=if(ri)TiiTi,a_{\mathrm{tok}} = \frac{\sum_i f(r_i)\,T_i}{\sum_i T_i},

where TiT_i is the token count or a deterministic token-count estimate. The distinction matters for compute budgets, mixture proportions, and scaling-law interpretation.

4.3 LSH buckets

LSH buckets is part of the canonical scope of contamination and dedup audits. We model the relevant object as a finite collection D={ri}i=1n\mathcal{D} = \{r_i\}_{i=1}^n with record- level metadata mim_i and text or token content xix_i. The practical question is whether the transformation preserves the intended empirical distribution.

A useful local invariant is:

valid(ri,S)=1ri can be consumed by the next pipeline stage.\text{valid}(r_i, \mathcal{S}) = 1 \quad \Longrightarrow \quad r_i \text{ can be consumed by the next pipeline stage.}

For shingle, the invariant should be explicit enough that a checker can fail fast. If the invariant is only written in a notebook comment or an engineer's memory, it will not protect a long-running data build.

Examples:

  • A small local experiment can store this object in memory; a frontier-scale run must store it as sharded, versioned, validated records.
  • The mathematical object is simple, but the operational contract must survive restarts, parallel workers, schema changes, and audits.
  • The notebook for this section uses synthetic data so the same ideas can be executed without external files.

Non-examples:

  • A path on disk without a manifest is not a reproducible dataset.
  • A metric dashboard without record-level lineage is not a provenance system.
  • A filter threshold without an audit sample is not evidence of quality.

Implementation consequence: every transformation should report both a count and a rate. If ninn_{\mathrm{in}} records enter the stage and noutn_{\mathrm{out}} records leave, the acceptance rate is

a=noutnin.a = \frac{n_{\mathrm{out}}}{n_{\mathrm{in}}}.

A sudden change in aa is a data-drift signal even when the code still runs. This is why pipeline math is inseparable from logging, manifests, and audit slices.

For LLM work, the token-weighted view is often more important than the document-weighted view. A filter that removes 5 percent of documents may remove 30 percent of tokens if it targets long documents. The corresponding token acceptance rate is

atok=if(ri)TiiTi,a_{\mathrm{tok}} = \frac{\sum_i f(r_i)\,T_i}{\sum_i T_i},

where TiT_i is the token count or a deterministic token-count estimate. The distinction matters for compute budgets, mixture proportions, and scaling-law interpretation.

4.4 Similarity thresholds

Similarity thresholds is part of the canonical scope of contamination and dedup audits. We model the relevant object as a finite collection D={ri}i=1n\mathcal{D} = \{r_i\}_{i=1}^n with record-level metadata mim_i and text or token content xix_i. The practical question is whether the transformation preserves the intended empirical distribution.

A useful local invariant is:

valid(ri,S)=1ri can be consumed by the next pipeline stage.\text{valid}(r_i, \mathcal{S}) = 1 \quad \Longrightarrow \quad r_i \text{ can be consumed by the next pipeline stage.}

For MinHash, the invariant should be explicit enough that a checker can fail fast. If the invariant is only written in a notebook comment or an engineer's memory, it will not protect a long-running data build.

Examples:

  • A small local experiment can store this object in memory; a frontier-scale run must store it as sharded, versioned, validated records.
  • The mathematical object is simple, but the operational contract must survive restarts, parallel workers, schema changes, and audits.
  • The notebook for this section uses synthetic data so the same ideas can be executed without external files.

Non-examples:

  • A path on disk without a manifest is not a reproducible dataset.
  • A metric dashboard without record-level lineage is not a provenance system.
  • A filter threshold without an audit sample is not evidence of quality.

Implementation consequence: every transformation should report both a count and a rate. If ninn_{\mathrm{in}} records enter the stage and noutn_{\mathrm{out}} records leave, the acceptance rate is

a=noutnin.a = \frac{n_{\mathrm{out}}}{n_{\mathrm{in}}}.

A sudden change in aa is a data-drift signal even when the code still runs. This is why pipeline math is inseparable from logging, manifests, and audit slices.

For LLM work, the token-weighted view is often more important than the document-weighted view. A filter that removes 5 percent of documents may remove 30 percent of tokens if it targets long documents. The corresponding token acceptance rate is

atok=if(ri)TiiTi,a_{\mathrm{tok}} = \frac{\sum_i f(r_i)\,T_i}{\sum_i T_i},

where TiT_i is the token count or a deterministic token-count estimate. The distinction matters for compute budgets, mixture proportions, and scaling-law interpretation.

4.5 False merge risks

False merge risks is part of the canonical scope of contamination and dedup audits. We model the relevant object as a finite collection D={ri}i=1n\mathcal{D} = \{r_i\}_{i=1}^n with record-level metadata mim_i and text or token content xix_i. The practical question is whether the transformation preserves the intended empirical distribution.

A useful local invariant is:

valid(ri,S)=1ri can be consumed by the next pipeline stage.\text{valid}(r_i, \mathcal{S}) = 1 \quad \Longrightarrow \quad r_i \text{ can be consumed by the next pipeline stage.}

For Jaccard, the invariant should be explicit enough that a checker can fail fast. If the invariant is only written in a notebook comment or an engineer's memory, it will not protect a long-running data build.

Examples:

  • A small local experiment can store this object in memory; a frontier-scale run must store it as sharded, versioned, validated records.
  • The mathematical object is simple, but the operational contract must survive restarts, parallel workers, schema changes, and audits.
  • The notebook for this section uses synthetic data so the same ideas can be executed without external files.

Non-examples:

  • A path on disk without a manifest is not a reproducible dataset.
  • A metric dashboard without record-level lineage is not a provenance system.
  • A filter threshold without an audit sample is not evidence of quality.

Implementation consequence: every transformation should report both a count and a rate. If ninn_{\mathrm{in}} records enter the stage and noutn_{\mathrm{out}} records leave, the acceptance rate is

a=noutnin.a = \frac{n_{\mathrm{out}}}{n_{\mathrm{in}}}.

A sudden change in aa is a data-drift signal even when the code still runs. This is why pipeline math is inseparable from logging, manifests, and audit slices.

For LLM work, the token-weighted view is often more important than the document-weighted view. A filter that removes 5 percent of documents may remove 30 percent of tokens if it targets long documents. The corresponding token acceptance rate is

atok=if(ri)TiiTi,a_{\mathrm{tok}} = \frac{\sum_i f(r_i)\,T_i}{\sum_i T_i},

where TiT_i is the token count or a deterministic token-count estimate. The distinction matters for compute budgets, mixture proportions, and scaling-law interpretation.

5. Benchmark Contamination Audits

Benchmark Contamination Audits gives the conceptual and mathematical layer for contamination and dedup audits. The local variables in this section should be read as pipeline objects: documents, records, tokens, filters, weights, shards, and manifests.

5.1 Exact benchmark match

Exact benchmark match is part of the canonical scope of contamination and dedup audits. We model the relevant object as a finite collection D={ri}i=1n\mathcal{D} = \{r_i\}_{i=1}^n with record-level metadata mim_i and text or token content xix_i. The practical question is whether the transformation preserves the intended empirical distribution.

A useful local invariant is:

valid(ri,S)=1ri can be consumed by the next pipeline stage.\text{valid}(r_i, \mathcal{S}) = 1 \quad \Longrightarrow \quad r_i \text{ can be consumed by the next pipeline stage.}

For duplicate, the invariant should be explicit enough that a checker can fail fast. If the invariant is only written in a notebook comment or an engineer's memory, it will not protect a long-running data build.

Examples:

  • A small local experiment can store this object in memory; a frontier-scale run must store it as sharded, versioned, validated records.
  • The mathematical object is simple, but the operational contract must survive restarts, parallel workers, schema changes, and audits.
  • The notebook for this section uses synthetic data so the same ideas can be executed without external files.

Non-examples:

  • A path on disk without a manifest is not a reproducible dataset.
  • A metric dashboard without record-level lineage is not a provenance system.
  • A filter threshold without an audit sample is not evidence of quality.

Implementation consequence: every transformation should report both a count and a rate. If ninn_{\mathrm{in}} records enter the stage and noutn_{\mathrm{out}} records leave, the acceptance rate is

a=noutnin.a = \frac{n_{\mathrm{out}}}{n_{\mathrm{in}}}.

A sudden change in aa is a data-drift signal even when the code still runs. This is why pipeline math is inseparable from logging, manifests, and audit slices.

For LLM work, the token-weighted view is often more important than the document-weighted view. A filter that removes 5 percent of documents may remove 30 percent of tokens if it targets long documents. The corresponding token acceptance rate is

atok=if(ri)TiiTi,a_{\mathrm{tok}} = \frac{\sum_i f(r_i)\,T_i}{\sum_i T_i},

where TiT_i is the token count or a deterministic token-count estimate. The distinction matters for compute budgets, mixture proportions, and scaling-law interpretation.

5.2 Prompt-only contamination

Prompt-only contamination is part of the canonical scope of contamination and dedup audits. We model the relevant object as a finite collection D={ri}i=1n\mathcal{D} = \{r_i\}_{i=1}^n with record-level metadata mim_i and text or token content xix_i. The practical question is whether the transformation preserves the intended empirical distribution.

A useful local invariant is:

valid(ri,S)=1ri can be consumed by the next pipeline stage.\text{valid}(r_i, \mathcal{S}) = 1 \quad \Longrightarrow \quad r_i \text{ can be consumed by the next pipeline stage.}

For near duplicate, the invariant should be explicit enough that a checker can fail fast. If the invariant is only written in a notebook comment or an engineer's memory, it will not protect a long-running data build.

Examples:

  • A small local experiment can store this object in memory; a frontier-scale run must store it as sharded, versioned, validated records.
  • The mathematical object is simple, but the operational contract must survive restarts, parallel workers, schema changes, and audits.
  • The notebook for this section uses synthetic data so the same ideas can be executed without external files.

Non-examples:

  • A path on disk without a manifest is not a reproducible dataset.
  • A metric dashboard without record-level lineage is not a provenance system.
  • A filter threshold without an audit sample is not evidence of quality.

Implementation consequence: every transformation should report both a count and a rate. If ninn_{\mathrm{in}} records enter the stage and noutn_{\mathrm{out}} records leave, the acceptance rate is

a=noutnin.a = \frac{n_{\mathrm{out}}}{n_{\mathrm{in}}}.

A sudden change in aa is a data-drift signal even when the code still runs. This is why pipeline math is inseparable from logging, manifests, and audit slices.

For LLM work, the token-weighted view is often more important than the document-weighted view. A filter that removes 5 percent of documents may remove 30 percent of tokens if it targets long documents. The corresponding token acceptance rate is

atok=if(ri)TiiTi,a_{\mathrm{tok}} = \frac{\sum_i f(r_i)\,T_i}{\sum_i T_i},

where TiT_i is the token count or a deterministic token-count estimate. The distinction matters for compute budgets, mixture proportions, and scaling-law interpretation.

5.3 Answer leakage

Answer leakage is part of the canonical scope of contamination and dedup audits. We model the relevant object as a finite collection D={ri}i=1n\mathcal{D} = \{r_i\}_{i=1}^n with record-level metadata mim_i and text or token content xix_i. The practical question is whether the transformation preserves the intended empirical distribution.

A useful local invariant is:

valid(ri,S)=1ri can be consumed by the next pipeline stage.\text{valid}(r_i, \mathcal{S}) = 1 \quad \Longrightarrow \quad r_i \text{ can be consumed by the next pipeline stage.}

For shingle, the invariant should be explicit enough that a checker can fail fast. If the invariant is only written in a notebook comment or an engineer's memory, it will not protect a long-running data build.

Examples:

  • A small local experiment can store this object in memory; a frontier-scale run must store it as sharded, versioned, validated records.
  • The mathematical object is simple, but the operational contract must survive restarts, parallel workers, schema changes, and audits.
  • The notebook for this section uses synthetic data so the same ideas can be executed without external files.

Non-examples:

  • A path on disk without a manifest is not a reproducible dataset.
  • A metric dashboard without record-level lineage is not a provenance system.
  • A filter threshold without an audit sample is not evidence of quality.

Implementation consequence: every transformation should report both a count and a rate. If ninn_{\mathrm{in}} records enter the stage and noutn_{\mathrm{out}} records leave, the acceptance rate is

a=noutnin.a = \frac{n_{\mathrm{out}}}{n_{\mathrm{in}}}.

A sudden change in aa is a data-drift signal even when the code still runs. This is why pipeline math is inseparable from logging, manifests, and audit slices.

For LLM work, the token-weighted view is often more important than the document-weighted view. A filter that removes 5 percent of documents may remove 30 percent of tokens if it targets long documents. The corresponding token acceptance rate is

atok=if(ri)TiiTi,a_{\mathrm{tok}} = \frac{\sum_i f(r_i)\,T_i}{\sum_i T_i},

where TiT_i is the token count or a deterministic token-count estimate. The distinction matters for compute budgets, mixture proportions, and scaling-law interpretation.

5.4 Paraphrase contamination preview

Paraphrase contamination preview is part of the canonical scope of contamination and dedup audits. We model the relevant object as a finite collection D={ri}i=1n\mathcal{D} = \{r_i\}_{i=1}^n with record-level metadata mim_i and text or token content xix_i. The practical question is whether the transformation preserves the intended empirical distribution.

A useful local invariant is:

valid(ri,S)=1ri can be consumed by the next pipeline stage.\text{valid}(r_i, \mathcal{S}) = 1 \quad \Longrightarrow \quad r_i \text{ can be consumed by the next pipeline stage.}

For MinHash, the invariant should be explicit enough that a checker can fail fast. If the invariant is only written in a notebook comment or an engineer's memory, it will not protect a long-running data build.

Examples:

  • A small local experiment can store this object in memory; a frontier-scale run must store it as sharded, versioned, validated records.
  • The mathematical object is simple, but the operational contract must survive restarts, parallel workers, schema changes, and audits.
  • The notebook for this section uses synthetic data so the same ideas can be executed without external files.

Non-examples:

  • A path on disk without a manifest is not a reproducible dataset.
  • A metric dashboard without record-level lineage is not a provenance system.
  • A filter threshold without an audit sample is not evidence of quality.

Implementation consequence: every transformation should report both a count and a rate. If ninn_{\mathrm{in}} records enter the stage and noutn_{\mathrm{out}} records leave, the acceptance rate is

a=noutnin.a = \frac{n_{\mathrm{out}}}{n_{\mathrm{in}}}.

A sudden change in aa is a data-drift signal even when the code still runs. This is why pipeline math is inseparable from logging, manifests, and audit slices.

For LLM work, the token-weighted view is often more important than the document-weighted view. A filter that removes 5 percent of documents may remove 30 percent of tokens if it targets long documents. The corresponding token acceptance rate is

atok=if(ri)TiiTi,a_{\mathrm{tok}} = \frac{\sum_i f(r_i)\,T_i}{\sum_i T_i},

where TiT_i is the token count or a deterministic token-count estimate. The distinction matters for compute budgets, mixture proportions, and scaling-law interpretation.

5.5 WIMBD-style count/search audit

WIMBD-style count/search audit is part of the canonical scope of contamination and dedup audits. We model the relevant object as a finite collection D={ri}i=1n\mathcal{D} = \{r_i\}_{i=1}^n with record-level metadata mim_i and text or token content xix_i. The practical question is whether the transformation preserves the intended empirical distribution.

A useful local invariant is:

valid(ri,S)=1ri can be consumed by the next pipeline stage.\text{valid}(r_i, \mathcal{S}) = 1 \quad \Longrightarrow \quad r_i \text{ can be consumed by the next pipeline stage.}

For Jaccard, the invariant should be explicit enough that a checker can fail fast. If the invariant is only written in a notebook comment or an engineer's memory, it will not protect a long-running data build.

Examples:

  • A small local experiment can store this object in memory; a frontier-scale run must store it as sharded, versioned, validated records.
  • The mathematical object is simple, but the operational contract must survive restarts, parallel workers, schema changes, and audits.
  • The notebook for this section uses synthetic data so the same ideas can be executed without external files.

Non-examples:

  • A path on disk without a manifest is not a reproducible dataset.
  • A metric dashboard without record-level lineage is not a provenance system.
  • A filter threshold without an audit sample is not evidence of quality.

Implementation consequence: every transformation should report both a count and a rate. If ninn_{\mathrm{in}} records enter the stage and noutn_{\mathrm{out}} records leave, the acceptance rate is

a=noutnin.a = \frac{n_{\mathrm{out}}}{n_{\mathrm{in}}}.

A sudden change in aa is a data-drift signal even when the code still runs. This is why pipeline math is inseparable from logging, manifests, and audit slices.

For LLM work, the token-weighted view is often more important than the document-weighted view. A filter that removes 5 percent of documents may remove 30 percent of tokens if it targets long documents. The corresponding token acceptance rate is

atok=if(ri)TiiTi,a_{\mathrm{tok}} = \frac{\sum_i f(r_i)\,T_i}{\sum_i T_i},

where TiT_i is the token count or a deterministic token-count estimate. The distinction matters for compute budgets, mixture proportions, and scaling-law interpretation.

6. Memorization and Privacy

Memorization and Privacy gives the conceptual and mathematical layer for contamination and dedup audits. The local variables in this section should be read as pipeline objects: documents, records, tokens, filters, weights, shards, and manifests.

6.1 Repetition and memorization

Repetition and memorization is part of the canonical scope of contamination and dedup audits. We model the relevant object as a finite collection D={ri}i=1n\mathcal{D} = \{r_i\}_{i=1}^n with record-level metadata mim_i and text or token content xix_i. The practical question is whether the transformation preserves the intended empirical distribution.

A useful local invariant is:

valid(ri,S)=1ri can be consumed by the next pipeline stage.\text{valid}(r_i, \mathcal{S}) = 1 \quad \Longrightarrow \quad r_i \text{ can be consumed by the next pipeline stage.}

For duplicate, the invariant should be explicit enough that a checker can fail fast. If the invariant is only written in a notebook comment or an engineer's memory, it will not protect a long-running data build.

Examples:

  • A small local experiment can store this object in memory; a frontier-scale run must store it as sharded, versioned, validated records.
  • The mathematical object is simple, but the operational contract must survive restarts, parallel workers, schema changes, and audits.
  • The notebook for this section uses synthetic data so the same ideas can be executed without external files.

Non-examples:

  • A path on disk without a manifest is not a reproducible dataset.
  • A metric dashboard without record-level lineage is not a provenance system.
  • A filter threshold without an audit sample is not evidence of quality.

Implementation consequence: every transformation should report both a count and a rate. If ninn_{\mathrm{in}} records enter the stage and noutn_{\mathrm{out}} records leave, the acceptance rate is

a=noutnin.a = \frac{n_{\mathrm{out}}}{n_{\mathrm{in}}}.

A sudden change in aa is a data-drift signal even when the code still runs. This is why pipeline math is inseparable from logging, manifests, and audit slices.

For LLM work, the token-weighted view is often more important than the document-weighted view. A filter that removes 5 percent of documents may remove 30 percent of tokens if it targets long documents. The corresponding token acceptance rate is

atok=if(ri)TiiTi,a_{\mathrm{tok}} = \frac{\sum_i f(r_i)\,T_i}{\sum_i T_i},

where TiT_i is the token count or a deterministic token-count estimate. The distinction matters for compute budgets, mixture proportions, and scaling-law interpretation.

6.2 PII leakage risk

PII leakage risk is part of the canonical scope of contamination and dedup audits. We model the relevant object as a finite collection D={ri}i=1n\mathcal{D} = \{r_i\}_{i=1}^n with record-level metadata mim_i and text or token content xix_i. The practical question is whether the transformation preserves the intended empirical distribution.

A useful local invariant is:

valid(ri,S)=1ri can be consumed by the next pipeline stage.\text{valid}(r_i, \mathcal{S}) = 1 \quad \Longrightarrow \quad r_i \text{ can be consumed by the next pipeline stage.}

For near duplicate, the invariant should be explicit enough that a checker can fail fast. If the invariant is only written in a notebook comment or an engineer's memory, it will not protect a long-running data build.

Examples:

  • A small local experiment can store this object in memory; a frontier-scale run must store it as sharded, versioned, validated records.
  • The mathematical object is simple, but the operational contract must survive restarts, parallel workers, schema changes, and audits.
  • The notebook for this section uses synthetic data so the same ideas can be executed without external files.

Non-examples:

  • A path on disk without a manifest is not a reproducible dataset.
  • A metric dashboard without record-level lineage is not a provenance system.
  • A filter threshold without an audit sample is not evidence of quality.

Implementation consequence: every transformation should report both a count and a rate. If ninn_{\mathrm{in}} records enter the stage and noutn_{\mathrm{out}} records leave, the acceptance rate is

a=noutnin.a = \frac{n_{\mathrm{out}}}{n_{\mathrm{in}}}.

A sudden change in aa is a data-drift signal even when the code still runs. This is why pipeline math is inseparable from logging, manifests, and audit slices.

For LLM work, the token-weighted view is often more important than the document-weighted view. A filter that removes 5 percent of documents may remove 30 percent of tokens if it targets long documents. The corresponding token acceptance rate is

atok=if(ri)TiiTi,a_{\mathrm{tok}} = \frac{\sum_i f(r_i)\,T_i}{\sum_i T_i},

where TiT_i is the token count or a deterministic token-count estimate. The distinction matters for compute budgets, mixture proportions, and scaling-law interpretation.

6.3 Extraction attack motivation

Extraction attack motivation is part of the canonical scope of contamination and dedup audits. We model the relevant object as a finite collection D={ri}i=1n\mathcal{D} = \{r_i\}_{i=1}^n with record-level metadata mim_i and text or token content xix_i. The practical question is whether the transformation preserves the intended empirical distribution.

A useful local invariant is:

valid(ri,S)=1ri can be consumed by the next pipeline stage.\text{valid}(r_i, \mathcal{S}) = 1 \quad \Longrightarrow \quad r_i \text{ can be consumed by the next pipeline stage.}

For shingle, the invariant should be explicit enough that a checker can fail fast. If the invariant is only written in a notebook comment or an engineer's memory, it will not protect a long-running data build.

Examples:

  • A small local experiment can store this object in memory; a frontier-scale run must store it as sharded, versioned, validated records.
  • The mathematical object is simple, but the operational contract must survive restarts, parallel workers, schema changes, and audits.
  • The notebook for this section uses synthetic data so the same ideas can be executed without external files.

Non-examples:

  • A path on disk without a manifest is not a reproducible dataset.
  • A metric dashboard without record-level lineage is not a provenance system.
  • A filter threshold without an audit sample is not evidence of quality.

Implementation consequence: every transformation should report both a count and a rate. If ninn_{\mathrm{in}} records enter the stage and noutn_{\mathrm{out}} records leave, the acceptance rate is

a=noutnin.a = \frac{n_{\mathrm{out}}}{n_{\mathrm{in}}}.

A sudden change in aa is a data-drift signal even when the code still runs. This is why pipeline math is inseparable from logging, manifests, and audit slices.

For LLM work, the token-weighted view is often more important than the document-weighted view. A filter that removes 5 percent of documents may remove 30 percent of tokens if it targets long documents. The corresponding token acceptance rate is

atok=if(ri)TiiTi,a_{\mathrm{tok}} = \frac{\sum_i f(r_i)\,T_i}{\sum_i T_i},

where TiT_i is the token count or a deterministic token-count estimate. The distinction matters for compute budgets, mixture proportions, and scaling-law interpretation.

6.4 Dedup impact

Dedup impact is part of the canonical scope of contamination and dedup audits. We model the relevant object as a finite collection D={ri}i=1n\mathcal{D} = \{r_i\}_{i=1}^n with record- level metadata mim_i and text or token content xix_i. The practical question is whether the transformation preserves the intended empirical distribution.

A useful local invariant is:

valid(ri,S)=1ri can be consumed by the next pipeline stage.\text{valid}(r_i, \mathcal{S}) = 1 \quad \Longrightarrow \quad r_i \text{ can be consumed by the next pipeline stage.}

For MinHash, the invariant should be explicit enough that a checker can fail fast. If the invariant is only written in a notebook comment or an engineer's memory, it will not protect a long-running data build.

Examples:

  • A small local experiment can store this object in memory; a frontier-scale run must store it as sharded, versioned, validated records.
  • The mathematical object is simple, but the operational contract must survive restarts, parallel workers, schema changes, and audits.
  • The notebook for this section uses synthetic data so the same ideas can be executed without external files.

Non-examples:

  • A path on disk without a manifest is not a reproducible dataset.
  • A metric dashboard without record-level lineage is not a provenance system.
  • A filter threshold without an audit sample is not evidence of quality.

Implementation consequence: every transformation should report both a count and a rate. If ninn_{\mathrm{in}} records enter the stage and noutn_{\mathrm{out}} records leave, the acceptance rate is

a=noutnin.a = \frac{n_{\mathrm{out}}}{n_{\mathrm{in}}}.

A sudden change in aa is a data-drift signal even when the code still runs. This is why pipeline math is inseparable from logging, manifests, and audit slices.

For LLM work, the token-weighted view is often more important than the document-weighted view. A filter that removes 5 percent of documents may remove 30 percent of tokens if it targets long documents. The corresponding token acceptance rate is

atok=if(ri)TiiTi,a_{\mathrm{tok}} = \frac{\sum_i f(r_i)\,T_i}{\sum_i T_i},

where TiT_i is the token count or a deterministic token-count estimate. The distinction matters for compute budgets, mixture proportions, and scaling-law interpretation.

6.5 Redaction logs

Redaction logs is part of the canonical scope of contamination and dedup audits. We model the relevant object as a finite collection D={ri}i=1n\mathcal{D} = \{r_i\}_{i=1}^n with record-level metadata mim_i and text or token content xix_i. The practical question is whether the transformation preserves the intended empirical distribution.

A useful local invariant is:

valid(ri,S)=1ri can be consumed by the next pipeline stage.\text{valid}(r_i, \mathcal{S}) = 1 \quad \Longrightarrow \quad r_i \text{ can be consumed by the next pipeline stage.}

For Jaccard, the invariant should be explicit enough that a checker can fail fast. If the invariant is only written in a notebook comment or an engineer's memory, it will not protect a long-running data build.

Examples:

  • A small local experiment can store this object in memory; a frontier-scale run must store it as sharded, versioned, validated records.
  • The mathematical object is simple, but the operational contract must survive restarts, parallel workers, schema changes, and audits.
  • The notebook for this section uses synthetic data so the same ideas can be executed without external files.

Non-examples:

  • A path on disk without a manifest is not a reproducible dataset.
  • A metric dashboard without record-level lineage is not a provenance system.
  • A filter threshold without an audit sample is not evidence of quality.

Implementation consequence: every transformation should report both a count and a rate. If ninn_{\mathrm{in}} records enter the stage and noutn_{\mathrm{out}} records leave, the acceptance rate is

a=noutnin.a = \frac{n_{\mathrm{out}}}{n_{\mathrm{in}}}.

A sudden change in aa is a data-drift signal even when the code still runs. This is why pipeline math is inseparable from logging, manifests, and audit slices.

For LLM work, the token-weighted view is often more important than the document-weighted view. A filter that removes 5 percent of documents may remove 30 percent of tokens if it targets long documents. The corresponding token acceptance rate is

atok=if(ri)TiiTi,a_{\mathrm{tok}} = \frac{\sum_i f(r_i)\,T_i}{\sum_i T_i},

where TiT_i is the token count or a deterministic token-count estimate. The distinction matters for compute budgets, mixture proportions, and scaling-law interpretation.

7. Common Mistakes

#MistakeWhy It Is WrongFix
1Trusting a file because it existsA zero-byte or unparsable artifact can still pass a loose path checkValidate content and parseability
2Counting documents but not tokensLong documents dominate computeReport both document and token rates
3Changing schemas without versioningOld and new records become indistinguishablePin schema versions in every record
4Dropping metadata during transformsAudits and removals become impossiblePreserve source and transform lineage
5Using nondeterministic orderingRebuilds cannot be comparedSeed and record ordering rules
6Ignoring failed recordsSilent loss can bias the corpusQuarantine and summarize failures
7Treating filters as neutralFilters encode preferences and tradeoffsAblate and audit every major filter
8Mixing train and eval sourcesEvaluation becomes contaminatedRun overlap audits before release
9Optimizing one aggregate scoreSmall domains can regressTrack slice metrics
10Skipping data cardsUsers cannot judge intended use or riskPublish structured documentation
11Assuming licenses are uniformSource terms can conflictTrack license at source and record level
12Forgetting reproducible manifestsThe same name can refer to different dataUse hashes and version pins

8. Exercises

  1. (*) Build a synthetic duplicate example, compute its validation signal, and explain which downstream stage would fail if the signal were wrong.
  2. (*) Build a synthetic near duplicate example, compute its validation signal, and explain which downstream stage would fail if the signal were wrong.
  3. (*) Build a synthetic shingle example, compute its validation signal, and explain which downstream stage would fail if the signal were wrong.
  4. (**) Build a synthetic MinHash example, compute its validation signal, and explain which downstream stage would fail if the signal were wrong.
  5. (**) Build a synthetic Jaccard example, compute its validation signal, and explain which downstream stage would fail if the signal were wrong.
  6. (**) Build a synthetic contamination example, compute its validation signal, and explain which downstream stage would fail if the signal were wrong.
  7. (**) Build a synthetic memorization example, compute its validation signal, and explain which downstream stage would fail if the signal were wrong.
  8. (***) Build a synthetic duplicate example, compute its validation signal, and explain which downstream stage would fail if the signal were wrong.
  9. (***) Build a synthetic near duplicate example, compute its validation signal, and explain which downstream stage would fail if the signal were wrong.
  10. (***) Build a synthetic shingle example, compute its validation signal, and explain which downstream stage would fail if the signal were wrong.

9. Why This Matters for AI

ConceptAI impact
duplicateControls what examples, gradients, risks, or audits the model pipeline can represent
near duplicateControls what examples, gradients, risks, or audits the model pipeline can represent
shingleControls what examples, gradients, risks, or audits the model pipeline can represent
MinHashControls what examples, gradients, risks, or audits the model pipeline can represent
JaccardControls what examples, gradients, risks, or audits the model pipeline can represent
contaminationControls what examples, gradients, risks, or audits the model pipeline can represent
memorizationControls what examples, gradients, risks, or audits the model pipeline can represent

Data pipeline quality is model quality in delayed form. The model eventually converts these records into gradients; any unresolved ambiguity becomes either wasted compute, misleading evaluation, memorization risk, or irreproducible science.

10. Conceptual Bridge

This section connects the previous and next pieces of the curriculum as follows:

raw sources -> records -> validation -> assembly -> audits -> documentation -> mixture

The next section is [Documentation and Governance](../06-Documentation-and- Governance/notes.md). It uses the contracts established here and moves one step further through the LLM data pipeline.

References

Skill Check

Test this lesson

Answer 4 quick questions to lock in the lesson and feed your adaptive practice queue.

--
Score
0/4
Answered
Not attempted
Status
1

Which module does this lesson belong to?

2

Which section is covered in this lesson content?

3

Which term is most central to this lesson?

4

What is the best way to use this lesson for real learning?

Your answers save locally first, then sync when account storage is available.
Practice queue