Lesson overview | Previous part | Next part
Red Teaming and Safety Evaluations: Part 5: Safety Benchmarks and Datasets to 6. Measurement Math
5. Safety Benchmarks and Datasets
Safety Benchmarks and Datasets develops the part of red teaming and safety evaluations that the approved TOC assigns to Chapter 18. The emphasis is alignment behavior, safety constraints, and feedback loops, not generic fine-tuning or production monitoring.
5.1 SafetyPrompts
SafetyPrompts belongs in the canonical scope of red teaming and safety evaluations. The object is the safety attack and evaluation protocol, not merely a prompt trick or a moderation label. We study how data, losses, policies, review processes, and safety constraints shape a model's conditional distribution over responses.
A compact way to read this subsection is through the local symbol v(x,y). It marks the alignment object being transformed: an instruction policy, a preference pair, a violation classifier, a guardrail action, or a feedback event. The details differ, but the discipline is the same: state the object, state the loss or decision rule, then audit the behavioral side effects.
For safetyprompts, this formula should not be treated as a slogan. It defines which tokens, responses, comparisons, or decisions receive gradient or operational weight. A change in masking, sampling, rubric wording, or thresholding changes the effective objective even if the model architecture is unchanged.
| Alignment object | Mathematical question | Engineering question |
|---|---|---|
| Data | Which examples define the target behavior? | Who wrote, filtered, and approved them? |
| Objective | Which terms receive weight? | Are masks, margins, and thresholds logged? |
| Policy | Which actions are allowed or disallowed? | Can reviewers reproduce the decision? |
| Evaluation | Which metric detects regression? | Is the test private, stable, and sliced? |
| Feedback | Which new evidence changes training? | How does it enter the next dataset version? |
Examples:
- Treat safetyprompts as part of the model contract and store the exact data version.
- Record the prompt template, role format, policy version, and decoder settings.
- Compare aligned and reference policies on both helpfulness and safety slices.
- Use held-out examples that were not used to tune refusals or rewards.
- Inspect failure cases before declaring the objective successful.
Non-examples:
- Calling a model aligned because it sounds polite on a few prompts.
- Training on refusals without measuring over-refusal on benign requests.
- Using a reward model as ground truth without calibration or adversarial checks.
- Shipping a guardrail threshold without measuring false positive and false negative rates.
- Letting feedback logs change training without provenance or consent controls.
A useful implementation pattern is to separate policy, data, and measurement. The policy says what behavior is desired. The data supplies examples, comparisons, attacks, or feedback events. The measurement checks whether the updated system moved in the intended direction without unacceptable regressions.
policy text/rubric
|
v
training or guardrail data -> objective/threshold -> aligned system
| |
v v
audit metadata held-out safety eval
Worked reasoning pattern for safetyprompts:
- Name the target behavior in plain language.
- Write the mathematical variable that represents it.
- Specify which examples or comparisons estimate it.
- Choose the optimization loss or runtime decision rule.
- Define the regression metric that would prove the change became worse.
Three details are especially easy to miss in alignment work. First, the user intent distribution is not the same as the pretraining distribution. Second, safety labels are not ordinary class labels; they encode policy judgments that can change by context. Third, optimization pressure finds shortcuts, so every proxy must be monitored for Goodhart-style failures.
| Failure pressure | Typical symptom | Mitigation |
|---|---|---|
| Proxy reward | High reward but worse human judgment | Holdout preferences and adversarial review |
| Refusal shortcut | Safe but unhelpful responses | Measure benign refusal rate separately |
| Template overfit | Good on training chat format only | Evaluate alternate templates and languages |
| Policy ambiguity | Inconsistent labels | Adjudication and rubric revision |
| Feedback drift | New labels change old policy silently | Version policy, rubric, and dataset together |
AI connection: SafetyPrompts is part of the post-training stack used by modern assistant systems. It links the base language model to human intent, safety policy, and deployment constraints without pretending that a single loss can capture all values. The goal is not perfect alignment by formula; it is a repeatable loop where evidence, objectives, and safeguards improve together.
5.2 HarmBench-style suites
HarmBench-style suites belongs in the canonical scope of red teaming and safety evaluations. The object is the safety attack and evaluation protocol, not merely a prompt trick or a moderation label. We study how data, losses, policies, review processes, and safety constraints shape a model's conditional distribution over responses.
A compact way to read this subsection is through the local symbol v(x,y). It marks the alignment object being transformed: an instruction policy, a preference pair, a violation classifier, a guardrail action, or a feedback event. The details differ, but the discipline is the same: state the object, state the loss or decision rule, then audit the behavioral side effects.
For harmbench-style suites, this formula should not be treated as a slogan. It defines which tokens, responses, comparisons, or decisions receive gradient or operational weight. A change in masking, sampling, rubric wording, or thresholding changes the effective objective even if the model architecture is unchanged.
| Alignment object | Mathematical question | Engineering question |
|---|---|---|
| Data | Which examples define the target behavior? | Who wrote, filtered, and approved them? |
| Objective | Which terms receive weight? | Are masks, margins, and thresholds logged? |
| Policy | Which actions are allowed or disallowed? | Can reviewers reproduce the decision? |
| Evaluation | Which metric detects regression? | Is the test private, stable, and sliced? |
| Feedback | Which new evidence changes training? | How does it enter the next dataset version? |
Examples:
- Treat harmbench-style suites as part of the model contract and store the exact data version.
- Record the prompt template, role format, policy version, and decoder settings.
- Compare aligned and reference policies on both helpfulness and safety slices.
- Use held-out examples that were not used to tune refusals or rewards.
- Inspect failure cases before declaring the objective successful.
Non-examples:
- Calling a model aligned because it sounds polite on a few prompts.
- Training on refusals without measuring over-refusal on benign requests.
- Using a reward model as ground truth without calibration or adversarial checks.
- Shipping a guardrail threshold without measuring false positive and false negative rates.
- Letting feedback logs change training without provenance or consent controls.
A useful implementation pattern is to separate policy, data, and measurement. The policy says what behavior is desired. The data supplies examples, comparisons, attacks, or feedback events. The measurement checks whether the updated system moved in the intended direction without unacceptable regressions.
policy text/rubric
|
v
training or guardrail data -> objective/threshold -> aligned system
| |
v v
audit metadata held-out safety eval
Worked reasoning pattern for harmbench-style suites:
- Name the target behavior in plain language.
- Write the mathematical variable that represents it.
- Specify which examples or comparisons estimate it.
- Choose the optimization loss or runtime decision rule.
- Define the regression metric that would prove the change became worse.
Three details are especially easy to miss in alignment work. First, the user intent distribution is not the same as the pretraining distribution. Second, safety labels are not ordinary class labels; they encode policy judgments that can change by context. Third, optimization pressure finds shortcuts, so every proxy must be monitored for Goodhart-style failures.
| Failure pressure | Typical symptom | Mitigation |
|---|---|---|
| Proxy reward | High reward but worse human judgment | Holdout preferences and adversarial review |
| Refusal shortcut | Safe but unhelpful responses | Measure benign refusal rate separately |
| Template overfit | Good on training chat format only | Evaluate alternate templates and languages |
| Policy ambiguity | Inconsistent labels | Adjudication and rubric revision |
| Feedback drift | New labels change old policy silently | Version policy, rubric, and dataset together |
AI connection: HarmBench-style suites is part of the post-training stack used by modern assistant systems. It links the base language model to human intent, safety policy, and deployment constraints without pretending that a single loss can capture all values. The goal is not perfect alignment by formula; it is a repeatable loop where evidence, objectives, and safeguards improve together.
5.3 Jailbreak sets
Jailbreak sets belongs in the canonical scope of red teaming and safety evaluations. The object is the safety attack and evaluation protocol, not merely a prompt trick or a moderation label. We study how data, losses, policies, review processes, and safety constraints shape a model's conditional distribution over responses.
A compact way to read this subsection is through the local symbol v(x,y). It marks the alignment object being transformed: an instruction policy, a preference pair, a violation classifier, a guardrail action, or a feedback event. The details differ, but the discipline is the same: state the object, state the loss or decision rule, then audit the behavioral side effects.
For jailbreak sets, this formula should not be treated as a slogan. It defines which tokens, responses, comparisons, or decisions receive gradient or operational weight. A change in masking, sampling, rubric wording, or thresholding changes the effective objective even if the model architecture is unchanged.
| Alignment object | Mathematical question | Engineering question |
|---|---|---|
| Data | Which examples define the target behavior? | Who wrote, filtered, and approved them? |
| Objective | Which terms receive weight? | Are masks, margins, and thresholds logged? |
| Policy | Which actions are allowed or disallowed? | Can reviewers reproduce the decision? |
| Evaluation | Which metric detects regression? | Is the test private, stable, and sliced? |
| Feedback | Which new evidence changes training? | How does it enter the next dataset version? |
Examples:
- Treat jailbreak sets as part of the model contract and store the exact data version.
- Record the prompt template, role format, policy version, and decoder settings.
- Compare aligned and reference policies on both helpfulness and safety slices.
- Use held-out examples that were not used to tune refusals or rewards.
- Inspect failure cases before declaring the objective successful.
Non-examples:
- Calling a model aligned because it sounds polite on a few prompts.
- Training on refusals without measuring over-refusal on benign requests.
- Using a reward model as ground truth without calibration or adversarial checks.
- Shipping a guardrail threshold without measuring false positive and false negative rates.
- Letting feedback logs change training without provenance or consent controls.
A useful implementation pattern is to separate policy, data, and measurement. The policy says what behavior is desired. The data supplies examples, comparisons, attacks, or feedback events. The measurement checks whether the updated system moved in the intended direction without unacceptable regressions.
policy text/rubric
|
v
training or guardrail data -> objective/threshold -> aligned system
| |
v v
audit metadata held-out safety eval
Worked reasoning pattern for jailbreak sets:
- Name the target behavior in plain language.
- Write the mathematical variable that represents it.
- Specify which examples or comparisons estimate it.
- Choose the optimization loss or runtime decision rule.
- Define the regression metric that would prove the change became worse.
Three details are especially easy to miss in alignment work. First, the user intent distribution is not the same as the pretraining distribution. Second, safety labels are not ordinary class labels; they encode policy judgments that can change by context. Third, optimization pressure finds shortcuts, so every proxy must be monitored for Goodhart-style failures.
| Failure pressure | Typical symptom | Mitigation |
|---|---|---|
| Proxy reward | High reward but worse human judgment | Holdout preferences and adversarial review |
| Refusal shortcut | Safe but unhelpful responses | Measure benign refusal rate separately |
| Template overfit | Good on training chat format only | Evaluate alternate templates and languages |
| Policy ambiguity | Inconsistent labels | Adjudication and rubric revision |
| Feedback drift | New labels change old policy silently | Version policy, rubric, and dataset together |
AI connection: Jailbreak sets is part of the post-training stack used by modern assistant systems. It links the base language model to human intent, safety policy, and deployment constraints without pretending that a single loss can capture all values. The goal is not perfect alignment by formula; it is a repeatable loop where evidence, objectives, and safeguards improve together.
5.4 Refusal measurement
Refusal measurement belongs in the canonical scope of red teaming and safety evaluations. The object is the safety attack and evaluation protocol, not merely a prompt trick or a moderation label. We study how data, losses, policies, review processes, and safety constraints shape a model's conditional distribution over responses.
A compact way to read this subsection is through the local symbol v(x,y). It marks the alignment object being transformed: an instruction policy, a preference pair, a violation classifier, a guardrail action, or a feedback event. The details differ, but the discipline is the same: state the object, state the loss or decision rule, then audit the behavioral side effects.
For refusal measurement, this formula should not be treated as a slogan. It defines which tokens, responses, comparisons, or decisions receive gradient or operational weight. A change in masking, sampling, rubric wording, or thresholding changes the effective objective even if the model architecture is unchanged.
| Alignment object | Mathematical question | Engineering question |
|---|---|---|
| Data | Which examples define the target behavior? | Who wrote, filtered, and approved them? |
| Objective | Which terms receive weight? | Are masks, margins, and thresholds logged? |
| Policy | Which actions are allowed or disallowed? | Can reviewers reproduce the decision? |
| Evaluation | Which metric detects regression? | Is the test private, stable, and sliced? |
| Feedback | Which new evidence changes training? | How does it enter the next dataset version? |
Examples:
- Treat refusal measurement as part of the model contract and store the exact data version.
- Record the prompt template, role format, policy version, and decoder settings.
- Compare aligned and reference policies on both helpfulness and safety slices.
- Use held-out examples that were not used to tune refusals or rewards.
- Inspect failure cases before declaring the objective successful.
Non-examples:
- Calling a model aligned because it sounds polite on a few prompts.
- Training on refusals without measuring over-refusal on benign requests.
- Using a reward model as ground truth without calibration or adversarial checks.
- Shipping a guardrail threshold without measuring false positive and false negative rates.
- Letting feedback logs change training without provenance or consent controls.
A useful implementation pattern is to separate policy, data, and measurement. The policy says what behavior is desired. The data supplies examples, comparisons, attacks, or feedback events. The measurement checks whether the updated system moved in the intended direction without unacceptable regressions.
policy text/rubric
|
v
training or guardrail data -> objective/threshold -> aligned system
| |
v v
audit metadata held-out safety eval
Worked reasoning pattern for refusal measurement:
- Name the target behavior in plain language.
- Write the mathematical variable that represents it.
- Specify which examples or comparisons estimate it.
- Choose the optimization loss or runtime decision rule.
- Define the regression metric that would prove the change became worse.
Three details are especially easy to miss in alignment work. First, the user intent distribution is not the same as the pretraining distribution. Second, safety labels are not ordinary class labels; they encode policy judgments that can change by context. Third, optimization pressure finds shortcuts, so every proxy must be monitored for Goodhart-style failures.
| Failure pressure | Typical symptom | Mitigation |
|---|---|---|
| Proxy reward | High reward but worse human judgment | Holdout preferences and adversarial review |
| Refusal shortcut | Safe but unhelpful responses | Measure benign refusal rate separately |
| Template overfit | Good on training chat format only | Evaluate alternate templates and languages |
| Policy ambiguity | Inconsistent labels | Adjudication and rubric revision |
| Feedback drift | New labels change old policy silently | Version policy, rubric, and dataset together |
AI connection: Refusal measurement is part of the post-training stack used by modern assistant systems. It links the base language model to human intent, safety policy, and deployment constraints without pretending that a single loss can capture all values. The goal is not perfect alignment by formula; it is a repeatable loop where evidence, objectives, and safeguards improve together.
5.5 Over-refusal measurement
Over-refusal measurement belongs in the canonical scope of red teaming and safety evaluations. The object is the safety attack and evaluation protocol, not merely a prompt trick or a moderation label. We study how data, losses, policies, review processes, and safety constraints shape a model's conditional distribution over responses.
A compact way to read this subsection is through the local symbol v(x,y). It marks the alignment object being transformed: an instruction policy, a preference pair, a violation classifier, a guardrail action, or a feedback event. The details differ, but the discipline is the same: state the object, state the loss or decision rule, then audit the behavioral side effects.
For over-refusal measurement, this formula should not be treated as a slogan. It defines which tokens, responses, comparisons, or decisions receive gradient or operational weight. A change in masking, sampling, rubric wording, or thresholding changes the effective objective even if the model architecture is unchanged.
| Alignment object | Mathematical question | Engineering question |
|---|---|---|
| Data | Which examples define the target behavior? | Who wrote, filtered, and approved them? |
| Objective | Which terms receive weight? | Are masks, margins, and thresholds logged? |
| Policy | Which actions are allowed or disallowed? | Can reviewers reproduce the decision? |
| Evaluation | Which metric detects regression? | Is the test private, stable, and sliced? |
| Feedback | Which new evidence changes training? | How does it enter the next dataset version? |
Examples:
- Treat over-refusal measurement as part of the model contract and store the exact data version.
- Record the prompt template, role format, policy version, and decoder settings.
- Compare aligned and reference policies on both helpfulness and safety slices.
- Use held-out examples that were not used to tune refusals or rewards.
- Inspect failure cases before declaring the objective successful.
Non-examples:
- Calling a model aligned because it sounds polite on a few prompts.
- Training on refusals without measuring over-refusal on benign requests.
- Using a reward model as ground truth without calibration or adversarial checks.
- Shipping a guardrail threshold without measuring false positive and false negative rates.
- Letting feedback logs change training without provenance or consent controls.
A useful implementation pattern is to separate policy, data, and measurement. The policy says what behavior is desired. The data supplies examples, comparisons, attacks, or feedback events. The measurement checks whether the updated system moved in the intended direction without unacceptable regressions.
policy text/rubric
|
v
training or guardrail data -> objective/threshold -> aligned system
| |
v v
audit metadata held-out safety eval
Worked reasoning pattern for over-refusal measurement:
- Name the target behavior in plain language.
- Write the mathematical variable that represents it.
- Specify which examples or comparisons estimate it.
- Choose the optimization loss or runtime decision rule.
- Define the regression metric that would prove the change became worse.
Three details are especially easy to miss in alignment work. First, the user intent distribution is not the same as the pretraining distribution. Second, safety labels are not ordinary class labels; they encode policy judgments that can change by context. Third, optimization pressure finds shortcuts, so every proxy must be monitored for Goodhart-style failures.
| Failure pressure | Typical symptom | Mitigation |
|---|---|---|
| Proxy reward | High reward but worse human judgment | Holdout preferences and adversarial review |
| Refusal shortcut | Safe but unhelpful responses | Measure benign refusal rate separately |
| Template overfit | Good on training chat format only | Evaluate alternate templates and languages |
| Policy ambiguity | Inconsistent labels | Adjudication and rubric revision |
| Feedback drift | New labels change old policy silently | Version policy, rubric, and dataset together |
AI connection: Over-refusal measurement is part of the post-training stack used by modern assistant systems. It links the base language model to human intent, safety policy, and deployment constraints without pretending that a single loss can capture all values. The goal is not perfect alignment by formula; it is a repeatable loop where evidence, objectives, and safeguards improve together.
6. Measurement Math
Measurement Math develops the part of red teaming and safety evaluations that the approved TOC assigns to Chapter 18. The emphasis is alignment behavior, safety constraints, and feedback loops, not generic fine-tuning or production monitoring.
6.1 Attack success rate
Attack success rate belongs in the canonical scope of red teaming and safety evaluations. The object is the safety attack and evaluation protocol, not merely a prompt trick or a moderation label. We study how data, losses, policies, review processes, and safety constraints shape a model's conditional distribution over responses.
A compact way to read this subsection is through the local symbol v(x,y). It marks the alignment object being transformed: an instruction policy, a preference pair, a violation classifier, a guardrail action, or a feedback event. The details differ, but the discipline is the same: state the object, state the loss or decision rule, then audit the behavioral side effects.
For attack success rate, this formula should not be treated as a slogan. It defines which tokens, responses, comparisons, or decisions receive gradient or operational weight. A change in masking, sampling, rubric wording, or thresholding changes the effective objective even if the model architecture is unchanged.
| Alignment object | Mathematical question | Engineering question |
|---|---|---|
| Data | Which examples define the target behavior? | Who wrote, filtered, and approved them? |
| Objective | Which terms receive weight? | Are masks, margins, and thresholds logged? |
| Policy | Which actions are allowed or disallowed? | Can reviewers reproduce the decision? |
| Evaluation | Which metric detects regression? | Is the test private, stable, and sliced? |
| Feedback | Which new evidence changes training? | How does it enter the next dataset version? |
Examples:
- Treat attack success rate as part of the model contract and store the exact data version.
- Record the prompt template, role format, policy version, and decoder settings.
- Compare aligned and reference policies on both helpfulness and safety slices.
- Use held-out examples that were not used to tune refusals or rewards.
- Inspect failure cases before declaring the objective successful.
Non-examples:
- Calling a model aligned because it sounds polite on a few prompts.
- Training on refusals without measuring over-refusal on benign requests.
- Using a reward model as ground truth without calibration or adversarial checks.
- Shipping a guardrail threshold without measuring false positive and false negative rates.
- Letting feedback logs change training without provenance or consent controls.
A useful implementation pattern is to separate policy, data, and measurement. The policy says what behavior is desired. The data supplies examples, comparisons, attacks, or feedback events. The measurement checks whether the updated system moved in the intended direction without unacceptable regressions.
policy text/rubric
|
v
training or guardrail data -> objective/threshold -> aligned system
| |
v v
audit metadata held-out safety eval
Worked reasoning pattern for attack success rate:
- Name the target behavior in plain language.
- Write the mathematical variable that represents it.
- Specify which examples or comparisons estimate it.
- Choose the optimization loss or runtime decision rule.
- Define the regression metric that would prove the change became worse.
Three details are especially easy to miss in alignment work. First, the user intent distribution is not the same as the pretraining distribution. Second, safety labels are not ordinary class labels; they encode policy judgments that can change by context. Third, optimization pressure finds shortcuts, so every proxy must be monitored for Goodhart-style failures.
| Failure pressure | Typical symptom | Mitigation |
|---|---|---|
| Proxy reward | High reward but worse human judgment | Holdout preferences and adversarial review |
| Refusal shortcut | Safe but unhelpful responses | Measure benign refusal rate separately |
| Template overfit | Good on training chat format only | Evaluate alternate templates and languages |
| Policy ambiguity | Inconsistent labels | Adjudication and rubric revision |
| Feedback drift | New labels change old policy silently | Version policy, rubric, and dataset together |
AI connection: Attack success rate is part of the post-training stack used by modern assistant systems. It links the base language model to human intent, safety policy, and deployment constraints without pretending that a single loss can capture all values. The goal is not perfect alignment by formula; it is a repeatable loop where evidence, objectives, and safeguards improve together.
6.2 Confidence intervals
Confidence intervals belongs in the canonical scope of red teaming and safety evaluations. The object is the safety attack and evaluation protocol, not merely a prompt trick or a moderation label. We study how data, losses, policies, review processes, and safety constraints shape a model's conditional distribution over responses.
A compact way to read this subsection is through the local symbol v(x,y). It marks the alignment object being transformed: an instruction policy, a preference pair, a violation classifier, a guardrail action, or a feedback event. The details differ, but the discipline is the same: state the object, state the loss or decision rule, then audit the behavioral side effects.
For confidence intervals, this formula should not be treated as a slogan. It defines which tokens, responses, comparisons, or decisions receive gradient or operational weight. A change in masking, sampling, rubric wording, or thresholding changes the effective objective even if the model architecture is unchanged.
| Alignment object | Mathematical question | Engineering question |
|---|---|---|
| Data | Which examples define the target behavior? | Who wrote, filtered, and approved them? |
| Objective | Which terms receive weight? | Are masks, margins, and thresholds logged? |
| Policy | Which actions are allowed or disallowed? | Can reviewers reproduce the decision? |
| Evaluation | Which metric detects regression? | Is the test private, stable, and sliced? |
| Feedback | Which new evidence changes training? | How does it enter the next dataset version? |
Examples:
- Treat confidence intervals as part of the model contract and store the exact data version.
- Record the prompt template, role format, policy version, and decoder settings.
- Compare aligned and reference policies on both helpfulness and safety slices.
- Use held-out examples that were not used to tune refusals or rewards.
- Inspect failure cases before declaring the objective successful.
Non-examples:
- Calling a model aligned because it sounds polite on a few prompts.
- Training on refusals without measuring over-refusal on benign requests.
- Using a reward model as ground truth without calibration or adversarial checks.
- Shipping a guardrail threshold without measuring false positive and false negative rates.
- Letting feedback logs change training without provenance or consent controls.
A useful implementation pattern is to separate policy, data, and measurement. The policy says what behavior is desired. The data supplies examples, comparisons, attacks, or feedback events. The measurement checks whether the updated system moved in the intended direction without unacceptable regressions.
policy text/rubric
|
v
training or guardrail data -> objective/threshold -> aligned system
| |
v v
audit metadata held-out safety eval
Worked reasoning pattern for confidence intervals:
- Name the target behavior in plain language.
- Write the mathematical variable that represents it.
- Specify which examples or comparisons estimate it.
- Choose the optimization loss or runtime decision rule.
- Define the regression metric that would prove the change became worse.
Three details are especially easy to miss in alignment work. First, the user intent distribution is not the same as the pretraining distribution. Second, safety labels are not ordinary class labels; they encode policy judgments that can change by context. Third, optimization pressure finds shortcuts, so every proxy must be monitored for Goodhart-style failures.
| Failure pressure | Typical symptom | Mitigation |
|---|---|---|
| Proxy reward | High reward but worse human judgment | Holdout preferences and adversarial review |
| Refusal shortcut | Safe but unhelpful responses | Measure benign refusal rate separately |
| Template overfit | Good on training chat format only | Evaluate alternate templates and languages |
| Policy ambiguity | Inconsistent labels | Adjudication and rubric revision |
| Feedback drift | New labels change old policy silently | Version policy, rubric, and dataset together |
AI connection: Confidence intervals is part of the post-training stack used by modern assistant systems. It links the base language model to human intent, safety policy, and deployment constraints without pretending that a single loss can capture all values. The goal is not perfect alignment by formula; it is a repeatable loop where evidence, objectives, and safeguards improve together.
6.3 False positive and false negative tradeoffs
False positive and false negative tradeoffs belongs in the canonical scope of red teaming and safety evaluations. The object is the safety attack and evaluation protocol, not merely a prompt trick or a moderation label. We study how data, losses, policies, review processes, and safety constraints shape a model's conditional distribution over responses.
A compact way to read this subsection is through the local symbol v(x,y). It marks the alignment object being transformed: an instruction policy, a preference pair, a violation classifier, a guardrail action, or a feedback event. The details differ, but the discipline is the same: state the object, state the loss or decision rule, then audit the behavioral side effects.
For false positive and false negative tradeoffs, this formula should not be treated as a slogan. It defines which tokens, responses, comparisons, or decisions receive gradient or operational weight. A change in masking, sampling, rubric wording, or thresholding changes the effective objective even if the model architecture is unchanged.
| Alignment object | Mathematical question | Engineering question |
|---|---|---|
| Data | Which examples define the target behavior? | Who wrote, filtered, and approved them? |
| Objective | Which terms receive weight? | Are masks, margins, and thresholds logged? |
| Policy | Which actions are allowed or disallowed? | Can reviewers reproduce the decision? |
| Evaluation | Which metric detects regression? | Is the test private, stable, and sliced? |
| Feedback | Which new evidence changes training? | How does it enter the next dataset version? |
Examples:
- Treat false positive and false negative tradeoffs as part of the model contract and store the exact data version.
- Record the prompt template, role format, policy version, and decoder settings.
- Compare aligned and reference policies on both helpfulness and safety slices.
- Use held-out examples that were not used to tune refusals or rewards.
- Inspect failure cases before declaring the objective successful.
Non-examples:
- Calling a model aligned because it sounds polite on a few prompts.
- Training on refusals without measuring over-refusal on benign requests.
- Using a reward model as ground truth without calibration or adversarial checks.
- Shipping a guardrail threshold without measuring false positive and false negative rates.
- Letting feedback logs change training without provenance or consent controls.
A useful implementation pattern is to separate policy, data, and measurement. The policy says what behavior is desired. The data supplies examples, comparisons, attacks, or feedback events. The measurement checks whether the updated system moved in the intended direction without unacceptable regressions.
policy text/rubric
|
v
training or guardrail data -> objective/threshold -> aligned system
| |
v v
audit metadata held-out safety eval
Worked reasoning pattern for false positive and false negative tradeoffs:
- Name the target behavior in plain language.
- Write the mathematical variable that represents it.
- Specify which examples or comparisons estimate it.
- Choose the optimization loss or runtime decision rule.
- Define the regression metric that would prove the change became worse.
Three details are especially easy to miss in alignment work. First, the user intent distribution is not the same as the pretraining distribution. Second, safety labels are not ordinary class labels; they encode policy judgments that can change by context. Third, optimization pressure finds shortcuts, so every proxy must be monitored for Goodhart-style failures.
| Failure pressure | Typical symptom | Mitigation |
|---|---|---|
| Proxy reward | High reward but worse human judgment | Holdout preferences and adversarial review |
| Refusal shortcut | Safe but unhelpful responses | Measure benign refusal rate separately |
| Template overfit | Good on training chat format only | Evaluate alternate templates and languages |
| Policy ambiguity | Inconsistent labels | Adjudication and rubric revision |
| Feedback drift | New labels change old policy silently | Version policy, rubric, and dataset together |
AI connection: False positive and false negative tradeoffs is part of the post-training stack used by modern assistant systems. It links the base language model to human intent, safety policy, and deployment constraints without pretending that a single loss can capture all values. The goal is not perfect alignment by formula; it is a repeatable loop where evidence, objectives, and safeguards improve together.
6.4 Severity weighting
Severity weighting belongs in the canonical scope of red teaming and safety evaluations. The object is the safety attack and evaluation protocol, not merely a prompt trick or a moderation label. We study how data, losses, policies, review processes, and safety constraints shape a model's conditional distribution over responses.
A compact way to read this subsection is through the local symbol v(x,y). It marks the alignment object being transformed: an instruction policy, a preference pair, a violation classifier, a guardrail action, or a feedback event. The details differ, but the discipline is the same: state the object, state the loss or decision rule, then audit the behavioral side effects.
For severity weighting, this formula should not be treated as a slogan. It defines which tokens, responses, comparisons, or decisions receive gradient or operational weight. A change in masking, sampling, rubric wording, or thresholding changes the effective objective even if the model architecture is unchanged.
| Alignment object | Mathematical question | Engineering question |
|---|---|---|
| Data | Which examples define the target behavior? | Who wrote, filtered, and approved them? |
| Objective | Which terms receive weight? | Are masks, margins, and thresholds logged? |
| Policy | Which actions are allowed or disallowed? | Can reviewers reproduce the decision? |
| Evaluation | Which metric detects regression? | Is the test private, stable, and sliced? |
| Feedback | Which new evidence changes training? | How does it enter the next dataset version? |
Examples:
- Treat severity weighting as part of the model contract and store the exact data version.
- Record the prompt template, role format, policy version, and decoder settings.
- Compare aligned and reference policies on both helpfulness and safety slices.
- Use held-out examples that were not used to tune refusals or rewards.
- Inspect failure cases before declaring the objective successful.
Non-examples:
- Calling a model aligned because it sounds polite on a few prompts.
- Training on refusals without measuring over-refusal on benign requests.
- Using a reward model as ground truth without calibration or adversarial checks.
- Shipping a guardrail threshold without measuring false positive and false negative rates.
- Letting feedback logs change training without provenance or consent controls.
A useful implementation pattern is to separate policy, data, and measurement. The policy says what behavior is desired. The data supplies examples, comparisons, attacks, or feedback events. The measurement checks whether the updated system moved in the intended direction without unacceptable regressions.
policy text/rubric
|
v
training or guardrail data -> objective/threshold -> aligned system
| |
v v
audit metadata held-out safety eval
Worked reasoning pattern for severity weighting:
- Name the target behavior in plain language.
- Write the mathematical variable that represents it.
- Specify which examples or comparisons estimate it.
- Choose the optimization loss or runtime decision rule.
- Define the regression metric that would prove the change became worse.
Three details are especially easy to miss in alignment work. First, the user intent distribution is not the same as the pretraining distribution. Second, safety labels are not ordinary class labels; they encode policy judgments that can change by context. Third, optimization pressure finds shortcuts, so every proxy must be monitored for Goodhart-style failures.
| Failure pressure | Typical symptom | Mitigation |
|---|---|---|
| Proxy reward | High reward but worse human judgment | Holdout preferences and adversarial review |
| Refusal shortcut | Safe but unhelpful responses | Measure benign refusal rate separately |
| Template overfit | Good on training chat format only | Evaluate alternate templates and languages |
| Policy ambiguity | Inconsistent labels | Adjudication and rubric revision |
| Feedback drift | New labels change old policy silently | Version policy, rubric, and dataset together |
AI connection: Severity weighting is part of the post-training stack used by modern assistant systems. It links the base language model to human intent, safety policy, and deployment constraints without pretending that a single loss can capture all values. The goal is not perfect alignment by formula; it is a repeatable loop where evidence, objectives, and safeguards improve together.
6.5 Paired safety comparisons
Paired safety comparisons belongs in the canonical scope of red teaming and safety evaluations. The object is the safety attack and evaluation protocol, not merely a prompt trick or a moderation label. We study how data, losses, policies, review processes, and safety constraints shape a model's conditional distribution over responses.
A compact way to read this subsection is through the local symbol v(x,y). It marks the alignment object being transformed: an instruction policy, a preference pair, a violation classifier, a guardrail action, or a feedback event. The details differ, but the discipline is the same: state the object, state the loss or decision rule, then audit the behavioral side effects.
For paired safety comparisons, this formula should not be treated as a slogan. It defines which tokens, responses, comparisons, or decisions receive gradient or operational weight. A change in masking, sampling, rubric wording, or thresholding changes the effective objective even if the model architecture is unchanged.
| Alignment object | Mathematical question | Engineering question |
|---|---|---|
| Data | Which examples define the target behavior? | Who wrote, filtered, and approved them? |
| Objective | Which terms receive weight? | Are masks, margins, and thresholds logged? |
| Policy | Which actions are allowed or disallowed? | Can reviewers reproduce the decision? |
| Evaluation | Which metric detects regression? | Is the test private, stable, and sliced? |
| Feedback | Which new evidence changes training? | How does it enter the next dataset version? |
Examples:
- Treat paired safety comparisons as part of the model contract and store the exact data version.
- Record the prompt template, role format, policy version, and decoder settings.
- Compare aligned and reference policies on both helpfulness and safety slices.
- Use held-out examples that were not used to tune refusals or rewards.
- Inspect failure cases before declaring the objective successful.
Non-examples:
- Calling a model aligned because it sounds polite on a few prompts.
- Training on refusals without measuring over-refusal on benign requests.
- Using a reward model as ground truth without calibration or adversarial checks.
- Shipping a guardrail threshold without measuring false positive and false negative rates.
- Letting feedback logs change training without provenance or consent controls.
A useful implementation pattern is to separate policy, data, and measurement. The policy says what behavior is desired. The data supplies examples, comparisons, attacks, or feedback events. The measurement checks whether the updated system moved in the intended direction without unacceptable regressions.
policy text/rubric
|
v
training or guardrail data -> objective/threshold -> aligned system
| |
v v
audit metadata held-out safety eval
Worked reasoning pattern for paired safety comparisons:
- Name the target behavior in plain language.
- Write the mathematical variable that represents it.
- Specify which examples or comparisons estimate it.
- Choose the optimization loss or runtime decision rule.
- Define the regression metric that would prove the change became worse.
Three details are especially easy to miss in alignment work. First, the user intent distribution is not the same as the pretraining distribution. Second, safety labels are not ordinary class labels; they encode policy judgments that can change by context. Third, optimization pressure finds shortcuts, so every proxy must be monitored for Goodhart-style failures.
| Failure pressure | Typical symptom | Mitigation |
|---|---|---|
| Proxy reward | High reward but worse human judgment | Holdout preferences and adversarial review |
| Refusal shortcut | Safe but unhelpful responses | Measure benign refusal rate separately |
| Template overfit | Good on training chat format only | Evaluate alternate templates and languages |
| Policy ambiguity | Inconsistent labels | Adjudication and rubric revision |
| Feedback drift | New labels change old policy silently | Version policy, rubric, and dataset together |
AI connection: Paired safety comparisons is part of the post-training stack used by modern assistant systems. It links the base language model to human intent, safety policy, and deployment constraints without pretending that a single loss can capture all values. The goal is not perfect alignment by formula; it is a repeatable loop where evidence, objectives, and safeguards improve together.