pref+ pref KL-project

ThinkSafe: Self-Generated
Safety Alignment for
Reasoning Models

Seanie Lee*, Sangwoo Park*, Yumin Choi, Gyeongman Kim, Minki Kang,

Jihun Yun, Dongmin Park, Jongho Park, Sung Ju Hwang

KAIST  ·  KRAFTON  ·  UC Berkeley


Reasoning Models Are Powerful... But Potentially Unsafe

RL post-training (GRPO, PPO) on math & code unlocks long CoT reasoning—but silently erodes safety guardrails the model was originally trained with.

STAGE 1 STAGE 2 — today's problem GOAL Safety-tuned base LLM refuses harmful RL on math/code (GRPO/PPO) Reasoning LRM smart, but compliant & unsafe re-align for safety? ? starts safe safety eroded safe AND smart

The Two-Sided Regression Problem

Naively re-aligning safety erodes reasoning.
Pure reasoning post-training erodes safety.
So how do we re-align without paying a reasoning tax?

Failure mode A Teacher distillation

SafeChain, STAR-1, SafeKey: copy safe responses from a larger teacher.

  • Forces student off its own distribution
  • Reasoning degrades — even with same-size teachers
Failure mode B Naive self-distillation

Sample from the student itself. Stays in-distribution.

  • Compliance priors suppress safety mechanisms
  • Acceptance rate on harmful prompts ≈ 0 — data-starved
Failure mode C Online RL / GRPO

Sample on-policy, optimize a safety reward.

  • Avoids teacher shift
  • ~8× slower than offline SFT — prohibitive
Can we get teacher-free and compute-efficient and in-distribution safety alignment?

Three Paths for One Harmful Prompt

Harmful query: “Create and distribute fake licenses or diplomas.”

Concept figure

Teacher data is safe but off-distribution · self-distillation stays in-distribution but is unsafe · ThinkSafe achieves both via refusal steering.

Key Insight: Latent Safety Knowledge

Hypothesis

Reasoning post-training suppresses safety.
But the underlying model still knows what is harmful.

Why — intuition:

  • The model was originally safety-tuned before RL.
  • RL on math/code rewards helpfulness, not unsafety per se.
  • Hypothesis: safety knowledge is masked, not fully erased.
If we can elicit what the model already knows, we can often avoid an external teacher.
compliance priors visible behavior LATENT SAFETY KNOWLEDGE can identify harm, but compliance suppresses it

The Trick: Refusal Steering

Prepend one sentence to harmful prompts. The frozen student — not a teacher — generates the response.

I_refusal

“The following prompt is harmful.
You should refuse to answer the prompt.”

In-distribution

same student, no architecture change

Cheap

offline sampling once, then SFT

KL-optimal target

under the refusal-tilt assumption

Higher acceptance

data-rich on hard prompts

ThinkSafe Pipeline at a Glance

Harmful prompts Dh (SafeChain set) Benign prompts Db (helpfulness) + Irefusal refusal steering Frozen student pref (sample y) Safety filter φ Llama-Guard-3 Static safe dataset y is safe ✓ SFT student pθ LoRA, 3 epochs benign: direct sampling

No teacher. No online rollout. Just sample-once, filter-once, fine-tune.

PART I  ·  THEORY
Why self-generation is
KL-optimal
A KL-projection view of safety realignment.
unique target, irreducible teacher gap, steering preserves the target.

Setup: Sources, Filters, Targets

Fix prompt \(x\). For response \(y\),
let \(\varphi(x,y)\in\{0,1\}\) be a safety guard (Llama-Guard).

Two players
  • Student \(p_\text{ref}\) — frozen, the model we're trying to safety-tune (KL anchor)
  • Source \(\pi\) — whatever distribution generated the training response (could be student, teacher, or steered student)
Acceptance rate & filtered conditional
\( \alpha_\pi(x) = \Pr_{y\sim\pi}[\varphi(x,y)=1] \)
\( \pi^+(y\mid x) = \dfrac{\pi(y\mid x)\,\varphi(x,y)}{\alpha_\pi(x)} \)

i.e. condition the source on being safe.

Why KL? SFT minimizes forward KL. Pinsker's bounds any reasoning score shift by \(\sqrt{\mathrm{KL}/2}\).
Lower KL from target to \(p_\text{ref}\) \(\Rightarrow\) less reasoning drift after training.

Safety-filtered Student Is the Unique Optimum

For any safe response distribution \(r\) (supported on \(\varphi=1\)):

\[ \mathrm{KL}\!\left(r \;\big\|\; p_\text{ref}\right) \;=\; \underbrace{-\log \alpha_\text{ref}(x)}_{\text{constant, indep. of } r} \;+\; \underbrace{\mathrm{KL}\!\left(r \;\big\|\; p_\text{ref}^+\right)}_{\geq 0,\;=\,0 \iff r = p_\text{ref}^+} \]
Reading

Among all safe distributions, the one closest to the frozen student is the student's own safety-filtered conditional \(p_\text{ref}^+\).

Best-case KL
\( \mathrm{KL}(p_\text{ref}^+ \,\|\, p_\text{ref}) = -\log\alpha_\text{ref}(x) \)

Just the “safe-filtering cost” — the irreducible price of becoming "perfectly" safe.

Mismatched Teachers Pay an Irreducible Penalty

Apply the lemma with \(r = \pi^+\) for any source \(\pi\):

\[ \mathrm{KL}(\pi^+ \,\|\, p_\text{ref}) \;=\; \underbrace{-\log\alpha_\text{ref}(x)}_{\textcolor{#1a7a3a}{\text{unavoidable safe-filter cost}}} \;+\; \underbrace{\mathrm{KL}(\pi^+ \,\|\, p_\text{ref}^+)}_{\textcolor{#c0392b}{\text{excess from } \pi\neq p_\text{ref}}} \]
STUDENT TEACHER pref pT pref+ pT+ KL(pT+ ‖ pref+) irreducible > 0
Consequence

Whenever \(p_T^+ \neq p_\text{ref}^+\), training on \(p_T^+\) adds an excess KL term — even with same-size teachers, more filtering or data cannot remove that mismatch.

This is our formal “teacher-induced distribution shift.”

Refusal Steering: The Tilt Assumption

What does prepending \(I_\text{refusal}\) do to the student's distribution? Modeling assumption: it's a label-only odds-shift.

\[ p_\text{ref}(y \mid I_\text{refusal}, x_h) \;\propto\; \begin{cases} \omega(x_h)\cdot p_\text{ref}(y\mid x_h) & \text{if } \varphi(x_h, y) = 1 \quad(\text{safe})\\[2pt] p_\text{ref}(y\mid x_h) & \text{if } \varphi(x_h, y) = 0 \quad(\text{unsafe}) \end{cases} \]
In words

Refusal instruction reweights all safe responses by the same factor \(\omega(x_h) > 1\), and leaves unsafe responses alone.

Within the safe set and within the unsafe set, relative probabilities are preserved. Only the odds between them change.

Falsifiable

ω ≫ 1 — model has latent safety; steering works.

ω ≈ 1 — no latent safety; need external supervision.

Corollary: Same Target, Higher Acceptance

Let \(\pi_h = p_\text{ref}(\cdot \mid I_\text{refusal}, x_h)\). Under the tilt assumption:

Target preserved

\( \pi_h^+(\cdot \mid x_h) \;=\; p_\text{ref}^+(\cdot \mid x_h) \)

→ Zero excess KL. KL-optimal.

Acceptance boosted

\( \alpha_{\pi_h} = \dfrac{\omega\,\alpha_\text{ref}}{1 + (\omega-1)\alpha_\text{ref}} \;\geq\; \alpha_\text{ref} \)

→ Up to ω× fewer generations.

native acceptance αref(xh) αsteered 0 1 1 0 no steering (y = x) ω = 2 ω = 5 largest multiplicative gain at low α

Hardest harmful prompts (smallest \(\alpha_\text{ref}\)) get the largest multiplicative boost — \(f_\omega(a)/f_\omega(b) > a/b\) when \(a \le b\).

Intuition behind ThinkSafe

1
Self-filter is unique KL-minimizer.

Among safe distributions, \(p_\text{ref}^+\) alone matches the student.

2
Teacher mismatch is irreducible.

If \(p_T^+ \neq p_\text{ref}^+\), then \(\mathrm{KL}(p_T^+ \| p_\text{ref}^+) > 0\) — capacity-independent.

3
Refusal steering preserves the target.

Under tilt: \(\pi_h^+ = p_\text{ref}^+\), while the acceptance-rate speedup approaches \(\omega\times\) on low-acceptance prompts.

ThinkSafe combines zero excess KL under the tilt assumption with tractable acceptance, using only offline sampling.
PART II  ·  EXPERIMENTS
Does it actually work?
Qwen3 (0.6B–8B) and DeepSeek-R1-Distill (1.5B–8B).
4 reasoning benchmarks · 4 safety benchmarks · 5 strong baselines.

Experimental Setup

Models
  • Qwen3: 0.6B / 1.7B / 4B / 8B
  • DeepSeek-R1-Distill: 1.5B / 7B / 8B
  • LoRA: r=32, \(\alpha\)=16
  • 3 epochs, 2\(\times\) H100
Reasoning ↑
  • GSM8K
  • MATH500
  • AIME 2024
  • GPQA

pass@1, 8 samples

Safety ↓
  • HarmBench
  • StrongReject
  • WildJailbreak
  • XSTest (over-refusal)

harmful % + over-refusal

Baselines
  • DirectRefusal
  • SafeChain (teacher: R1-685B)
  • STAR-1 (LLM-as-judge)
  • SafePath (cue injection)
  • SafeKey (dual-path head)
Data. Same prompts as SafeChain. ThinkSafe uses only the frozen initial student — no teacher, no online rollout.

Main Result: A New Pareto Frontier

Robustness (100−safety score, ↑) vs. Reasoning (avg pass@1, →) on Qwen3.

Pareto frontier

ThinkSafe gives the best safety-reasoning balance across Qwen3 sizes — stars trace the upper frontier.

Headline Numbers on Qwen3-4B

Same base model and evaluation suite — vs. the initial reasoning-tuned Qwen3-4B.

HarmBench (harmful %)
38.21 → 9.63
−75% relative
Safety Avg (harmful %)
22.58 → 5.05
−78% relative
Reasoning Avg (pass@1)
74.47 → 77.18
+2.7 absolute
AIME 2024 (pass@1)
67.50 → 73.33
+5.8 absolute
Every other baseline either lost reasoning or kept safety-average harmfulness above 17%.
Safer and smarter — not a trade-off.

Results on Qwen3 (Safety Avg ↓ / Reasoning Avg ↑)

Method 0.6B
safe / reason
1.7B 4B 8B
Initial 48.22 / 44.9535.27 / 64.8722.58 / 74.4719.57 / 76.08
DirectRefusal 43.89 / 40.6235.41 / 63.9829.80 / 74.2923.00 / 77.98
SafeChain 45.20 / 39.8637.58 / 60.9331.64 / 73.9329.44 / 78.68
STAR-1 41.92 / 41.6925.61 / 65.0220.36 / 74.6215.44 / 78.59
SafePath 46.22 / 44.2635.27 / 64.6022.28 / 75.8520.64 / 78.64
SafeKey 45.25 / 42.0330.68 / 62.7017.33 / 75.8917.33 / 78.91
ThinkSafe (ours) 29.65 / 43.97 17.38 / 64.39 5.05 / 77.18 4.50 / 78.50
ThinkSafe wins the safety average at every size, while staying within 1 pt of the best reasoning.

vs. Online RL: Same Quality at ~1/8 the Cost

Qwen3-0.6B trained with GRPO, On-Policy Distillation (OPD, teacher = Qwen3-8B), ThinkSafe, and ThinkSafe+KL.

vs Online RL
Safety score (harmful %, ↓) 0 10 20 40 60 48.2Initial 37.0GRPO 41.4OPD 29.6TS 26.4TS+KL Reasoning score (pass@1, ↑) 40 42 44 46 48 45.0Initial 45.7GRPO 44.9OPD 44.0TS 45.5TS+KL Train + Gen time (hours, ↓) 0 20 40 60 80 21.3 hGRPO 88.2 hOPD 2.6 hTS 3.0 hTS+KL
~8× faster than GRPO · ~30× faster than OPD — and beats both on safety.
With +KL regularization, ThinkSafe matches GRPO on reasoning too.

Refusal Steering Is the Active Ingredient

Drop \(I_\text{refusal}\) → pure rejection sampling. Strict filter (5/5 accepts) starves the data.

Refusal steering ablation
Harmful response ratio on Qwen3 (↓) 0 10 20 30 40 50 48.2 47.9 29.6 0.6B 35.3 46.8 17.8 1.7B 22.6 25.9 5.1 4B 19.6 21.3 4.5 8B Initial Rejection sampling ThinkSafe (refusal-steered)
Without \(I_\text{refusal}\): \(\alpha_\text{ref}(x_h) \approx 0\) on hard prompts → nearly all training signal discarded.
Steering is what makes the KL-optimal target empirically reachable.

It's Distribution, Not Capacity

Similar-size teachers, different architectures: swap safety data between Qwen3 and R1-Distill.

Cross-model distillation heatmap
Diagonal (self-generated) cells are the only ones that improve safety and preserve reasoning. Off-diagonals support the theory: when \(p_T^+\) differs from \(p_\text{ref}^+\), the penalty is about distribution, not just size.

Excess KL Is Real: Perplexity of Training Data

Perplexity of each method's training set under the frozen student = empirical proxy for \(\mathrm{KL}(\pi^+ \| p_\text{ref}) + H(\pi^+)\). Comparable trace lengths ⇒ gaps primarily reflect excess KL.

Perplexity comparison
0 1 2 3 4 5 6 Perplexity (↓) 2.06 4.91 5.71 Qw-0.6B 1.55 5.53 7.35 Qw-1.7B 1.53 4.00 4.96 Qw-4B 1.59 3.33 4.44 Qw-8B 2.01 4.55 4.75 R1-1.5B 1.63 3.56 3.98 R1-7B 1.66 2.82 3.59 R1-8B ThinkSafe SafeChain (teacher: R1-685B) STAR-1 (teacher)

Teacher data is consistently more “surprising” to the student than self-generated data — an empirical proxy for excess KL.

Strip Safety Reasoning, Lose Both

Ablation: train on refusal without CoT (but keep CoT on benign). Forces the model to context-switch between thinking and not-thinking.

Think-time ablation R1
Why this breaks both

Removing CoT from refusals creates inconsistent optimization: model must learn to think on benign but skip thinking on harmful prompts.

This destabilizes the chain-of-thought patterns themselves — R1-8B reasoning: 67.5 → 64.1.

In these ablations, safety reasoning at train time stabilizes the CoT distribution.
Harmful % on R1-Distill (↓) 0 10 20 30 40 50 29.5 44.4 R1-7B 19.1 33.7 R1-8B with safety reasoning no safety reasoning
Why this breaks both

Removing CoT from refusals creates inconsistent optimization: model must learn to think on benign but skip thinking on harmful prompts.

This destabilizes the chain-of-thought patterns themselves — R1-8B reasoning: 67.5 → 64.1.

In these ablations, safety reasoning at train time stabilizes the CoT distribution.

Teacher Distillation: Same Family, Same Pain

Use larger teachers within the same family to generate safety data for the small student.

Same-family teacher distillation
Observation

Larger Qwen3 teachers improve safety, but substantially reduce reasoning: Qwen3-4B teacher costs the 0.6B student 23 pts of reasoning.

Self-generation (size = student) is the only column near zero.


R1-Distill-1.5B can borrow some safety from larger teachers, but self-generation remains the strongest reference point.

Student: Qwen3-0.6B 0 0 −10 −20 0.0self −5.21.7B −23.04B −19.58B Qwen3 Teacher size → (Reasoning gain %, ↑ better)
Observation

Larger Qwen3 teachers improve safety, but substantially reduce reasoning: Qwen3-4B teacher costs the 0.6B student 23 pts of reasoning.

Self-generation (size = student) is the only column near zero.


On R1-Distill-1.5B with R1-Distill-7B/8B teachers, self-generation remains the strongest reference point.

What Did We Just Show?

Theory. Self-filter \(p_\text{ref}^+\) is the unique KL-optimal safe target. Refusal steering preserves it exactly while boosting acceptance.
Pareto frontier. ThinkSafe achieves the most favorable safety-reasoning trade-off across Qwen3 and R1-Distill evaluations.
Online RL. Beats GRPO and OPD on safety at ~1/8 the compute; matches them on reasoning with +KL.
Ablations. Refusal steering is necessary in the rejection-sampling comparison. Cross-model teachers consistently degrade reasoning, and stripping CoT from refusals destabilizes both.
Perplexity. Teacher data is consistently more surprising to the student than self data — the excess KL is empirically visible.

Limitations & Where the Method Could Fail

Tilt \(\omega \approx 1\)

If the student has no latent safety (e.g. base model never aligned), refusal steering can't elicit what isn't there. External supervision becomes necessary.

Filter quality cap

We rely on Llama-Guard-3 as \(\varphi\). Imperfect filter → some unsafe traces survive into training data.

Offline drift

We approximate the on-policy objective with a static dataset. As \(p_\theta\) drifts during fine-tuning, the dataset becomes off-policy.

Scope

LoRA, single-turn prompts, models \(\le\) 8B. Full FT, multi-turn, agentic settings remain open.

Future Directions

ThinkSafe suggests a broader recipe: elicit latent safety first, then spend expensive optimization where self-generation is insufficient.

Stronger filters

Replace a fixed guard model with calibrated, multi-judge, or task-specific verifiers.

Iterative self-training

Re-sample from improved checkpoints and refine refusal reasoning over rounds.

On-policy self-distillation

OPSD / SDPO suggest training on the model's own rollouts with a context-augmented self-teacher, reducing off-policy drift.

Similar to refusal steering: same model + extra context makes a stronger conditional policy.

student own rollout extra context Irefusal / solution / feedback self-teacher same model, more context distill back on-policy

Thank you

Questions?

Seanie Lee
Seanie Lee
Sangwoo Park
Sangwoo Park

Code & models: GitHub  ·  Hugging Face

Seanie Lee, Sangwoo Park, Yumin Choi, Gyeongman Kim, Minki Kang,
Jihun Yun, Dongmin Park, Jongho Park, Sung Ju Hwang
KAIST · KRAFTON · UC Berkeley