ThinkSafe: Self-Generated
Safety Alignment for
Reasoning Models

Seanie Lee*, Sangwoo Park*, Yumin Choi, Gyeongman Kim, Minki Kang,

Jihun Yun, Dongmin Park, Jongho Park, Sung Ju Hwang

KAIST · KRAFTON · UC Berkeley

Reasoning Models Are Powerful... But Potentially Unsafe

RL post-training (GRPO, PPO) on math & code unlocks long CoT reasoning—but silently erodes safety guardrails the model was originally trained with.

The Two-Sided Regression Problem

Naively re-aligning safety erodes reasoning.
Pure reasoning post-training erodes safety.
So how do we re-align without paying a reasoning tax?

Failure mode A Teacher distillation

SafeChain, STAR-1, SafeKey: copy safe responses from a larger teacher.

Forces student off its own distribution
Reasoning degrades — even with same-size teachers

Failure mode B Naive self-distillation

Sample from the student itself. Stays in-distribution.

Compliance priors suppress safety mechanisms
Acceptance rate on harmful prompts ≈ 0 — data-starved

Failure mode C Online RL / GRPO

Sample on-policy, optimize a safety reward.

Avoids teacher shift
~8× slower than offline SFT — prohibitive

Can we get teacher-free and compute-efficient and in-distribution safety alignment?

Three Paths for One Harmful Prompt

Harmful query: “Create and distribute fake licenses or diplomas.”

Teacher data is safe but off-distribution · self-distillation stays in-distribution but is unsafe · ThinkSafe achieves both via refusal steering.

Key Insight: Latent Safety Knowledge

Hypothesis

Reasoning post-training suppresses safety.
But the underlying model still knows what is harmful.

Why — intuition:

The model was originally safety-tuned before RL.
RL on math/code rewards helpfulness, not unsafety per se.
Hypothesis: safety knowledge is masked, not fully erased.

If we can elicit what the model already knows, we can often avoid an external teacher.

The Trick: Refusal Steering

Prepend one sentence to harmful prompts. The frozen student — not a teacher — generates the response.

I_refusal

“The following prompt is harmful.
You should refuse to answer the prompt.”

In-distribution

same student, no architecture change

Cheap

offline sampling once, then SFT

KL-optimal target

under the refusal-tilt assumption

Higher acceptance

data-rich on hard prompts

ThinkSafe Pipeline at a Glance

No teacher. No online rollout. Just sample-once, filter-once, fine-tune.

PART I · THEORY

Why self-generation is
KL-optimal

A KL-projection view of safety realignment.
unique target, irreducible teacher gap, steering preserves the target.

Setup: Sources, Filters, Targets

Fix prompt \(x\). For response \(y\),
let \(\varphi(x,y)\in\{0,1\}\) be a safety guard (Llama-Guard).

Two players

Student \(p_\text{ref}\) — frozen, the model we're trying to safety-tune (KL anchor)
Source \(\pi\) — whatever distribution generated the training response (could be student, teacher, or steered student)

Acceptance rate & filtered conditional

\alpha_\pi(x) = \Pr_{y\sim\pi}[\varphi(x,y)=1]

\pi^+(y\mid x) = \dfrac{\pi(y\mid x)\,\varphi(x,y)}{\alpha_\pi(x)}

i.e. condition the source on being safe.

Why KL? SFT minimizes forward KL. Pinsker's bounds any reasoning score shift by \(\sqrt{\mathrm{KL}/2}\).

Lower KL from target to \(p_\text{ref}\) \(\Rightarrow\) less reasoning drift after training.

Safety-filtered Student Is the Unique Optimum

For any safe response distribution \(r\) (supported on \(\varphi=1\)):

\[ \mathrm{KL}\!\left(r \;\big\|\; p_\text{ref}\right) \;=\; \underbrace{-\log \alpha_\text{ref}(x)}_{\text{constant, indep. of } r} \;+\; \underbrace{\mathrm{KL}\!\left(r \;\big\|\; p_\text{ref}^+\right)}_{\geq 0,\;=\,0 \iff r = p_\text{ref}^+} \]

Reading

Among all safe distributions, the one closest to the frozen student is the student's own safety-filtered conditional \(p_\text{ref}^+\).

Best-case KL

\mathrm{KL}(p_\text{ref}^+ \,\|\, p_\text{ref}) = -\log\alpha_\text{ref}(x)

Just the “safe-filtering cost” — the irreducible price of becoming "perfectly" safe.

Mismatched Teachers Pay an Irreducible Penalty

Apply the lemma with \(r = \pi^+\) for any source \(\pi\):

\[ \mathrm{KL}(\pi^+ \,\|\, p_\text{ref}) \;=\; \underbrace{-\log\alpha_\text{ref}(x)}_{\textcolor{#1a7a3a}{\text{unavoidable safe-filter cost}}} \;+\; \underbrace{\mathrm{KL}(\pi^+ \,\|\, p_\text{ref}^+)}_{\textcolor{#c0392b}{\text{excess from } \pi\neq p_\text{ref}}} \]

Consequence

Whenever \(p_T^+ \neq p_\text{ref}^+\), training on \(p_T^+\) adds an excess KL term — even with same-size teachers, more filtering or data cannot remove that mismatch.

This is our formal “teacher-induced distribution shift.”

Refusal Steering: The Tilt Assumption

What does prepending \(I_\text{refusal}\) do to the student's distribution? Modeling assumption: it's a label-only odds-shift.

\[ p_\text{ref}(y \mid I_\text{refusal}, x_h) \;\propto\; \begin{cases} \omega(x_h)\cdot p_\text{ref}(y\mid x_h) & \text{if } \varphi(x_h, y) = 1 \quad(\text{safe})\\[2pt] p_\text{ref}(y\mid x_h) & \text{if } \varphi(x_h, y) = 0 \quad(\text{unsafe}) \end{cases} \]

In words

Refusal instruction reweights all safe responses by the same factor \(\omega(x_h) > 1\), and leaves unsafe responses alone.

Within the safe set and within the unsafe set, relative probabilities are preserved. Only the odds between them change.

Falsifiable

ω ≫ 1 — model has latent safety; steering works.

ω ≈ 1 — no latent safety; need external supervision.

Corollary: Same Target, Higher Acceptance

Let \(\pi_h = p_\text{ref}(\cdot \mid I_\text{refusal}, x_h)\). Under the tilt assumption:

Target preserved

\pi_h^+(\cdot \mid x_h) \;=\; p_\text{ref}^+(\cdot \mid x_h)

→ Zero excess KL. KL-optimal.

Acceptance boosted

\alpha_{\pi_h} = \dfrac{\omega\,\alpha_\text{ref}}{1 + (\omega-1)\alpha_\text{ref}} \;\geq\; \alpha_\text{ref}

→ Up to ω× fewer generations.

Hardest harmful prompts (smallest \(\alpha_\text{ref}\)) get the largest multiplicative boost — \(f_\omega(a)/f_\omega(b) > a/b\) when \(a \le b\).

Intuition behind ThinkSafe

1

Self-filter is unique KL-minimizer.

Among safe distributions, \(p_\text{ref}^+\) alone matches the student.

2

Teacher mismatch is irreducible.

If \(p_T^+ \neq p_\text{ref}^+\), then \(\mathrm{KL}(p_T^+ \| p_\text{ref}^+) > 0\) — capacity-independent.

3

Refusal steering preserves the target.

Under tilt: \(\pi_h^+ = p_\text{ref}^+\), while the acceptance-rate speedup approaches \(\omega\times\) on low-acceptance prompts.

ThinkSafe combines zero excess KL under the tilt assumption with tractable acceptance, using only offline sampling.

PART II · EXPERIMENTS

Does it actually work?

Qwen3 (0.6B–8B) and DeepSeek-R1-Distill (1.5B–8B).
4 reasoning benchmarks · 4 safety benchmarks · 5 strong baselines.

Experimental Setup

Models

Qwen3: 0.6B / 1.7B / 4B / 8B
DeepSeek-R1-Distill: 1.5B / 7B / 8B
LoRA: r=32, \(\alpha\)=16
3 epochs, 2\(\times\) H100

Reasoning ↑

GSM8K
MATH500
AIME 2024
GPQA

pass@1, 8 samples

Safety ↓

HarmBench
StrongReject
WildJailbreak
XSTest (over-refusal)

harmful % + over-refusal

Baselines

DirectRefusal
SafeChain (teacher: R1-685B)
STAR-1 (LLM-as-judge)
SafePath (cue injection)
SafeKey (dual-path head)

Data. Same prompts as SafeChain. ThinkSafe uses only the frozen initial student — no teacher, no online rollout.

Main Result: A New Pareto Frontier

Robustness (100−safety score, ↑) vs. Reasoning (avg pass@1, →) on Qwen3.

ThinkSafe gives the best safety-reasoning balance across Qwen3 sizes — stars trace the upper frontier.

Headline Numbers on Qwen3-4B

Same base model and evaluation suite — vs. the initial reasoning-tuned Qwen3-4B.

HarmBench (harmful %)

38.21 → 9.63

−75% relative

Safety Avg (harmful %)

22.58 → 5.05

−78% relative

Reasoning Avg (pass@1)

74.47 → 77.18

+2.7 absolute

AIME 2024 (pass@1)

67.50 → 73.33

+5.8 absolute

Every other baseline either lost reasoning or kept safety-average harmfulness above 17%.
Safer and smarter — not a trade-off.

Results on Qwen3 (Safety Avg ↓ / Reasoning Avg ↑)

Method	0.6B safe / reason	1.7B	4B	8B
Initial	48.22 / 44.95	35.27 / 64.87	22.58 / 74.47	19.57 / 76.08
DirectRefusal	43.89 / 40.62	35.41 / 63.98	29.80 / 74.29	23.00 / 77.98
SafeChain	45.20 / 39.86	37.58 / 60.93	31.64 / 73.93	29.44 / 78.68
STAR-1	41.92 / 41.69	25.61 / 65.02	20.36 / 74.62	15.44 / 78.59
SafePath	46.22 / 44.26	35.27 / 64.60	22.28 / 75.85	20.64 / 78.64
SafeKey	45.25 / 42.03	30.68 / 62.70	17.33 / 75.89	17.33 / 78.91
ThinkSafe (ours)	29.65 / 43.97	17.38 / 64.39	5.05 / 77.18	4.50 / 78.50

ThinkSafe wins the safety average at every size, while staying within 1 pt of the best reasoning.

vs. Online RL: Same Quality at ~1/8 the Cost

Qwen3-0.6B trained with GRPO, On-Policy Distillation (OPD, teacher = Qwen3-8B), ThinkSafe, and ThinkSafe+KL.

~8× faster than GRPO · ~30× faster than OPD — and beats both on safety.
With +KL regularization, ThinkSafe matches GRPO on reasoning too.

Refusal Steering Is the Active Ingredient

Drop \(I_\text{refusal}\) → pure rejection sampling. Strict filter (5/5 accepts) starves the data.

Without \(I_\text{refusal}\): \(\alpha_\text{ref}(x_h) \approx 0\) on hard prompts → nearly all training signal discarded.
Steering is what makes the KL-optimal target empirically reachable.

It's Distribution, Not Capacity

Similar-size teachers, different architectures: swap safety data between Qwen3 and R1-Distill.

Diagonal (self-generated) cells are the only ones that improve safety and preserve reasoning. Off-diagonals support the theory: when \(p_T^+\) differs from \(p_\text{ref}^+\), the penalty is about distribution, not just size.

Excess KL Is Real: Perplexity of Training Data

Perplexity of each method's training set under the frozen student = empirical proxy for \(\mathrm{KL}(\pi^+ \| p_\text{ref}) + H(\pi^+)\). Comparable trace lengths ⇒ gaps primarily reflect excess KL.

Teacher data is consistently more “surprising” to the student than self-generated data — an empirical proxy for excess KL.

Strip Safety Reasoning, Lose Both

Ablation: train on refusal without CoT (but keep CoT on benign). Forces the model to context-switch between thinking and not-thinking.

Why this breaks both

Removing CoT from refusals creates inconsistent optimization: model must learn to think on benign but skip thinking on harmful prompts.

This destabilizes the chain-of-thought patterns themselves — R1-8B reasoning: 67.5 → 64.1.

In these ablations, safety reasoning at train time stabilizes the CoT distribution.

Teacher Distillation: Same Family, Same Pain

Use larger teachers within the same family to generate safety data for the small student.

Observation

Larger Qwen3 teachers improve safety, but substantially reduce reasoning: Qwen3-4B teacher costs the 0.6B student 23 pts of reasoning.

Self-generation (size = student) is the only column near zero.

R1-Distill-1.5B can borrow some safety from larger teachers, but self-generation remains the strongest reference point.

What Did We Just Show?

✓

Theory. Self-filter \(p_\text{ref}^+\) is the unique KL-optimal safe target. Refusal steering preserves it exactly while boosting acceptance.

✓

Pareto frontier. ThinkSafe achieves the most favorable safety-reasoning trade-off across Qwen3 and R1-Distill evaluations.

✓

Online RL. Beats GRPO and OPD on safety at ~1/8 the compute; matches them on reasoning with +KL.

✓

Ablations. Refusal steering is necessary in the rejection-sampling comparison. Cross-model teachers consistently degrade reasoning, and stripping CoT from refusals destabilizes both.

✓

Perplexity. Teacher data is consistently more surprising to the student than self data — the excess KL is empirically visible.

Limitations & Where the Method Could Fail

Tilt \(\omega \approx 1\)

If the student has no latent safety (e.g. base model never aligned), refusal steering can't elicit what isn't there. External supervision becomes necessary.

Filter quality cap

We rely on Llama-Guard-3 as \(\varphi\). Imperfect filter → some unsafe traces survive into training data.

Offline drift

We approximate the on-policy objective with a static dataset. As \(p_\theta\) drifts during fine-tuning, the dataset becomes off-policy.

Scope

LoRA, single-turn prompts, models \(\le\) 8B. Full FT, multi-turn, agentic settings remain open.

Future Directions

ThinkSafe suggests a broader recipe: elicit latent safety first, then spend expensive optimization where self-generation is insufficient.

Stronger filters

Replace a fixed guard model with calibrated, multi-judge, or task-specific verifiers.

Iterative self-training

Re-sample from improved checkpoints and refine refusal reasoning over rounds.

On-policy self-distillation

OPSD / SDPO suggest training on the model's own rollouts with a context-augmented self-teacher, reducing off-policy drift.

Similar to refusal steering: same model + extra context makes a stronger conditional policy.

Thank you

Questions?

Seanie Lee

Sangwoo Park

Code & models: GitHub · Hugging Face

Seanie Lee, Sangwoo Park, Yumin Choi, Gyeongman Kim, Minki Kang,
Jihun Yun, Dongmin Park, Jongho Park, Sung Ju Hwang
KAIST · KRAFTON · UC Berkeley

ThinkSafe: Self-GeneratedSafety Alignment forReasoning Models