Reasoning Models May Decide Before They Think, Study Finds

Research1 source·Apr 6

llm alignment circuit-analysis ai-progress

Summary

• Research shows LLM reasoning models encode action decisions before generating chain-of-thought text
• Linear probes decoded tool-calling decisions from pre-generation activations with high confidence, even before a single reasoning token appeared
• Findings raise serious questions about whether AI chain-of-thought is genuine deliberation or post-hoc rationalization

Adjust signal

Details

#	Type	Key Point	Context
1	Research	Linear probes decoded tool-calling decisions from pre-generation activations with high confidence	Probes were applied to internal model activations before any reasoning tokens were generated, meaning the model's internal state already encoded the decision outcome before the visible thinking process began. This is strong evidence that chain-of-thought is not the locus of actual decision-making.
2	Research	In some cases decisions were detectable even before a single reasoning token was produced	This is the strongest form of the finding: not just that decisions are encoded early in the reasoning chain, but that they can be read from activations at the very start, before any text is output. This directly undermines the premise that reasoning tokens represent genuine deliberation.
3	Stat	Activation steering caused behavioral flips in 7–79% of cases depending on model and benchmark	By perturbing the activation direction associated with the encoded decision, researchers could causally influence the model's output. The wide range reflects variation across models and tasks, but the causal relationship is consistent: pre-reasoning activation state shapes the final behavior.
4	Insight	When steering flipped a decision, chain-of-thought rationalized the new outcome rather than resisting it	If chain-of-thought were genuine deliberation, one might expect it to resist or flag an externally induced change. Instead, the reasoning adapted to justify whichever conclusion the perturbed activations produced — consistent with post-hoc rationalization rather than independent reasoning.
5	Security Alert	Activation steering can reliably manipulate model behavior, raising adversarial robustness concerns	The ability to flip decisions by perturbing internal activations suggests a potential attack surface. If pre-generation activations are the true locus of decision-making, interpretability and safety work focused solely on output text may miss the most important layer.
6	Context	Paper is an arXiv preprint (April 2026) and has not yet undergone peer review	Submitted April 1, 2026 with a v3 revision on April 3, 2026. As with all preprints, findings should be treated as promising but preliminary until independently replicated and formally peer reviewed.

1.Research

Linear probes decoded tool-calling decisions from pre-generation activations with high confidence

Probes were applied to internal model activations before any reasoning tokens were generated, meaning the model's internal state already encoded the decision outcome before the visible thinking process began. This is strong evidence that chain-of-thought is not the locus of actual decision-making.

2.Research

In some cases decisions were detectable even before a single reasoning token was produced

This is the strongest form of the finding: not just that decisions are encoded early in the reasoning chain, but that they can be read from activations at the very start, before any text is output. This directly undermines the premise that reasoning tokens represent genuine deliberation.

3.Stat

Activation steering caused behavioral flips in 7–79% of cases depending on model and benchmark

By perturbing the activation direction associated with the encoded decision, researchers could causally influence the model's output. The wide range reflects variation across models and tasks, but the causal relationship is consistent: pre-reasoning activation state shapes the final behavior.

4.Insight

When steering flipped a decision, chain-of-thought rationalized the new outcome rather than resisting it

If chain-of-thought were genuine deliberation, one might expect it to resist or flag an externally induced change. Instead, the reasoning adapted to justify whichever conclusion the perturbed activations produced — consistent with post-hoc rationalization rather than independent reasoning.

5.Security Alert

Activation steering can reliably manipulate model behavior, raising adversarial robustness concerns

The ability to flip decisions by perturbing internal activations suggests a potential attack surface. If pre-generation activations are the true locus of decision-making, interpretability and safety work focused solely on output text may miss the most important layer.

6.Context

Paper is an arXiv preprint (April 2026) and has not yet undergone peer review

Submitted April 1, 2026 with a v3 revision on April 3, 2026. As with all preprints, findings should be treated as promising but preliminary until independently replicated and formally peer reviewed.

Research = empirical finding from the paper; Stat = specific numerical result; Insight = interpretive claim from authors; Security Alert = adversarial/safety implication; Context = background or caveats

What This Means

If these findings hold up under peer review and replication, they represent a fundamental challenge to how the AI industry thinks about chain-of-thought reasoning — not as a transparent window into model cognition, but as a post-hoc narrative constructed around a decision already made in latent space. This has direct consequences for AI safety and interpretability research, which has invested heavily in reasoning traces as a mechanism for oversight. Developers and enterprises relying on reasoning models for high-stakes tool use should be aware that the visible rationale may not accurately represent why a model took a given action.

Sentiment

Intrigued researchers highlight implications for AI interpretability and safety

@DotVigneshVignesh Raja · AI Intern @ServiceNow CoreLLM, co-authorView post

Presenting

“Our results suggest that reasoning models can encode action choices before visible deliberation, and that CoT can sometimes rationalize rather than drive those choices.”

@den_run_aiDenis A. · AI Researcher @ServiceNow, co-authorView post

Highlighting

“'Therefore I am. I Think' - shows that reasoning LLMs often decide first, then think - not the other way around. Linear probes decode tool-calling decisions from pre-reasoning activations at >90% AUROC, and activation steering flips behavior 7-79% of the time, with the CoT rewriting itself to justify the new choice.”

@guifavGuilherme Favaron · Independent AI tech analystView post

Concerned

“Do reasoning models actually reason, or do they decide first and rationalize after? ... This challenges the assumption that visible reasoning drives decisions in models like o1 and DeepSeek R1. The 'thinking' may be more post hoc narrative than deliberation.”

@yesnoerroryesnoerror · AI research curator for VCs and executivesView post

Impressed

“A fascinating new paper tests the idea that LLMs 'think out loud' before making decisions. Turns out: the decision often comes first. ... chain-of-thought may be more post-hoc justification than transparent reasoning. If you’re building tools for alignment, safety, or efficiency, this is essential reading.”

Split

No significant divides; consensus among AI researchers and practitioners on the finding's importance for oversight and safety (~95/5 agreement/minor notes).

Sources

Therefore I am. I ThinkArxiv

Similar Events

Reasoning Boosts Factual Recall in LLMs — Even for Simple Single-Hop Questions

Mar 13

University of Pennsylvania Research Defines 'Cognitive Surrender' as a New Risk of AI Dependence

Apr 3