Reasoning Models May Decide Before They Think, Study Finds
Summary
- • Research shows LLM reasoning models encode action decisions before generating chain-of-thought text
- • Linear probes decoded tool-calling decisions from pre-generation activations with high confidence, even before a single reasoning token appeared
- • Findings raise serious questions about whether AI chain-of-thought is genuine deliberation or post-hoc rationalization
Details
Linear probes decoded tool-calling decisions from pre-generation activations with high confidence
Probes were applied to internal model activations before any reasoning tokens were generated, meaning the model's internal state already encoded the decision outcome before the visible thinking process began. This is strong evidence that chain-of-thought is not the locus of actual decision-making.
In some cases decisions were detectable even before a single reasoning token was produced
This is the strongest form of the finding: not just that decisions are encoded early in the reasoning chain, but that they can be read from activations at the very start, before any text is output. This directly undermines the premise that reasoning tokens represent genuine deliberation.
Activation steering caused behavioral flips in 7–79% of cases depending on model and benchmark
By perturbing the activation direction associated with the encoded decision, researchers could causally influence the model's output. The wide range reflects variation across models and tasks, but the causal relationship is consistent: pre-reasoning activation state shapes the final behavior.
When steering flipped a decision, chain-of-thought rationalized the new outcome rather than resisting it
If chain-of-thought were genuine deliberation, one might expect it to resist or flag an externally induced change. Instead, the reasoning adapted to justify whichever conclusion the perturbed activations produced — consistent with post-hoc rationalization rather than independent reasoning.
Activation steering can reliably manipulate model behavior, raising adversarial robustness concerns
The ability to flip decisions by perturbing internal activations suggests a potential attack surface. If pre-generation activations are the true locus of decision-making, interpretability and safety work focused solely on output text may miss the most important layer.
Paper is an arXiv preprint (April 2026) and has not yet undergone peer review
Submitted April 1, 2026 with a v3 revision on April 3, 2026. As with all preprints, findings should be treated as promising but preliminary until independently replicated and formally peer reviewed.
Research = empirical finding from the paper; Stat = specific numerical result; Insight = interpretive claim from authors; Security Alert = adversarial/safety implication; Context = background or caveats
What This Means
If these findings hold up under peer review and replication, they represent a fundamental challenge to how the AI industry thinks about chain-of-thought reasoning — not as a transparent window into model cognition, but as a post-hoc narrative constructed around a decision already made in latent space. This has direct consequences for AI safety and interpretability research, which has invested heavily in reasoning traces as a mechanism for oversight. Developers and enterprises relying on reasoning models for high-stakes tool use should be aware that the visible rationale may not accurately represent why a model took a given action.
Sentiment
Intrigued researchers highlight implications for AI interpretability and safety
“Our results suggest that reasoning models can encode action choices before visible deliberation, and that CoT can sometimes rationalize rather than drive those choices.”
“'Therefore I am. I Think' - shows that reasoning LLMs often decide first, then think - not the other way around. Linear probes decode tool-calling decisions from pre-reasoning activations at >90% AUROC, and activation steering flips behavior 7-79% of the time, with the CoT rewriting itself to justify the new choice.”
“Do reasoning models actually reason, or do they decide first and rationalize after? ... This challenges the assumption that visible reasoning drives decisions in models like o1 and DeepSeek R1. The 'thinking' may be more post hoc narrative than deliberation.”
“A fascinating new paper tests the idea that LLMs 'think out loud' before making decisions. Turns out: the decision often comes first. ... chain-of-thought may be more post-hoc justification than transparent reasoning. If you’re building tools for alignment, safety, or efficiency, this is essential reading.”
Split
No significant divides; consensus among AI researchers and practitioners on the finding's importance for oversight and safety (~95/5 agreement/minor notes).
