The Case Against Token Counting: Why Cost-Per-Task Should Replace Cost-Per-Token
Summary
- • Dollar-per-token comparisons are misleading when models vary wildly in task efficiency
- • Hidden thinking tokens in reasoning modes can inflate costs 3-10x without visibility
- • Agentic coding loops balloon context from 5K to 80K+ tokens across a single task
- • The only metric that matters is cost per successfully completed task
Details
$/MTok is the wrong unit of cost
API pricing comparisons look damning — Claude Opus 4.6 appears 12-80x more expensive than GPT-5.1 Codex Mini — but ignore how many tokens each model actually consumes per task.
Tokenizer variance: ~5-15% (minor factor)
OpenAI, Anthropic, and Google tokenizers differ slightly, causing ~5-15% token count variance for the same input. For code this gap is even smaller. Not the primary cost driver.
Output verbosity: 2-5x variance (major factor)
Same 'refactor this function' prompt can yield 200 lines from a capable model vs 400 verbose lines from a cheaper one — or a wrong answer requiring costly re-prompting turns that balloon input context.
Hidden thinking tokens: 3-10x cost multiplier
Reasoning models (Claude extended thinking, OpenAI o-series, Gemini thinking mode) bill chain-of-thought tokens as standard output tokens. 500 visible output tokens may actually be 5,000 billed tokens — invisible but charged.
Agentic loops: context grows from 5K to 80K+ tokens
Claude Code and Codex re-send full conversation history each turn. A 10-step coding task grows from ~5K to 80K+ input tokens. Anthropic reports ~$6/dev/day average Claude Code usage. Prompt caching at 10% of standard rate provides partial mitigation.
Cost-per-task: the metric nobody publishes
A model charging 2x per token may be cheaper per completed task if it solves the problem in fewer turns. No provider publishes clean cost-per-task figures, making this the practitioner's job to instrument and measure.
Analysis of why cost-per-token is a broken metric for AI model evaluation
What This Means
AI practitioners and engineering leaders relying on per-token spend as a productivity benchmark are likely drawing flawed conclusions about model efficiency and cost. Teams should instrument workflows to measure task completion rates and total cost per outcome — not token volume — especially when using reasoning models or agentic coding tools where hidden costs are largest. Models that appear most expensive on a pricing page may deliver the lowest total cost in production.
Sources
- Token MythRobonomics
