← Back to feed
6

The Case Against Token Counting: Why Cost-Per-Task Should Replace Cost-Per-Token

Products1 source·Mar 25

Summary

  • • Dollar-per-token comparisons are misleading when models vary wildly in task efficiency
  • • Hidden thinking tokens in reasoning modes can inflate costs 3-10x without visibility
  • • Agentic coding loops balloon context from 5K to 80K+ tokens across a single task
  • • The only metric that matters is cost per successfully completed task
Adjust signal

Details

1.Insight

$/MTok is the wrong unit of cost

API pricing comparisons look damning — Claude Opus 4.6 appears 12-80x more expensive than GPT-5.1 Codex Mini — but ignore how many tokens each model actually consumes per task.

2.Stat

Tokenizer variance: ~5-15% (minor factor)

OpenAI, Anthropic, and Google tokenizers differ slightly, causing ~5-15% token count variance for the same input. For code this gap is even smaller. Not the primary cost driver.

3.Stat

Output verbosity: 2-5x variance (major factor)

Same 'refactor this function' prompt can yield 200 lines from a capable model vs 400 verbose lines from a cheaper one — or a wrong answer requiring costly re-prompting turns that balloon input context.

4.Tech Info

Hidden thinking tokens: 3-10x cost multiplier

Reasoning models (Claude extended thinking, OpenAI o-series, Gemini thinking mode) bill chain-of-thought tokens as standard output tokens. 500 visible output tokens may actually be 5,000 billed tokens — invisible but charged.

5.Context

Agentic loops: context grows from 5K to 80K+ tokens

Claude Code and Codex re-send full conversation history each turn. A 10-step coding task grows from ~5K to 80K+ input tokens. Anthropic reports ~$6/dev/day average Claude Code usage. Prompt caching at 10% of standard rate provides partial mitigation.

6.Insight

Cost-per-task: the metric nobody publishes

A model charging 2x per token may be cheaper per completed task if it solves the problem in fewer turns. No provider publishes clean cost-per-task figures, making this the practitioner's job to instrument and measure.

Analysis of why cost-per-token is a broken metric for AI model evaluation

What This Means

AI practitioners and engineering leaders relying on per-token spend as a productivity benchmark are likely drawing flawed conclusions about model efficiency and cost. Teams should instrument workflows to measure task completion rates and total cost per outcome — not token volume — especially when using reasoning models or agentic coding tools where hidden costs are largest. Models that appear most expensive on a pricing page may deliver the lowest total cost in production.

Sources

Similar Events