Frontier AI Models Fail at Visual Financial Document Reasoning
Summary
- • GPT-5.4, Gemini 3.1 Pro, and Claude Opus 4.6 score only 56-64% on image-based financial tasks
- • Consistent 16-20 percentage-point accuracy drop when models read charts vs. typed numbers
- • All three models scored near-zero (0-4%) from memory alone, confirming genuine document reasoning is tested
- • Analysis argues AI displacement of financial analysts is premature given real-world performance gaps
Details
25 real financial document tasks tested across three frontier models with 50 evaluations each
Tasks sourced from earnings reports, investor presentations, roadmap slides, and regulatory fee schedules. Each required identifying specific numbers and performing a financial calculation (margin, growth rate, dilution, ratio) with a single correct numerical answer — binary pass/fail scoring. Two variants per task: image-only and text-only.
Parametric knowledge baseline: all models scored near zero (0-4%) without document access
Claude Opus 4.6 and GPT-5.4 each answered 1/25 correctly; Gemini 3.1 Pro scored 0/25. This confirms the tasks genuinely test document reasoning rather than recalled financial knowledge, validating the benchmark design.
Text-only accuracy ranged from 72% (GPT-5.4) to 80% (Gemini 3.1 Pro); Claude Opus 4.6 at 76%
These scores reflect model performance when numbers are explicitly written out — the most favorable input format and the mode closest to standard benchmark conditions.
Image-only accuracy fell to 56-64%; Claude dropped -20pp, GPT-5.4 and Gemini each -16pp
The consistency of the drop across different model families points to a shared architectural limitation in visual data extraction from dense documents. Claude Opus 4.6 image-only: 56%; GPT-5.4: 56%; Gemini 3.1 Pro: 64%.
All three models failed identically on task_138 (Fidelity Rising Wedge) in image-only mode; all correct in text
The task required identifying the dollar difference between upper and lower trend lines at entry point from a chart image. All three models answered correctly ($4.00) in text-only mode and all three failed in image-only mode — illustrating how visual format alone can flip a correct answer to a wrong one.
Models fail in two distinct ways: misreading values from visual documents and applying wrong financial operations
Even when visual extraction is partially successful, models sometimes apply the incorrect financial formula to the extracted numbers. This means errors compound — the pipeline has two independent failure points before reaching a correct answer.
Visual extraction from real financial documents is a systemic bottleneck for every frontier model tested
The consistent degradation across GPT-5.4, Gemini 3.1 Pro, and Claude Opus 4.6 rules out single-model quirks. All three architectures share the same weakness when confronted with chart-heavy, image-native financial documents.
Analysis argues AI displacement of financial analysts is premature given the real-world performance gap
The piece pushes back against Anthropic's own labor market research placing financial analysts among the ten most AI-exposed occupations. The argument: benchmark performance overstates real-world capability when inputs are multimodal and document-native.
Tech Info = study design and methodology, Stat = specific quantitative performance measurement, Research = empirical finding from the study, Insight = analytical conclusion or implication
What This Means
Despite strong performance on standard benchmarks, today's frontier AI models — including GPT-5.4, Gemini 3.1 Pro, and Claude Opus 4.6 — lose 16-20 percentage points of accuracy the moment financial data arrives in its natural visual format rather than as typed text, landing in a 56-64% accuracy range on realistic tasks. Since investor decks, earnings reports, and regulatory documents are almost universally image-rich and chart-heavy, this gap has direct implications for how much autonomous analytical work AI can reliably perform today. The analysis argues this performance ceiling means fears of near-term AI displacement for financial analysts are overstated — though the same findings define a clear capability threshold that model developers will be under pressure to close.
