Frontier AI Models Fail at Visual Financial Document Reasoning

Research1 source·Apr 8

anthropic claude openai google multimodal benchmarks finance

Summary

• GPT-5.4, Gemini 3.1 Pro, and Claude Opus 4.6 score only 56-64% on image-based financial tasks
• Consistent 16-20 percentage-point accuracy drop when models read charts vs. typed numbers
• All three models scored near-zero (0-4%) from memory alone, confirming genuine document reasoning is tested
• Analysis argues AI displacement of financial analysts is premature given real-world performance gaps

Adjust signal

Details

#	Type	Key Point	Context
1	Tech Info	25 real financial document tasks tested across three frontier models with 50 evaluations each	Tasks sourced from earnings reports, investor presentations, roadmap slides, and regulatory fee schedules. Each required identifying specific numbers and performing a financial calculation (margin, growth rate, dilution, ratio) with a single correct numerical answer — binary pass/fail scoring. Two variants per task: image-only and text-only.
2	Stat	Parametric knowledge baseline: all models scored near zero (0-4%) without document access	Claude Opus 4.6 and GPT-5.4 each answered 1/25 correctly; Gemini 3.1 Pro scored 0/25. This confirms the tasks genuinely test document reasoning rather than recalled financial knowledge, validating the benchmark design.
3	Stat	Text-only accuracy ranged from 72% (GPT-5.4) to 80% (Gemini 3.1 Pro); Claude Opus 4.6 at 76%	These scores reflect model performance when numbers are explicitly written out — the most favorable input format and the mode closest to standard benchmark conditions.
4	Stat	Image-only accuracy fell to 56-64%; Claude dropped -20pp, GPT-5.4 and Gemini each -16pp	The consistency of the drop across different model families points to a shared architectural limitation in visual data extraction from dense documents. Claude Opus 4.6 image-only: 56%; GPT-5.4: 56%; Gemini 3.1 Pro: 64%.
5	Research	All three models failed identically on task_138 (Fidelity Rising Wedge) in image-only mode; all correct in text	The task required identifying the dollar difference between upper and lower trend lines at entry point from a chart image. All three models answered correctly ($4.00) in text-only mode and all three failed in image-only mode — illustrating how visual format alone can flip a correct answer to a wrong one.
6	Research	Models fail in two distinct ways: misreading values from visual documents and applying wrong financial operations	Even when visual extraction is partially successful, models sometimes apply the incorrect financial formula to the extracted numbers. This means errors compound — the pipeline has two independent failure points before reaching a correct answer.
7	Insight	Visual extraction from real financial documents is a systemic bottleneck for every frontier model tested	The consistent degradation across GPT-5.4, Gemini 3.1 Pro, and Claude Opus 4.6 rules out single-model quirks. All three architectures share the same weakness when confronted with chart-heavy, image-native financial documents.
8	Insight	Analysis argues AI displacement of financial analysts is premature given the real-world performance gap	The piece pushes back against Anthropic's own labor market research placing financial analysts among the ten most AI-exposed occupations. The argument: benchmark performance overstates real-world capability when inputs are multimodal and document-native.

1.Tech Info

25 real financial document tasks tested across three frontier models with 50 evaluations each

Tasks sourced from earnings reports, investor presentations, roadmap slides, and regulatory fee schedules. Each required identifying specific numbers and performing a financial calculation (margin, growth rate, dilution, ratio) with a single correct numerical answer — binary pass/fail scoring. Two variants per task: image-only and text-only.

2.Stat

Parametric knowledge baseline: all models scored near zero (0-4%) without document access

Claude Opus 4.6 and GPT-5.4 each answered 1/25 correctly; Gemini 3.1 Pro scored 0/25. This confirms the tasks genuinely test document reasoning rather than recalled financial knowledge, validating the benchmark design.

3.Stat

Text-only accuracy ranged from 72% (GPT-5.4) to 80% (Gemini 3.1 Pro); Claude Opus 4.6 at 76%

These scores reflect model performance when numbers are explicitly written out — the most favorable input format and the mode closest to standard benchmark conditions.

4.Stat

Image-only accuracy fell to 56-64%; Claude dropped -20pp, GPT-5.4 and Gemini each -16pp

The consistency of the drop across different model families points to a shared architectural limitation in visual data extraction from dense documents. Claude Opus 4.6 image-only: 56%; GPT-5.4: 56%; Gemini 3.1 Pro: 64%.

5.Research

All three models failed identically on task_138 (Fidelity Rising Wedge) in image-only mode; all correct in text

The task required identifying the dollar difference between upper and lower trend lines at entry point from a chart image. All three models answered correctly ($4.00) in text-only mode and all three failed in image-only mode — illustrating how visual format alone can flip a correct answer to a wrong one.

6.Research

Models fail in two distinct ways: misreading values from visual documents and applying wrong financial operations

Even when visual extraction is partially successful, models sometimes apply the incorrect financial formula to the extracted numbers. This means errors compound — the pipeline has two independent failure points before reaching a correct answer.

7.Insight

Visual extraction from real financial documents is a systemic bottleneck for every frontier model tested

The consistent degradation across GPT-5.4, Gemini 3.1 Pro, and Claude Opus 4.6 rules out single-model quirks. All three architectures share the same weakness when confronted with chart-heavy, image-native financial documents.

8.Insight

Analysis argues AI displacement of financial analysts is premature given the real-world performance gap

The piece pushes back against Anthropic's own labor market research placing financial analysts among the ten most AI-exposed occupations. The argument: benchmark performance overstates real-world capability when inputs are multimodal and document-native.

Tech Info = study design and methodology, Stat = specific quantitative performance measurement, Research = empirical finding from the study, Insight = analytical conclusion or implication

What This Means

Despite strong performance on standard benchmarks, today's frontier AI models — including GPT-5.4, Gemini 3.1 Pro, and Claude Opus 4.6 — lose 16-20 percentage points of accuracy the moment financial data arrives in its natural visual format rather than as typed text, landing in a 56-64% accuracy range on realistic tasks. Since investor decks, earnings reports, and regulatory documents are almost universally image-rich and chart-heavy, this gap has direct implications for how much autonomous analytical work AI can reliably perform today. The analysis argues this performance ceiling means fears of near-term AI displacement for financial analysts are overstated — though the same findings define a clear capability threshold that model developers will be under pressure to close.

Sources

AI Can't Read an Investor DeckMercor

Similar Events

Anthropic Research Argues Anthropomorphizing AI Can Improve Safety

Apr 5

AI Benchmarks Fall Short: The Case for Human-Context Evaluation

Mar 31