← Back to feed
7

Open Models Trail Closed Frontier by 8–10 Months on Private Benchmarks

Research1 source·5d ago

Summary

  • • Open models lag closed models by 8-10 months on private benchmarks as of 2025
  • • The gap was smallest around DeepSeek R1 in January 2025 and has since grown
  • • Public benchmarks underestimate the open-closed gap by roughly a factor of two
  • • Analysis argues real-world task gap is likely even larger than private benchmarks show
Adjust signal

Details

1.Stat

Open models trail closed frontier by 8-10 months on private benchmarks

Measured across 8 private benchmarks where test data is not publicly accessible. On 9 public benchmarks, the gap shrinks to roughly 4-6 months. The methodology tracks when each model crosses defined threshold scores and computes the time lag between open and closed models reaching the same performance level.

2.Research

Gap was smallest around DeepSeek R1 (Jan 2025) and has grown since

The analysis covers data from 2023 through the present. The convergence point near DeepSeek R1's release marked the narrowest measured gap between open and closed model capabilities; the trend has reversed since, with closed labs widening their lead.

3.Insight

Public benchmarks underestimate the open-closed gap by nearly 2x

The authors argue that open model developers appear to be doing some combination of not fully filtering benchmark data from training sets and hillclimbing on public test scores. Because the same directional trend appears in both private and public benchmark sets — which are completely disjoint — the trend is likely real, but public scores significantly flatter open models relative to private evaluations.

4.Insight

Real-world task gap likely larger than even private benchmarks suggest

The analysis contends that large closed labs have access to more varied training data and real enterprise feedback loops, while open model developers are relatively more focused on benchmark scores. This structural difference means practical task performance gaps probably exceed what any current benchmark measures.

5.Context

Third-party provider degradation may inflate the measured gap for open models

Researchers running private benchmarks on open Chinese models sometimes route inference through third-party providers with zero-data-retention policies for privacy reasons. Implementation bugs at these providers can subtly degrade model performance, potentially biasing the gap estimate upward — particularly on private benchmarks.

6.Research

Threshold-crossing methodology used across 17 benchmarks with ~110 datapoints

For each benchmark, threshold scores are set at 5% intervals. The gap is estimated by finding when an open model first crosses each threshold versus when a closed model did the same. 2023-2024 data is partially self-reported; all data and code are publicly available on GitHub.

Benchmark analysis findings on the performance gap between open-weight and closed frontier AI models

What This Means

This analysis argues that despite the rapid progress of open-weight models, the gap with closed frontier labs has not closed — and is actively widening again after a brief narrowing around DeepSeek R1 in early 2025. For teams deciding whether to build on open or closed models, the implication is that closed APIs retain a meaningful capability lead, particularly on tasks that resemble real enterprise workloads rather than public benchmarks. The finding that public benchmarks underestimate the gap by nearly 2x is a significant methodological caution for anyone using leaderboard rankings as a proxy for production readiness. If the authors' thesis holds, the true capability frontier remains firmly in the hands of well-resourced closed labs.

Sources

Similar Events