Data Pruning at Training Time Boosts LLM Fact Memorization by 1.3X
Summary
- • Training data pruning lets a 110M-parameter model match a 1.3B-parameter model on factual recall
- • Fact accuracy degrades when training data information exceeds model capacity, especially with power-law distributions
- • Data selection method uses training loss alone — no extra labels or metadata required
- • Accepted at ICLR 2026 Workshop on Navigating and Addressing Data Problems for Foundation Models
Details
Fact accuracy falls below capacity limit when training data information exceeds model capacity
The paper formalizes this via information theory: every model has a capacity limit for fact storage, and once the aggregate information in training facts surpasses it, accuracy degrades. This provides a principled explanation for why larger training datasets don't always improve factual recall.
Skewed power-law fact frequency distributions exacerbate memorization failures
When some facts appear orders of magnitude more often than others — a common property of web-scraped corpora — the model over-invests capacity in high-frequency facts and under-memorizes rare but equally important ones. Flattening this distribution is a key lever in the proposed method.
Data selection uses training loss alone to prune facts and flatten frequency distribution
No external signals, knowledge graphs, or human annotations are required. The method identifies which facts are over-represented or already well-learned based on the model's loss trajectory, then prunes redundant instances. This makes it tractable for large-scale pretraining pipelines.
GPT2-Small (110M params) with pruning memorizes 1.3X more entity facts than standard training
Tested by pretraining from scratch on an annotated Wikipedia corpus. The pruned 110M model matches the factual performance of a 1.3B-parameter model trained on the full dataset — a 10X compute efficiency gain attributable entirely to data selection.
On semi-synthetic datasets, the method boosts fact accuracy to the theoretical capacity limit
High-entropy fact datasets were used to validate that the method can close the full gap between observed accuracy and the information-theoretic upper bound.
Smarter data curation can substitute for model scaling in knowledge-intensive tasks
The 10X parameter efficiency gain suggests organizations with compute constraints could achieve strong factual recall without scaling to billion-parameter models, provided they invest in principled pretraining data selection.
Method validated primarily on entity facts from Wikipedia; generalization to other knowledge types unproven
The annotated Wikipedia corpus provides clean entity-fact pairs, which is an idealized setting. Real-world pretraining data is noisier and contains many fact types beyond named-entity knowledge. Broader applicability remains an open question.
Research = core research finding, Tech Info = technical approach details, Stat = quantitative result, Insight = practical implication, Context = scope caveat or background
What This Means
This research offers a concrete, compute-efficient remedy for factual hallucination in LLMs: by framing the problem information-theoretically, the authors show that smarter data curation can substitute for model scaling. Matching a 1.3B-parameter model's factual performance with a 10X smaller model through data pruning alone changes the cost calculus for building knowledge-intensive systems. The loss-based selection signal is an elegant, pipeline-compatible mechanism that requires no costly data annotation. The main open question is whether the approach scales beyond clean entity-fact corpora to the messier knowledge types that characterize real-world pretraining datasets.
