Data Pruning at Training Time Boosts LLM Fact Memorization by 1.3X

Research1 source·1d ago

Summary

• Training data pruning lets a 110M-parameter model match a 1.3B-parameter model on factual recall
• Fact accuracy degrades when training data information exceeds model capacity, especially with power-law distributions
• Data selection method uses training loss alone — no extra labels or metadata required
• Accepted at ICLR 2026 Workshop on Navigating and Addressing Data Problems for Foundation Models

Adjust signal

Details

#	Type	Key Point	Context
1	Research	Fact accuracy falls below capacity limit when training data information exceeds model capacity	The paper formalizes this via information theory: every model has a capacity limit for fact storage, and once the aggregate information in training facts surpasses it, accuracy degrades. This provides a principled explanation for why larger training datasets don't always improve factual recall.
2	Research	Skewed power-law fact frequency distributions exacerbate memorization failures	When some facts appear orders of magnitude more often than others — a common property of web-scraped corpora — the model over-invests capacity in high-frequency facts and under-memorizes rare but equally important ones. Flattening this distribution is a key lever in the proposed method.
3	Tech Info	Data selection uses training loss alone to prune facts and flatten frequency distribution	No external signals, knowledge graphs, or human annotations are required. The method identifies which facts are over-represented or already well-learned based on the model's loss trajectory, then prunes redundant instances. This makes it tractable for large-scale pretraining pipelines.
4	Stat	GPT2-Small (110M params) with pruning memorizes 1.3X more entity facts than standard training	Tested by pretraining from scratch on an annotated Wikipedia corpus. The pruned 110M model matches the factual performance of a 1.3B-parameter model trained on the full dataset — a 10X compute efficiency gain attributable entirely to data selection.
5	Stat	On semi-synthetic datasets, the method boosts fact accuracy to the theoretical capacity limit	High-entropy fact datasets were used to validate that the method can close the full gap between observed accuracy and the information-theoretic upper bound.
6	Insight	Smarter data curation can substitute for model scaling in knowledge-intensive tasks	The 10X parameter efficiency gain suggests organizations with compute constraints could achieve strong factual recall without scaling to billion-parameter models, provided they invest in principled pretraining data selection.
7	Context	Method validated primarily on entity facts from Wikipedia; generalization to other knowledge types unproven	The annotated Wikipedia corpus provides clean entity-fact pairs, which is an idealized setting. Real-world pretraining data is noisier and contains many fact types beyond named-entity knowledge. Broader applicability remains an open question.

1.Research

Fact accuracy falls below capacity limit when training data information exceeds model capacity

The paper formalizes this via information theory: every model has a capacity limit for fact storage, and once the aggregate information in training facts surpasses it, accuracy degrades. This provides a principled explanation for why larger training datasets don't always improve factual recall.

2.Research

Skewed power-law fact frequency distributions exacerbate memorization failures

When some facts appear orders of magnitude more often than others — a common property of web-scraped corpora — the model over-invests capacity in high-frequency facts and under-memorizes rare but equally important ones. Flattening this distribution is a key lever in the proposed method.

3.Tech Info

Data selection uses training loss alone to prune facts and flatten frequency distribution

No external signals, knowledge graphs, or human annotations are required. The method identifies which facts are over-represented or already well-learned based on the model's loss trajectory, then prunes redundant instances. This makes it tractable for large-scale pretraining pipelines.

4.Stat

GPT2-Small (110M params) with pruning memorizes 1.3X more entity facts than standard training

Tested by pretraining from scratch on an annotated Wikipedia corpus. The pruned 110M model matches the factual performance of a 1.3B-parameter model trained on the full dataset — a 10X compute efficiency gain attributable entirely to data selection.

5.Stat

On semi-synthetic datasets, the method boosts fact accuracy to the theoretical capacity limit

High-entropy fact datasets were used to validate that the method can close the full gap between observed accuracy and the information-theoretic upper bound.

6.Insight

Smarter data curation can substitute for model scaling in knowledge-intensive tasks

The 10X parameter efficiency gain suggests organizations with compute constraints could achieve strong factual recall without scaling to billion-parameter models, provided they invest in principled pretraining data selection.

7.Context

Method validated primarily on entity facts from Wikipedia; generalization to other knowledge types unproven

The annotated Wikipedia corpus provides clean entity-fact pairs, which is an idealized setting. Real-world pretraining data is noisier and contains many fact types beyond named-entity knowledge. Broader applicability remains an open question.

Research = core research finding, Tech Info = technical approach details, Stat = quantitative result, Insight = practical implication, Context = scope caveat or background

What This Means

This research offers a concrete, compute-efficient remedy for factual hallucination in LLMs: by framing the problem information-theoretically, the authors show that smarter data curation can substitute for model scaling. Matching a 1.3B-parameter model's factual performance with a 10X smaller model through data pruning alone changes the cost calculus for building knowledge-intensive systems. The loss-based selection signal is an elegant, pipeline-compatible mechanism that requires no costly data annotation. The main open question is whether the approach scales beyond clean entity-fact corpora to the messier knowledge types that characterize real-world pretraining datasets.

Sources

Cram Less to Fit More: Training Data Pruning Improves Memorization of FactsMachinelearning

Similar Events

Reasoning Boosts Factual Recall in LLMs — Even for Simple Single-Hop Questions

Mar 13

LLM Relayering Technique "RYS" Generalizes Across Models, Hints at Universal Thinking Space

Mar 25