RL Fine-Tuning Enables Small 4B Models to Match Large LLMs as Recursive Agents

Research1 source·May 13

reinforcement-learning llm agents fine-tuning claude benchmarks skyrl rlm

Summary

• RL fine-tuning teaches 4B models to act as recursive language models natively
• A single shared policy trains both parent decomposer and child sub-agent roles
• The 4B model matches Claude Sonnet 4.6 on evidence selection at a fraction of the size and cost
• Training code and RLM scaffold are open-sourced via SkyRL

Adjust signal

Details

#	Type	Key Point	Context
1	Research	RL fine-tuning instills native RLM behavior in 4B models	Supervised fine-tuning and prompting cannot reliably elicit recursive language model behavior in small models. Reinforcement learning fine-tuning is shown to be necessary to produce consistent, task-specific RLM behavior at this scale.
2	New Tech	Single shared policy trains both parent and child RLM roles simultaneously	Rather than maintaining separate models for the orchestrating parent and the spawned child sub-agents, one 4B model is trained to play both roles. Child rollout advantages are inherited from the parent rollouts that spawned them, eliminating the need for additional reward signals.
3	Tech Info	RLMs use a Python REPL as their primary environment for inspecting and decomposing context	Context is stored as an external object rather than placed directly in the context window. Each turn the model writes code, the REPL executes it, and results are returned as the next user message. Built-in functions include FINAL(), rlm_query() for a single child, and rlm_query_batched() for parallel child dispatch.
4	Stat	RL fine-tuned 4B model matches Claude Sonnet 4.6 on evidence selection task	On an evidence selection benchmark over multiple scientific documents, the fine-tuned 4B model achieves parity with Claude Sonnet 4.6 using the identical RLM harness and REPL environment, at a fraction of the size and cost.
5	Insight	RLMs address long-context problems by externalizing context and decomposing recursively	Instead of expanding context windows, RLMs treat context as a programmatic object to be inspected and broken down through recursive self-calls, making them well-suited for tasks involving large documents.
6	Infrastructure	Full training and evaluation code open-sourced via SkyRL	The release includes RL training scripts, the RLM scaffold implementation, and the evidence selection environment, lowering the barrier for teams to reproduce results or adapt the approach to their own long-context tasks.
7	Strategy	Approach targets production deployment of cheap, purpose-built RLMs	RL fine-tuning produces models with predictable latency and stable behavior, addressing two core operational weaknesses of prompt-tuned RLMs that have limited real-world adoption.

1.Research

RL fine-tuning instills native RLM behavior in 4B models

Supervised fine-tuning and prompting cannot reliably elicit recursive language model behavior in small models. Reinforcement learning fine-tuning is shown to be necessary to produce consistent, task-specific RLM behavior at this scale.

2.New Tech

Single shared policy trains both parent and child RLM roles simultaneously

Rather than maintaining separate models for the orchestrating parent and the spawned child sub-agents, one 4B model is trained to play both roles. Child rollout advantages are inherited from the parent rollouts that spawned them, eliminating the need for additional reward signals.

3.Tech Info

RLMs use a Python REPL as their primary environment for inspecting and decomposing context

Context is stored as an external object rather than placed directly in the context window. Each turn the model writes code, the REPL executes it, and results are returned as the next user message. Built-in functions include FINAL(), rlm_query() for a single child, and rlm_query_batched() for parallel child dispatch.

4.Stat

RL fine-tuned 4B model matches Claude Sonnet 4.6 on evidence selection task

On an evidence selection benchmark over multiple scientific documents, the fine-tuned 4B model achieves parity with Claude Sonnet 4.6 using the identical RLM harness and REPL environment, at a fraction of the size and cost.

5.Insight

RLMs address long-context problems by externalizing context and decomposing recursively

Instead of expanding context windows, RLMs treat context as a programmatic object to be inspected and broken down through recursive self-calls, making them well-suited for tasks involving large documents.

6.Infrastructure

Full training and evaluation code open-sourced via SkyRL

The release includes RL training scripts, the RLM scaffold implementation, and the evidence selection environment, lowering the barrier for teams to reproduce results or adapt the approach to their own long-context tasks.

7.Strategy

Approach targets production deployment of cheap, purpose-built RLMs

RL fine-tuning produces models with predictable latency and stable behavior, addressing two core operational weaknesses of prompt-tuned RLMs that have limited real-world adoption.

Research = academic/empirical finding, New Tech = novel technical approach, Tech Info = how a system works, Stat = quantitative result, Insight = analytical observation, Infrastructure = tooling/code release, Strategy = deployment or product positioning

What This Means

This research demonstrates that frontier-scale reasoning on complex long-document tasks is no longer exclusive to large, expensive models — a properly RL fine-tuned 4B model can match a much larger system when given the right recursive scaffolding. For teams building production AI pipelines, this opens a path to deploying capable document-analysis agents at a fraction of current inference costs. The open-source release of training code and the RLM scaffold means the technique is immediately reproducible and extensible, which could accelerate adoption of recursive architectures as a standard approach to long-context problems.

Sources

Reinforcing Recursive Language ModelsAlphaxiv

Similar Events

AWS Reinforcement Fine-Tuning with LLM-as-a-Judge Using Amazon Nova Models

Apr 30

Reasoning Boosts Factual Recall in LLMs — Even for Simple Single-Hop Questions

Mar 13