RL Fine-Tuning Enables Small 4B Models to Match Large LLMs as Recursive Agents
Summary
- • RL fine-tuning teaches 4B models to act as recursive language models natively
- • A single shared policy trains both parent decomposer and child sub-agent roles
- • The 4B model matches Claude Sonnet 4.6 on evidence selection at a fraction of the size and cost
- • Training code and RLM scaffold are open-sourced via SkyRL
Details
RL fine-tuning instills native RLM behavior in 4B models
Supervised fine-tuning and prompting cannot reliably elicit recursive language model behavior in small models. Reinforcement learning fine-tuning is shown to be necessary to produce consistent, task-specific RLM behavior at this scale.
Single shared policy trains both parent and child RLM roles simultaneously
Rather than maintaining separate models for the orchestrating parent and the spawned child sub-agents, one 4B model is trained to play both roles. Child rollout advantages are inherited from the parent rollouts that spawned them, eliminating the need for additional reward signals.
RLMs use a Python REPL as their primary environment for inspecting and decomposing context
Context is stored as an external object rather than placed directly in the context window. Each turn the model writes code, the REPL executes it, and results are returned as the next user message. Built-in functions include FINAL(), rlm_query() for a single child, and rlm_query_batched() for parallel child dispatch.
RL fine-tuned 4B model matches Claude Sonnet 4.6 on evidence selection task
On an evidence selection benchmark over multiple scientific documents, the fine-tuned 4B model achieves parity with Claude Sonnet 4.6 using the identical RLM harness and REPL environment, at a fraction of the size and cost.
RLMs address long-context problems by externalizing context and decomposing recursively
Instead of expanding context windows, RLMs treat context as a programmatic object to be inspected and broken down through recursive self-calls, making them well-suited for tasks involving large documents.
Full training and evaluation code open-sourced via SkyRL
The release includes RL training scripts, the RLM scaffold implementation, and the evidence selection environment, lowering the barrier for teams to reproduce results or adapt the approach to their own long-context tasks.
Approach targets production deployment of cheap, purpose-built RLMs
RL fine-tuning produces models with predictable latency and stable behavior, addressing two core operational weaknesses of prompt-tuned RLMs that have limited real-world adoption.
Research = academic/empirical finding, New Tech = novel technical approach, Tech Info = how a system works, Stat = quantitative result, Insight = analytical observation, Infrastructure = tooling/code release, Strategy = deployment or product positioning
What This Means
This research demonstrates that frontier-scale reasoning on complex long-document tasks is no longer exclusive to large, expensive models — a properly RL fine-tuned 4B model can match a much larger system when given the right recursive scaffolding. For teams building production AI pipelines, this opens a path to deploying capable document-analysis agents at a fraction of current inference costs. The open-source release of training code and the RLM scaffold means the technique is immediately reproducible and extensible, which could accelerate adoption of recursive architectures as a standard approach to long-context problems.
