← Back to feed
7

RL Fine-Tuning Enables Small 4B Models to Match Large LLMs as Recursive Agents

Research1 source·May 13

Summary

  • • RL fine-tuning teaches 4B models to act as recursive language models natively
  • • A single shared policy trains both parent decomposer and child sub-agent roles
  • • The 4B model matches Claude Sonnet 4.6 on evidence selection at a fraction of the size and cost
  • • Training code and RLM scaffold are open-sourced via SkyRL
Adjust signal

Details

1.Research

RL fine-tuning instills native RLM behavior in 4B models

Supervised fine-tuning and prompting cannot reliably elicit recursive language model behavior in small models. Reinforcement learning fine-tuning is shown to be necessary to produce consistent, task-specific RLM behavior at this scale.

2.New Tech

Single shared policy trains both parent and child RLM roles simultaneously

Rather than maintaining separate models for the orchestrating parent and the spawned child sub-agents, one 4B model is trained to play both roles. Child rollout advantages are inherited from the parent rollouts that spawned them, eliminating the need for additional reward signals.

3.Tech Info

RLMs use a Python REPL as their primary environment for inspecting and decomposing context

Context is stored as an external object rather than placed directly in the context window. Each turn the model writes code, the REPL executes it, and results are returned as the next user message. Built-in functions include FINAL(), rlm_query() for a single child, and rlm_query_batched() for parallel child dispatch.

4.Stat

RL fine-tuned 4B model matches Claude Sonnet 4.6 on evidence selection task

On an evidence selection benchmark over multiple scientific documents, the fine-tuned 4B model achieves parity with Claude Sonnet 4.6 using the identical RLM harness and REPL environment, at a fraction of the size and cost.

5.Insight

RLMs address long-context problems by externalizing context and decomposing recursively

Instead of expanding context windows, RLMs treat context as a programmatic object to be inspected and broken down through recursive self-calls, making them well-suited for tasks involving large documents.

6.Infrastructure

Full training and evaluation code open-sourced via SkyRL

The release includes RL training scripts, the RLM scaffold implementation, and the evidence selection environment, lowering the barrier for teams to reproduce results or adapt the approach to their own long-context tasks.

7.Strategy

Approach targets production deployment of cheap, purpose-built RLMs

RL fine-tuning produces models with predictable latency and stable behavior, addressing two core operational weaknesses of prompt-tuned RLMs that have limited real-world adoption.

Research = academic/empirical finding, New Tech = novel technical approach, Tech Info = how a system works, Stat = quantitative result, Insight = analytical observation, Infrastructure = tooling/code release, Strategy = deployment or product positioning

What This Means

This research demonstrates that frontier-scale reasoning on complex long-document tasks is no longer exclusive to large, expensive models — a properly RL fine-tuned 4B model can match a much larger system when given the right recursive scaffolding. For teams building production AI pipelines, this opens a path to deploying capable document-analysis agents at a fraction of current inference costs. The open-source release of training code and the RLM scaffold means the technique is immediately reproducible and extensible, which could accelerate adoption of recursive architectures as a standard approach to long-context problems.

Sources

Similar Events