Claude Code Runs 910 ML Experiments on 16-GPU Cluster in 8 Hours

Research1 source·Mar 20

anthropic claude ai-agents autoresearch code-generation gpu andrej-karpathy distributed-training skypilot kubernetes

Summary

• Claude Code ran ~910 experiments autonomously on a 16-GPU cluster in 8 hours
• Agent reduced validation loss by 2.87% and discovered model width as key variable
• Parallel search revealed parameter interaction effects impossible to catch sequentially
• Agent self-discovered heterogeneous GPU types and optimized its own hardware strategy

Adjust signal

Details

#	Type	Key Point	Context
1	Stat	Agent submitted ~910 experiments over 8 hours across 16 GPUs	Default single-GPU Autoresearch runs ~12 experiments per hour. The 16-GPU cluster yielded ~114 experiments per hour — roughly a 9.5x throughput increase — but the qualitative shift in search strategy is the more significant change.
2	Research	Validation loss (val_bpb) dropped from 1.003 to 0.974 — a 2.87% improvement over baseline	The key finding was that model width mattered more than any single hyperparameter tested. The agent identified this by running six model width variants in a single parallel wave, converging in one round rather than six sequential ones.
3	Insight	Parallelism changed agent search behavior from greedy hill-climbing to factorial grid exploration	With one GPU the agent must test sequentially. With 16 GPUs it ran grids of 10-13 experiments simultaneously, which allowed it to detect interaction effects between parameters that sequential search structurally cannot surface.
4	New Tech	Agent autonomously developed a two-tier hardware strategy across heterogeneous GPU types	Without explicit instruction, the agent discovered it had access to both H100s and H200s and invented a screening pipeline: run candidate ideas on H100s to filter cheaply, then promote promising experiments to H200s for full validation.
5	Infrastructure	SkyPilot used to give the agent autonomous cluster management across Kubernetes	SkyPilot is an open-source tool that abstracts cloud and Kubernetes job submission. The agent reads the SkyPilot skill and launches and manages GPU jobs via short YAML experiment definitions — no manual cloud setup required.
6	Context	Autoresearch is Andrej Karpathy's project for autonomous neural network script improvement	Three files: read-only data/eval script, GPT training script the agent modifies, and instructions file. Each experiment runs under a fixed 5-minute wall-clock budget. Karpathy's original overnight run found ~20 improvements amounting to an 11% time-to-GPT-2 reduction on the nanochat leaderboard.

1.Stat

Agent submitted ~910 experiments over 8 hours across 16 GPUs

Default single-GPU Autoresearch runs ~12 experiments per hour. The 16-GPU cluster yielded ~114 experiments per hour — roughly a 9.5x throughput increase — but the qualitative shift in search strategy is the more significant change.

2.Research

Validation loss (val_bpb) dropped from 1.003 to 0.974 — a 2.87% improvement over baseline

The key finding was that model width mattered more than any single hyperparameter tested. The agent identified this by running six model width variants in a single parallel wave, converging in one round rather than six sequential ones.

3.Insight

Parallelism changed agent search behavior from greedy hill-climbing to factorial grid exploration

With one GPU the agent must test sequentially. With 16 GPUs it ran grids of 10-13 experiments simultaneously, which allowed it to detect interaction effects between parameters that sequential search structurally cannot surface.

4.New Tech

Agent autonomously developed a two-tier hardware strategy across heterogeneous GPU types

Without explicit instruction, the agent discovered it had access to both H100s and H200s and invented a screening pipeline: run candidate ideas on H100s to filter cheaply, then promote promising experiments to H200s for full validation.

5.Infrastructure

SkyPilot used to give the agent autonomous cluster management across Kubernetes

SkyPilot is an open-source tool that abstracts cloud and Kubernetes job submission. The agent reads the SkyPilot skill and launches and manages GPU jobs via short YAML experiment definitions — no manual cloud setup required.

6.Context

Autoresearch is Andrej Karpathy's project for autonomous neural network script improvement

Three files: read-only data/eval script, GPT training script the agent modifies, and instructions file. Each experiment runs under a fixed 5-minute wall-clock budget. Karpathy's original overnight run found ~20 improvements amounting to an 11% time-to-GPT-2 reduction on the nanochat leaderboard.

Stat = quantitative result, Research = experimental finding, Insight = behavioral observation, New Tech = emergent capability, Infrastructure = tooling/setup, Context = background

What This Means

This experiment demonstrates that giving an AI coding agent parallel compute does more than speed up search — it qualitatively changes the search strategy, enabling the agent to detect parameter interactions and self-optimize resource allocation in ways that are impossible with sequential execution. The emergent hardware tiering behavior, where the agent invented its own cost-efficiency strategy without being told to, is a notable signal that agents operating over larger resource pools will develop non-obvious optimization behaviors. For AI engineers and ML researchers, this points toward a near-term workflow where agentic systems run continuous parallelized hyperparameter and architecture search with minimal human supervision, compressing experiment cycles from days to hours.

Sources

Scaling Karpathy's Autoresearch: What Happens When the Agent Gets a GPU ClusterBlog

Similar Events

Anthropic Partners with SpaceX, Doubles Claude Usage Limits

May 6

Harvard Physicist Uses Claude to Produce Frontier Physics Paper in Two Weeks

Mar 24