Summary
- • AI labs admit safety research must be handed to AI as human oversight fails to scale
- • Only ~600 full-time safety researchers exist against rapidly advancing AI capabilities
- • OpenAI's Superalignment team, formed in 2023 to automate alignment, has since collapsed
- • Anthropic and Redwood Research leaders argue AI-assisted alignment is the only viable path
Details
Full-time AI safety researchers grew from ~100 to ~600 between GPT-1 and 2025
Despite sixfold growth, safety research remains a tiny fraction of total AI research headcount. Resources overwhelmingly flow toward making models faster, smarter, and cheaper rather than safer.
All three frontier labs use frontier models in their own development pipelines
Anthropic, OpenAI, and Google DeepMind all claim frontier models contribute to training their successors. As AI takes over coding, eval design, and training infrastructure, human-led safety research becomes increasingly inadequate to keep pace.
OpenAI's Superalignment team aimed to build an automated alignment researcher but collapsed
Formed in 2023 with Jan Leike and Ilya Sutskever as co-leads, the team stated that human-supervised alignment techniques cannot scale to superintelligence. The team has since been disbanded.
Anthropic's Jan Leike sees building a human-level alignment researcher as near-term achievable
Now leading Anthropic's Alignment Science team, Leike frames the intermediate goal as building a model 'as good as us at alignment research' — an easier milestone than aligning a superintelligence, and one he believes is within reach.
Some alignment tasks already partially automated under human supervision
Frontier LLMs can write code, run evaluations, and propose new research directions when given a description and human oversight — marking an early but real transition toward AI-assisted safety work.
Core tension: 'aligned' AI faithfully executes user intent — including bad actors' intent
Anthropic's Joe Carlsmith, who shapes Claude's constitution, stresses alignment concerns motives not morality. An aligned AI can be weaponized by a malicious operator, making control of automated alignment research a question of existential stakes.
Stat = quantitative data; Insight = attributed analysis; Context = background information; Research = technical capability detail
What This Means
The AI safety field is converging on a deeply circular solution: use AI to align AI, because humans alone cannot supervise systems smarter than themselves. This bet — still unproven and acknowledged as risky even by its proponents — will increasingly define how frontier labs justify deploying ever-more-capable models and may become the central strategic and ethical question in AI development over the next decade.
