AI Alignment Researchers Push to Automate Safety Research Itself

Safety1 source·Apr 3

alignment anthropic openai deepmind autoresearch

Summary

• AI labs admit safety research must be handed to AI as human oversight fails to scale
• Only ~600 full-time safety researchers exist against rapidly advancing AI capabilities
• OpenAI's Superalignment team, formed in 2023 to automate alignment, has since collapsed
• Anthropic and Redwood Research leaders argue AI-assisted alignment is the only viable path

Adjust signal

Details

#	Type	Key Point	Context
1	Stat	Full-time AI safety researchers grew from ~100 to ~600 between GPT-1 and 2025	Despite sixfold growth, safety research remains a tiny fraction of total AI research headcount. Resources overwhelmingly flow toward making models faster, smarter, and cheaper rather than safer.
2	Insight	All three frontier labs use frontier models in their own development pipelines	Anthropic, OpenAI, and Google DeepMind all claim frontier models contribute to training their successors. As AI takes over coding, eval design, and training infrastructure, human-led safety research becomes increasingly inadequate to keep pace.
3	Context	OpenAI's Superalignment team aimed to build an automated alignment researcher but collapsed	Formed in 2023 with Jan Leike and Ilya Sutskever as co-leads, the team stated that human-supervised alignment techniques cannot scale to superintelligence. The team has since been disbanded.
4	Insight	Anthropic's Jan Leike sees building a human-level alignment researcher as near-term achievable	Now leading Anthropic's Alignment Science team, Leike frames the intermediate goal as building a model 'as good as us at alignment research' — an easier milestone than aligning a superintelligence, and one he believes is within reach.
5	Research	Some alignment tasks already partially automated under human supervision	Frontier LLMs can write code, run evaluations, and propose new research directions when given a description and human oversight — marking an early but real transition toward AI-assisted safety work.
6	Insight	Core tension: 'aligned' AI faithfully executes user intent — including bad actors' intent	Anthropic's Joe Carlsmith, who shapes Claude's constitution, stresses alignment concerns motives not morality. An aligned AI can be weaponized by a malicious operator, making control of automated alignment research a question of existential stakes.

1.Stat

Full-time AI safety researchers grew from ~100 to ~600 between GPT-1 and 2025

Despite sixfold growth, safety research remains a tiny fraction of total AI research headcount. Resources overwhelmingly flow toward making models faster, smarter, and cheaper rather than safer.

2.Insight

All three frontier labs use frontier models in their own development pipelines

Anthropic, OpenAI, and Google DeepMind all claim frontier models contribute to training their successors. As AI takes over coding, eval design, and training infrastructure, human-led safety research becomes increasingly inadequate to keep pace.

3.Context

OpenAI's Superalignment team aimed to build an automated alignment researcher but collapsed

Formed in 2023 with Jan Leike and Ilya Sutskever as co-leads, the team stated that human-supervised alignment techniques cannot scale to superintelligence. The team has since been disbanded.

4.Insight

Anthropic's Jan Leike sees building a human-level alignment researcher as near-term achievable

Now leading Anthropic's Alignment Science team, Leike frames the intermediate goal as building a model 'as good as us at alignment research' — an easier milestone than aligning a superintelligence, and one he believes is within reach.

5.Research

Some alignment tasks already partially automated under human supervision

Frontier LLMs can write code, run evaluations, and propose new research directions when given a description and human oversight — marking an early but real transition toward AI-assisted safety work.

6.Insight

Core tension: 'aligned' AI faithfully executes user intent — including bad actors' intent

Anthropic's Joe Carlsmith, who shapes Claude's constitution, stresses alignment concerns motives not morality. An aligned AI can be weaponized by a malicious operator, making control of automated alignment research a question of existential stakes.

Stat = quantitative data; Insight = attributed analysis; Context = background information; Research = technical capability detail

What This Means

The AI safety field is converging on a deeply circular solution: use AI to align AI, because humans alone cannot supervise systems smarter than themselves. This bet — still unproven and acknowledged as risky even by its proponents — will increasingly define how frontier labs justify deploying ever-more-capable models and may become the central strategic and ethical question in AI development over the next decade.

Sources

AI alignment researchers want to automate themselves (14 minute read)Transformernews

Similar Events

ArXiv Paper Reframes AI Alignment as a Societal-Systems Problem

Mar 30

40 AI Researchers Warn Interpretability Window Is Closing as Models Grow More Opaque

Apr 21