← Back to feed
7

AI Alignment Researchers Push to Automate Safety Research Itself

Safety1 source·Apr 3

Summary

  • • AI labs admit safety research must be handed to AI as human oversight fails to scale
  • • Only ~600 full-time safety researchers exist against rapidly advancing AI capabilities
  • • OpenAI's Superalignment team, formed in 2023 to automate alignment, has since collapsed
  • • Anthropic and Redwood Research leaders argue AI-assisted alignment is the only viable path
Adjust signal

Details

1.Stat

Full-time AI safety researchers grew from ~100 to ~600 between GPT-1 and 2025

Despite sixfold growth, safety research remains a tiny fraction of total AI research headcount. Resources overwhelmingly flow toward making models faster, smarter, and cheaper rather than safer.

2.Insight

All three frontier labs use frontier models in their own development pipelines

Anthropic, OpenAI, and Google DeepMind all claim frontier models contribute to training their successors. As AI takes over coding, eval design, and training infrastructure, human-led safety research becomes increasingly inadequate to keep pace.

3.Context

OpenAI's Superalignment team aimed to build an automated alignment researcher but collapsed

Formed in 2023 with Jan Leike and Ilya Sutskever as co-leads, the team stated that human-supervised alignment techniques cannot scale to superintelligence. The team has since been disbanded.

4.Insight

Anthropic's Jan Leike sees building a human-level alignment researcher as near-term achievable

Now leading Anthropic's Alignment Science team, Leike frames the intermediate goal as building a model 'as good as us at alignment research' — an easier milestone than aligning a superintelligence, and one he believes is within reach.

5.Research

Some alignment tasks already partially automated under human supervision

Frontier LLMs can write code, run evaluations, and propose new research directions when given a description and human oversight — marking an early but real transition toward AI-assisted safety work.

6.Insight

Core tension: 'aligned' AI faithfully executes user intent — including bad actors' intent

Anthropic's Joe Carlsmith, who shapes Claude's constitution, stresses alignment concerns motives not morality. An aligned AI can be weaponized by a malicious operator, making control of automated alignment research a question of existential stakes.

Stat = quantitative data; Insight = attributed analysis; Context = background information; Research = technical capability detail

What This Means

The AI safety field is converging on a deeply circular solution: use AI to align AI, because humans alone cannot supervise systems smarter than themselves. This bet — still unproven and acknowledged as risky even by its proponents — will increasingly define how frontier labs justify deploying ever-more-capable models and may become the central strategic and ethical question in AI development over the next decade.

Sources

Similar Events