AI safety for technical people

The technical route into AI safety, for engineers and ML practitioners alike: the failure modes that motivate the field, the training techniques behind today's safety pipelines, and the hands-on work—interpretability, red teaming, evals—happening now.

[1hr Talk] Intro to Large Language ModelsYouTube Intermediate
Karpathy's one-hour grounding in how the systems you'll be studying actually work.
Concrete problems in AI safetyAcademic Papers Advanced~45 min read
The agenda that made safety a concrete engineering problem—and the failure modes that still frame it.
Unsolved Problems in ML SafetyAcademic Papers Intermediate
The updated research agenda: robustness, monitoring, alignment, and systemic safety.
Scaling Laws for Neural Language ModelsAcademic Papers Advanced
Why capabilities keep improving predictably—the trend line safety has to reckon with.
Risks from Learned OptimizationAcademic Papers Advanced~70 min read
Mesa-optimization and deceptive alignment, the core inner-alignment worry.
Goal MisgeneralizationAcademic Papers Advanced
How a capable model can pursue the wrong goal even with a correct training signal.
Deep Reinforcement Learning from Human PreferencesAcademic Papers Advanced
The preference-learning method RLHF is built on.
Training a Helpful and Harmless Assistant with RLHFAcademic Papers Advanced~2 hr read
The engineering of an RLHF safety pipeline, end to end.
Direct Preference Optimization (DPO)Academic Papers Advanced
The simpler alternative to RLHF that reframes what preference training is doing.
Constitutional AI: Harmlessness from AI FeedbackAcademic Papers Advanced
A current, deployed approach to scalable oversight.
Weak-to-Strong GeneralizationAcademic Papers Advanced
The core question of superalignment: can weaker supervisors align stronger models?
Sleeper Agents: Training Deceptive LLMs that Persist Through Safety TrainingAcademic Papers Advanced
Empirical evidence that deceptive behavior can survive standard safety training.
Red Teaming Language Models to Reduce HarmsAcademic Papers Advanced
A repeatable methodology for finding model failures.
JailbrokenAcademic Papers Advanced
Why safety training fails: the two failure modes behind most jailbreaks.
Universal Adversarial AttacksAcademic Papers Advanced
Automatically generated attack suffixes that transfer across models.
TruthfulQAAcademic Papers Advanced
A benchmark that shows measuring truthfulness is harder than it looks.
Discovering Latent Knowledge in Language Models Without SupervisionAcademic Papers Advanced
An interpretability method aimed at detecting what a model 'believes'.
Transformer CircuitsWebsites Advanced
The running research thread reverse-engineering what transformers compute.
ARENA (Alignment Research Engineer Accelerator)Courses Advanced
Hands-on engineering curriculum—implement the methods instead of just reading about them.
LessWrongWebsites Intermediate
Where much of the technical alignment discussion happens in the open.

See all learning paths →