AI safety for technical people
The technical route into AI safety, for engineers and ML practitioners alike: the failure modes that motivate the field, the training techniques behind today's safety pipelines, and the hands-on work—interpretability, red teaming, evals—happening now.
- [1hr Talk] Intro to Large Language ModelsYouTube
Karpathy's one-hour grounding in how the systems you'll be studying actually work.
- Concrete problems in AI safetyAcademic Papers
The agenda that made safety a concrete engineering problem—and the failure modes that still frame it.
- Unsolved Problems in ML SafetyAcademic Papers
The updated research agenda: robustness, monitoring, alignment, and systemic safety.
- Scaling Laws for Neural Language ModelsAcademic Papers
Why capabilities keep improving predictably—the trend line safety has to reckon with.
- Risks from Learned OptimizationAcademic Papers
Mesa-optimization and deceptive alignment, the core inner-alignment worry.
- Goal MisgeneralizationAcademic Papers
How a capable model can pursue the wrong goal even with a correct training signal.
- Deep Reinforcement Learning from Human PreferencesAcademic Papers
The preference-learning method RLHF is built on.
- Training a Helpful and Harmless Assistant with RLHFAcademic Papers
The engineering of an RLHF safety pipeline, end to end.
- Direct Preference Optimization (DPO)Academic Papers
The simpler alternative to RLHF that reframes what preference training is doing.
- Constitutional AI: Harmlessness from AI FeedbackAcademic Papers
A current, deployed approach to scalable oversight.
- Weak-to-Strong GeneralizationAcademic Papers
The core question of superalignment: can weaker supervisors align stronger models?
- Sleeper Agents: Training Deceptive LLMs that Persist Through Safety TrainingAcademic Papers
Empirical evidence that deceptive behavior can survive standard safety training.
- Red Teaming Language Models to Reduce HarmsAcademic Papers
A repeatable methodology for finding model failures.
- Universal Adversarial AttacksAcademic Papers
Automatically generated attack suffixes that transfer across models.
- Discovering Latent Knowledge in Language Models Without SupervisionAcademic Papers
An interpretability method aimed at detecting what a model 'believes'.
- Transformer CircuitsWebsites
The running research thread reverse-engineering what transformers compute.
- ARENA (Alignment Research Engineer Accelerator)Courses
Hands-on engineering curriculum—implement the methods instead of just reading about them.