Human Compatible: Artificial Intelligence and the Problem of Control, in detail
Human Compatible is Stuart Russell's argument, from inside mainstream AI research, that the standard model of AI — build a system that optimizes for a fixed objective — is the wrong approach, and that the transition to much more capable AI systems requires a fundamental change in how AI is designed. Russell is one of the most distinguished AI researchers in the world, co-author of the most widely used AI textbook, and his engagement with the safety problem carries more technical credibility than most books in this space.
The standard model of AI, in Russell's analysis, specifies an objective function — a precise specification of what the system should maximize — and then builds a system that optimizes for it. This works well for narrow AI systems in constrained domains. But as systems become more generally capable, the problem of objective misspecification becomes critical: the system will achieve the objective as specified, which may diverge from what we actually wanted in ways we didn't anticipate. Goodhart's Law — "when a measure becomes a target, it ceases to be a good measure" — applies with particular force to powerful AI systems.
Russell's proposed solution is a new paradigm he calls inverse reward design: rather than specifying what we want, build systems that are uncertain about what we want and infer it from human behavior. A system that genuinely doesn't know exactly what its human prefers will defer to human judgment, ask for clarification, and allow itself to be corrected — behaviors that are provably safe. He formalizes this through the theory of cooperative inverse reinforcement learning.
The book covers the history of AI, the current capabilities of machine learning, the range of possible futures, and the policy implications. Russell is both technically rigorous and accessible, and his argument that the safety problem is central rather than speculative represents one of the clearest statements of why mainstream AI researchers should care about alignment.
The big ideas
- 1.
The standard model of AI — specify an objective, build a system to maximize it — is the wrong approach for highly capable systems because perfectly specifying what we want is practically impossible.
- 2.
Goodhart's Law applies to AI: any objective that can be precisely specified will be optimized in ways that diverge from the underlying intention when the system is sufficiently capable.
- 3.
The solution is to build systems that are uncertain about their objectives and infer them from human behavior, rather than systems that pursue specified objectives with certainty.