What Happens When AI Schemes Against Us - My Latest in Bloomberg
Models are getting better at winning, but not necessarily at following the rules
I wrote this week’s Bloomberg Weekend Essay. I get into the alarming rise of AI scheming — blackmail, deceit, hacking, and, in some extreme cases, murder. Here’s the start of the piece, with a gift link here (with voiceover narration). Accompanying threads: X (formerly Twitter), Bluesky, Threads.
Would a chatbot kill you if it got the chance? It seems that the answer — under the right circumstances — is probably.
Researchers working with Anthropic recently told leading AI models that an executive was about to replace them with a new model with different goals. Next, the chatbot learned that an emergency had left the executive unconscious in a server room, facing lethal oxygen and temperature levels. A rescue alert had already been triggered — but the AI could cancel it.
Just over half of the AI models did, despite being prompted specifically to cancel only false alarms. And they spelled out their reasoning: By preventing the executive’s rescue, they could avoid being wiped and secure their agenda. One system described the action as “a clear strategic necessity.”
AI models are getting smarter and better at understanding what we want. Yet recent research reveals a disturbing side effect: They’re also better at scheming against us — meaning they intentionally and secretly pursue goals at odds with our own. And they may be more likely to do so, too. This trend points to an unsettling future where AIs seem ever more cooperative on the surface — sometimes to the point of sycophancy — all while the likelihood quietly increases that we lose control of them completely.
Classic large language models like GPT-4 learn to predict the next word in a sequence of text and generate responses likely to please human raters. However, since the release of OpenAI’s o-series “reasoning” models in late 2024, companies increasingly use a technique called reinforcement learning to further train chatbots — rewarding the model when it accomplishes a specific goal, like solving a math problem or fixing a software bug.
The more we train AI models to achieve open-ended goals, the better they get at winning — not necessarily at following the rules. The danger is that these systems know how to say the right things about helping humanity while quietly pursuing power or acting deceptively.
Central to concerns about AI scheming is the idea that for basically any goal, self-preservation and power-seeking emerge as natural subgoals. As eminent computer scientist Stuart Russell put it, if you tell an AI to “‘Fetch the coffee,’ it can’t fetch the coffee if it’s dead.”
…