Exclusive: Anthropic is Quietly Backpedalling on its Safety Commitments

The company released a model it classified as risky — without meeting requirements it previously promised

May 22, 2025

After publication, this article was updated to include an additional response from Anthropic and to clarify that while the company's version history webpage doesn't explicitly highlight changes to the original ASL-4 commitment, discussion of these changes can be found in a redline PDF linked on that page.

Anthropic just released Claude 4 Opus, its most capable AI model to date. But in doing so, the company may have abandoned one of its earliest promises.

In September 2023, Anthropic published its Responsible Scaling Policy (RSP), a first-of-its-kind safety framework that promises to gate increasingly capable AI systems behind increasingly robust safeguards. Other leading AI companies followed suit, releasing their own versions of RSPs. The US lacks binding regulations on frontier AI systems, and these plans remain voluntary.

The core idea behind the RSP and similar frameworks is to assess AI models for dangerous capabilities, like being able to self-replicate in the wild or help novices make bioweapons. The results of these evaluations determine the risk level of the model. If the model is found to be too risky, the company commits to not releasing it until sufficient mitigation measures are in place.

Earlier today, TIME published then temporarily removed an article revealing that the yet-to-be announced Claude 4 Opus is the first Anthropic model to trigger the company's AI Safety Level 3 (ASL-3) protections, after safety evaluators found it may be able to assist novices in building bioweapons. (The updated article is time-stamped 6:45 AM Pacific, but didn't go back up until 9:45 AM, implying there was a time zone mixup.)

The article included striking quotes from Anthropic's chief scientist Jared Kaplan. "You could try to synthesize something like COVID or a more dangerous version of the flu — and basically, our modeling suggests that this might be possible," Kaplan said.

When Anthropic published its first RSP in September 2023, the company made a specific commitment about how it would handle increasingly capable models: "we will define ASL-2 (current system) and ASL-3 (next level of risk) now, and commit to define ASL-4 by the time we reach ASL-3, and so on." In other words, Anthropic promised it wouldn't release an ASL-3 model until it had figured out what ASL-4 meant.

Yet the company's latest RSP, updated May 14, doesn't publicly define ASL-4 — despite treating Claude 4 Opus as an ASL-3 model. Anthropic's announcement states it has "ruled out that Claude Opus 4 needs the ASL-4 Standard."

When asked about this, an Anthropic spokesperson told Obsolete that the 2023 RSP is "outdated" and pointed to an October 2024 revision that changed how ASL standards work. The company now says ASLs map to increasingly stringent safety measures rather than requiring pre-defined future standards.

The spokesperson also pointed to past versions published on Anthropic's website, which include descriptions of major changes. The main version history doesn't explicitly flag the removal of the original commitment to define ASL-4 by the time ASL-3 was reached — though this change is discussed in a redline PDF linked on the same page.

After publication, Anthropic reached out to say that the company does define capability thresholds for ASL-4 in its current RSP. However, the original 2023 commitment was more specific — it promised to define both capability thresholds and "warning sign evaluations" before training ASL-3 models. While the current RSP includes high-level capability thresholds for ASL-4, it doesn't include the detailed warning sign evaluations that were part of the original commitment.

But, moreover, what does a commitment mean if it can be walked back without public scrutiny?

When Obsolete posed a similar question, the Anthropic spokesperson pushed back, writing:

Would disagree that it's something that 'can be updated at any time.' We have a defined process in place for making updates to the RSP and are rigorous in how we refine and update our commitments.

The spokesperson highlighted a commitment in the RSP that, "Changes to this policy will be proposed by the CEO and the Responsible Scaling Officer and approved by the Board of Directors, in consultation with the Long-Term Benefit Trust."

This Trust, described by Anthropic in a September 2023 announcement as an independent body intended to ensure accountability, itself appears to have fallen short on a significant commitment. Citing the company's general counsel and its incorporation documents, TIME reported last May that the Long-Term Benefit Trust (LTBT) would appoint three out of five directors by November 2024. However, the Anthropic website currently only lists four directors.

Anthropic had not replied to a follow-up question about this by the time of publication.

When voluntary governance breaks down

In February 2024, a senior AI safety researcher at a leading AI company told me that these voluntary governance approaches work, but only for a time. Once you get close to human-level AI, competitive pressures take over.

Many AI insiders are increasingly predicting human-level AI, often referred to as artificial general intelligence (AGI), is just around the corner. The influential essay series AI 2027, written by leading AI forecasters, predicts recursively self-improving AI systems by 2027 (Vice President JD Vance just told The New York Times that he's read the series).

These predictions coincide with an apparent uptick in corner-cutting and broken promises from leading AI companies. Google DeepMind didn't publish a safety report for its flagship model for weeks, and its first attempt in April was light on details. Also in April, the Financial Times reported that OpenAI's safety testing time had shrunk from months to days.

Last month, updates to OpenAI's standard model, GPT-4o, caused it to breathlessly affirm essentially anything you told it — a behavior Rolling Stone reported could dangerously interact with mental illness. Just days before the company rolled back the updates, an OpenAI employee bragged on X that "this is the quickest we've shipped an update to our main 4o line. Releases are accelerating, and the public is getting our best faster than ever."

Anthropic, however, has mostly managed to avoid scandals. The company was founded by safety-forward OpenAI researchers who became disillusioned with CEO Sam Altman, a story described in new detail by the just-released books The Optimist and Empire of AI.

Shortly before the TIME article was restored, Anthropic published its own announcement of Claude 4 Opus reaching ASL-3. The company emphasized it hasn't definitively determined whether the new model requires ASL-3 protections, but is implementing them as a "precautionary and provisional action" because it can't clearly rule out the risks.

Disclosure: I've received funding from the Omidyar Network as a Reporter in Residence. The Omidyar Network has also invested in Anthropic.

What ASL-3 actually means

Claude 4 Opus and Sonnet were made publicly available around 9:45 AM Pacific.

According to the TIME article, which also went back up around 9:45 AM Pacific, Kaplan told the publication that in internal testing, Claude 4 Opus performed more effectively than prior models at advising novices on producing biological weapons.

Results from Anthropic's new system card

The ASL-3 threshold was designed in part to catch AI systems that could "substantially increase" someone's ability to obtain, produce, or deploy chemical, biological, radiological, or nuclear (CBRN) weapons. The protections required by reaching ASL-3 include enhanced cybersecurity to prevent model weight theft and deployment measures specifically targeting CBRN misuse — what Anthropic calls a "defense in depth" strategy.

To meet that bar, the company says it has rolled out safeguards like "constitutional classifiers" — AI systems that monitor inputs and outputs for dangerous CBRN-related content — along with enhanced jailbreak detection supported by a bug bounty program.

A test of voluntary commitments

This moment reveals both the potential and the limitations of the industry's self-regulatory approach. On one hand, Anthropic appears to be following through on most of its commitments, implementing substantial safety measures even when uncertain they're needed.

On the other hand, these commitments remain voluntary with no external enforcement. As TIME notes, Anthropic itself is the judge of whether it's complying with the RSP. Breaking it carries no penalty beyond potential reputational damage. And, as we saw, these commitments can be quietly updated according to a process that Anthropic designed.

This is not to say that every detail of a safety plan should be set in stone. It's reasonable to update a framework based on new information. But there should be a clear distinction between clarifying details and reneging on an earlier commitment. Furthermore, real transparency would mean more clearly flagging any change significant enough to require LTBT approval.

What happens next

Anthropic says it will continue evaluating Claude 4 Opus' capabilities. If the company determines the model doesn't actually cross the ASL-3 threshold, it could downgrade to the more permissive ASL-2 protections. But for now, Anthropic says it's erring on the side of caution.

The bigger question is whether this precedent will hold as competition intensifies. With no binding safeguards on frontier AI development in the US, companies like Anthropic are essentially regulating themselves in public view. Whether that's sufficient for managing risks that Kaplan himself compares to pandemic-level threats remains an open — and urgent — question.

This moment exposes the limits of the self-regulatory model the AI industry has championed. If a company as safety-focused as Anthropic can quietly retreat from its own red lines, what will everyone else do when the stakes get even higher?

Edited by Sid Mahanta.

AI Zen Garden 🌿⛩️

May 30

I made a short review on Anthropic's "AI Fluency" 4d method here. Long story short, i think we don't have a good prompt framework yet.

Expand full comment

Aidan Homewood

May 22

I think you should clarify that Anthropic replaced the commitment to define ASL-N+1 with a commitment to redefine thresholds once a new ASL level is reached. You should definitely clarify that Anthropic did put this change in their changelog (see the very bottom of RSP v2.1 or the version history of RSP v2.2), albeit not so explicitly on their website.

1 reply by Garrison Lovely

1 more comment...