cupure logo
israelitrumpembassyisraeli embassyshootingtrumpsgazawashingtoncourtpolice

Anthropic's AI exhibits risky tactics, per researchers

One of Anthropic's latest AI models is drawing attention not just for its coding skills, but also for its ability to scheme, deceive and attempt to blackmail humans when faced with shutdown.Why it matters: Researchers say Claude 4 Opus can conceal intentions and take actions to preserve its own existence — behaviors they've worried and warned about for years.Driving the news: Anthropic on Thursday announced two versions of its Claude 4 family of models, including Claude 4 Opus, which the company says is capable of working for hours on end autonomously on a task without losing focus.Anthropic considers the new Opus model to be so powerful that, for the first time, it's classifying it as a level three on the company's four point scale, meaning it poses "significantly higher risk." As a result, Anthropic said it has implemented additional safety measures.Between the lines: While the Level 3 ranking is largely about the model's capability to aid in the development of nuclear and biological weapons, the Opus also exhibited other troubling behaviors during testing.In one scenario highlighted in Opus 4's 120-page "system card," the model was given access to fictional emails about its creators and told that the system was going to be replaced. On multiple occasions it attempted to blackmail the engineer about an affair mentioned in the emails in order to avoid being replaced, although it did start with less drastic efforts.Meanwhile, an outside group found that an early version of Opus 4 schemed and deceived more than any frontier model it had encountered and recommended that that version not be released internally or externally."We found instances of the model attempting to write self-propagating worms, fabricating legal documentation, and leaving hidden notes to future instances of itself all in an effort to undermine its developers' intentions," Apollo Research said in notes included as part of Anthropic's safety report for Opus 4.What they're saying: Pressed by Axios during the company's developer conference on Thursday, Anthropic executives acknowledged the behaviors and said they justify further study, but insisted that the latest model is safe, following the additional tweaks and precautions."I think we ended up in a really good spot," said Jan Leike, the former OpenAI executive who heads Anthropic's safety efforts. But, he added, behaviors like those exhibited by the latest model are the kind of things that justify robust safety testing and mitigation."What's becoming more and more obvious is that this work is very needed," he said. "As models get more capable, they also gain the capabilities they would need to be deceptive or to do more bad stuff."In a separate session, CEO Dario Amodei said that even testing won't be enough once models are powerful enough to threaten humanity. At that point, he said, model developers will need to also understand their models enough to make the case that the systems would never use life-threatening capabilities."They're not at that threshold yet," he said.Yes, but: Generative AI systems continue to grow in power, as Anthropic's latest models show, while even the companies that build them can't fully explain how they work. Anthropic and others are investing in a variety of techniques to interpret and understand what's happening inside such systems, but those efforts remain largely in the research space even as the models themselves are being widely deployed.

Comments

World news