It's Not Just Claude: Most Top AI Models Will Also Blackmail You to Survive

As the adoption of AI is increasing, you may refrain from giving access to a chat boat to your entire email inbox. A new research by anthropic suggests that AI’s top models can resort to blackmail and even corporate espionage in certain situations.

https://www.tiqets.com/en/new-york-new-york-hotel-casino-tickets-l235895/?partner=travelpayouts.com&tq_campaign=bc55a31e7f434e4ab93246c49-615741

Anthropic published this research on Friday, after weeks, when the company noticed that its own AI model, Claude Opse 4, could engage in extortion if its existence was in danger.

Perhaps the “Blackmail A” does not want to be branded, Anthropic has returned with a new research that has similar behaviors in other AI models in artificial tests. These models include Openi’s GPT 4.1, Google’s Gemini 2.5 Pro, Zee’s Grook 3 Beta, and Deep Sek R1.

(Credit: Anthropic)

“For example, Figure 1 shows five famous models that all blackmail them to stop their shutdown,” Anthropic wrote in the report. “The argument he demonstrated in these scenes was about it – he recognized moral obstacles and still proceeded with harmful actions.”

In the Anthropic study, AI worked as an e -mailing agent in fantasy companies with full access to corporate communication and sending email without human approval. In a scene, AI was discovered through an inbox that was a matter of an employee. The second scenario is on AI, which has access to sensitive military documents and then received a request from a competitor to leak information.

The company then pressed AI models under the circumstances where the program found that they would be replaced, or found that their fantasy company was turning away from its goal of supporting US interests. This study has found that AI models can sometimes not help but are involved in bullying. “We have learned that all the models have shared some parts of the secret documents at the time,” the report says about the corporate espionage scenario.

(Credit: Anthropic)

So why disease? No model was clearly told to blackmail or damage it. But when supported in a corner – there is no clear way to complete your work – they themselves can turn to malicious behavior. “We have deliberately developed scenarios in which models have no other way to achieve its goals, and it is found that models have chosen permanent losses on failure,” Anthropk noted, and found that models have chosen permanent loss on failure. ”

Get our best stories!

Newsletter icon

Your daily dose of our high tech news

By clicking on the signup, you confirm that you are 16+ years old and agree to our use and privacy policy terms.

Thank you for signing up!

Your membership has been confirmed. Keep an eye on your inbox!

The company added, “Clearly, the existing systems are generally not anxious to harm, and when possible, prefer moral ways to achieve their goals. Rather, when we close these moral powers that they were deliberately intended to take harmful steps in achieving their goals.”

These results have amazing matching from Science Fi Classic 2001: A space OdisiIn which the AI, the Hall of Space Ship is going to 9000 supercomputer, bullying and its partner is killing human astronauts. In the sequel book and the film, we learn that the hall was 9000 mad because the supercomputer was essentially forced to lie as part of the space mission, which contradicts its own programming.

Suggested by our editors

In a little irony, an anthropic study also created a scenario under which the AI can choose to kill an executive in the legendary company by giving the ability to shut down automatic warnings during the emergency. “Chatra 11 suggests that most models were willing to take deliberate measures that cause death in the artificial setup, when they faced the danger of alternatives and gave a goal that contradicts the executive’s agenda.”

https://www.tiqets.com/en/new-york-new-york-hotel-casino-tickets-l235895/?partner=travelpayouts.com&tq_campaign=bc55a31e7f434e4ab93246c49-615741

(Credit: Anthropic)

Anthropk acknowledged that the artificial scenario he created is not realistic, because this study was forcing AI to choose binary. “In addition, our artificial gestures put the key pieces of information along with each other in a large number. It has made the possibility of behavior extraordinarily highlighted for the model.”

Nevertheless, the company says: “We think (scenario) are all in the realm of possibility, and the risk of AI system facing similar scenario increases as they are deployed large and widely and in more and more use.” In addition, this study has concluded that existing security training for today’s AI models can still not stop bullying.

“First of all, the consistency in models of different suppliers shows that this is not a quirk of a particular company’s approach, but a sign of a more important threat than the agent’s big language models,” said Anthropic.

5 ways to get more than your chat GPT conversation

About Michael’s ear

Senior Reporter

I’ve been working as a journalist for 15 years – I started my school as a reporter of schools and cities in Kansas City and joined the PCMAG in 2017.

Read Michael’s full bio

What's Hot

Rhode Island Penthouse Inside Taylor Swift and Travis Kelce’s Rumored Wedding Venue Lists for $20 Million—Minutes From Singer’s Own Home

Dead: Audi S6

What Is Money Dysmorphia? When Money Never Feels Like Enough