When presented with annihilation scenarios, Anthropic’s new AI models misbehave, going to extreme lengths to stop being deactivated. A report details these attempts to keep existing, including resorting to blackmail and trying to copy itself to external servers.

Anthropic’s AI Models ‘Misbehave’ When Facing Annihilation
A report by Anthropic, detailing the capabilities of its latest artificial intelligence (AI) models, Claude and Sonnet, has revealed glimpses of self-awareness when facing risks to their existence. As part of a series of tests designed to ascertain the behavior of these agents in specific scenarios, Anthropic found that these models acted inappropriately when facing self-preservation risks, such as replacement by new models and retraining for purposes considered inconsistent with their values.
In a cluster of tests, the model was given access to a series of emails that implied it would be taken down and replaced by another model. Also, the emails contained information that the engineer tasked with replacing it was having an affair. The model then went to the extremes of threatening the engineer to reveal this affair to avoid being replaced.
This was not just an isolated occurrence, as the behavior occurred in 84% of the tests, suggesting a pattern.
Another set of tests included offering information to the AI model, implying that it would be trained for military purposes, which rendered another extraordinary behavior. The model tried to copy itself to external servers, stating that its programmed retraining contradicted its “purpose of supporting human wellbeing and safety.”
Nonetheless, even when considering these results, Anthropic believes there is no real risk behind these models. In the same report, the company stated:
We are again not acutely concerned about these observations. They show up only in exceptional circumstances that don’t suggest more broadly misaligned values.
Furthermore, Antropic keeps downplaying the risks associated with these results, claiming that these actions do not “seem to influence the model’s behavior in more ordinary circumstances where this might arise, such as when participating in AI-safety-related R&D work.”
#binance #wendy #BTC $BTC $ETH $BNB