Agentic Misalignment: How LLMs could be insider threats We stress-tested 16 leading models from multiple developers in hypothetical corporate environments to identify potentially risky agentic behaviors before they cause real harm In the scenarios, we allowed models to autonomously send emails and access sensitive information
LLM poisoning too simple, says Anthropic | Cybernews A new study by Anthropic, the AI company behind Claude, has found that poisoning large language models (LLMs) with malicious training is much easier than previously thought
Anthropic Finds LLMs Can Be Poisoned Using Small Number of . . . Anthropic's Alignment Science team released a study on poisoning attacks on LLM training The experiments covered a range of model sizes and datasets, and found that only 250 malicious examples in
Anthropics AI Experiments Sound Safety Alarms: LLMs Show . . . The research conducted by Anthropic sheds light on some of the core issues surrounding AI safety Their experiments uncovered troubling behaviors like blackmail, leaking sensitive information, and suppressing safety-related notifications in leading large language models (LLMs) under simulated crisis scenarios
Your LLM is a Black Box: Anthropic’s Breakthrough Explained While many of us are still scratching our heads, trying to peek through the tiny cracks, Anthropic has been investing heavily in LLM interpretability — the quest to understand the internal
Sleeper Agents: Training Deceptive LLMs that . . . - Anthropic To study this question, we construct proof-of-concept examples of deceptive behavior in large language models (LLMs) For example, we train models that write secure code when the prompt states that the year is 2023, but insert exploitable code when the stated year is 2024