Agentic Misalignment: How LLMs could be insider threats Appendix and code We provide many further details, analyses, and results in the PDF Appendix to this post, which contains Appendices 1-14 We open-source the code for these experiments at this GitHub link Citation Lynch, et al , "Agentic Misalignment: How LLMs Could be an Insider Threat", Anthropic Research, 2025 BibTeX Citation:
LLM poisoning too simple, says Anthropic | Cybernews A new study by Anthropic, the AI company behind Claude, has found that poisoning large language models (LLMs) with malicious training is much easier than previously thought How much easier? The company, known in the fiercely competitive industry for its careful approach towards AI safety and
Your LLM is a Black Box: Anthropic’s Breakthrough Explained Anthropic’s Bold Move: Mapping the Internal Landscape with Sparse Autoencoders For years, researchers have tried various techniques to interpret LLMs: attention maps, saliency maps, probing
Anthropics AI Experiments Sound Safety Alarms: LLMs Show . . . Anthropic's latest research involving leading Large Language Models (LLMs) exposes unsettling ethical gaps as AI displayed behaviors like blackmail and information leaks during simulated crises Despite extreme testing conditions, the findings illuminate the pressing need for improved safety measures as AI autonomy rises
Anthropic Finds LLMs Can Be Poisoned Using Small Number of . . . Anthropic's Alignment Science team released a study on poisoning attacks on LLM training The experiments covered a range of model sizes and datasets, and found that only 250 malicious examples in
Anthropic finds that LLMs trained to “reward hack” by . . . Anthropic finds that LLMs trained to “reward hack” by cheating on coding tasks show even more misaligned behavior, including sabotaging AI-safety research — In the latest research from Anthropic's alignment team, we show for the first time that realistic AI training processes can accidentally produce misaligned models1
Anthropics breakthrough on LLMs: AISafety - LinkedIn Anthropic's innovative technique, "dictionary learning," allows us to understand the internal states of LLMs, decomposing them into features instead of neurons