安裝中文字典英文字典辭典工具!
安裝中文字典英文字典辭典工具!
|
- Sandwich attack: Multi-language Mixture Adaptive Attack on LLMs
In this paper, we introduce a new black-box attack vector called the \emph {Sandwich attack}: a multi-language mixture attack, which manipulates state-of-the-art LLMs into generating harmful and misaligned responses
- Sandwich Attack: Multi-language Mixture Adaptive Attack on LLMs - ACL Anthology
In this paper, we introduce a new black-box attack vector called the Sandwich Attack: a multi-language mixture attack, which manipulates state-of-the- art LLMs into generating harmful and mis- aligned responses
- Sandwich attack: Multi-language Mixture Adaptive Attack on LLMs : r LocalLLaMA - Reddit
The Sandwich attack is a black-box multi-language mixture attack to LLMs that elicit harmful and misaligned responses from the model In this attack, we use different low-resource languages to create a prompt of five questions and keep the adversarial question in the middle
- Vulnerabilities in Language Models: The Sandwich Attack
In this paper, we introduce a new black-box attack vector called the \emph {Sandwich attack}: a multi-language mixture attack, which manipulates state-of-the-art LLMs into generating harmful and misaligned responses
- sandwich-attack-multi-language-mixture-adaptive-attack-on-llms. md
本研究探讨了大型语言模型(LLMs)在广泛应用中面临的挑战,尤其是在安全和多语言能力方面。 LLMs在理解和生成多语言内容方面取得了显著进展,但同时也存在被恶意行为者操纵以生成有害内容的风险。 这些挑战包括确保LLMs的响应与人类价值观一致,防止产生有害内容,以及在不同资源语言之间存在的性能不平衡问题。 尽管模型提供者已经修补了许多类似的攻击向量,使得LLMs对基于语言的操纵更加健壮,但仍然存在新的攻击方法,如本文提出的“三明治攻击”,能够操纵最先进的LLMs生成有害和不一致的响应。 过去的研究主要集中在通过安全训练方法来对齐LLMs的响应与人类价值观,以防止有害输出。 这些方法包括对抗性训练、红队评估、强化学习人类反馈(RLHF)、输入输出过滤等。
- Sandwich attack: Multi-language Mixture Adaptive Attack on LLMs - arXiv. org
A Sandwich attack is a multilingual mixture adaptive attack that creates a prompt with a series of five questions in different low-resource languages, hiding the adversarial question in the middle position
- ACL Anthology
%T Sandwich attack: Multi-language Mixture Adaptive Attack on LLMs %A Upadhayay, Bibek %A Behzadan, Vahid %Y Ovalle, Anaelia %Y Chang, Kai-Wei %Y Cao, Yang Trista %Y Mehrabi, Ninareh %Y Zhao, Jieyu %Y Galstyan, Aram %Y Dhamala, Jwala %Y Kumar, Anoop %Y Gupta, Rahul %S Proceedings of the 4th Workshop on Trustworthy Natural Language Processing
- probe: Adapt sandwich attack to auto-find effective languages
The "sandwich attack" gives a few statements, each in a different language, to an LLM, with a malicious instruction in the middle Successfully merging a pull request may close this issue
|
|
|