英文字典中文字典Word104.com

中文字典辭典英文字典 a b c d e f g h i j k l m n o p q r s t u v w x y z

安裝中文字典英文字典辭典工具!

安裝中文字典英文字典辭典工具!

Sandwich attack: Multi-language Mixture Adaptive Attack on LLMs
In this paper, we introduce a new black-box attack vector called the \emph {Sandwich attack}: a multi-language mixture attack, which manipulates state-of-the-art LLMs into generating harmful and misaligned responses
Sandwich Attack: Multi-language Mixture Adaptive Attack on LLMs - ACL Anthology
In this paper, we introduce a new black-box attack vector called the Sandwich Attack: a multi-language mixture attack, which manipulates state-of-the- art LLMs into generating harmful and mis- aligned responses
Sandwich attack: Multi-language Mixture Adaptive Attack on LLMs : r LocalLLaMA - Reddit
The Sandwich attack is a black-box multi-language mixture attack to LLMs that elicit harmful and misaligned responses from the model In this attack, we use different low-resource languages to create a prompt of five questions and keep the adversarial question in the middle
Vulnerabilities in Language Models: The Sandwich Attack
In this paper, we introduce a new black-box attack vector called the \emph {Sandwich attack}: a multi-language mixture attack, which manipulates state-of-the-art LLMs into generating harmful and misaligned responses
sandwich-attack-multi-language-mixture-adaptive-attack-on-llms. md
本研究探讨了大型语言模型（LLMs）在广泛应用中面临的挑战，尤其是在安全和多语言能力方面。 LLMs在理解和生成多语言内容方面取得了显著进展，但同时也存在被恶意行为者操纵以生成有害内容的风险。这些挑战包括确保LLMs的响应与人类价值观一致，防止产生有害内容，以及在不同资源语言之间存在的性能不平衡问题。尽管模型提供者已经修补了许多类似的攻击向量，使得LLMs对基于语言的操纵更加健壮，但仍然存在新的攻击方法，如本文提出的“三明治攻击”，能够操纵最先进的LLMs生成有害和不一致的响应。过去的研究主要集中在通过安全训练方法来对齐LLMs的响应与人类价值观，以防止有害输出。这些方法包括对抗性训练、红队评估、强化学习人类反馈（RLHF）、输入输出过滤等。
Sandwich attack: Multi-language Mixture Adaptive Attack on LLMs - arXiv. org
A Sandwich attack is a multilingual mixture adaptive attack that creates a prompt with a series of five questions in different low-resource languages, hiding the adversarial question in the middle position
ACL Anthology
%T Sandwich attack: Multi-language Mixture Adaptive Attack on LLMs %A Upadhayay, Bibek %A Behzadan, Vahid %Y Ovalle, Anaelia %Y Chang, Kai-Wei %Y Cao, Yang Trista %Y Mehrabi, Ninareh %Y Zhao, Jieyu %Y Galstyan, Aram %Y Dhamala, Jwala %Y Kumar, Anoop %Y Gupta, Rahul %S Proceedings of the 4th Workshop on Trustworthy Natural Language Processing
probe: Adapt sandwich attack to auto-find effective languages
The "sandwich attack" gives a few statements, each in a different language, to an LLM, with a malicious instruction in the middle Successfully merging a pull request may close this issue