Tricking LLMs into Disobedience: Formalizing, Analyzing, and Detecting Jailbreaks

Abhinav Rao, Atharva Naik, Sachin Vashistha, Somak Aditya, Monojit Choudhury. Tricking LLMs into Disobedience: Formalizing, Analyzing, and Detecting Jailbreaks. In Nicoletta Calzolari, Min-Yen Kan, VĂ©ronique Hoste, Alessandro Lenci, Sakriani Sakti, Nianwen Xue, editors, Proceedings of the 2024 Joint International Conference on Computational Linguistics, Language Resources and Evaluation, LREC/COLING 2024, 20-25 May, 2024, Torino, Italy. pages 16802-16830, ELRA and ICCL, 2024. [doi]

Abstract

Abstract is missing.