Safety Arithmetic: A Framework for Test-time Safety Alignment of Language Models by Steering Parameters and Activations

Rima Hazra, Sayan Layek, Somnath Banerjee 0002, Soujanya Poria. Safety Arithmetic: A Framework for Test-time Safety Alignment of Language Models by Steering Parameters and Activations. In Yaser Al-Onaizan, Mohit Bansal, Yun-Nung Chen, editors, Proceedings of the 2024 Conference on Empirical Methods in Natural Language Processing, EMNLP 2024, Miami, FL, USA, November 12-16, 2024. pages 21759-21776, Association for Computational Linguistics, 2024. [doi]

Abstract

Abstract is missing.