Activation Scaling for Steering and Interpreting Language Models

Niklas Stoehr, Kevin Du, Vésteinn Snæbjarnarson, Robert West, Ryan Cotterell, Aaron Schein. Activation Scaling for Steering and Interpreting Language Models. In Yaser Al-Onaizan, Mohit Bansal, Yun-Nung Chen, editors, Findings of the Association for Computational Linguistics: EMNLP 2024, Miami, Florida, USA, November 12-16, 2024. pages 8189-8200, Association for Computational Linguistics, 2024. [doi]

Abstract

Abstract is missing.