Steering Llama 2 via Contrastive Activation Addition

Nina Rimsky, Nick Gabrieli, Julian Schulz, Meg Tong, Evan Hubinger, Alexander Matt Turner. Steering Llama 2 via Contrastive Activation Addition. In Lun-Wei Ku, Andre Martins, Vivek Srikumar, editors, Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), ACL 2024, Bangkok, Thailand, August 11-16, 2024. pages 15504-15522, Association for Computational Linguistics, 2024. [doi]

Authors

Nina Rimsky

This author has not been identified. Look up 'Nina Rimsky' in Google

Nick Gabrieli

This author has not been identified. Look up 'Nick Gabrieli' in Google

Julian Schulz

This author has not been identified. Look up 'Julian Schulz' in Google

Meg Tong

This author has not been identified. Look up 'Meg Tong' in Google

Evan Hubinger

This author has not been identified. Look up 'Evan Hubinger' in Google

Alexander Matt Turner

This author has not been identified. Look up 'Alexander Matt Turner' in Google