Uncovering Safety Risks of Large Language Models through Concept Activation Vector

Zhihao Xu, Ruixuan Huang, Changyu Chen, Xiting Wang. Uncovering Safety Risks of Large Language Models through Concept Activation Vector. In Amir Globersons, Lester Mackey, Danielle Belgrave, Angela Fan, Ulrich Paquet, Jakub M. Tomczak, Cheng Zhang 0005, editors, Advances in Neural Information Processing Systems 38: Annual Conference on Neural Information Processing Systems 2024, NeurIPS 2024, Vancouver, BC, Canada, December 10 - 15, 2024. 2024. [doi]

Authors

Zhihao Xu

This author has not been identified. Look up 'Zhihao Xu' in Google

Ruixuan Huang

This author has not been identified. Look up 'Ruixuan Huang' in Google

Changyu Chen

This author has not been identified. Look up 'Changyu Chen' in Google

Xiting Wang

This author has not been identified. Look up 'Xiting Wang' in Google