ODIN: Disentangled Reward Mitigates Hacking in RLHF

Lichang Chen, Chen Zhu 0001, Jiuhai Chen, Davit Soselia, Tianyi Zhou 0001, Tom Goldstein, Heng Huang, Mohammad Shoeybi, Bryan Catanzaro. ODIN: Disentangled Reward Mitigates Hacking in RLHF. In Forty-first International Conference on Machine Learning, ICML 2024, Vienna, Austria, July 21-27, 2024. OpenReview.net, 2024. [doi]

Authors

Lichang Chen

This author has not been identified. Look up 'Lichang Chen' in Google

Chen Zhu 0001

This author has not been identified. Look up 'Chen Zhu 0001' in Google

Jiuhai Chen

This author has not been identified. Look up 'Jiuhai Chen' in Google

Davit Soselia

This author has not been identified. Look up 'Davit Soselia' in Google

Tianyi Zhou 0001

This author has not been identified. Look up 'Tianyi Zhou 0001' in Google

Tom Goldstein

This author has not been identified. Look up 'Tom Goldstein' in Google

Heng Huang

This author has not been identified. Look up 'Heng Huang' in Google

Mohammad Shoeybi

This author has not been identified. Look up 'Mohammad Shoeybi' in Google

Bryan Catanzaro

This author has not been identified. Look up 'Bryan Catanzaro' in Google