Lichang Chen, Chen Zhu 0001, Jiuhai Chen, Davit Soselia, Tianyi Zhou 0001, Tom Goldstein, Heng Huang, Mohammad Shoeybi, Bryan Catanzaro. ODIN: Disentangled Reward Mitigates Hacking in RLHF. In Forty-first International Conference on Machine Learning, ICML 2024, Vienna, Austria, July 21-27, 2024. OpenReview.net, 2024. [doi]
Abstract is missing.