The following publications are possibly variants of this publication:
- Mitigating Reward Over-Optimization in RLHF via Behavior-Supported RegularizationJuntao Dai, Taiye Chen, Yaodong Yang 0001, Qian Zheng, Gang Pan 0001. iclr 2025: [doi]
- RRM: Robust Reward Model Training Mitigates Reward HackingTianqi Liu 0002, Wei Xiong 0015, Jie Ren 0006, Lichang Chen, Junru Wu, Rishabh Joshi, Yang Gao, Jiaming Shen, Zhen Qin 0001, Tianhe Yu, Daniel Sohn, Anastasia Makarova, Jeremiah Zhe Liu, Yuan Liu, Bilal Piot, Abe Ittycheriah, Aviral Kumar, Mohammad Saleh. iclr 2025: [doi]
- RLHF Workflow: From Reward Modeling to Online RLHFHanze Dong, Wei Xiong 0015, Bo Pang 0004, Haoxiang Wang 0003, Han Zhao 0002, Yingbo Zhou, Nan Jiang 0008, Doyen Sahoo, Caiming Xiong, Tong Zhang 0001. tmlr, 2024, 2024. [doi]
- Mitigating the Alignment Tax of RLHFYong Lin, Hangyu Lin, Wei Xiong 0015, Shizhe Diao, Jianmeng Liu, Jipeng Zhang, Rui Pan, Haoxiang Wang 0003, Wenbin Hu 0002, Hanning Zhang, Hanze Dong, Renjie Pi, Han Zhao 0002, Nan Jiang 0008, Heng Ji, Yuan Yao, Tong Zhang 0001. emnlp 2024: 580-606 [doi]
- How to Evaluate Reward Models for RLHFEvan Frick, Tianle Li, Connor Chen, Wei-Lin Chiang, Anastasios Nikolas Angelopoulos, Jiantao Jiao, Banghua Zhu, Joseph E. Gonzalez, Ion Stoica. iclr 2025: [doi]
- Taming Overconfidence in LLMs: Reward Calibration in RLHFJixuan Leng, Chengsong Huang, Banghua Zhu, Jiaxin Huang 0001. iclr 2025: [doi]
- The Trickle-down Impact of Reward Inconsistency on RLHFLingfeng Shen, Sihao Chen, Linfeng Song, Lifeng Jin, Baolin Peng, Haitao Mi, Daniel Khashabi, Dong Yu 0001. iclr 2024: [doi]