Confronting Reward Model Overoptimization with Constrained RLHF

Ted Moskovitz, Aaditya K. Singh, DJ Strouse, Tuomas Sandholm, Ruslan Salakhutdinov, Anca D. Dragan, Stephen Marcus McAleer. Confronting Reward Model Overoptimization with Constrained RLHF. In The Twelfth International Conference on Learning Representations, ICLR 2024, Vienna, Austria, May 7-11, 2024. OpenReview.net, 2024. [doi]

Authors

Ted Moskovitz

This author has not been identified. Look up 'Ted Moskovitz' in Google

Aaditya K. Singh

This author has not been identified. Look up 'Aaditya K. Singh' in Google

DJ Strouse

This author has not been identified. Look up 'DJ Strouse' in Google

Tuomas Sandholm

This author has not been identified. Look up 'Tuomas Sandholm' in Google

Ruslan Salakhutdinov

This author has not been identified. Look up 'Ruslan Salakhutdinov' in Google

Anca D. Dragan

This author has not been identified. Look up 'Anca D. Dragan' in Google

Stephen Marcus McAleer

This author has not been identified. Look up 'Stephen Marcus McAleer' in Google