Countering Reward Over-Optimization in LLM with Demonstration-Guided Reinforcement Learning

Mathieu Rita, Florian Strub, Rahma Chaabouni 0001, Paul Michel, Emmanuel Dupoux, Olivier Pietquin. Countering Reward Over-Optimization in LLM with Demonstration-Guided Reinforcement Learning. In Lun-Wei Ku, Andre Martins, Vivek Srikumar, editors, Findings of the Association for Computational Linguistics, ACL 2024, Bangkok, Thailand and virtual meeting, August 11-16, 2024. pages 12447-12472, Association for Computational Linguistics, 2024. [doi]

Authors

Mathieu Rita

This author has not been identified. Look up 'Mathieu Rita' in Google

Florian Strub

This author has not been identified. Look up 'Florian Strub' in Google

Rahma Chaabouni 0001

This author has not been identified. Look up 'Rahma Chaabouni 0001' in Google

Paul Michel

This author has not been identified. Look up 'Paul Michel' in Google

Emmanuel Dupoux

This author has not been identified. Look up 'Emmanuel Dupoux' in Google

Olivier Pietquin

This author has not been identified. Look up 'Olivier Pietquin' in Google