HARM: Learning Hate-Aware Reward Model for Evaluating Natural Language Explanations of Offensive Content

Lorenzo Puppi Vecchi, Alceu de Souza Britto Jr., Emerson Cabrera Paraiso, Rafael M. O. Cruz. HARM: Learning Hate-Aware Reward Model for Evaluating Natural Language Explanations of Offensive Content. In Vera Demberg, Kentaro Inui, LluĂ­s Marquez, editors, Findings of the Association for Computational Linguistics: EACL 2026, Rabat, Morocco, March 24-29, 2026. pages 4393-4431, Association for Computational Linguistics, 2026. [doi]

Abstract

Abstract is missing.