Evaluating the Performance of Large Language Models via Debates

Behrad Moniri, Hamed Hassani, Edgar Dobriban. Evaluating the Performance of Large Language Models via Debates. In Luis Chiruzzo, Alan Ritter, Lu Wang, editors, Findings of the Association for Computational Linguistics: NAACL 2025, Albuquerque, New Mexico, USA, April 29 - May 4, 2025. pages 2040-2075, Association for Computational Linguistics, 2025. [doi]

Abstract

Abstract is missing.