Behrad Moniri, Hamed Hassani, Edgar Dobriban. Evaluating the Performance of Large Language Models via Debates. In Luis Chiruzzo, Alan Ritter, Lu Wang, editors, Findings of the Association for Computational Linguistics: NAACL 2025, Albuquerque, New Mexico, USA, April 29 - May 4, 2025. pages 2040-2075, Association for Computational Linguistics, 2025. [doi]
Abstract is missing.