Seyed Mahed Mousavi, Edoardo Cecchinato, Lucia Hornikova, Giuseppe Riccardi. Garbage In, Reasoning Out? Why Benchmark Scores are Unreliable and What to Do About It. In Vera Demberg, Kentaro Inui, LluĂs Marquez, editors, Findings of the Association for Computational Linguistics: EACL 2026, Rabat, Morocco, March 24-29, 2026. pages 1747-1759, Association for Computational Linguistics, 2026. [doi]
Abstract is missing.