Abstract is missing.
- Item Response Theory to Evaluate Speech Synthesis: Beyond Synthetic Speech DifficultyChaina Oliveira, Ricardo B. C. Prudêncio. [doi]
- Robustness Testing of Machine Learning Families using Instance-Level IRT-DifficultyRaül Fabra-Boluda, Cèsar Ferri, Fernando Martínez-Plumed, María José Ramírez-Quintana. [doi]
- Evaluating Object Permanence in Embodied Agents using the Animal-AI EnvironmentKonstantinos Voudouris, Niall Donnelly, Danaja Rutar, Ryan Burnell, John Burden, José Hernández-Orallo, Lucy Cheke. [doi]
- FERM: A FEature-space Representation Measure for Improved Model EvaluationYeu-Shin Fu, Wenbo Ge, Jo Plested. [doi]
- On Young Children's Exploration, Aha! Moments and Explanations in Model Building for Self-Regulated Problem-SolvingVicky Charisi, Natalia Díaz Rodríguez, Barbara Mawhin, Luis Merino. [doi]
- Evaluating Sports Analytics Models: Challenges, Approaches, and Lessons LearnedJesse Davis, Lotte Bransen, Laurens Devos, Wannes Meert, Pieter Robberechts, Jan Van Haaren, Maaike Van Roy. [doi]
- The Relevance of Non-Human Errors in Machine LearningRicardo Baeza-Yates, Marina Estévez-Almenzar. [doi]
- Reject Before You Run: Small Assessors Anticipate Big Language ModelsLexin Zhou, Fernando Martínez-Plumed, José Hernández-Orallo, Cèsar Ferri, Wout Schellaert. [doi]
- Evaluating Understanding on Conceptual Abstraction BenchmarksVictor Vikram Odouard, Melanie Mitchell. [doi]
- A Framework for Categorising AI Evaluation InstrumentsAnthony G. Cohn, José Hernández-Orallo, Julius Sechang Mboli, Yael Moros-Daval, Zhiliang Xiang, Lexin Zhou. [doi]