Do Language Models Care about Text Quality? Evaluating Web-Crawled Corpora across 11 Languages

Rik van Noord, Taja Kuzman, Peter Rupnik, Nikola Ljubesic, Miquel Esplà-Gomis, Gema Ramírez-Sánchez, Antonio Toral. Do Language Models Care about Text Quality? Evaluating Web-Crawled Corpora across 11 Languages. In Nicoletta Calzolari, Min-Yen Kan, Véronique Hoste, Alessandro Lenci, Sakriani Sakti, Nianwen Xue, editors, Proceedings of the 2024 Joint International Conference on Computational Linguistics, Language Resources and Evaluation, LREC/COLING 2024, 20-25 May, 2024, Torino, Italy. pages 5221-5234, ELRA and ICCL, 2024. [doi]

Authors

Rik van Noord

This author has not been identified. Look up 'Rik van Noord' in Google

Taja Kuzman

This author has not been identified. Look up 'Taja Kuzman' in Google

Peter Rupnik

This author has not been identified. Look up 'Peter Rupnik' in Google

Nikola Ljubesic

This author has not been identified. Look up 'Nikola Ljubesic' in Google

Miquel Esplà-Gomis

This author has not been identified. Look up 'Miquel Esplà-Gomis' in Google

Gema Ramírez-Sánchez

This author has not been identified. Look up 'Gema Ramírez-Sánchez' in Google

Antonio Toral

This author has not been identified. Look up 'Antonio Toral' in Google