Do Language Models Care about Text Quality? Evaluating Web-Crawled Corpora across 11 Languages

Rik van Noord, Taja Kuzman, Peter Rupnik, Nikola Ljubesic, Miquel Esplà-Gomis, Gema Ramírez-Sánchez, Antonio Toral. Do Language Models Care about Text Quality? Evaluating Web-Crawled Corpora across 11 Languages. In Nicoletta Calzolari, Min-Yen Kan, Véronique Hoste, Alessandro Lenci, Sakriani Sakti, Nianwen Xue, editors, Proceedings of the 2024 Joint International Conference on Computational Linguistics, Language Resources and Evaluation, LREC/COLING 2024, 20-25 May, 2024, Torino, Italy. pages 5221-5234, ELRA and ICCL, 2024. [doi]

@inproceedings{NoordKRLERT24,
  title = {Do Language Models Care about Text Quality? Evaluating Web-Crawled Corpora across 11 Languages},
  author = {Rik van Noord and Taja Kuzman and Peter Rupnik and Nikola Ljubesic and Miquel Esplà-Gomis and Gema Ramírez-Sánchez and Antonio Toral},
  year = {2024},
  url = {https://aclanthology.org/2024.lrec-main.465},
  researchr = {https://researchr.org/publication/NoordKRLERT24},
  cites = {0},
  citedby = {0},
  pages = {5221-5234},
  booktitle = {Proceedings of the 2024 Joint International Conference on Computational Linguistics, Language Resources and Evaluation, LREC/COLING 2024, 20-25 May, 2024, Torino, Italy},
  editor = {Nicoletta Calzolari and Min-Yen Kan and Véronique Hoste and Alessandro Lenci and Sakriani Sakti and Nianwen Xue},
  publisher = {ELRA and ICCL},
  isbn = {978-2-493814-10-4},
}