Identifying Rare Languages in Common Crawl Data is a Needles-in-a-Haystack Problem

Rasul Dent, Pedro Ortiz Suarez, Thibault Clérice, Benoît Sagot. Identifying Rare Languages in Common Crawl Data is a Needles-in-a-Haystack Problem. In Christos Christodoulopoulos 0001, Tanmoy Chakraborty 0002, Carolyn Rose, Violet Peng, editors, Findings of the Association for Computational Linguistics: EMNLP 2025, Suzhou, China, November 4-9, 2025. pages 1460-1473, Association for Computational Linguistics, 2025. [doi]

References

No references recorded for this publication.

Cited by

No citations of this publication recorded.