CCNet: Extracting High Quality Monolingual Datasets from Web Crawl Data

Guillaume Wenzek, Marie-Anne Lachaux, Alexis Conneau, Vishrav Chaudhary, Francisco Guzmán, Armand Joulin, Edouard Grave. CCNet: Extracting High Quality Monolingual Datasets from Web Crawl Data. In Nicoletta Calzolari, Frédéric Béchet, Philippe Blache, Khalid Choukri, Christopher Cieri, Thierry Declerck, Sara Goggi, Hitoshi Isahara, Bente Maegaard, Joseph Mariani, Hélène Mazo, Asunción Moreno, Jan Odijk, Stelios Piperidis, editors, Proceedings of The 12th Language Resources and Evaluation Conference, LREC 2020, Marseille, France, May 11-16, 2020. pages 4003-4012, European Language Resources Association, 2020. [doi]

Authors

Guillaume Wenzek

This author has not been identified. Look up 'Guillaume Wenzek' in Google

Marie-Anne Lachaux

This author has not been identified. Look up 'Marie-Anne Lachaux' in Google

Alexis Conneau

This author has not been identified. Look up 'Alexis Conneau' in Google

Vishrav Chaudhary

This author has not been identified. Look up 'Vishrav Chaudhary' in Google

Francisco Guzmán

This author has not been identified. Look up 'Francisco Guzmán' in Google

Armand Joulin

This author has not been identified. Look up 'Armand Joulin' in Google

Edouard Grave

This author has not been identified. Look up 'Edouard Grave' in Google