CulturaX: A Cleaned, Enormous, and Multilingual Dataset for Large Language Models in 167 Languages

Thuat Nguyen, Chien Van Nguyen, Viet Dac Lai, Hieu Man, Nghia Trung Ngo, Franck Dernoncourt, Ryan A. Rossi, Thien Huu Nguyen. CulturaX: A Cleaned, Enormous, and Multilingual Dataset for Large Language Models in 167 Languages. In Nicoletta Calzolari, Min-Yen Kan, Véronique Hoste, Alessandro Lenci, Sakriani Sakti, Nianwen Xue, editors, Proceedings of the 2024 Joint International Conference on Computational Linguistics, Language Resources and Evaluation, LREC/COLING 2024, 20-25 May, 2024, Torino, Italy. pages 4226-4237, ELRA and ICCL, 2024. [doi]

Authors

Thuat Nguyen

This author has not been identified. Look up 'Thuat Nguyen' in Google

Chien Van Nguyen

This author has not been identified. Look up 'Chien Van Nguyen' in Google

Viet Dac Lai

This author has not been identified. Look up 'Viet Dac Lai' in Google

Hieu Man

This author has not been identified. Look up 'Hieu Man' in Google

Nghia Trung Ngo

This author has not been identified. Look up 'Nghia Trung Ngo' in Google

Franck Dernoncourt

This author has not been identified. Look up 'Franck Dernoncourt' in Google

Ryan A. Rossi

This author has not been identified. Look up 'Ryan A. Rossi' in Google

Thien Huu Nguyen

This author has not been identified. Look up 'Thien Huu Nguyen' in Google