The RefinedWeb Dataset for Falcon LLM: Outperforming Curated Corpora with Web Data Only

Guilherme Penedo, Quentin Malartic, Daniel Hesslow, Ruxandra Cojocaru, Hamza Alobeidli, Alessandro Cappelli, Baptiste Pannier, Ebtesam Almazrouei, Julien Launay. The RefinedWeb Dataset for Falcon LLM: Outperforming Curated Corpora with Web Data Only. In Alice Oh, Tristan Naumann, Amir Globerson, Kate Saenko, Moritz Hardt, Sergey Levine, editors, Advances in Neural Information Processing Systems 36: Annual Conference on Neural Information Processing Systems 2023, NeurIPS 2023, New Orleans, LA, USA, December 10 - 16, 2023. 2023. [doi]

@inproceedings{PenedoMHCACPAL23,
  title = {The RefinedWeb Dataset for Falcon LLM: Outperforming Curated Corpora with Web Data Only},
  author = {Guilherme Penedo and Quentin Malartic and Daniel Hesslow and Ruxandra Cojocaru and Hamza Alobeidli and Alessandro Cappelli and Baptiste Pannier and Ebtesam Almazrouei and Julien Launay},
  year = {2023},
  url = {http://papers.nips.cc/paper_files/paper/2023/hash/fa3ed726cc5073b9c31e3e49a807789c-Abstract-Datasets_and_Benchmarks.html},
  researchr = {https://researchr.org/publication/PenedoMHCACPAL23},
  cites = {0},
  citedby = {0},
  booktitle = {Advances in Neural Information Processing Systems 36: Annual Conference on Neural Information Processing Systems 2023, NeurIPS 2023, New Orleans, LA, USA, December 10 - 16, 2023},
  editor = {Alice Oh and Tristan Naumann and Amir Globerson and Kate Saenko and Moritz Hardt and Sergey Levine},
}