Beyond a Single Extractor: Re-thinking HTML-to-Text Extraction for LLM Pre-training

Jeffrey Li, Joshua P. Gardner, Doug Kang, Fangping Shi, Karanjeet Singh 0003, Chun-Liang Li, Herumb Shandilya, David Leo Wright Hall, Oncel Tuzel, Percy Liang, Ludwig Schmidt, Hadi Pouransari, Fartash Faghri. Beyond a Single Extractor: Re-thinking HTML-to-Text Extraction for LLM Pre-training. In Vera Demberg, Kentaro Inui, Lluís Marquez, editors, Findings of the Association for Computational Linguistics: EACL 2026, Rabat, Morocco, March 24-29, 2026. pages 5836-5861, Association for Computational Linguistics, 2026. [doi]

@inproceedings{LiGKSSLSHTLSPF26,
  title = {Beyond a Single Extractor: Re-thinking HTML-to-Text Extraction for LLM Pre-training},
  author = {Jeffrey Li and Joshua P. Gardner and Doug Kang and Fangping Shi and Karanjeet Singh 0003 and Chun-Liang Li and Herumb Shandilya and David Leo Wright Hall and Oncel Tuzel and Percy Liang and Ludwig Schmidt and Hadi Pouransari and Fartash Faghri},
  year = {2026},
  url = {https://aclanthology.org/2026.findings-eacl.307/},
  researchr = {https://researchr.org/publication/LiGKSSLSHTLSPF26},
  cites = {0},
  citedby = {0},
  pages = {5836-5861},
  booktitle = {Findings of the Association for Computational Linguistics: EACL 2026, Rabat, Morocco, March 24-29, 2026},
  editor = {Vera Demberg and Kentaro Inui and Lluís Marquez},
  publisher = {Association for Computational Linguistics},
  isbn = {979-8-89176-386-9},
}