Beyond a Single Extractor: Re-thinking HTML-to-Text Extraction for LLM Pre-training

Jeffrey Li, Joshua P. Gardner, Doug Kang, Fangping Shi, Karanjeet Singh 0003, Chun-Liang Li, Herumb Shandilya, David Leo Wright Hall, Oncel Tuzel, Percy Liang, Ludwig Schmidt, Hadi Pouransari, Fartash Faghri. Beyond a Single Extractor: Re-thinking HTML-to-Text Extraction for LLM Pre-training. In Vera Demberg, Kentaro Inui, LluĂ­s Marquez, editors, Findings of the Association for Computational Linguistics: EACL 2026, Rabat, Morocco, March 24-29, 2026. pages 5836-5861, Association for Computational Linguistics, 2026. [doi]

Abstract

Abstract is missing.