Beyond a Single Extractor: Re-thinking HTML-to-Text Extraction for LLM Pre-training

Jeffrey Li, Joshua P. Gardner, Doug Kang, Fangping Shi, Karanjeet Singh 0003, Chun-Liang Li, Herumb Shandilya, David Leo Wright Hall, Oncel Tuzel, Percy Liang, Ludwig Schmidt, Hadi Pouransari, Fartash Faghri. Beyond a Single Extractor: Re-thinking HTML-to-Text Extraction for LLM Pre-training. In Vera Demberg, Kentaro Inui, LluĂ­s Marquez, editors, Findings of the Association for Computational Linguistics: EACL 2026, Rabat, Morocco, March 24-29, 2026. pages 5836-5861, Association for Computational Linguistics, 2026. [doi]

Authors

Jeffrey Li

This author has not been identified. Look up 'Jeffrey Li' in Google

Joshua P. Gardner

This author has not been identified. Look up 'Joshua P. Gardner' in Google

Doug Kang

This author has not been identified. Look up 'Doug Kang' in Google

Fangping Shi

This author has not been identified. Look up 'Fangping Shi' in Google

Karanjeet Singh 0003

This author has not been identified. Look up 'Karanjeet Singh 0003' in Google

Chun-Liang Li

This author has not been identified. Look up 'Chun-Liang Li' in Google

Herumb Shandilya

This author has not been identified. Look up 'Herumb Shandilya' in Google

David Leo Wright Hall

This author has not been identified. Look up 'David Leo Wright Hall' in Google

Oncel Tuzel

This author has not been identified. Look up 'Oncel Tuzel' in Google

Percy Liang

This author has not been identified. Look up 'Percy Liang' in Google

Ludwig Schmidt

This author has not been identified. Look up 'Ludwig Schmidt' in Google

Hadi Pouransari

This author has not been identified. Look up 'Hadi Pouransari' in Google

Fartash Faghri

This author has not been identified. Look up 'Fartash Faghri' in Google