Is one run enough? Reproducibility of flagship large language models across temperature and reasoning settings in biomedical text processing

Paul Windisch, Carole Koechli, Fabio Dennstädt, Daniel M. Aebersold, Daniel R. Zwahlen, Robert Förster, Christina Schröder. Is one run enough? Reproducibility of flagship large language models across temperature and reasoning settings in biomedical text processing. JAMIA, 33(6):1179-1184, 2026. [doi]

Abstract

Abstract is missing.