Time Machine: Generative Real-Time Model for Failure (and Lead Time) Prediction in HPC Systems

Khalid Ayedh Alharthi, Arshad Jhumka, Sheng Di, Lin Gui, Franck Cappello, Simon McIntosh-Smith. Time Machine: Generative Real-Time Model for Failure (and Lead Time) Prediction in HPC Systems. In 53rd Annual IEEE/IFIP International Conference on Dependable Systems and Network, DSN 2023, Porto, Portugal, June 27-30, 2023. pages 508-521, IEEE, 2023. [doi]

Abstract

Abstract is missing.