Fine-grained Automated Failure Management for Extreme-Scale GPU Accelerated Systems

Yonatan Levitt, Richard Barella, Sam Zeltner, Thomas Musta, Lance Cheney, Gustavo Espinosa, Olivier Franza, Balazs Gerofi. Fine-grained Automated Failure Management for Extreme-Scale GPU Accelerated Systems. In Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis, SC 2025, St. Louis, MO, USA, November 16-21, 2025. pages 1073-1084, ACM, 2025. [doi]

Abstract

Abstract is missing.