Failure detection and propagation in HPC systems

George Bosilca, Aurelien Bouteiller, Amina Guermouche, Thomas Hérault, Yves Robert, Pierre Sens, Jack J. Dongarra. Failure detection and propagation in HPC systems. In John West 0001, Cherri M. Pancake, editors, Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis, SC 2016, Salt Lake City, UT, USA, November 13-18, 2016. pages 27, ACM, 2016. [doi]

Possibly Related Publications

The following publications are possibly variants of this publication: