Failure detection and propagation in HPC systems

George Bosilca, Aurelien Bouteiller, Amina Guermouche, Thomas Hérault, Yves Robert, Pierre Sens, Jack J. Dongarra. Failure detection and propagation in HPC systems. In John West 0001, Cherri M. Pancake, editors, Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis, SC 2016, Salt Lake City, UT, USA, November 13-18, 2016. pages 27, ACM, 2016. [doi]

Authors

George Bosilca

This author has not been identified. Look up 'George Bosilca' in Google

Aurelien Bouteiller

This author has not been identified. Look up 'Aurelien Bouteiller' in Google

Amina Guermouche

This author has not been identified. Look up 'Amina Guermouche' in Google

Thomas Hérault

This author has not been identified. Look up 'Thomas Hérault' in Google

Yves Robert

This author has not been identified. Look up 'Yves Robert' in Google

Pierre Sens

This author has not been identified. Look up 'Pierre Sens' in Google

Jack J. Dongarra

This author has not been identified. Look up 'Jack J. Dongarra' in Google