Systemic Assessment of Node Failures in HPC Production Platforms

Anwesha Das, Frank Mueller 0001, Barry Rountree. Systemic Assessment of Node Failures in HPC Production Platforms. In 35th IEEE International Parallel and Distributed Processing Symposium, IPDPS 2021, Portland, OR, USA, May 17-21, 2021. pages 267-276, IEEE, 2021. [doi]