A Case for Epidemic Fault Detection and Group Membership in HPC Storage Systems

Shane Snyder, Philip H. Carns, Jonathan Jenkins, Kevin Harms, Robert B. Ross, Misbah Mubarak, Christopher D. Carothers. A Case for Epidemic Fault Detection and Group Membership in HPC Storage Systems. In Stephen A. Jarvis, Steven A. Wright, Simon D. Hammond, editors, High Performance Computing Systems. Performance Modeling, Benchmarking, and Simulation - 5th International Workshop, PMBS 2014, New Orleans, LA, USA, November 16, 2014. Revised Selected Papers. Volume 8966 of Lecture Notes in Computer Science, pages 237-248, Springer, 2014. [doi]

Abstract

Abstract is missing.