Machine Learning Models for GPU Error Prediction in a Large Scale HPC System

Bin Nie, Ji Xue, Saurabh Gupta, Tirthak Patel, Christian Engelmann, Evgenia Smirni, Devesh Tiwari. Machine Learning Models for GPU Error Prediction in a Large Scale HPC System. In 48th Annual IEEE/IFIP International Conference on Dependable Systems and Networks, DSN 2018, Luxembourg City, Luxembourg, June 25-28, 2018. pages 95-106, IEEE Computer Society, 2018. [doi]

Abstract

Abstract is missing.