Machine Learning Models for GPU Error Prediction in a Large Scale HPC System

Bin Nie, Ji Xue, Saurabh Gupta, Tirthak Patel, Christian Engelmann, Evgenia Smirni, Devesh Tiwari. Machine Learning Models for GPU Error Prediction in a Large Scale HPC System. In 48th Annual IEEE/IFIP International Conference on Dependable Systems and Networks, DSN 2018, Luxembourg City, Luxembourg, June 25-28, 2018. pages 95-106, IEEE Computer Society, 2018. [doi]

Authors

Bin Nie

This author has not been identified. Look up 'Bin Nie' in Google

Ji Xue

This author has not been identified. Look up 'Ji Xue' in Google

Saurabh Gupta

This author has not been identified. Look up 'Saurabh Gupta' in Google

Tirthak Patel

This author has not been identified. Look up 'Tirthak Patel' in Google

Christian Engelmann

This author has not been identified. Look up 'Christian Engelmann' in Google

Evgenia Smirni

This author has not been identified. Look up 'Evgenia Smirni' in Google

Devesh Tiwari

This author has not been identified. Look up 'Devesh Tiwari' in Google