Distributed Training of Large Language Models on AWS Trainium

Xinwei Fu, Zhen Zhang, Haozheng Fan, Guangtai Huang, Mohammad El-Shabani, Randy Huang, Rahul Solanki, Fei Wu, Ron Diamant, Yida Wang 0003. Distributed Training of Large Language Models on AWS Trainium. In Proceedings of the 2024 ACM Symposium on Cloud Computing, SoCC 2024, Redmond, WA, USA, November 20-22, 2024. pages 961-976, ACM, 2024. [doi]

Authors

Xinwei Fu

This author has not been identified. Look up 'Xinwei Fu' in Google

Zhen Zhang

This author has not been identified. Look up 'Zhen Zhang' in Google

Haozheng Fan

This author has not been identified. Look up 'Haozheng Fan' in Google

Guangtai Huang

This author has not been identified. Look up 'Guangtai Huang' in Google

Mohammad El-Shabani

This author has not been identified. Look up 'Mohammad El-Shabani' in Google

Randy Huang

This author has not been identified. Look up 'Randy Huang' in Google

Rahul Solanki

This author has not been identified. Look up 'Rahul Solanki' in Google

Fei Wu

This author has not been identified. Look up 'Fei Wu' in Google

Ron Diamant

This author has not been identified. Look up 'Ron Diamant' in Google

Yida Wang 0003

This author has not been identified. Look up 'Yida Wang 0003' in Google