Duplex: A Device for Large Language Models with Mixture of Experts, Grouped Query Attention, and Continuous Batching

Sungmin Yun 0001, Kwanhee Kyung, Juhwan Cho, Jaewan Choi, Jongmin Kim 0007, Byeongho Kim, Sukhan Lee 0002, Kyomin Sohn, Jung Ho Ahn. Duplex: A Device for Large Language Models with Mixture of Experts, Grouped Query Attention, and Continuous Batching. In 57th IEEE/ACM International Symposium on Microarchitecture, MICRO 2024, Austin, TX, USA, November 2-6, 2024. pages 1429-1443, IEEE, 2024. [doi]

Abstract

Abstract is missing.