SpecInfer: Accelerating Large Language Model Serving with Tree-based Speculative Inference and Verification

Xupeng Miao, Gabriele Oliaro, Zhihao Zhang, Xinhao Cheng, Zeyu Wang, Zhengxin Zhang, Rae Ying Yee Wong, Alan Zhu, Lijie Yang, Xiaoxiang Shi, Chunan Shi, Zhuoming Chen, Daiyaan Arfeen, Reyna Abhyankar, Zhihao Jia. SpecInfer: Accelerating Large Language Model Serving with Tree-based Speculative Inference and Verification. In Rajiv Gupta 0001, Nael B. Abu-Ghazaleh, Madan Musuvathi, Dan Tsafrir, editors, Proceedings of the 29th ACM International Conference on Architectural Support for Programming Languages and Operating Systems, Volume 3, ASPLOS 2024, La Jolla, CA, USA, 27 April 2024- 1 May 2024. pages 932-949, ACM, 2024. [doi]

Authors

Xupeng Miao

This author has not been identified. Look up 'Xupeng Miao' in Google

Gabriele Oliaro

This author has not been identified. Look up 'Gabriele Oliaro' in Google

Zhihao Zhang

This author has not been identified. Look up 'Zhihao Zhang' in Google

Xinhao Cheng

This author has not been identified. Look up 'Xinhao Cheng' in Google

Zeyu Wang

This author has not been identified. Look up 'Zeyu Wang' in Google

Zhengxin Zhang

This author has not been identified. Look up 'Zhengxin Zhang' in Google

Rae Ying Yee Wong

This author has not been identified. Look up 'Rae Ying Yee Wong' in Google

Alan Zhu

This author has not been identified. Look up 'Alan Zhu' in Google

Lijie Yang

This author has not been identified. Look up 'Lijie Yang' in Google

Xiaoxiang Shi

This author has not been identified. Look up 'Xiaoxiang Shi' in Google

Chunan Shi

This author has not been identified. Look up 'Chunan Shi' in Google

Zhuoming Chen

This author has not been identified. Look up 'Zhuoming Chen' in Google

Daiyaan Arfeen

This author has not been identified. Look up 'Daiyaan Arfeen' in Google

Reyna Abhyankar

This author has not been identified. Look up 'Reyna Abhyankar' in Google

Zhihao Jia

This author has not been identified. Look up 'Zhihao Jia' in Google