Abstract is missing.
- VQPy: An Object-Oriented Approach to Modern Video AnalyticsShan Yu, Zhenting Zhu, Yu Chen, Hanchen Xu, Pengzhan Zhao, Yang Wang, Arthi Padmanabhan, Hugo Latapie, Harry Xu 0001. [doi]
- Keyformer: KV Cache reduction through key tokens selection for Efficient Generative InferenceMuhammad Adnan, Akhil Arunkumar, Gaurav Jain, Prashant J. Nair, Ilya Soloveychik, Purushotham Kamath. [doi]
- Disaggregated Multi-Tower: Topology-aware Modeling Technique for Efficient Large Scale RecommendationLiang Luo, Buyun Zhang, Michael Tsang, Yinbin Ma, Ching-Hsiang Chu, Yuxin Chen, Shen Li, Yuchen Hao, Yanli Zhao, Guna Lakshminarayanan, Ellie Wen, JongSoo Park, Dheevatsa Mudigere, Maxim Naumov. [doi]
- vMCU: Coordinated Memory Management and Kernel Optimization for DNN Inference on MCUsSize Zheng 0001, Renze Chen, Meng Li 0004, Zihao Ye, Luis Ceze, Yun Liang 0001. [doi]
- FLASH: Fast Model Adaptation in ML-Centric Cloud PlatformsHaoran Qiu, Weichao Mao, Archit Patke, Shengkun Cui, Chen Wang 0039, Hubertus Franke, Zbigniew Kalbarczyk, Tamer Basar, Ravi K. Iyer. [doi]
- Efficient Post-training Quantization with FP8 FormatsHaihao Shen, Naveen Mellempudi, Xin He, Qun Gao, Chang Wang, Mengni Wang. [doi]
- Fine-Tuning Language Models Using Formal Methods Feedback: A Use Case in Autonomous SystemsYunhao Yang, Neel P. Bhatt, Tyler Ingebrand, William Ward, Steven Carr, Atlas Wang, Ufuk Topcu. [doi]
- Q-Hitter: A Better Token Oracle for Efficient LLM Inference via Sparse-Quantized KV CacheZhenyu Zhang 0015, Shiwei Liu, Runjin Chen, Bhavya Kailkhura, Beidi Chen, Atlas Wang. [doi]
- HeteGen: Efficient Heterogeneous Parallel Inference for Large Language Models on Resource-Constrained DevicesXuanlei Zhao, Bin Jia, Haotian Zhou, Ziming Liu, Shenggan Cheng, Yang You 0001. [doi]
- Atom: Low-Bit Quantization for Efficient and Accurate LLM ServingYilong Zhao, Chien-Yu Lin, Kan Zhu, Zihao Ye, Lequn Chen, Size Zheng 0001, Luis Ceze, Arvind Krishnamurthy, Tianqi Chen 0001, Baris Kasikci. [doi]
- SiDA: Sparsity-Inspired Data-Aware Serving for Efficient and Scalable Large Mixture-of-Experts ModelsZhixu Du, Shiyu Li, Yuhao Wu, Xiangyu Jiang, Jingwei Sun 0002, Qilin Zheng, Yongkai Wu, Ang Li 0005, Hai Li 0001, Yiran Chen 0001. [doi]
- SLoRA: Scalable Serving of Thousands of LoRA AdaptersYing Sheng 0007, Shiyi Cao, Dacheng Li, Coleman Hooper, Nicholas Lee, Shuo Yang, Christopher Chou, Banghua Zhu, Lianmin Zheng, Kurt Keutzer, Joseph Gonzalez 0001, Ion Stoica. [doi]
- COMET: Neural Cost Model Explanation FrameworkIsha Chaudhary, Alex Renda, Charith Mendis, Gagandeep Singh. [doi]
- Prompt Cache: Modular Attention Reuse for Low-Latency InferenceIn Gim, Guojun Chen, Seung-Seob Lee, Nikhil Sarda, Anurag Khandelwal, Lin Zhong 0001. [doi]
- On Latency Predictors for Neural Architecture SearchYash Akhauri, Mohamed S. Abdelfattah. [doi]
- JIT-Q: Just-in-time Quantization with Processing-In-Memory for Efficient ML TrainingMohamed Assem Ibrahim, Shaizeen Aga, Ada Li, Suchita Pati, Mahzabeen Islam. [doi]
- FlashDecoding++: Faster Large Language Model Inference with Asynchronization, Flat GEMM Optimization, and HeuristicsKe Hong, Guohao Dai, Jiaming Xu, Qiuli Mao, Xiuhong Li, Jun Liu, Kangdi Chen, Yuhan Dong, Yu Wang 0002. [doi]
- QMoE: Sub-1-Bit Compression of Trillion Parameter ModelsElias Frantar, Dan Alistarh. [doi]
- UniDM: A Unified Framework for Data Manipulation with Large Language ModelsYichen Qian, Yongyi He, Rong Zhu, Jintao Huang, Zhijian Ma, Haibin Wang, Yaohua Wang, Xiuyu Sun, Defu Lian, Bolin Ding, Jingren Zhou. [doi]
- Distributed Matrix-Based Sampling for Graph Neural Network TrainingAlok Tripathy, Katherine A. Yelick, Aydin Buluç. [doi]
- Accelerating ReLU for MPC-Based Private Inference with a Communication-Efficient Sign EstimationKiwan Maeng, G. Edward Suh. [doi]
- Torch2Chip: An End-to-end Customizable Deep Neural Network Compression and Deployment Toolkit for Prototype Hardware Accelerator DesignJian Meng, Yuan Liao, Anupreetham Anupreetham, Ahmed Hasssan, Shixing Yu, Han-Sok Suh, Xiaofeng Hu, Jae-sun Seo. [doi]
- AWQ: Activation-aware Weight Quantization for On-Device LLM Compression and AccelerationJi Lin 0002, Jiaming Tang, Haotian Tang, Shang Yang, Wei-Ming Chen, Wei-Chen Wang, Guangxuan Xiao, Xingyu Dang, Chuang Gan, Song Han 0003. [doi]
- Punica: Multi-Tenant LoRA ServingLequn Chen, Zihao Ye 0001, Yongji Wu, Danyang Zhuo, Luis Ceze, Arvind Krishnamurthy. [doi]
- HeteroSwitch: Characterizing and Taming System-Induced Data Heterogeneity in Federated LearningGyudong Kim, Mehdi Ghasemi 0003, Soroush Heidari, Seungryong Kim, Young-geun Kim, Sarma B. K. Vrudhula, Carole-Jean Wu. [doi]
- LIFL: A Lightweight, Event-driven Serverless Platform for Federated LearningShixiong Qi, K. K. Ramakrishnan, Myungjin Lee. [doi]
- CloudEval-YAML: A Practical Benchmark for Cloud Configuration GenerationYifei Xu, Yuning Chen, Xumiao Zhang, Xianshang Lin, Pan Hu, Yunfei Ma, Songwu Lu, Wan Du, Zhuoqing Mao, Ennan Zhai, Dennis Cai. [doi]
- Does Compressing Activations Help Model Parallel Training?Song Bian 0002, Dacheng Li, Hongyi Wang 0001, Eric P. Xing, Shivaram Venkataraman. [doi]
- DiffusionPipe: Training Large Diffusion Models with Efficient PipelinesYe Tian, Zhen Jia 0001, Ziyue Luo, Yida Wang 0003, Chuan Wu 0001. [doi]
- ACROBAT: Optimizing Auto-batching of Dynamic Deep Learning at Compile TimePratik Fegade, Tianqi Chen 0001, Phillip B. Gibbons, Todd C. Mowry. [doi]
- VIDUR: A Large-Scale Simulation Framework for LLM InferenceAmey Agrawal, Nitin Kedia, Jayashree Mohan, Ashish Panwar, Nipun Kwatra, Bhargav S. Gulavani, Ramachandran Ramjee, Alexey Tumanov. [doi]
- Schrodinger's FP Training Neural Networks with Dynamic Floating-Point ContainersMilos Nikolic 0002, Enrique Torres-Sánchez, Jiahui Wang, Ali Hadi Zadeh, Mostafa Mahmoud, Ameer Abdelhadi, Kareem Ibrahim, Andreas Moshovos. [doi]
- Proteus: Preserving Model Confidentiality during Graph OptimizationsYubo Gao, Maryam Haghifam, Christina Giannoula, Renbo Tu, Gennady Pekhimenko, Nandita Vijaykumar. [doi]
- FedTrans: Efficient Federated Learning via Multi-Model TransformationYuxuan Zhu, Jiachen Liu, Mosharaf Chowdhury, Fan Lai. [doi]
- Accurate Low-Degree Polynomial Approximation of Non-Polynomial Operators for Fast Private Inference in Homomorphic EncryptionJingtian Dang, Jianming Tong, Anupam Golder, Cong Hao, Arijit Raychowdhury, Tushar Krishna. [doi]
- Lancet: Accelerating Mixture-of-Experts Training via Whole Graph Computation-Communication OverlappingChenyu Jiang, Ye Tian, Zhen Jia 0001, Shuai Zheng 0004, Chuan Wu 0001, Yida Wang 0003. [doi]
- L-GreCo: Layerwise-adaptive Gradient Compression For Efficient Data-parallel Deep LearningIlia Markov, Kaveh Alim, Elias Frantar, Dan Alistarh. [doi]