Abstract is missing.
- FastTree: Optimizing Attention Kernel and Runtime for Tree-Structured LLM InferenceZaifeng Pan, Yitong Ding, Yue Guan 0003, Zheng Wang 0075, Zhongkai Yu, Xulong Tang, Yida Wang, Yufei Ding 0001. [doi]
- Interference-aware Edge Runtime Prediction with Conformal Matrix CompletionTianshu Huang, Arjun Ramesh, Emily Ruppel, Nuno Pereira 0001, Anthony Rowe 0001, Carlee Joe-Wong. [doi]
- Venn: Resource Management For Collaborative Learning JobsJiachen Liu, Fan Lai 0001, Eric Ding 0002, Yiwen Zhang 0008, Mosharaf Chowdhury. [doi]
- SwiftVI: Time-Efficient Planning and Learning with MDPsKasper Overgaard Mortensen, Konstantinos Skitsas, Emil Morre Christensen, Mohammad Sadegh Talebi, Andreas Pavlogiannis, Davide Mottin, Panagiotis Karras. [doi]
- AI Metropolis: Scaling Large Language Model-based Multi-Agent Simulation with Out-of-order ExecutionZhiqiang Xie, Hao Kang, Ying Sheng 0007, Tushar Krishna, Kayvon Fatahalian, Christos Kozyrakis. [doi]
- Graph Learning at Scale: Characterizing and Optimizing Pre-Propagation GNNsZichao Yue, Chenhui Deng, Zhiru Zhang. [doi]
- LeanAttention: Hardware-Aware Scalable Attention Mechanism for the Decode-Phase of TransformersRya Sanovar, Srikant Bharadwaj, Renée St. Amant, Victor Rühle, Saravan Rajmohan. [doi]
- LAVA: Lifetime-Aware VM Allocation with Learned Distributions and Adaptation to MispredictionsJianheng Ling, Pratik Worah, Yawen Wang, Yunchuan Kong, Chunlei Wang, Clifford Stein 0003, Diwakar Gupta, Jason Behmer, Logan A. Bush, Prakash Ramanan, Rajesh Kumar, Thomas Chestna, Yajing Liu, Ying Liu, Ye Zhao, Kathryn S. McKinley, Meeyoung Park, Martin Maas 0001. [doi]
- On Distributed Larger-Than-Memory Subset Selection With Pairwise Submodular FunctionsMaximilian Böther, Abraham Sebastian, Pranjal Awasthi, Ana Klimovic, Srikumar Ramalingam. [doi]
- ScaleFusion: Scalable Inference of Spatial-Temporal Diffusion Transformers for High-Resolution Long Video GenerationJiacheng Yang, Jun Wu, Zhen Zhang 0063, Xinwei Fu, Zhiying Xu, Zhen Jia 0001, Yida Wang 0003, Gennady Pekhimenko. [doi]
- VoLUT: Efficient Volumetric streaming enhanced by LUT-based super-resolutionChendong Wang, Anlan Zhang, Yifan Yang 0004, Lili Qiu, Yuqing Yang 0001, Xinyang Jiang, Feng Qian 0001, Suman Banerjee 0001. [doi]
- Scaling Deep Learning Training with MPMD Pipeline ParallelismAnxhelo Xhebraj, Sean Lee, Hanfeng Chen, Vinod Grover. [doi]
- Training Ultra Long Context Language Model with Fully Pipelined Distributed TransformerJinghan Yao, Sam Ade Jacobs, Masahiro Tanaka, Olatunji Ruwase, Hari Subramoni, Dhabaleswar K. Panda 0001. [doi]
- A Bring-Your-Own-Model Approach for ML-Driven Storage Placement in Warehouse-Scale ComputersChenxi Yang, Yan Li, Martin Maas 0001, Mustafa Uysal, Ubaid Ullah Hafeez, Arif Merchant, Richard McDougall. [doi]
- SOLA: Optimizing SLO Attainment for Large Language Model Serving with State-Aware SchedulingKe Hong, Xiuhong Li, Lufang Chen, Qiuli Mao, Guohao Dai 0001, Xuefei Ning, Shengen Yan, Yun Liang 0001, Yu Wang 0002. [doi]
- SparseTransX: Efficient Training of Translation-Based Knowledge Graph Embeddings Using Sparse Matrix OperationsMd Saidul Hoque Anik, Ariful Azad. [doi]
- Balancing Pipeline Parallelism with Vocabulary ParallelismMan Tsung Yeung, Penghui Qi, Min Lin, Xinyi Wan. [doi]
- SampleAttention: Near-Lossless Acceleration of Long Context LLM Inference with Adaptive Structured Sparse AttentionQianchao Zhu, Jiangfei Duan, Chang Chen 0001, Siran Liu, Xiuhong Li, Guanyu Feng, Xin Lv, Xiao Chuanfu, Dahua Lin, Chao Yang 0002. [doi]
- NEO: Saving GPU Memory Crisis with CPU Offloading for Online LLM InferenceXuanlin Jiang, Yang Zhou, Shiyi Cao, Ion Stoica, Minlan Yu. [doi]
- Rubick: Exploiting Job Reconfigurability for Deep Learning Cluster SchedulingXinyi Zhang, Hanyu Zhao, Wencong Xiao, Xianyan Jia, Fei Xu, Yong Li 0045, Wei Lin 0016, Fangming Liu. [doi]
- Efficient On-Device Machine Learning with a Biologically-Plausible Forward-Only AlgorithmBaichuan Huang, Amir Aminifar. [doi]
- Self-Data Distillation for Recovering Quality in Pruned Large Language ModelsVithursan Thangarasa, Ganesh Venkatesh, Mike Lasby, Nish Sinnadurai, Sean Lie. [doi]
- FLStore: Efficient Federated Learning Storage for non-training workloadsAhmad Faraz Khan 0001, Samuel Fountain, Ahmed M. Abdelmoniem, Ali Raza Butt, Ali Anwar 0001. [doi]
- Youmu: Efficient Columnar Data Pipeline for LLM TrainingTianle Zhong, Jiechen Zhao 0002, Qiang Su, Geoffrey Fox. [doi]
- PipeFill: Using GPUs During Bubbles in Pipeline-parallel LLM TrainingDaiyaan Arfeen, Zhen Zhang, Xinwei Fu, Gregory R. Ganger, Yida Wang. [doi]
- Spa: Scaling Graph Neural Network Training on Large graphs via Probabilistic splittingSandeep Polisetty, Juelin Liu, Yi Fung 0001, Seung-Hwan Lim, Hui Guan 0001, Marco Serafini. [doi]
- Radius: Range-based Gradient Sparsity for Large Foundation Model Pre-trainingMingkai Zheng, Zhao Zhang 0007. [doi]
- FlashInfer: Efficient and Customizable Attention Engine for LLM Inference ServingZihao Ye 0001, Lequn Chen 0001, Ruihang Lai, Wuwei Lin, Yineng Zhang, Stephanie Wang, Tianqi Chen 0001, Baris Kasikci, Vinod Grover, Arvind Krishnamurthy, Luis Ceze. [doi]
- The Hidden Bloat in Machine Learning SystemsHuaifeng Zhang, Ahmed Ali-Eldin. [doi]
- Marconi: Prefix Caching for the Era of Hybrid LLMsRui Pan 0003, Zhuang Wang, Zhen Jia, Can Karakus, Luca Zancato, Tri Dao, Yida Wang, Ravi Netravali. [doi]
- FlexAttention: A Programming Model for Generating Fused Attention VariantsJuechu Dong, Boyuan Feng, Driss Guessous, Yanbo Liang, Horace He. [doi]
- APOLLO: SGD-like Memory, AdamW-level PerformanceHanqing Zhu, Zhenyu Zhang 0015, Wenyan Cong, Xi Liu, Sem Park, Vikas Chandra, Bo Long, David Z. Pan, Zhangyang Wang, Jinwon Lee. [doi]
- TurboAttention: Efficient attention approximation for high throughputs llmHao Kang, Srikant Bharadwaj, James Hensman, Tushar Krishna, Victor Rühle, Saravan Rajmohan. [doi]
- DiffServe: Efficiently Serving Text-to-Image Diffusion Models with Query-Aware Model ScalingSohaib Ahmad, Qizheng Yang, Haoliang Wang, Ramesh K. Sitaraman, Hui Guan 0001. [doi]
- ThunderServe: High-performance and Cost-efficient LLM Serving in Cloud EnvironmentsYouhe Jiang, Fangcheng Fu, Xiaozhe Yao, Taiyi Wang, Bin Cui 0001, Ana Klimovic, Eiko Yoneki. [doi]
- Optimizing LLM Queries in Relational Data Analytics WorkloadsShu Liu, Asim Biswal, Amog Kamsetty, Audrey Cheng, Luis Gaspar Schroeder, Liana Patel, Shiyi Cao, Xiangxi Mo, Ion Stoica, Joseph E. Gonzalez, Matei Zaharia. [doi]
- MiLo: Efficient Quantized MoE Inference with Mixture of Low-Rank CompensatorsBeichen Huang, Yueming Yuan, Zelei Shao, Minjia Zhang. [doi]
- Rethinking Key-Value Cache Compression Techniques for Large Language Model ServingWei Gao 0064, Xinyu Zhou, Peng Sun 0006, Tianwei Zhang 0004, Yonggang Wen 0001. [doi]
- QServe: W4A8KV4 Quantization and System Co-design for Efficient LLM ServingYujun Lin 0001, Haotian Tang, Shang Yang, Zhekai Zhang, Guangxuan Xiao, Chuang Gan 0001, Song Han 0001. [doi]
- Efficient LLM Inference using Dynamic Input Pruning and Cache-Aware MaskingMarco Federici, Davide Belli, Mart van Baalen, Amir Jalalirad, Andrii Skliar, Bence Major, Markus Nagel, Paul N. Whatmough. [doi]
- Lumos: Efficient Performance Modeling and Estimation for Large-scale LLM TrainingMingyu Liang, Hiwot Tadese Kassa, Wenyin Fu, Brian Coutinho, Louis Feng, Christina Delimitrou. [doi]
- COMET: Fine-grained Computation-communication Overlapping for Mixture-of-ExpertsShulai Zhang, Ningxin Zheng, Haibin Lin, Ziheng Jiang, Wenlei Bao, Chengquan Jiang, Qi Hou, Weihao Cui, Size Zheng 0001, Li-Wen Chang, Quan Chen 0002, Xin Liu 0086. [doi]
- Photon: Federated LLM Pre-TrainingLorenzo Sani, Alex Iacob, Zeyu Cao, Royson Lee, Bill Marino, Yan Gao 0016, Wanru Zhao, Dongqi Cai 0001, Zexi Li 0001, Xinchi Qiu, Nicholas D. Lane. [doi]
- ProtoRAIL: A Risk-cognizant Imitation Agent for Adaptive vCPU Oversubscription In the CloudLu Wang 0029, Mayukh Das, Fangkai Yang, Bo Qiao 0001, Hang Dong 0004, Si-qin, Victor Rühle, Chetan Bansal, Eli Cortez, Íñigo Goiri, Saravan Rajmohan, Qingwei Lin, Dongmei Zhang 0001. [doi]
- Supply-Chain Attacks in Machine Learning FrameworksYue Gao 0011, Ilia Shumailov, Kassem Fawaz. [doi]
- Know Where You're Uncertain When Planning with Multimodal Foundation Models: A Formal FrameworkNeel P. Bhatt, Yunhao Yang, Rohan Siva, Daniel Milan, Ufuk Topcu, Zhangyang Wang. [doi]
- XGrammar: Flexible and Efficient Structured Generation Engine for Large Language ModelsYixin Dong, Charlie F. Ruan, Yaxing Cai, Ziyi Xu, Yilong Zhao, Ruihang Lai, Tianqi Chen 0001. [doi]
- Context Parallelism for Scalable Million-Token InferenceAmy Yang, Jingyi Yang, Aya Ibrahim, Xinfeng Xie, Bangsheng Tang, Grigory Sizov, JongSoo Park, Jianyu Huang. [doi]
- Enabling Unstructured Sparse Acceleration on Structured Sparse AcceleratorsGeonhwa Jeong, Po-An Tsai, Abhimanyu Rajeshkumar Bambhaniya, Stephen W. Keckler, Tushar Krishna. [doi]
- TileLink: Generating Efficient Compute-Communication Overlapping Kernels using Tile-Centric PrimitivesSize Zheng 0001, Jin Fang, Xuegui Zheng, Qi Hou, Wenlei Bao, Ningxin Zheng, Ziheng Jiang, Dongyang Wang, Jianxi Ye, Haibin Lin, Li-Wen Chang, Xin Liu 0086. [doi]
- AIOpsLab: A Holistic Framework to Evaluate AI Agents for Enabling Autonomous CloudsYinfang Chen, Manish Shetty, Gagan Somashekar, Minghua Ma, Yogesh Simmhan, Jonathan Mace, Chetan Bansal, Rujia Wang, Saravan Rajmohan. [doi]
- FedProphet: Memory-Efficient Federated Adversarial Training via Robust and Consistent Cascade LearningMinxue Tang, Yitu Wang, Jingyang Zhang, Louis DiValentin, Aolin Ding, Amin Hass, Yiran Chen 0001, Hai Li 0001. [doi]
- FlexInfer: Flexible LLM Inference with CPU ComputationsSeonjin Na, Geonhwa Jeong, Byung Hoon Ahn, Aaron Jezghani, Jeffrey Young 0001, Christopher J. Hughes, Tushar Krishna, Hyesoon Kim. [doi]
- MEADOW: Memory-efficient Dataflow and Data Packing for Low Power Edge LLMsAbhishek Moitra, Arkapravo Ghosh, Shrey Agrawal, Aporva Amarnath, Karthik Swaminathan, Priyadarshini Panda. [doi]
- AdaParse: An Adaptive Parallel PDF Parsing and Resource Scaling EngineCarlo Siebenschuh, Kyle Hippe, Ozan Gökdemir, Alexander Brace, Arham Mushtaq Khan, Khalid Hossain, Yadu N. Babuji, Nicholas Chia, Venkatram Vishwanath, Arvind Ramanathan, Rick L. Stevens, Ian T. Foster, Robert Underwood. [doi]
- ReaL: Efficient RLHF Training of Large Language Models with Parameter ReallocationZhiyu Mei, Wei Fu, Kaiwei Li, Guangju Wang, Huanchen Zhang, Yi Wu 0013. [doi]
- Seesaw: High-throughput LLM Inference via Model Re-shardingQidong Su, Wei Zhao 0046, Xin Li, Muralidhar Andoorveedu, Chenhao Jiang, Zhanda Zhu, Kevin Song, Christina Giannoula, Gennady Pekhimenko. [doi]
- HyC-LoRA: Memory Efficient LoRA Fine-tuning with Hybrid Activation CompressionYujin Wang, Shunan Dong, Zongle Huang, Yichen You, Liu He, Huazhong Yang, Yongpan Liu, Hongyang Jia. [doi]
- Mas-Attention: Memory-Aware Stream Processing for Attention Acceleration on Resource-constrained Edge DevicesMohammadali Shakerdargah, Shan Lu, Chao Gao, Di Niu. [doi]
- Lightweight Software Kernels and Hardware Extensions for Efficient Sparse Deep Neural Networks on MicrocontrollersFrancesco Daghero, Daniele Jahier Pagliari, Francesco Conti 0001, Luca Benini, Massimo Poncino, Alessio Burrello. [doi]
- LServe: Efficient Long-sequence LLM Serving with Unified Sparse AttentionShang Yang, Junxian Guo, Haotian Tang, Qinghao Hu 0004, Guangxuan Xiao, Jiaming Tang, Yujun Lin 0001, Zhijian Liu, Yao Lu 0006, Song Han 0003. [doi]