Abstract is missing.
- The Importance of Generalizability in Machine Learning for SystemsVarun Gohil, Sundar Dev, Gaurang Upasani, David Lo 0003, Parthasarathy Ranganathan, Christina Delimitrou. 1 [doi]
- Veritas - Demystifying Silent Data Corruptions: μArch-Level Modeling and Fleet Data of Modern x86 CPUsOdysseas Chatzopoulos, Nikos Karystinos, George Papadimitriou 0001, Dimitris Gizopoulos, Harish Dattatraya Dixit, Sriram Sankar. 1-14 [doi]
- ChameleonEC: Exploiting Tunability of Erasure Coding for Low-Interference RepairYuhui Cai, Shiyao Lin, Zhirong Shen, Jiahui Yang, Jiwu Shu. 15-28 [doi]
- DPUaudit: DPU-assisted Pull-based Architecture for Near-Zero Cost System AuditingPeng Jiang 0007, Hanlin Jiang, Ruizhe Huang, Hanwen Lei, Zhineng Zhong, Shaokun Zhang, Yuxin Ren 0001, Ning Jia, Xinwei Hu, Yao Guo 0001, Xiangqun Chen, Ding Li 0001. 29-43 [doi]
- Delinquent Loop Pre-execution Using Predicated Helper ThreadsAnirudh Seshadri, Eric Rotenberg. 44-58 [doi]
- Mascot: Predicting Memory Dependencies and Opportunities for Speculative Memory BypassingKarl H. Mose, Sebastian S. Kim, Alberto Ros 0001, Timothy M. Jones 0001, Robert D. Mullins. 59-71 [doi]
- Architecting Value Prediction around In-Order ExecutionPierre Ravenel, Arthur Perais, Benoît Dupont de Dinechin, Frédéric Pétrot. 72-84 [doi]
- Efficient Optimization with Encoded Ising ModelsDevrath Iyer, Sara Achour. 85-98 [doi]
- SPARK: Sparsity Aware, Low Area, Energy-Efficient, Near-memory Architecture for Accelerating Linear Programming ProblemsSiddhartha Raman Sundara Raman, Lizy Kurian John, Jaydeep P. Kulkarni. 99-112 [doi]
- LegoZK: A Dynamically Reconfigurable Accelerator for Zero-Knowledge ProofZhengbang Yang, Lutan Zhao, Peinan Li, Han Liu, Kai Li, Boyan Zhao, Dan Meng, Rui Hou 0001. 113-126 [doi]
- Reuse-Aware Compilation for Zoned Quantum Architectures Based on Neutral AtomsWan-Hsuan Lin, Daniel Bochen Tan, Jason Cong. 127-142 [doi]
- HATT: Hamiltonian Adaptive Ternary Tree for Optimizing Fermion-to-Qubit MappingYuhao Liu, Kevin Yao, Jonathan Hong, Julien Froustey, Ermal Rrapaj, Costin Iancu, Gushu Li, Yunong Shi. 143-157 [doi]
- QuCLEAR: Clifford Extraction and Absorption for Quantum Circuit OptimizationJi Liu 0007, Alvin Gonzales, Benchen Huang, Zain Hamid Saleem, Paul D. Hovland. 158-172 [doi]
- Gaze into the Pattern: Characterizing Spatial Patterns with Internal Temporal Correlations for Hardware PrefetchingZixiao Chen, Chentao Wu, Yunfei Gu, Ranhao Jia, Jie Li 0002, Minyi Guo. 173-187 [doi]
- To Cross, or Not to Cross Pages for Prefetching?Georgios Vavouliotis, Martí Torrents, Boris Grot, Kleovoulos Kalaitzidis, Leeor Peled, Marc Casas. 188-203 [doi]
- Integrating Prefetcher Selection with Dynamic Request Allocation Improves Prefetching EfficiencyMengming Li, Qijun Zhang, Yongqing Ren, Zhiyao Xie. 204-216 [doi]
- VR-Pipe: Streamlining Hardware Graphics Pipeline for Volume RenderingJunseo Lee, Jaisung Kim, Junyong Park, Jaewoong Sim. 217-230 [doi]
- IRIS: Unleashing ISP-Software Cooperation to Optimize the Machine Vision PipelineRaúl Taranco, José María Arnau, Antonio González 0001. 231-245 [doi]
- Uni-Render: A Unified Accelerator for Real-Time Rendering Across Diverse Neural RenderersChaojian Li, Sixu Li, Linrui Jiang, Jingqun Zhang, Yingyan Celine Lin. 246-260 [doi]
- Interleaved Logical Qubits in Atom ArraysJoshua Viszlai, Sophia Fuhui Lin, Siddharth Dangwal, Conor Bradley, Vikram Ramesh, Jonathan M. Baker, Hannes Bernien, Frederic T. Chong. 261-274 [doi]
- Choco-Q: Commute Hamiltonian-based QAOA for Constrained Binary OptimizationDebin Xiang, Qifan Jiang, Liqiang Lu, Siwei Tan, Jianwei Yin. 275-289 [doi]
- BOSS: Blocking algorithm for optimizing shuttling scheduling in Ion TrapXian Wu, Chenghong Zhu, Jingbo Wang, Xin Wang. 290-303 [doi]
- LSQCA: Resource-Efficient Load/Store Architecture for Limited-Scale Fault-Tolerant Quantum ComputingTakumi Kobori, Yasunari Suzuki, Yosuke Ueno, Teruo Tanimoto, Synge Todo, Yuuki Tokunaga. 304-320 [doi]
- R.I.P. Geomean Speedup Use Equal-Work (Or Equal-Time) Harmonic Mean Speedup InsteadLieven Eeckhout. 322 [doi]
- eDKM: An Efficient and Accurate Train-Time Weight Clustering for Large Language ModelsMinsik Cho, Keivan Alizadeh-Vahid, Qichen Fu, Saurabh Adya, Carlo C. del Mundo, Mohammad Rastegari, Devang Naik, Peter Zatloukal. 323 [doi]
- EXION: Exploiting Inter-and Intra-Iteration Output Sparsity for Diffusion ModelsJaehoon Heo, Adiwena Putra, Jieon Yoon, Sungwoong Yune, Hangyeol Lee, Ji-Hoon Kim 0004, Joo-Young Kim 0001. 324-337 [doi]
- Ditto: Accelerating Diffusion Model via Temporal Value SimilaritySungbin Kim, Hyunwuk Lee, Wonho Cho, Mincheol Park, Won Woo Ro. 338-352 [doi]
- Gaussian Blending Unit: An Edge GPU Plug-in for Real-Time Gaussian-Based Rendering in AR/VRZhifan Ye, Yonggan Fu, Jingqun Zhang, Leshu Li, Yongan Zhang, Sixu Li, Cheng Wan 0005, Chenxi Wan, Chaojian Li, Sreemanth Prathipati, Yingyan Celine Lin. 353-365 [doi]
- GSArch: Breaking Memory Barriers in 3D Gaussian Splatting Training via Architectural SupportHoushu He, Gang Li 0015, Fangxin Liu, Li Jiang 0002, Xiaoyao Liang, Zhuoran Song. 366-379 [doi]
- Palermo: Improving the Performance of Oblivious Memory using Protocol-Hardware Co-DesignHaojie Ye, Yuchen Xia, Yuhan Chen, Kuan-Yu Chen 0001, Yichao Yuan, Shuwen Deng, Baris Kasikci, Trevor N. Mudge, Nishil Talati. 380-393 [doi]
- SpecMPK: Efficient In-Process Isolation with Speculative and Secure Permission Update InstructionDebpratim Adak, Huiyang Zhou, Eric Rotenberg, Amro Awad. 394-408 [doi]
- BrokenSleep: Remote Power Timing Attack Exploiting Processor Idle StatesHyosang Kim, Ki-Dong Kang, Gyeongseo Park, Seungkyu Lee, Daehoon Kim. 409-422 [doi]
- Efficient Memory Side-Channel Protection for Embedding Generation in Machine LearningMuhammad Umar 0002, Akhilesh Parag Marathe, Monami Dutta Gupta, Shubham Jogprakash Ghosh, G. Edward Suh, Wenjie Xiong 0001. 423-441 [doi]
- Criticality-Aware Instruction-Centric Bandwidth Partitioning for Data Center ApplicationsLiren Zhu, Liujia Li, Jianyu Wu, Yiming Yao, Zhan Shi, Jie Zhang, Zhenlin Wang, Xiaolin Wang, Yingwei Luo, Diyu Zhou. 442-457 [doi]
- Concord: Rethinking Distributed Coherence for Software Caches in Serverless EnvironmentsJovan Stojkovic, Chloe Alverti, Alan Andrade, Nikoleta Iliakopoulou, Hubertus Franke, Tianyin Xu, Josep Torrellas. 458-473 [doi]
- Grad: Intelligent Microservice Scaling by Harnessing Resource FungibilityLiao Chen, Chenyu Lin, Shutian Luo, Huanle Xu, Chengzhong Xu 0001. 474-486 [doi]
- Multi-Dimensional Vector ISA Extension for Mobile In-Cache ComputingAlireza Khadem, Daichi Fujiki, Hilbert Chen, Yufeng Gu, Nishil Talati, Scott A. Mahlke, Reetuparna Das. 487-503 [doi]
- ER-DCIM: Error-Resilient Digital CIM Architecture with Run-Time MAC-Cell Error CorrectionZhen He, Yiqi Wang 0005, Zihan Wu 0006, Shaojun Wei, Yang Hu 0001, Fengbin Tu, Shouyi Yin. 504-517 [doi]
- AsyncDIMM: Achieving Asynchronous Execution in DIMM-Based Near-Memory ProcessingLiyan Chen, Dongxu Lyu, Jianfei Jiang 0001, Qin Wang 0009, Zhigang Mao, Naifeng Jing. 518-532 [doi]
- SoMa: Identifying, Exploring, and Understanding the DRAM Communication Scheduling Space for DNN AcceleratorsJingwei Cai, Xuan Wang, Mingyu Gao 0001, Sen Peng, Zijian Zhu, Yuchen Wei, Zuotong Wu, Kaisheng Ma. 533-548 [doi]
- Adyna: Accelerating Dynamic Neural Networks with Adaptive SchedulingZhiyao Li, Bohan Yang, Jiaxiang Li, Taijie Chen, Xintong Li, Mingyu Gao 0001. 549-562 [doi]
- EDA: Energy-Efficient Inter-Layer Model Compilation for Edge DNN Inference AccelerationBo Ren Pao, I-Chia Chen, En-Hao Chang, Tsung Tai Yeh. 563-576 [doi]
- SkyByte: Architecting an Efficient Memory-Semantic CXL-based SSD with OS and Hardware Co-designHaoyang Zhang, Yuqi Xue, Yirui Eric Zhou, Shaobo Li 0005, Jian Huang 0006. 577-593 [doi]
- Zebra: Efficient Redundant Array of Zoned Namespace SSDs Enabled by Zone Random Write Area (ZRWA)Tianyang Jiang, Guangyan Zhang, Xiaojian Liao, Yuqi Zhou. 594-607 [doi]
- Reviving In-Storage Hardware Compression on ZNS SSDs through Host-SSD CollaborationYingjia Wang, Tao Lu, Yuhong Liang, Xiang Chen, Ming-Chang Yang. 608-623 [doi]
- UniNDP: A Unified Compilation and Simulation Tool for Near DRAM Processing ArchitecturesTongxin Xie, Zhenhua Zhu, Bing Li, Yukai He, Cong Li, Guangyu Sun 0003, Huazhong Yang, Yuan Xie 0001, Yu Wang 0002. 624-640 [doi]
- Piccolo: Large-Scale Graph Processing with Fine-Grained in-Memory Scatter-GatherChangmin Shin, Jaeyong Song, Hongsun Jang, Dogeun Kim, Jun Sung, Taehee Kwon, Jae Hyung Ju, Frank Liu 0001, YeonKyu Choi, Jinho Lee. 641-656 [doi]
- GoPIM: GCN-Oriented Pipeline Optimization for PIM AcceleratorsSiling Yang, Shuibing He, Wenjiong Wang, Yanlong Yin, Tong Wu, Weijian Chen 0002, Xuechen Zhang 0001, Xian-He Sun, Dan Feng. 657-670 [doi]
- LUT-DLA: Lookup Table as Efficient Extreme Low-Bit Deep Learning AcceleratorGuoyu Li, Shengyu Ye, Chunyun Chen, Yang Wang 0053, Fan Yang 0024, Ting Cao, Cheng Liu, Mohamed M. Sabry Aly, Mao Yang 0004. 671-684 [doi]
- Exploring the Performance Improvement of Tensor Processing Engines through Transformation in the Bit-weight Dimension of MACsQizhe Wu, Huawen Liang, Yuchen Gui, Zhichen Zeng 0002, Zerong He, Linfeng Tao, Xiaotian Wang, Letian Zhao, Zhaoxi Zeng, Wei Yuan, Wei Wu, Xi Jin 0002. 685-700 [doi]
- Panacea: Novel DNN Accelerator using Accuracy-Preserving Asymmetric Quantization and Energy-Saving Bit-Slice SparsityDongyun Kam, Myeongji Yun, Sunwoo Yoo, Seungwoo Hong, Zhengya Zhang, Youngjoo Lee. 701-715 [doi]
- From Optimal to Practical: Efficient Micro-op Cache Replacement Policies for Data Center ApplicationsKan Zhu, Yilong Zhao, Yufei Gao, Peter Braun 0005, Tanvir Ahmed Khan, Heiner Litz, Baris Kasikci, Shuwen Deng. 716-731 [doi]
- Rethinking Dead Block Prediction for Intermittent ComputingGan Fang, Changhee Jung. 732-744 [doi]
- Efficient Caching with A Tag-enhanced DRAMMaryam Babaie, Ayaz Akram, Wendy Elsasser, Brent Haukness, Michael R. Miller, Taeksang Song, Thomas Vogelsang, Steven C. Woo, Jason Lowe-Power. 745-760 [doi]
- PROCA: Programmable Probabilistic Processing Unit Architecture with Accept/Reject Prediction & Multicore Pipelining for Causal InferenceYihan Fu, Anjunyi Fan, Wenshuo Yue, Hongxiao Zhao, Daijing Shi, Qiuping Wu, Jiayi Li, Xiangyu Zhang, Yaoyu Tao, Yuchao Yang 0001, Bonan Yan. 761-774 [doi]
- CogSys: Efficient and Scalable Neurosymbolic Cognition System via Algorithm-Hardware Co-DesignZishen Wan, Hanchen Yang, Ritik Raj, Che-Kai Liu, Ananda Samajdar, Arijit Raychowdhury, Tushar Krishna. 775-789 [doi]
- NeuVSA: A Unified and Efficient Accelerator for Neural Vector SearchZiming Yuan, Lei Dai, Wen Li, Jie Zhang, Shengwen Liang, Ying Wang, Cheng Liu, Huawei Li, Xiaowei Li, Jiafeng Guo, Peng Wang, Renhai Chen, Gong Zhang 0001. 790-805 [doi]
- Prosperity: Accelerating Spiking Neural Networks via Product SparsityChiyue Wei, Cong Guo 0003, Feng Cheng, Shiyu Li 0001, Hao Frank Yang, Hai Helen Li, Yiran Chen 0001. 806-820 [doi]
- Bit-slice Architecture for DNN Acceleration with Slice-level Sparsity Enhancement and ExploitationInsu Choi, Young-Seo Yoon, Joon-Sung Yang. 821-835 [doi]
- A Hardware-Software Design Framework for SpMV Acceleration with Flexible Access Pattern PortfolioZhenyu Wu, Maolin Wang 0002, Hayden Kwok-Hay So. 836-848 [doi]
- Variable Read Disturbance: An Experimental Analysis of Temporal Variation in DRAM Read DisturbanceAtaberk Olgun, F. Nisa Bostanci, Ismail Emir Yüksel, Oguzhan Canpolat, Haocong Luo, Geraldo F. Oliveira, A. Giray Yaglikçi, Minesh Patel, Onur Mutlu. 849-866 [doi]
- Understanding RowHammer Under Reduced Refresh Latency: Experimental Analysis of Real DRAM Chips and Implications on Future SolutionsYahya Can Tugrul, A. Giray Yaglikçi, Ismail Emir Yüksel, Ataberk Olgun, Oguzhan Canpolat, Nisa Bostanci, Mohammad Sadrosadati, Oguz Ergin, Onur Mutlu. 867-886 [doi]
- Chronus: Understanding and Securing the Cutting-Edge Industry Solutions to DRAM Read DisturbanceOguzhan Canpolat, A. Giray Yaglikçi, Geraldo F. Oliveira, Ataberk Olgun, Nisa Bostanci, Ismail Emir Yuksel, Haocong Luo, Oguz Ergin, Onur Mutlu. 887-905 [doi]
- NOVA: A Novel Vertex Management Architecture for Scalable Graph ProcessingMarjan Fariborz, Mahyar Samani, Austin York, S. J. Ben Yoo, Jason Lowe-Power, Venkatesh Akella. 906-919 [doi]
- MeHyper: Accelerating Hypergraph Neural Networks by Exploring Implicit DataflowsWenju Zhao, Pengcheng Yao, Dan Chen 0006, Long Zheng 0003, Xiaofei Liao, Qinggang Wang, Shaobo Ma, Yu Li, Haifeng Liu 0003, Wenjing Xiao, Yufei Sun, Bing Zhu 0008, Hai Jin 0001, Jingling Xue. 920-933 [doi]
- Cambricon-DG: An Accelerator for Redundant-Free Dynamic Graph Neural Networks Based on Nonlinear IsolationZhifei Yue, Xinkai Song, Tianbo Liu 0006, Xing Hu 0001, Rui Zhang 0040, Zidong Du, Wei Li 0008, Qi Guo 0001, Tianshi Chen 0002. 934-948 [doi]
- TB-STC: Transposable Block-wise N: M Structured Sparse Tensor CoreJun Liu 0071, Shulin Zeng, Junbo Zhao 0001, Li Ding 0012, Zeyu Wang, Jinhao Li 0006, Zhenhua Zhu, Xuefei Ning, Chen Zhang, Yu Wang 0002, Guohao Dai. 949-962 [doi]
- CROSS: Compiler-Driven Optimization of Sparse DNNs Using Sparse/Dense Computation KernelsFangxin Liu, Shiyuan Huang, Ning Yang, Zongwu Wang, Haomin Li 0002, Li Jiang 0002. 963-976 [doi]
- AccelES: Accelerating Top-K SpMV for Embedding Similarity via Low-bit PruningJiaqi Zhai, Xuanhua Shi, Kaiyi Huang, Chencheng Ye, Weifang Hu, Bingsheng He, Hai Jin 0001. 977-990 [doi]
- AutoRFM: Scaling Low-Cost in-DRAM Trackers to Ultra-Low Rowhammer ThresholdsMoinuddin Qureshi. 991-1004 [doi]
- DAPPER: A Performance-Attack-Resilient Tracker for RowHammer DefenseJeonghyun Woo, Prashant J. Nair. 1005-1020 [doi]
- QPRAC: Towards Secure and Practical PRAC-based Rowhammer Mitigation using Priority QueuesJeonghyun Woo, Shaopeng Chris Lin, Prashant J. Nair, Aamer Jaleel, Gururaj Saileshwar. 1021-1037 [doi]
- I-DGNN: A Graph Dissimilarity-based Framework for Designing Scalable and Efficient DGNN AcceleratorsJiaqi Yang, Hao Zheng 0005, Ahmed Louri. 1038-1051 [doi]
- Mithril: A Scalable System for Deep GNN TrainingJingji Chen, Zhuoming Chen, Xuehai Qian. 1052-1065 [doi]
- Buffalo: Enabling Large-Scale GNN Training via Memory-Efficient BucketizationShuangyan Yang, Minjia Zhang, Dong Li. 1066-1081 [doi]
- BitMoD: Bit-serial Mixture-of-Datatype LLM AccelerationYuzong Chen 0001, Ahmed F. AbouElhamayed, Xilai Dai, Yang Wang, Marta Andronic, George A. Constantinides, Mohamed S. Abdelfattah. 1082-1097 [doi]
- FIGLUT: An Energy-Efficient Accelerator Design for FP-INT GEMM Using Look-Up TablesGunho Park, Hyeokjun Kwon, Jiwoo Kim, Jeongin Bae, Baeseong Park, Dongsoo Lee, Youngjoo Lee. 1098-1111 [doi]
- M-ANT: Efficient Low-bit Group Quantization for LLMs via Mathematically Adaptive Numerical TypeWeiming Hu, Haoyan Zhang, Cong Guo 0003, Yu Feng 0007, Renyang Guan, Zhendong Hua, Zihan Liu 0002, Yue Guan 0003, Minyi Guo, Jingwen Leng. 1112-1126 [doi]
- FHENDI: A Near-DRAM Accelerator for Compiler-Generated Fully Homomorphic Encryption ApplicationsYongmo Park, Aporva Amarnath, Subhankar Pal, Karthik Swaminathan, Alper Buyuktosunoglu, Hayim Shaul, Ehud Aharoni, Nir Drucker, Wei D. Lu, Omri Soceanu, Pradip Bose. 1127-1142 [doi]
- EFFACT: A Highly Efficient Full-Stack FHE Acceleration PlatformYi Huang, Xinsheng Gong, Xiangyu Kong, Dibei Chen, Jianfeng Zhu 0001, Wenping Zhu, Liangwei Li, Mingyu Gao, Shaojun Wei, Aoyang Zhang, Leibo Liu. 1143-1157 [doi]
- Anaheim: Architecture and Algorithms for Processing Fully Homomorphic Encryption in MemoryJongmin Kim 0007, Sungmin Yun 0001, Hyesung Ji, Wonseok Choi 0012, Sangpyo Kim, Jung Ho Ahn. 1158-1173 [doi]
- Hydra: Scale-out FHE Accelerator Architecture for Secure Deep Learning on FPGAYinghao Yang, Xicheng Xu, Haibin Zhang, Jie Song, Xin Tang, Hang Lu, Xiaowei Li 0001. 1174-1186 [doi]
- WarpDrive: GPU-Based Fully Homomorphic Encryption Acceleration Leveraging Tensor and CUDA CoresGuang Fan, Mingzhe Zhang, Fangyu Zheng, Shengyu Fan, Tian Zhou, Xianglong Deng, Wenxu Tang, Liang Kong, Yixuan Song, Shoumeng Yan. 1187-1200 [doi]
- MLPerf Power: Benchmarking the Energy Efficiency of Machine Learning Systems from μWatts to MWatts for Sustainable AIArya Tschand, Arun Tejusve Raghunath Rajan, Sachin Idgunji, Anirban Ghosh, Jeremy Holleman, Csaba Király 0002, Pawan Ambalkar, Ritika Borkar, Ramesh Chukka, Trevor Cockrell, Oliver Curtis, Grigori Fursin, Miro Hodak, Hiwot Kassa, Anton Lokhmotov, Dejan Miskovic, Yuechao Pan, Manu Prasad Manmathan, Liz Raymond, Tom St. John, Arjun Suresh, Rowan Taubitz, Sean Zhan, Scott Wasson, David Kanter, Vijay Janapa Reddi. 1201-1216 [doi]
- Enterprise Class Modular Cache HierarchyCraig R. Walters, Deanna Postles Dunn Berger, Robert J. Sonnelitter, Alper Buyuktosunoglu. 1217-1230 [doi]
- Predicting DRAM-Caused Risky VMs in Large-Scale CloudsYaoguang Yong, Xiaoming Du, Xuhua Ma, Yuxiang Wang, Bin Yao 0002, Xudong Zheng, Huite Yi. 1231-1245 [doi]
- Enhancing Large-Scale AI Training Efficiency: The C4 Solution for Real-Time Anomaly Detection and Communication OptimizationJianbo Dong, Bin Luo, Jun Zhang, Pengcheng Zhang, Fei Feng, Yikai Zhu, Ang Liu, Zian Chen, Yi Shi, Hairong Jiao, Gang Lu, Yu Guan, Ennan Zhai, Wencong Xiao, Hanyu Zhao, Man Yuan, Siran Yang, Xiang Li, Jiamang Wang, Rui Men, Jianwei Zhang, Chang Zhou, Dennis Cai, Yuan Xie, Binzhang Fu. 1246-1258 [doi]
- Revisiting Reliability in Large-Scale Machine Learning Research ClustersApostolos Kokolis, Michael Kuchnik, John Hoffman, Adithya Kumar, Parth Malani, Faye Ma, Zachary Devito, Shubho Sengupta, Kalyan Saladi, Carole-Jean Wu. 1259-1274 [doi]
- HILP: Accounting for Workload-Level Parallelism in System-on-Chip Design Space ExplorationJoseph Rogers, Lieven Eeckhout, Magnus Jahre. 1275-1288 [doi]
- CORDOBA: Carbon-Efficient Optimization Framework for Computing SystemsMariam Elgamal, Doug Carmean, Elnaz Ansari, Okay Zed, Ramesh Peri, Srilatha Manne, Udit Gupta, Gu-Yeon Wei, David Brooks 0001, Gage Hills, Carole-Jean Wu. 1289-1303 [doi]
- Architecting Space Microdatacenters: A System-level ApproachNathan Bleier, Rick Eason, Michael Lembeck, Rakesh Kumar. 1304-1319 [doi]
- ARTEMIS: Agile Discovery of Efficient Real-Time Systems-on-Chips in the Heterogeneous EraSubhankar Pal, Aporva Amarnath, Behzad Boroujerdian, Augusto Vega, Alper Buyuktosunoglu, John-David Wellman, Vijay Janapa Reddi, Pradip Bose. 1320-1334 [doi]
- LEGO: Spatial Accelerator Generation and Optimization for Tensor ApplicationsYujun Lin 0001, Zhekai Zhang, Song Han 0003. 1335-1347 [doi]
- DynamoLLM: Designing LLM Inference Clusters for Performance and Energy EfficiencyJovan Stojkovic, Chaojie Zhang, Íñigo Goiri, Josep Torrellas, Esha Choukse. 1348-1362 [doi]
- throttLL'eM: Predictive GPU Throttling for Energy Efficient LLM Inference ServingAndreas Kosmas Kakolyris, Dimosthenis Masouros, Petros Vavaroutsos, Sotirios Xydis, Dimitrios Soudris. 1363-1378 [doi]
- RpcNIC: Enabling Efficient Datacenter RPC Offloading on PCIe-attached SmartNICsJie Zhang 0081, Hongjing Huang, Xuzheng Chen, Xiang Li, Jieru Zhao, Ming Liu 0027, Zeke Wang. 1379-1394 [doi]
- NVMePass: A Lightweight, High-performance and Scalable NVMe Virtualization Architecture with I/O Queues PassthroughYiquan Chen, Zhen Jin 0008, Yijing Wang, Yi Chen, Jiexiong Xu, Hao Yu, Jinlong Chen, Wenhai Lin, Kanghua Fang, Keyao Zhang, Chengkun Wei, Qiang Liu, Yuan Xie 0001, Wenzhi Chen. 1395-1407 [doi]
- Warped-Compaction: Maximizing GPU Register File Bandwidth Utilization via Operand CompactionEunbi Jeong, Ipoom Jeong, Myung Kuk Yoon, Nam Sung Kim. 1408-1421 [doi]
- Cooperative Warp Execution in Tensor Core for RISC-V GPGPUAbubakr Nada, Giuseppe Maria Sarda, Erwan Lenormand. 1422-1436 [doi]
- SparseWeaver: Converting Sparse Operations as Dense Operations on GPUs for Graph WorkloadsShinnung Jeong, Liam Paul Coopert, Ju Min Lee, Heelim Choi, Nicholas Parnenzini, Chihyo Ahn, Yongwoo Lee 0001, Hanjun Kim 0001, Hyesoon Kim. 1437-1451 [doi]
- HSMU-SpGEMM: Achieving High Shared Memory Utilization for Parallel Sparse General Matrix-Matrix Multiplication on Modern GPUsMin Wu, Huizhang Luo, Fenfang Li, Yiran Zhang, Zhuo Tang, Kenli Li 0001, Jeff Zhang 0001, Chubo Liu. 1452-1466 [doi]
- Anda: Unlocking Efficient LLM Inference with a Variable-Length Grouped Activation Data FormatChao Fang, Man Shi, Robin Geens, Arne Symons, Zhongfeng Wang, Marian Verhelst. 1467-1481 [doi]
- LAD: Efficient Accelerator for Generative Inference of LLM with Locality Aware DecodingHaoran Wang, Yuming Li, Haobo Xu, Ying Wang 0001, Liqi Liu, Jun Yang, Yinhe Han 0001. 1482-1495 [doi]
- VQ-LLM: High-performance Code Generation for Vector Quantization Augmented LLM InferenceZihan Liu 0002, Xinhao Luo, Junxian Guo, Wentao Ni, Yangjie Zhou 0001, Yue Guan 0003, Cong Guo 0003, Weihao Cui, Yu Feng 0007, Minyi Guo, Yuhao Zhu 0001, Minjia Zhang, Chen Jin, Jingwen Leng. 1496-1509 [doi]
- InstAttention: In-Storage Attention Offloading for Cost-Effective Long-Context LLM InferenceXiurui Pan, Endian Li, Qiao Li, Shengwen Liang, Yizhou Shan, Ke Zhou, Yingwei Luo, Xiaolin Wang, Jie Zhang. 1510-1525 [doi]
- TidalMesh: Topology-Driven AllReduce Collective Communication for Mesh TopologyDongkyun Lim, John Kim. 1526-1540 [doi]
- Push Multicast: A Speculative and Coherent Interconnect for Mitigating Manycore CPU Communication BottleneckJiayi Huang 0001, Yanhua Chen, Zhe Wang 0023, Christopher J. Hughes, Yufei Ding, Yuan Xie 0001. 1541-1556 [doi]
- PIMnet: A Domain-Specific Network for Efficient Collective Communication in Scalable PIMHyojun Son, Gilbert Jonatan, Xiangyu Wu, Haeyoon Cho 0002, Kaustubh Shivdikar, José L. Abellán, Ajay Joshi, David R. Kaeli, John Kim 0001. 1557-1572 [doi]
- EIGEN: Enabling Efficient 3DIC Interconnect with Heterogeneous Dual-Layer Network-on-Active-InterposerSiyao Jia, Bo Jiao, Haozhe Zhu, Chixiao Chen, Qi Liu 0010, Ming Liu 0022. 1573-1587 [doi]
- Ariadne: A Hotness-Aware and Size-Adaptive Compressed Swap Technique for Fast Application Relaunch and Reduced CPU Usage on Mobile DevicesYu Liang 0004, Aofeng Shen, Chun Jason Xue, Riwei Pan, Haiyu Mao, Nika Mansouri-Ghiasi, Qingcai Jiang, Rakesh Nadig, Lei Li, Rachata Ausavarungnirun, Mohammad Sadrosadati, Onur Mutlu. 1588-1602 [doi]
- Gemina: A Coordinated and High-Performance Memory Deduplication EngineZhehua Zhang, Suzhen Wu, Wenyan You, Chunfeng Du, Bo Mao. 1603-1617 [doi]
- No Rush in Executing Atomic InstructionsAshkan Asgharzadeh, Josué Feliu, Manuel E. Acacio, Stefanos Kaxiras, Alberto Ros 0001. 1618-1630 [doi]
- Machine Learning-Guided Memory Optimization for DLRM Inference on Tiered MemoryJie Ren, Bin Ma, Shuangyan Yang, Benjamin Francis, Ehsan K. Ardestani, Min-Si, Dong Li. 1631-1647 [doi]
- Let-Me-In: (Still) Employing In-pointer Bounds Metadata for Fine-grained GPU Memory SafetyJaewon Lee, Euijun Chung, Saurabh Singh, Seonjin Na, Yonghae Kim, Jaekyu Lee, Hyesoon Kim. 1648-1661 [doi]
- Marching Page Walks: Batching and Concurrent Page Table Walks for Enhancing GPU ThroughputJiwon Lee, Gun Ko, Myung Kuk Yoon, Ipoom Jeong, Yunho Oh, Won Woo Ro. 1662-1677 [doi]
- OASIS: Object-Aware Page Management for Multi-GPU SystemsYueqi Wang, Bingyao Li, Mohamed Tarek Ibn Ziad, Lieven Eeckhout, Jun Yang 0002, Aamer Jaleel, Xulong Tang. 1678-1692 [doi]
- NearFetch: Saving Inter-Module Bandwidth in Many-Chip-Module GPUsXia Zhao, Guangda Zhang, Lu Wang, Shiqing Zhang, Huadong Dai. 1693-1706 [doi]
- PAISE: PIM-Accelerated Inference Scheduling Engine for Transformer-based LLMHyojung Lee, Daehyeon Baek, Jimyoung Son, Jieun Choi, Kihyo Moon, Minsung Jang. 1707-1719 [doi]
- FACIL: Flexible DRAM Address Mapping for SoC-PIM Cooperative On-device LLM InferenceSeong Hoon Seo, Junghoon Kim, Donghyun Lee 0005, Seonah Yoo, Seokwon Moon, Yeonhong Park, Jae W. Lee. 1720-1733 [doi]
- Lincoln: Real-Time 50~100B LLM Inference on Consumer Devices with LPDDR-Interfaced, Compute-Enabled Flash MemoryWeiyi Sun, Mingyu Gao, Zhaoshi Li, Aoyang Zhang, Iris Ying Chou, Jianfeng Zhu 0001, Shaojun Wei, Leibo Liu. 1734-1750 [doi]
- Make LLM Inference Affordable to Everyone: Augmenting GPU Memory with NDP-DIMMLian Liu, Shixin Zhao, Bing Li 0017, Haimeng Ren, Zhaohui Xu, Mengdi Wang, Xiaowei Li 0001, Yinhe Han 0001, Ying Wang 0001. 1751-1765 [doi]