Abstract is missing.
- MIRZA: Efficiently Mitigating Rowhammer with Randomization and ALERTHritvik Taneja, Ali Hajiabadi, Michele Marazzi, Kaveh Razavi, Moinuddin Qureshi. 1-13 [doi]
- LRM-GPU: Alleviating Synchronization Overhead for Multi-Chiplet GPU ArchitectureBaiqing Zhong, Zhirong Ye, Xiaojie Li, Peilin Wang, Haiqiu Huang, Zhaolin Li, Zhiyi Yu, Mingyu Wang 0003. 1-14 [doi]
- SpotCC: Facilitating Coded Computation for Prediction Serving Systems on Spot InstancesLin Wang, Yuchong Hu, Ziling Duan, Mingqi Li, Chenxuan Yao, Feifan Liu, Xiaolu Li 0002, Leihua Qin, Dan Feng 0001. 1-14 [doi]
- CROPHE: Cross-Operator Dataflow Optimization for Fully Homomorphic Encryption AcceleratorsXinhua Chen, Jiangbin Dong, Hongren Zheng, Tian Tang 0001, Mingyu Gao 0001. 1-14 [doi]
- MemSOS: OS-Guided Selective Memory MirroringJunghoon Kim 0008, Jongheon Jeong, Seokwon Moon, Seong Hoon Seo, Yeonhong Park, Jinkyu Jeong, Nam Sung Kim, Jae W. Lee. 1-15 [doi]
- Toward Scalable Gate-Level Parallelism on Trapped-Ion Processors with Racetrack ElectrodesEnhyeok Jang, Hyungseok Kim 0003, Yongju Lee 0003, Jaewon Kwon, Yipeng Huang 0001, Won Woo Ro. 1-17 [doi]
- Protean: A Programmable Spectre DefenseNicholas Mosier, Hamed Nemati, John C. Mitchell, Caroline Trippel. 1-20 [doi]
- Focus: A Streaming Concentration Architecture for Efficient Vision-Language ModelsChiyue Wei, Cong Guo 0003, Junyao Zhang 0003, Haoxuan Shan, Yifan Xu, Ziyue Zhang, Yudong Liu, Qinsi Wang, Changchun Zhou 0001, Hai Helen Li, Yiran Chen 0001. 1-18 [doi]
- Towards Resource-Efficient Serverless LLM Inference with SLINFERChuhao Xu, Zijun Li 0001, Quan Chen 0002, Han Zhao 0005, Xueyan Tang, Minyi Guo. 1-18 [doi]
- Area Bloating and the Future of SpecializationQixuan Yu, David Wentzlaff. 1-14 [doi]
- SCALE: Tackling Communication Bottlenecks in Confidential Distributed Machine LearningJoongun Park, Yongqin Wang, Huan Xu, Hanjiang Wu, Mengyuan Li, Tushar Krishna. 1-14 [doi]
- TENET-v2: Applying Relation-Centric Notation to Model and Optimize Data Swizzle in the Cache of Modern NPUHanyu Zhang, Fangxu Guo, Liqiang Lu, Long Wang, Yunfei Du, Zhe Wang, Jinghan Zhang, Jie Zhang, Chenli Xue, Chengpeng Wu, Ziyi Zhang, Yun Liang 0001, Size Zheng 0001, Jianwei Yin. 1-15 [doi]
- Secret Caching Sauce for High-Performance Secure MemoryXu Jiang 0005, Xueliang Wei, Yifei Qu, Dan Feng 0001, Yulai Xie 0002, Wei Tong 0001. 1-14 [doi]
- AccelFlow: Orchestrating an On-Package Ensemble of Fine-Grained Accelerators for MicroservicesJovan Stojkovic, Abraham Farrell, Zhangxiaowen Gong, Christopher J. Hughes, Josep Torrellas. 1-17 [doi]
- C³: CXL Coherence Controllers for Heterogeneous ArchitecturesAnatole Lefort, David Schall, Nicolò Carpentieri, Julian Pritzi, Soham Chakraborty 0001, Nicolai Oswald, Pramod Bhatotia. 1-17 [doi]
- Tempranillo: Non-Speculative Early Register ReleaseCarlos Escuin, Paolo Salvatore Galfano, Davide Basilio Bartolini, Leeor Peled, Mehdi Alipour. 1-17 [doi]
- ESTroM: Element-Flow Architecture for Processing Sparse Tractable Probabilistic ModelsAnjunyi Fan, Xuejie Liu, Anji Liu, Qiuping Wu, Jiaqi Yang, Yuchao Qin, Guy Van den Broeck, Yitao Liang, Bonan Yan. 1-15 [doi]
- Sassy: SmartNIC-Assisted Notification Delivery for μs-Scale RDMA WorkloadsHamed Seyedroudbari, Alexandros Daglis. 1-14 [doi]
- ASPA: Reassigning DDR5 Parity BandwidthFan Li, Qiufeng Li, Yanan Guo 0002, Weidong Cao 0001, Xin Xin 0008. 1-14 [doi]
- Intermittence-Aware Cache CompressionGan Fang, Jianping Zeng 0001, Yuchen Zhou 0005, Changhee Jung. 1-17 [doi]
- COMET: Communication and Memory Co-Design for Fine-Grained AI Inference in MCM AcceleratorsTaishu Sheng, Guangyu Sun 0003, Dezun Dong. 1-14 [doi]
- An Efficient and Scalable Hardware Architecture for Number Theoretic Transform on FPGA with Design AutomationYilan Zhu, Geng Yang 0001, Xingyu Tian, Dilshan Kumarathunga, Liang Kong 0005, Xianglong Deng, Shengyu Fan, Guang Fan, Guiming Shi, Lei Chen, Bo Zhang 0098, Yisong Chang, Shoumeng Yan, Zhenman Fang, Mingzhe Zhang 0005. 1-14 [doi]
- Scaling Graph Neural Network Training via Geometric OptimizationFangzhou Ye, Lingxiang Yin, Hao Zheng 0005. 1-15 [doi]
- BARD: Reducing Write Latency of DDR5 Memory by Exploiting Bank-ParallelismSuhas K. Vittal, Moinuddin Qureshi. 1-15 [doi]
- SMTcheck: Accurate SMT Interference Prediction to Improve Scheduling Efficiency in DatacentersSanghyun Kim, Jinhyeok Oh, Taehun Kim, Gyutae Kim, Youngsok Kim, Jaehyun Hwang, Joonsung Kim 0001. 1-15 [doi]
- Athena: Synergizing Data Prefetching and Off-Chip Prediction via Online Reinforcement LearningRahul Bera, Zhenrong Lang, Caroline Hengartner, Konstantinos Kanellopoulos, Rakesh Kumar 0003, Mohammad Sadrosadati, Onur Mutlu. 1-19 [doi]
- Fully Parallelized BP Decoding for Quantum LDPC Codes Can Outperform BP-OSDMing Wang, Ang Li, Frank Mueller 0001. 1-14 [doi]
- eGPU: Production-Scale Elastic Sharing Over 10,000 GPUsXiaochuan Tang, Hao Qi, Jianbo Dong, Yinghao Yu, Zhennan Xue, Zhengyu Zhang, Daocheng Ying, Zheng Cao 0003, Xiaoyi Lu 0001. 1-14 [doi]
- ELORA: Efficient LoRA and KV Cache Management for Multi-LoRA LLM ServingJiuchen Shi, Hang Zhang, Yixiao Wang, Quan Chen 0002, Yizhou Shan, Kaihua Fu, Wei Wang 0030, Minyi Guo. 1-14 [doi]
- DC-MBQC: A Distributed Compilation Framework for Measurement-Based Quantum ComputingYecheng Xue, Rui Yang, Zhiding Liang, Tongyang Li. 1-14 [doi]
- Pulse: Fine-Grained Hierarchical Hashing Index for Disaggregated MemoryGuangyang Deng, Zixiang Yu, Zhirong Shen, Qiangsheng Su, Zhinan Cheng, Jiwu Shu. 1-14 [doi]
- Cyclone: Designing Efficient and Highly Parallel QCCD Architectural Codesigns for Fault Tolerant Quantum MemorySahil Khan, Abhinav Anand, Kenneth R. Brown, Jonathan M. Baker. 1-14 [doi]
- Advancing Full-Stack Acceleration for SchröDinger-Style Quantum SimulationShuang Liang 0012, Yuncheng Lu, Ce Guo, Paul H. J. Kelly, Wayne Luk, Hongxiang Fan. 1-15 [doi]
- Adaptive Draft Sequence Length: Enhancing Speculative Decoding Throughput on PIM-Enabled SystemsRunze Wang, Qinggang Wang, Haifeng Liu 0003, Long Zheng 0003, Xiaofei Liao, Hai Jin 0001, Jingling Xue. 1-15 [doi]
- Nugget: Portable Program SnippetsZhantong Qiu, Mahyar Samani, Jason Lowe-Power. 1-17 [doi]
- VAR-Turbo: Unlocking the Potential of Visual Autoregressive Models Through Dual RedundancyXujiang Xiang, Fengbin Tu. 1-16 [doi]
- VeloxGNN: Efficient Out-of-Core GNN Training with Delayed Gradient PropagationYi Li, Tsun-Yu Yang, Zhaoyan Shen, Ming-Chang Yang, Bingzhe Li. 1-16 [doi]
- GenPairX: A Hardware-Algorithm Co-Designed Accelerator for Paired-End Read MappingJulien Eudine, Chu Li, Zhuo Cheng, Renzo Andri, Can Firtina, Mohammad Sadrosadati, Nika Mansouri-Ghiasi, Konstantina Koliogeorgi, Anirban Nag, Arash Tavakkol, Haiyu Mao, Onur Mutlu, Shai Bergman, Ji Zhang 0035. 1-16 [doi]
- A Deadlock-Free Bridge Module for Inter-Chiplet Cache-Coherent Communication in an Open Chiplet EcosystemZhiqiang Chen 0006, Wenwen Fu, Yongwen Wang, Hongwei Zhou. 1-13 [doi]
- Predicting DRAM Failures at Scale: A Two-Stage Approach for Heterogeneous SystemsChenglin Wang, Shouxin Wang, Zhirong Shen, Lu Tang, Shuyue Zhou, Ronglong Wu, Min Zhou, Jialiang Yu, Yiming Zhang. 1-14 [doi]
- UniFHE: Faster Accelerator for FHE with Diverse Algebraic Structure and Balanced Memory SystemQingyun Niu, Lutan Zhao, Ming Cai, Kai Li, Dan Meng 0002, Rui Hou 0001. 1-14 [doi]
- TurboFuzz: FPGA Accelerated Hardware Fuzzing for Processor Agile VerificationYang Zhong, Haoran Wu, Xueqi Li, Sa Wang, David Boland, Yungang Bao, Kan Shi. 1-16 [doi]
- N-DIPPER: A Distributed Inter-Die Peak Power Management Network for Nand SystemsJinwoo Park, John Kim. 1-14 [doi]
- ORANGE: Exploring Ockham's Razor for Neural Rendering by Accelerating 3DGS on NPUs with GEMM-Friendly Blending and Balanced WorkloadsHaomin Li 0002, Yun Liang 0001, Fangxin Liu, Bowen Zhu, Zongwu Wang, Yu Feng 0007, Liqiang Lu, Li Jiang 0002, Haibing Guan. 1-15 [doi]
- PIM-Malloc: A Fast and Scalable Dynamic Memory Allocator for Processing-In-Memory (PIM) ArchitecturesDongjae Lee, Bongjoon Hyun, Youngjin Kwon, Minsoo Rhu. 1-17 [doi]
- Streamlined on-Chip Temporal PrefetchingQuang Duong 0002, Calvin Lin. 1-15 [doi]
- The Last-Level Branch Predictor RevisitedDavid Schall, Mária Duracková, Boris Grot. 1-16 [doi]
- FlashFuser: Expanding the Scale of Kernel Fusion for Compute-Intensive Operators via Inter-Core ConnectionZiyu Huang, Yangjie Zhou 0001, Zihan Liu 0002, Xinhao Luo, Yijia Diao, Minyi Guo, Jidong Zhai, Yu Feng 0007, Chen Zhang 0001, Anbang Wu, Jingwen Leng. 1-14 [doi]
- Conflux: A High-Performance Keyword Private Retrieval System for Dynamic DatasetsZehao Chen, Zhaoyan Shen, Qian Wei, Hang Lu, Lei Ju 0001. 1-14 [doi]
- SPLATONIC: Architectural Support for 3D Gaussian Splatting SLAM via Sparse ProcessingXiaotong Huang, He Zhu, Tianrui Ma, Yuxiang Xiong, Fangxin Liu, Zhezhi He, Yiming Gan, Zihan Liu 0002, Jingwen Leng, Yu Feng 0007, Minyi Guo. 1-14 [doi]
- Cambricon-CIM: Enabling Energy-Efficient and Error-Resilient Analog CIM Acceleration via Reformation of Coding BasesHongrui Guo, Tianrui Ma, Zidong Du, Mo Zou, Yifan Hao 0001, Yongwei Zhao 0001, Rui Zhang 0040, Wei Li 0008, Xing Hu 0001, Zhiwei Xu 0002, Qi Guo 0001, Tianshi Chen 0002. 1-16 [doi]
- AutoHAAP: Automated Heterogeneity-Aware Asymmetric Partitioning for LLM TrainingYuanyuan Wang, Nana Tang, Yuyang Wang, Shu Pan, Dingding Yu, Zeyue Wang, Mou Sun, Kejie Fu, Fangyu Wang, Yunchuan Chen, Ning Sun, Fei Yang. 1-17 [doi]
- AQPIM: Breaking the PIM Capacity Wall for LLMs with in-Memory Activation QuantizationKosuke Matsushima, Yasuyuki Okoshi, Masato Motomura, Daichi Fujiki. 1-17 [doi]
- TraceRTL: Agile Performance Evaluation for Microarchitecture ExplorationZifei Zhang 0001, Yinan Xu 0001, Sa Wang, Dan Tang 0002, Yungang Bao. 1-15 [doi]
- HDPAT: Hierarchical Distributed Page Address Translation for Wafer-Scale GPUsDaoxuan Xu, Ying Li 0049, Yuwei Sun, Jie Ren, Yifan Sun 0002. 1-12 [doi]
- RPU - A Reasoning Processing UnitMatthew Joseph Adiletta, Gu-Yeon Wei, David Brooks 0001. 1-17 [doi]
- TraceQ: Trace-Based Reconstruction of Quantum Circuit Dataflow in Surface-Code Fault-Tolerant Quantum ComputingTheodoros Trochatos, Christopher Kang, Andrew Wang, Frederic T. Chong, Jakub Szefer. 1-14 [doi]
- Uni-STC: Unified Sparse Tensor CoreHaocheng Lian, Qiyue Zhang, Xinran Zhao, Meichen Dong, Yijie Nie, Zhengyi Zhao, Junzhong Shen, Wei Guo, Chun Huang, Bingcai Sui, Weifeng Liu. 1-18 [doi]
- PADE: A Predictor-Free Sparse Attention Accelerator via Unified Execution and Stage FusionHuizheng Wang, Hongbin Wang, Zichuan Wang, Zhiheng Yue, Yang Wang 0089, Chao Li 0009, Yang Hu 0001, Shouyi Yin. 1-19 [doi]
- µShare: Non-Intrusive Kernel Co-Locating on NVIDIA GPUsWenhao Huang 0005, Zhaolin Duan, Laiping Zhao, Yuhao Zhang 0006, Yanjie Wang, Yiming Li, Yihan Wang, Yichi Chen 0001, Zhihang Tang, Kang Chen, Deze Zeng, Wenxin Li 0001, Keqiu Li. 1-14 [doi]
- Cohet: A CXL-Driven Coherent Heterogeneous Computing Framework with Hardware-Calibrated Full-System SimulationYanjing Wang 0007, Lizhou Wu, Sunfeng Gao, Yibo Tang, Junhui Luo, Zicong Wang, Yang Ou, Dezun Dong, Nong Xiao 0001, Mingche Lai. 1-16 [doi]
- RidgeWalker: Perfectly Pipelined Graph Random Walks on FPGAsHongshi Tan, Yao Chen 0008, Xinyu Chen 0001, Qizhen Zhang, Cheng Chen 0008, Weng-Fai Wong, Bingsheng He. 1-15 [doi]
- D'ArQ: A QOC Framework with Causality-Aware Grouping and Basis SelectionChangheon Lee, Hyungseok Kim 0003, Seungwoo Choi, Youngmin Kim, Won Woo Ro. 1-13 [doi]
- Cambricon-GS: An Accelerator for 3D Gaussian Splatting Training With Gaussian-Pixel Hybrid ParallelismRui Wen, Zhifei Yue, Tianbo Liu 0006, Xinkai Song, Jin Li, Di Huang, Jiaming Guo, Xing Hu 0001, Zidong Du, Qi Guo 0001, Tianshi Chen 0002. 1-14 [doi]
- Pinball: A Cryogenic Predecoder for Quantum Error Correction Decoding Under Circuit-Level NoiseAlexander Knapen, Guanchen Tao, Jacob Mack, Tomas Bruno, Mehdi Saligane, Dennis Sylvester, Qirui Zhang 0001, Gokul Subramanian Ravi. 1-17 [doi]
- LoCaLUT: Harnessing Capacity-Computation Tradeoffs for LUT-Based Inference in DRAM-PIMJungUk Hong, Changmin Shin 0002, Sukjin Kim, Si Ung Noh, Taehee Kwon 0002, Seongyeon Park, Hanjun Kim 0001, Youngsok Kim, Jinho Lee 0001. 1-16 [doi]
- SSBleed: Non-Speculative Side-Channel Attacks via Speculative Store Bypass on Armv9 CPUsChang Liu 0117, Hongpei Zheng, Xin Zhang 0110, Dapeng Ju, Dongsheng Wang 0002, Yinqian Zhang, Trevor E. Carlson. 1-15 [doi]
- Exploration of LLM Workload Reliability Based on di/dt Effects and Voltage DroopsZhixing Jiang, Justin Garrigus, Allison Seigler, Ethan Syed, Yan-Lun Huang, Mehdi Sadi, Tawfik Rahal-Arabi, Lizy Kurian John. 1-15 [doi]
- The Memory Processing Unit: A Generalized Interface for End-to-End In-Memory ExecutionMinh S. Q. Truong, Yiqiu Sun 0002, Dawei Xiong, Amol Shah, Alexander Glass, Abraham Farrell, James A. Bain, L. Richard Carley, Saugata Ghose. 1-16 [doi]
- A PN-Free Digital 3-SAT Accelerator Using Crossbar Architecture and Frequency-Controlled CountersZhezheng Ren, Chenao Yuan, Yuke Zhang, Shiyu Su. 1-14 [doi]
- ARIADNE: Adaptive UVM Management for Efficient GPU Memory OversubscriptionHyunkyun Shin, Seongtae Bang, Hyungwon Park, Daehoon Kim 0001. 1-15 [doi]
- SFD: Towards Segment Fusion Dataflow for Spatial AcceleratorsFuyu Wang, Minghua Shen, Yufei Ding 0001, Nong Xiao 0001, Yutong Lu. 1-14 [doi]
- HR-DCIM: High-Reliability Floating-Point Digital CIM Architecture With Unified Low-Cost Iterative Error CorrectionZhen He, Yiqi Wang 0005, Zhiheng Yue, Zihan Wu 0006, Huiming Han, Shaojun Wei, Yang Hu 0001, Fengbin Tu, Shouyi Yin. 1-15 [doi]
- Enterprise Class On-Chip Accelerator IntegrationDeanna Postles Dunn Berger, Alper Buyuktosunoglu, Craig R. Walters, Robert J. Sonnelitter, Hailey Nicholson, Ashraf ElSharif, Yamil Rivera, Avery Francois, Cédric Lichtenau, Jason Kohl. 1-15 [doi]
- HERO-Sign: Hierarchical Tuning and Efficient Compiler-Time GPU Optimizations for SPHINCS+ Signature GenerationYaoyun Zhou, Qian Wang. 1-13 [doi]
- TEMP: A Memory Efficient Physical-Aware Tensor Partition-Mapping Framework on Wafer-Scale ChipsHuizheng Wang, Taiquan Wei, Zichuan Wang, Dingcheng Jiang, Qize Yang, Jiaxin Liu, Jingxiang Hou, Chao Li 0009, Jinyi Deng, Yang Hu 0001, Shouyi Yin. 1-18 [doi]
- SALT: Track-and-Mitigate Subarrays, Not Rows, for Blast-Radius-Free Rowhammer DefenseMoinuddin K. Qureshi. 1-16 [doi]
- NPUWattch: ML-Based Power, Area, and Timing Modeling for Neural AcceleratorsSehyeon Kim, Minkwan Kim, Chanho Park, Hanmok Park, Seonghoon Kim, Taigon Song, William J. Song. 1-14 [doi]
- MoEntwine: Unleashing the Potential of Wafer-Scale Chips for Large-Scale Expert Parallel InferenceXinru Tang, Jingxiang Hou, Dingcheng Jiang, Taiquan Wei, Jiaxin Liu, Jinyi Deng, Huizheng Wang, Qize Yang, Haoran Shang, Chao Li 0009, Yang Hu 0001, Shouyi Yin. 1-15 [doi]
- PASCAL: A Phase-Aware Scheduling Algorithm for Serving Reasoning-based Large Language ModelsEunyeong Cho, Jehyeon Bang, Ranggi Hwang, Minsoo Rhu. 1-16 [doi]
- AUM: Unleashing the Efficiency Potential of Shared Processors with Accelerator Units for LLM ServingXinkai Wang 0003, Chao Li 0009, Yiming Zhuansun, Jinyang Guo, Xiaofeng Hou, Jing Wang 0055, LuPing Wang, Weigao Chen, Cheng Huang, Guodong Yang, Liping Zhang 0013, Minyi Guo. 1-15 [doi]
- FractalCloud: A Fractal-Inspired Architecture for Efficient Large-Scale Point Cloud ProcessingYuzhe Fu, Changchun Zhou 0001, Hancheng Ye, Bowen Duan, Qiyu Huang, Chiyue Wei, Cong Guo 0003, Hai Helen Li, Yiran Chen 0001. 1-15 [doi]
- GustavSNN: Unleashing the Power of Gustavson's Algorithm on SNN Acceleration with Column-Parallel Tick-Batch DataflowSangwoo Hwang, Donghun Lee, Jahyun Koo 0002, Jaeha Kung. 1-14 [doi]
- Near-Zero-Overhead Freshness for Recommendation Systems via Inference-Side Model UpdatesWenjun Yu, Sitian Chen, Cheng Chen, Amelie Chi Zhou. 1-15 [doi]
- CoCoTree: A Computation-Capable Architecture for Collective Communication in Scalable PIMShunchen Shi, Qijia Yang, Fan Yang 0096, Yu Huang, Youwei Zhuo, Zhichun Li, Ninghui Sun, Xueqi Li 0001. 1-16 [doi]
- I-POP: Ignite Positive PrefetchersYiquan Lin, Wenhai Lin, Yiquan Chen, Jiexiong Xu, Shishun Cai, Jiarong Ye, Zonghui Wang, Wenzhi Chen. 1-16 [doi]
- V-Rex: Real-Time Streaming Video LLM Acceleration via Dynamic KV Cache RetrievalDonghyuk Kim, Sejeong Yang, Wonjin Shin, Joo-Young Kim 0001. 1-14 [doi]
- FACE: Fully Overlapped PD Scheduling and Multi-Level Architecture Co-Exploration on WaferZheng Xu, Dehao Kong, Jiaxin Liu, Dingcheng Jiang, Xu Dai, Jinyi Deng, Yang Hu 0001, Shouyi Yin. 1-16 [doi]
- QuCo: Efficient and Flexible Hardware-Driven Automatic Configuration of Tile Transfers in GPUsNicolás Meseguer, Daoxuan Xu, Yifan Sun 0002, Michael Pellauer, José L. Abellán, Manuel E. Acacio. 1-14 [doi]
- The Cost of Dynamic Reasoning: Demystifying AI Agents and Test-Time Scaling from an AI Infrastructure PerspectiveJiin Kim, Byeongjun Shin, Jinha Chung, Minsoo Rhu. 1-16 [doi]
- zkPHIRE: A Programmable Accelerator for ZKPs over HIgh-degRee, Expressive GatesAlhad Daftardar, Jianqiao Mo, Joey Ah-kiow, Benedikt Bünz, Siddharth Garg, Brandon Reagen. 1-15 [doi]
- Compression-Aware Gradient Splitting for Collective Communications in Distributed TrainingPranati Majhi, Sabuj Laskar, Abdullah Muzahid, Eun Jung Kim 0001. 1-16 [doi]
- PIMphony: Overcoming Bandwidth and Capacity Inefficiency in PIM-Based Long-Context LLM Inference SystemHyucksung Kwon, Kyungmo Koo, Janghyeon Kim, Woongkyu Lee, MinJae Lee, Gyeonggeun Jung, Hyungdeok Lee, Yousub Jung, Jaehan Park, Yosub Song, Byeongsu Yang, Haerang Choi, Guhyun Kim, Jongsoon Won, Woojae Shin, Changhyun Kim, Gyeongcheol Shin, Yongkee Kwon, Ilkon Kim, Euicheol Lim, John Kim 0001, Jungwook Choi. 1-21 [doi]
- Towards Compute-Aware In-Switch Computing for LLMs Tensor-Parallelism on Multi-GPU SystemsChen Zhang 0001, Qijun Zhang, Zhuoshan Zhou, Yijia Diao, Haibo Wang, Zhe Zhou, Zhipeng Tu, Zhiyao Li, Guangyu Sun, Zhuoran Song, Zhigang Ji, Jingwen Leng, Minyi Guo. 1-15 [doi]
- LEGO: Supporting LLM-Enhanced Games with One Gaming GPUHan Zhao 0005, Weihao Cui, Zeshen Zhang, Wenhao Zhang, Jiangtong Li, Quan Chen 0002, Pu Pang, Zijun Li 0001, Zhenhua Han, Yuqing Yang 0001, Minyi Guo. 1-14 [doi]
- BitDecoding: Unlocking Tensor Cores for Long-Context LLMs with Low-Bit KV CacheDayou Du, Shijie Cao, Jianyi Cheng, Luo Mai, Ting Cao 0003, Mao Yang 0004. 1-13 [doi]
- VectorLiteRAG: Latency-Aware and Fine-Grained Resource Partitioning for Efficient RAGJunkyum Kim, Divya Mahajan 0001. 1-15 [doi]
- PinDrop: Breaking the Silence on SDCs in a Large-Scale FleetPeter W. Deutsch, Harish Dattatraya Dixit, Gautham Vunnam, Carl Moran, Eleanor Ozer, Sriram Sankar. 1-14 [doi]
- PhasedStore: Supporting High-Performance Write-Through Cache-Coherence Protocols Under TSOBurak Ocalan, Chloe Alverti, Shashwat jaiswal, Antonis Psistakis, David A. Koufaty, Suyash Mahar, Steven Swanson, Josep Torrellas. 1-14 [doi]
- ReScue: Reliable and Secure CXL MemoryChihun Song, Austin Antony Cruz, Michael Jaemin Kim, Minbok Wi, Gaohan Ye, Kyungsan Kim, Sangyeol Lee, Jung Ho Ahn, Nam Sung Kim. 1-16 [doi]
- DRACO: A Hardware-Efficient Robot Rigid Body Dynamics Accelerator with Precision-Aware Quantization FrameworkXingyu Liu, Jiawei Liang, Yipu Zhang 0002, Linfeng Du, Chaofang Ma, Hui Yu, Jiang Xu 0001, Wei Zhang 0012. 1-13 [doi]
- ReThermal: Co-Design of Thermal-Aware Static and Dynamic Scheduling for LLM Training on Liquid-Cooled Wafer-Scale ChipsChengran Li, Huizheng Wang, Jiaxin Liu, Jingyao Liu, Zhiheng Yue, Xia Li, Shenfei Jiang, Jinyi Deng, Yang Hu 0001, Shouyi Yin. 1-15 [doi]
- CLINE: Improving Control Flow Compilation of Quantum Programs with Control Line EncodingAnbang Wu, Liqiang Lu, Jianwei Yin, Jingwen Leng, Minyi Guo. 1-13 [doi]
- GyRot: Leveraging Hidden Synergy Between Rotation and Fine-Grained Group Quantization for Low-Bit LLM InferenceSangjin Kim, Yuseon Chou, Byeongcheol Kim, Jungjun Oh, Hoi-Jun Yoo. 1-15 [doi]
- DP-HLS: A High-Level Synthesis Framework for Accelerating Dynamic Programming Algorithms in BioinformaticsAnshu Gupta, Yingqi Cao, Jason Liang, Yatish Turakhia. 1-17 [doi]
- WATOS: Efficient LLM Training Strategies and Architecture Co-Exploration for Wafer-Scale ChipHuizheng Wang, Zichuan Wang, Hongbin Wang, Jingxiang Hou, Taiquan Wei, Chao Li 0009, Yang Hu 0001, Shouyi Yin. 1-19 [doi]
- LiLo: Harnessing the on-Chip Accelerators in Intel CPUs for Compressed LLM Inference AccelerationHyungyo Kim, Qirong Xia, Jinghan Huang 0001, Nachuan Wang, Younjoo Lee 0001, Jung Ho Ahn, Wajdi K. Feghali, Ren Wang 0001, Nam Sung Kim. 1-17 [doi]
- AutoGNN: End-to-End Hardware-Driven Graph Preprocessing for Enhanced GNN PerformanceSeungkwan Kang, Seungjun Lee, Donghyun Gouk, Miryeong Kwon, Hyunkyu Choi, Junhyeok Jang, Sangwon Lee 0014, Huiwon Choi, Jie Zhang 0048, Wonil Choi, Mahmut Taylan Kandemir, Myoungsoo Jung. 1-17 [doi]
- REASON: Accelerating Probabilistic Logical Reasoning for Scalable Neuro-Symbolic IntelligenceZishen Wan, Che-Kai Liu, Jiayi Qian, Hanchen Yang 0001, Arijit Raychowdhury, Tushar Krishna. 1-16 [doi]
- LowCarb: Carbon-Aware Scheduling of Serverless FunctionsRohan Basu Roy, Devesh Tiwari. 1-16 [doi]
- Peregrine: Accelerating TFHE Bootstrapping on GPUs via Multi-Level External Product Co-DesignHaoqi He, Zhiwei Wang, Lutan Zhao, Dian Jiao, Dan Meng 0002, Rui Hou 0001. 1-14 [doi]
- Characterizing Cloud-Native LLM Inference at Bytedance and Exposing Optimization Challenges and Opportunities for Future AI AcceleratorsJingwei Cai, Dehao Kong, Hantao Huang, Zishan Jiang, Zixuan Ma, Qingyu Guo, Zhenxing Zhang, Guiming Shi, Mingyu Gao 0001, Kaisheng Ma, Minghui Yu. 1-19 [doi]
- IVE: An Accelerator for Single-Server Private Information Retrieval Using Versatile Processing ElementsSangpyo Kim, Hyesung Ji, Jongmin Kim 0007, Wonseok Choi 0015, Jaiyoung Park, Jung Ho Ahn. 1-15 [doi]
- Leveraging ASIC AI Chips for Homomorphic EncryptionJianming Tong, Tianhao Huang, Jingtian Dang, Leo de Castro, Anirudh Itagi, Anupam Golder, Asra Ali, Jeremy Kun, Jevin Jiang, Arvind 0001, G. Edward Suh, Tushar Krishna. 1-18 [doi]
- GRTX: Efficient Ray Tracing for 3D Gaussian-Based RenderingJunseo Lee, Sangyun Jeon, Jungi Lee, Junyong Park, Jaewoong Sim. 1-14 [doi]
- Swift: High-Performance Sparse-Dense Matrix Multiplication on GPUsJinyu Hu, Huizhang Luo, Hong Jiang 0001, Marc Casas, Kenli Li 0001, Chubo Liu. 1-16 [doi]
- RoMe: Row Granularity Access Memory System for Large Language ModelsHwayong Nam, Seungmin Baek, Jumin Kim, Michael Jaemin Kim, Jung Ho Ahn. 1-15 [doi]
- NP-CAM: Efficient and Scalable DNA Classification using a NoC-Partitioned CAM ArchitectureBenjamin F. Morris III, Tergel Molom-Ochir, Changchun Zhou 0001, Yiran Chen 0001, Alex K. Jones, Hai Li 0001. 1-14 [doi]
- Count2Multiply: Reliable In-Memory High-Radix CountingJoão Paulo C. de Lima, Benjamin F. Morris III, Asif Ali Khan, Jerónimo Castrillón, Alex K. Jones. 1-15 [doi]
- Conduit: Programmer-Transparent Near-Data Processing Using Multiple Compute-Capable Resources in Solid State DrivesRakesh Nadig, Vamanan Arulchelvan, Mayank Kabra, Harshita Gupta, Rahul Bera, Nika Mansouri-Ghiasi, Nanditha Rao, Qingcai Jiang, Andreas Kosmas Kakolyris, Yu Liang 0004, Mohammad Sadrosadati, Onur Mutlu. 1-20 [doi]
- DSAssassin: Cross-VM Side-Channel Attacks by Exploiting Intel Data Streaming AcceleratorBen Chen, Kunlin Li, Shuwen Deng, Dongsheng Wang, Yun Chen. 1-15 [doi]
- SAGe: A Lightweight Algorithm-Architecture Co-Design for Mitigating the Data Preparation Bottleneck in Large-Scale Genome Sequence AnalysisNika Mansouri-Ghiasi, Talu Güloglu, Harun Mustafa, Can Firtina, Konstantina Koliogeorgi, Konstantinos Kanellopoulos, Haiyu Mao, Rakesh Nadig, Mohammad Sadrosadati, Jisung Park 0001, Onur Mutlu. 1-23 [doi]