Abstract is missing.
- SYprox: Combining Host and Device Perforation with Mixed Precision Approximation on Heterogeneous ArchitecturesLorenzo Carpentieri, Biagio Cosenza. 1-12 [doi]
- BitWeaver: Read-Time Truncation in MemoryGarrett Gagnon, Srikanth Malla, Yangwook Kang, Liu Liu 0017. 13-25 [doi]
- NeurLZ: An Online Neural Learning-based Method to Enhance Scientific Lossy CompressionWenqi Jia 0003, Zhewen Hu, Youyuan Liu, Boyuan Zhang 0002, Jinzhen Wang, Jinyang Liu 0003, Wei Niu 0002, Stavros Kalafatis, JunZhou Huang, Sian Jin, Daoce Wang, Jiannan Tian, Miao Yin. 26-42 [doi]
- ghZCCL: Advancing GPU-aware Collective Communications with Homomorphic CompressionJiajun Huang 0001, Sheng Di, Yafan Huang, Zizhong Chen, Franck Cappello, Yanfei Guo, Rajeev Thakur. 43-56 [doi]
- Scaling Large-scale GNN Training to Thousands of Processors on CPU-based SupercomputersChen Zhuang, Lingqi Zhang 0001, Du Wu, Peng Chen 0035, Jiajun Huang 0001, Xin Liu 0020, Rio Yokota, Nikoli Dryden, Toshio Endo, Satoshi Matsuoka, Mohamed Wahib. 57-72 [doi]
- CoLa: Towards Communication-efficient Distributed Sparse Matrix-Matrix Multiplication on GPUsLixing Zhang, Yingxia Shao, Shigang Li 0002. 73-87 [doi]
- Cherry: Breaking the GPU Memory Wall for Large-Scale GNN Training via Micro-BatchingYan Wang, Qinghua Guo, Haoran Kong, Kai Sheng, Zhen Xie, Hao Chen, Weile Jia, Dingwen Tao, Xin He. 88-103 [doi]
- Fused3S: Fast Sparse Attention on Tensor CoresZitong Li, Aparna Chandramowlishwaran. 104-118 [doi]
- StructILU: Dependency-Preserving Incomplete LU with Hierarchical Parallelism for Structured Grid PDEs on GPUsHao Luo 0015, Qianchao Zhu, Xiaochen Hao, Chunxi Lei, Chengdi Ma, Chenchen Zhang, Yun Liang 0001, Chao Yang 0002. 119-134 [doi]
- IA-Chol: Input-Aware Cholesky Decomposition on CPU and GPUJixiao Deng, Qinglin Wang, Lin Chen 0028, Tun Li, Bo Yang 0023, Xinhai Chen, Jie Liu 0002. 135-148 [doi]
- CB-SpMV: A Data Aggregating and Balance Algorithm for for Cache-Friendly Block-Based SpMV on GPUsXing Cong, FuKai Sun, Yifan Chen, Chenhao Xie 0001, Yi Liu 0013, Depei Qian. 149-160 [doi]
- HR-SpMM: Adaptive Row Partitioning and Hybrid Kernel Design for Sparse Matrix MultiplicationQi Wang, Yaobin Wang, Yi Luo, Rong Luo, Pingping Tang. 161-172 [doi]
- G^3SA: A GPU-Accelerated Gold Standard Genomics Library for End-to-End Sequence AlignmentYeejoo Han, Sunwoo Kim, Seongyeon Park, Jinho Lee. 173-188 [doi]
- Graph Convolutional Network Acceleration Using Adiabatic Superconductor Josephson DevicesZhengang Li, Hongwu Peng, Xuan Shen, Masoud Zabihi, Xi Xie, Geng Yuan, Yanzhi Wang, Olivia Chen, Caiwen Ding. 189-204 [doi]
- TMModel: Modeling Texture Memory and Mobile GPU Performance to Accelerate DNN ComputationsJiexiong Guan, Zhenqing Hu, Christos D. Antonopoulos, Nikolaos Bellas, Spyros Lalis, Evgenia Smirni, Gang Zhou, Gagan Agrawal, Bin Ren. 205-220 [doi]
- DR-CircuitGNN: Training Acceleration of Heterogeneous Circuit Graph Neural Network on GPUsYuebo Luo, Shiyang Li, Junran Tao, Kiran Gautam Thorat, Xi Xie, Hongwu Peng, Nuo Xu, Caiwen Ding, Shaoyi Huang. 221-235 [doi]
- CLOVER: A GPU-native, Spatio-graph-based Approach to Exact kNNVictor Kamel, Hanxueyu Yan, Sean Chester. 236-249 [doi]
- Efficient Locality-aware Instruction Stream Scheduling for Stencil Computation on ARM ProcessorsShanghao Liu, Hailong Yang, Xin You, Zhongzhi Luan, Yi Liu, Depei Qian. 250-264 [doi]
- Accelerating Complex Stencil Computations with Adaptive Fusion StrategySiqi Wang, Hailong Yang, Pengbo Wang, Shaokang Du, Yufan Xu, Qingxiao Sun, Xiaoyan Liu, Xuezhu Wang, Xuning Liang, Zhongzhi Luan, Yi Liu, Depei Qian. 265-278 [doi]
- A3FR: Agile 3D Gaussian Splatting with Incremental Gaze Tracked Foveated Rendering in Virtual RealityShuo Xin, Haiyu Wang, Sai Qian Zhang. 279-292 [doi]
- EPIClear: Exploiting Domain-Specific Features for Epistasis Detection Acceleration on Tensor CoresRicardo Nobre, Miguel Graça, Leonel Sousa, Aleksandar Ilic. 293-307 [doi]
- Statistical Treatment of Variable MPI Latencies and MPI-Communication Hiding for Matrix-Free Finite Element OperatorsMax Heldman, Johann Rudi, Julie Bessac. 308-323 [doi]
- Fast and Fair Training for Deep Learning in Heterogeneous GPU ClustersZizhao Mo, Huanle Xu, Wing Cheong Lau. 324-338 [doi]
- SortingHat: System Topology-aware Scheduling of Deep Neural Network Models on Multi-GPU SystemsSeok Namkoong, Taehyeong Park 0001, Kiung Jung, Jinyoung Kim, Yongjun Park 0001. 339-354 [doi]
- CTCCL: Cost-Efficient Joint Device-Network Load Balancing for LLM Training in RoCE-based Intelligent Computing NetworkZhuotong Li, Liang Xu, Ziqi Huang, Shuyun Qian, Hongwei Bu, Ming Yang, Mengyun Luan, Weiguo Chen, Xu Wen. 355-367 [doi]
- Cephalo: Harnessing Heterogeneous GPU Clusters for Training Transformer ModelsRunsheng Benson Guo, Utkarsh Anand, Arthur Chen, Khuzaima Daudjee. 368-383 [doi]
- A Device-Side Execution Model for Multi-GPU Task GraphsIlyas Turimbetov, Mohamed Wahib, Didem Unat. 384-396 [doi]
- CRAMG: A Communication-Reduced Algebraic Multigrid MethodFan Yuan, Xiaojian Yang, Yunqing Huang, Dezun Dong, Chuanfu Xu, Jie Liu 0002, Xiaoqiang Yue, Shengguo Li, Hongxia Wang. 397-411 [doi]
- An Efficient 2D Fusion Method for High-Performance Two-Stage Eigensolvers on Modern Heterogeneous ArchitecturesYongxiao Zhou, Yi Zong, Yuyang Jin 0001, Heng Li, Wei Xue 0003. 412-425 [doi]
- SnuSOLVER: Optimizing Sparse Direct Solvers for Heterogeneous SystemsChaewon Kim, Jaehwan Lee, Jinpyo Kim, Dohyun Kim, Kyusu Ahn, Hyung Uk Cho, Seungin Baek, Jaejin Lee. 426-441 [doi]
- MAGNUS: Generating Data Locality to Accelerate Sparse Matrix-Matrix Multiplication on CPUsJordi Wolfson-Pou, Jan Laukemann, Fabrizio Petrini. 442-457 [doi]
- PIM-CARE: A Compiler-Assisted Dynamic Resource Allocation Framework for Real-world DRAM PIMInyong Hwang, Donghyeon Kim 0001, Seokwon Kang, Taehyeong Park 0001, Taehoon Kim 0001, Jiwon Seo 0002, Hanjun Kim 0001, Youngsok Kim, Yongjun Park 0001. 458-472 [doi]
- Proteus: Achieving High-Performance Processing-Using-DRAM with Dynamic Bit-Precision, Adaptive Data Representation, and Flexible ArithmeticGeraldo Francisco de Oliveira Junior, Mayank Kabra, Yuxin Guo, Kangqi Chen, Abdullah Giray Yaglikçi, Melina Soysal, Mohammad Sadrosadati, Joaquín Olivares Bueno, Saugata Ghose, Juan Gómez-Luna, Onur Mutlu. 473-494 [doi]
- SparsePIM: An Efficient HBM-Based PIM Architecture for Sparse Matrix-Vector MultiplicationsTaewoon Kang, Geonwoo Choi, Taeweon Suh, Gunjae Koo. 495-512 [doi]
- MARS: Processing-In-Memory Acceleration of Raw Signal Genome Analysis Inside the Storage SubsystemMelina Soysal, Konstantina Koliogeorgi, Can Firtina, Nika Mansouri-Ghiasi, Rakesh Nadig, Haiyu Mao, Geraldo Francisco de Oliveira Junior, Yu Liang 0004, Klea Zambaku, Mohammad Sadrosadati, Onur Mutlu. 513-534 [doi]
- DALdex: A DPU-Accelerated Persistent Learned Index via Incremental LearningAoyang Tong, Yu Hua 0001, Menglei Chen. 535-549 [doi]
- From Islands to Archipelago: Towards Collaborative and Adaptive Burst Buffer for HPC SystemsMingtian Shao, Ruibo Wang, Wenzhe Zhang, Kai Lu 0001, Yiqin Dai, Huijun Wu 0001. 550-563 [doi]
- PIE: Enabling Fast and Scalable Incremental Evolving Graph Analytics on Persistent MemoryYunmo Zhang, Jiacheng Huang 0002, Xizhe Yin, Junqiao Qiu, Hong Xu 0001, Chun Jason Xue. 564-579 [doi]
- DEDUPKV: A Space-Efficient and High-Performance Key-Value Store via Fine-Grained DeduplicationSafdar Jamil, Awais Khan 0002, Xubin He, Youngjae Kim 0001. 580-595 [doi]
- ConTraPh: Contrastive Learning for Parallelization and Performance OptimizationQuazi Ishtiaque Mahmud, Ali TehraniJamsaz, Nesreen K. Ahmed, Theodore L. Willke, Ali Jannesari. 596-610 [doi]
- UJOpt: Heuristic Approach for Applying Unroll-and-Jam Optimization and Loop Order SelectionShilpa Babalad, Shirish K. Shevade, Matthew Jacob Thazhuthaveetil, R. Govindarajan. 611-624 [doi]
- Loop Fusion in Matrix Multiplications with Sparse DependenceMohammad Mahdi Salehi Dezfuli, Kazem Cheshmi. 625-639 [doi]
- ConCo: Optimizing Compilation of Concurrent Tensor Programs on Shared GPUJiamin Lu, Jingwei Sun 0001, Yunlong Xu, Peng Sun, Guangzhong Sun. 640-653 [doi]
- Pushing the Limits of GPU Lossy Compression: A Hierarchical Delta ApproachBoyuan Zhang 0002, Yafan Huang, Sheng Di, Fengguang Song, Guanpeng Li, Franck Cappello. 654-669 [doi]
- Parallel Contraction Hierarchies Can Be Efficient and ScalableZijin Wan, Xiaojun Dong 0001, Letong Wang, Enzuo Zhu, Yan Gu 0001, Yihan Sun 0001. 670-688 [doi]
- BMQSim: Overcoming Memory Constraints in Quantum Circuit Simulation with a High-Fidelity Compression FrameworkBoyuan Zhang 0002, Bo Fang 0002, Fanjiang Ye, Luanzheng Guo, Fengguang Song, Nathan R. Tallent, Dingwen Tao. 689-704 [doi]
- DIV: An Index & Value compression method for SpMV on large matricesDimitrios Galanopoulos, Panagiotis Mpakos, Petros Anastasiadis, Nectarios Koziris, Georgios I. Goumas. 705-717 [doi]
- DIMPLES: Distributed Influence Maximization for Pandemic pLanning on Exascale SystemsMarco Minutoli, Reece Neff, Naw Safrin Sattar, Hao Lu 0001, John Feo, Henning S. Mortveit, Anil Vullikanti, Dawen Xie, Mandy L. Wilson, Gregor von Laszewski, Parantapa Bhattacharya, S. M. Ferdous, Ananth Kalyanaraman, Michela Becchi, Madhav V. Marathe, Mahantesh Halappanavar. 718-733 [doi]
- Light-FP: Analyze Floating-Point Error in a Highly Condensed ApproachJiazhi Mi, Li Chen, Haoyu Wang, Ruixiang Gao, Hongze Zhang, Ronghong Shen, Kai Lin, You Fu, Huimin Cui. 734-748 [doi]
- WisIO: Automated I/O Bottleneck Detection with Multi-Perspective Views for HPC WorkflowsIzzet Yildirim, Hariharan Devarajan, Anthony Kougkas, Xian-He Sun, Kathryn M. Mohror. 749-763 [doi]
- Efficient Server Consolidation through a balanced mix of Transformer-based and Conventional ApplicationsPablo Abad, Pablo Prieto, Valentin Puente, José-Ángel Gregorio. 764-775 [doi]
- Taking GPU Programming Models to Task for Performance PortabilityJoshua Hoke Davis, Pranav Sivaraman, Joy Kitson, Konstantinos Parasyris, Harshitha Menon, Isaac Minn, Giorgis Georgakoudis, Abhinav Bhatele. 776-791 [doi]
- Analyzing the Performance of Applications at ExascaleDragana Grbic, John M. Mellor-Crummey. 792-806 [doi]
- Understanding the Idiosyncrasies of Emerging BlueField DPUsArjun Kashyap, Yuke Li, Darren Ng, Xiaoyi Lu. 807-821 [doi]
- Multi-Node Multi-GPU DatalogAhmedur Rahman Shovon, Yihao Sun, Kristopher K. Micinski, Thomas Gilray, Sidharth Kumar. 822-836 [doi]
- SmartNIC-GPU-CPU Heterogeneous System for Large Machine Learning Model with Software-Hardware CodesignAnqi Guo, Yuchen Hao, Xiteng Yao, Shining Yang, Jianyu Huang, Tony Tong Geng, Martin C. Herbordt. 837-852 [doi]
- D-Rex: Heterogeneity-Aware Reliability Framework and Adaptive Algorithms for Distributed StorageMaxime Gonthier, Dante D. Sánchez-Gallegos, Haochen Pan, Bogdan Nicolae, Sicheng Zhou, Hai Duc Nguyen 0005, Valérie Hayot-Sasson, J. Gregory Pauloski, Jesús Carretero 0001, Kyle Chard, Ian T. Foster. 853-867 [doi]
- ORION: Optimizing OLAP Query Execution with Proactive Caching and Separate OperatorsZhixin Tong, Jiuchen Shi, Quan Chen 0002, Pu Pang, Shixuan Sun, Jie Meng, Jiang Liu, En Shao, Minyi Guo. 868-883 [doi]
- ORA: Job Runtime Prediction for High-Performance Computing Platforms Using the Online Retrieval-Augmented Language ModelHongyi Liu, Yinping Ma, Xiaosong Huang, Lingzhe Zhang, Tong Jia, Ying Li 0012. 884-894 [doi]
- Generating Microservice Graphs with Production Characteristics for Efficient Resource ScalingFanrong Du, Jiuchen Shi, Quan Chen 0002, Pu Pang, Li Li 0012, Minyi Guo. 895-910 [doi]
- HARNESS: Holistic Resource Management for Diversely Scaled Edge Cloud SystemsIsmet Dagli, Justin Davis, Mehmet Esat Belviranli. 911-927 [doi]
- Leonid: Exploring Automated Kernel Fusion in Performance-Portable Programming Models for Scientific ComputationChenchen Zhang, Hao Luo, Chao Yang. 928-942 [doi]
- DeCOS: Data-Efficient Reinforcement Learning for Compiler Optimization Selection Ignited by LLMTianming Cui, Pen-Chung Yew, Stephen McCamant, Antonia Zhai. 943-958 [doi]
- Pearl: Automatic Code Optimization Using Deep Reinforcement LearningDjamel Rassem Lamouri, Iheb Nassim Aouadj, Smail Kourta, Riyadh Baghdadi. 959-974 [doi]
- CIExplorer: Microarchitecture-Aware Exploration for Tightly Integrated Custom InstructionXiaoyu Hao, Sen Zhang, Liang Qiao, Qingcai Jiang, Jun Shi 0007, Junshi Chen, Hong An, Xulong Tang, Hao Shu, Honghui Yuan. 975-990 [doi]
- EVeREST-C: An Effective and Versatile Runtime Energy Saving Tool for CPUsAnna Yue, Pen-Chung Yew, Sanyam Mehta. 991-1004 [doi]
- EDAN: Towards Understanding Memory Parallelism and Latency Sensitivity in HPCSiyuan Shen, Mikhail Khalilov, Lukas Gianinazzi, Timo Schneider, Marcin Chrapek, Jai Dayal, Manisha Gajbe, Robert W. Wisniewski, Torsten Hoefler. 1005-1019 [doi]
- ROCKET: An RNS-based Photonic Accelerator for High-Precision and Energy-Efficient DNN TrainingHao Zhang 0058, Haibo Zhang 0001, Chengpeng Xia, Zhiyi Huang 0001, Yawen Chen 0001, Amanda Barnard. 1020-1033 [doi]
- A Global Perspective on Supercomputer Power Provisioning: Case Studies from United States and EuropeTapasya Patki, Barry Rountree, Torsten Wilde, Andrea Bartolini, Stephanie Brink, Esa Heiskanen, Sachin Idgunji, Matthias Maiterth, James H. Rogers, Ermal Rrapaj, Ralf Schneider, Woong Shin, Kathleen Shoga, Christian Simmendinger, Nicholas J. Wright, Zhengji Zhao. 1034-1051 [doi]
- PortFC: Designing High-performance Deadlock-free BCube NetworksPeirui Cao, Rui Ning, Hongwei Yang, Zhaochen Zhang, Chang Liu 0001, Rui Li 0020, Yongqi Yang, Yunzhuo Liu, Chengyuan Huang, Tao Sun 0010, Xiaodong Duan, Guihai Chen, Chen Tian 0001. 1052-1063 [doi]
- Auto-Healer: Self-Healing Hardware for Perception Stage Faults in Autonomous Driving SystemsAli Suvizi, Guru Venkataramani. 1064-1078 [doi]
- OpaQue: Program Output Obfuscation for Quantum Software Circuits in Quantum CloudsTirthak Patel, Aditya Ranjan, Daniel Silver, Harshitta Gandhi, William Cutler, Devesh Tiwari. 1079-1091 [doi]
- JBSA: A Bit-Serial Accelerator for Deep Neural Networks Using Superconducting SFQ LogicYang Su, Sheng Li, Huilong Jiang, Haofei Yin, Rongliang Fu, Junying Huang, Xiaochun Ye, Zhimin Zhang 0004, Jie Ren, Xiaoping Gao, Tsung-Yi Ho, Dongrui Fan. 1092-1105 [doi]
- YH-Light: Yielding Hierarchy-aware Partitioner for Large-scale Graph ProcessingXinbiao Gan, Tiejun Li, Chunye Gong, Jie Liu 0002, Kai Lu 0001. 1106-1116 [doi]
- MG-αGCD: Accelerating Graph Community Detection on Multi-GPU PlatformsShuai Yang, Changyou Zhang. 1117-1130 [doi]
- GraCFL: A Holistically Designed Vertex-Centric Graph System for CFL ReachabilitySakib Fuad, Amir Hossein Nodehi Sabet, Umar Farooq, Zhijia Zhao. 1131-1145 [doi]
- OPMOS: Ordered Parallel Algorithm for Multi-Objective Shortest-PathsLeo Gold, Adam Bienkowski, David Sidoti, Krishna R. Pattipati, Omer Khan. 1146-1161 [doi]
- A Multi-GPU Algorithm for Computing Maximal Independent Sets in Large GraphsAnju Mongandampulath Akathoott, Benila Virgin Jerald Xavier, Martin Burtscher. 1162-1175 [doi]
- A Cost-Effective Dueling Framework for Set-Associative Cache IndexingKevin Weston, Vahid Janfaza, Avery Johnson, Abdullah Muzahid. 1176-1189 [doi]
- DREAM: Device-Driven Efficient Access to Virtual MemoryNurlan Nazaraliyev, Elaheh Sadredini, Nael B. Abu-Ghazaleh. 1190-1205 [doi]
- Page Migration for Hardware Memory Disaggregation Across a NetworkArchit Patke, Christian Pinto, Saurabh Jha, Haoran Qiu, Zbigniew Kalbarczyk, Ravishankar K. Iyer. 1206-1218 [doi]
- MEMPLEX: A Memory System with Replication and Migration of Data for Multi-Chiplet NUMA ArchitecturesNeethu Bal Mallya, Bhavishya Goel, Ioannis Sourdis. 1219-1233 [doi]
- Persistent Memory Objects on the CheapDerrick Greenspan, Naveed Ul Mustafa, Jongouk Choi, Mark Heinrich, Yan Solihin. 1234-1249 [doi]