Abstract is missing.
- DAWN: Matrix Operation-Optimized Algorithm for Shortest Paths Problem on Unweighted GraphsYelai Feng, Huaixi Wang, Yining Zhu, Xiandong Liu, Hongyi Lu, Qing Liu. 1-13 [doi]
- Arkade: k-Nearest Neighbor Search With Non-Euclidean Distances using GPU Ray TracingDurga Keerthi Mandarapu, Vani Nagarajan, Artem Pelenitsyn, Milind Kulkarni 0001. 14-25 [doi]
- Shared Virtual Memory: Its Design and Performance Implications for Diverse ApplicationsBennett Cooper, Thomas R. W. Scogland, Rong Ge 0002. 26-37 [doi]
- FuseIM: Fusing Probabilistic Traversals for Influence Maximization on Exascale SystemsReece Neff, Mostafa Eghbali Zarch, Marco Minutoli, Mahantesh Halappanavar, Antonino Tumeo, Ananth Kalyanaraman, Michela Becchi. 38-49 [doi]
- An Autonomous Parallelization of Transformer Model Inference on Heterogeneous Edge DevicesJuhyeon Lee, Insung Bahk, Hoseung Kim, Sinjin Jeong, Suyeon Lee, Donghyun Min. 50-61 [doi]
- LCM: LLM-focused Hybrid SPM-cache Architecture with Cache Management for Multi-Core AI AcceleratorsChengtao Lai, Zhongchun Zhou, Akash Poptani, Wei Zhang. 62-73 [doi]
- HMComp: Extending Near-Memory Capacity using Compression in Hybrid MemoryQi Shao, Angelos Arelakis, Per Stenström. 74-84 [doi]
- NUCAlloc: Fine-Grained Block Placement in Hashed Last-Level NUCA CachesRaveendra Soori, Shreyas Prabhu, Harpreet Singh Chawla, Michael Ferdman. 85-97 [doi]
- Exploiting Vector Code Semantics for Efficient Data Cache PrefetchingFrancesc Martínez Palau, Martí Torrents, Adrià Armejach, Marc Casas. 98-109 [doi]
- Real-time High-resolution X-Ray Computed TomographyDu Wu, Peng Chen, Xiao Wang 0004, Isaac Lyngaas, Takaaki Miyajima, Toshio Endo, Satoshi Matsuoka, Mohamed Wahib. 110-123 [doi]
- RayJoin: Fast and Precise Spatial JoinLiang Geng, Rubao Lee, Xiaodong Zhang. 124-136 [doi]
- Optimizing Attention by Exploiting Data Reuse on ARM Multi-core CPUsXiao Fu, Weiling Yang, Dezun Dong, Xing Su. 137-149 [doi]
- Differentiating Set Intersections in Maximal Clique Enumeration by Function and Subproblem SizeHans Vandierendonck. 150-163 [doi]
- Minimizing Coherence Errors via Dynamic DecouplingSoheil Khadirsharbiyani, Movahhed Sadeghi, Mostafa Eghbali Zarch, Mahmut Taylan Kandemir. 164-175 [doi]
- Soft Error Resilience at Near-Zero CostJianping Zeng 0001, Shaoyu Huang, Jiuyang Liu, Changhee Jung. 176-187 [doi]
- Understanding GPU Memory Corruption at Extreme Scale: The Summit Case StudyVladyslav Oles, Anna Schmedding, George Ostrouchov, Woong Shin, Evgenia Smirni, Christian Engelmann. 188-200 [doi]
- Input Range Generation for Compiler-Induced Numerical InconsistenciesDolores Miao, Ignacio Laguna, Cindy Rubio-González. 201-212 [doi]
- Accurate Computation of the Logarithm of Modified Bessel Functions on GPUsAndreas Plesner, Hans Henrik Brandenborg Sørensen, Søren Hauberg. 213-224 [doi]
- RDMA-Based Algorithms for Sparse Matrix Multiplication on GPUsBenjamin Brock, Aydin Buluç, Katherine A. Yelick. 225-235 [doi]
- Distributed Ranges: A Model for Distributed Data Structures, Algorithms, and ViewsBenjamin Brock, Robert Cohn, Suyash Bakshi, Tuomas Karna, Jeongnim Kim, Mateusz Nowak, Lukasz Slusarczyk, Kacper Stefanski, Timothy G. Mattson. 236-246 [doi]
- Stencil Computation with Vector Outer ProductWenxuan Zhao, Liang Yuan, Baicheng Yan, Penghao Ma, Yunquan Zhang, Long Wang, Zhe Wang. 247-258 [doi]
- Ymir: A Scheduler for Foundation Model Fine-tuning Workloads in DatacentersWei Gao, Weiming Zhuang, Minghao Li, Peng Sun, Yonggang Wen 0001, Tianwei Zhang 0004. 259-271 [doi]
- DeepHYDRA: A Hybrid Deep Learning and DBSCAN-Based Approach to Time-Series Anomaly Detection in Dynamically-Configured SystemsFranz Kevin Stehle, Wainer Vandelli, Felix Zahn, Giuseppe Avolio, Holger Fröning. 272-285 [doi]
- An Efficient and Scalable Approach to Build Co-occurrence Matrix for DNN's Embedding LayerQuentin R. Petit, Chong Li, Nahid Emad. 286-297 [doi]
- Scheduling for Cyber-Physical Systems with Heterogeneous Processing Units under Real-World ConstraintsJustin McGowen, Ismet Dagli, Neil T. Dantam, Mehmet E. Belviranli. 298-311 [doi]
- SLIDEX: A Novel Architecture for Sliding Window ProcessingRaúl Taranco, José María Arnau, Antonio González 0001. 312-323 [doi]
- Quasar-ViT: Hardware-Oriented Quantization-Aware Architecture Search for Vision TransformersZhengang Li, Alec Lu, Yanyue Xie, Zhenglun Kong, Mengshu Sun, Hao Tang 0005, Zhong Jia Xue, Peiyan Dong, Caiwen Ding, Yanzhi Wang, Xue Lin, Zhenman Fang. 324-337 [doi]
- CLAY: CXL-based Scalable NDP Architecture Accelerating Embedding LayersSungmin Yun, Hwayong Nam, Kwanhee Kyung, Jaehyun Park 0006, Byeongho Kim, Yongsuk Kwon, Eojin Lee, Jung Ho Ahn. 338-351 [doi]
- NEOCNN: NTT-Enabled Optical Convolution Neural Network AcceleratorXianbin Li, Yinyi Liu, Fan Jiang, Chengeng Li, Yuxiang Fu, Wei Zhang, Jiang Xu. 352-362 [doi]
- sys-sage: A Unified Representation of Dynamic Topologies & Attributes on HPC SystemsStepan Vanecek, Martin Schulz 0001. 363-375 [doi]
- RTT-UAF: Reuse Time Tracking for Use-After-Free DetectionYubo Du, Yanan Guo 0002, Youtao Zhang, Jun Yang. 376-387 [doi]
- Tile Size and Loop Order Selection using Machine Learning for Multi-/Many-Core ArchitecturesShilpa Babalad, Shirish Shevade, Matthew Jacob Thazhuthaveetil, R. Govindarajan. 388-399 [doi]
- Matrix-free SBP-SAT finite difference methods and the multigrid preconditioner on GPUsAlexandre Chen, Brittany A. Erickson, Jeremy E. Kozdon, Jee Choi. 400-412 [doi]
- SmartFuse: Reconfigurable Smart Switches to Accelerate Fused Collectives in HPC ApplicationsPouya Haghi, Cheng Tan 0002, Anqi Guo, Chunshu Wu, Dongfang Liu, Ang Li 0006, Anthony Skjellum, Tong Geng, Martin C. Herbordt. 413-425 [doi]
- CommBench: Micro-Benchmarking Hierarchical Networks with Multi-GPU, Multi-NIC NodesMert Hidayetoglu, Simon Garcia De Gonzalo, Elliott Slaughter, Yu Li, Christopher Zimmer, Tekin Bicer, Bin Ren, William Gropp, Wen-mei Hwu, Alex Aiken. 426-436 [doi]
- gZCCL: Compression-Accelerated Collective Communication Framework for GPU ClustersJiajun Huang, Sheng Di, Xiaodong Yu 0001, Yujia Zhai, Jinyang Liu, Yafan Huang, Ken Raffenetti, Hui Zhou, Kai Zhao 0008, Xiaoyi Lu, Zizhong Chen, Franck Cappello, Yanfei Guo, Rajeev Thakur. 437-448 [doi]
- Enhanced UGAL Routing Schemes for Dragonfly NetworksRam Sharan Chaulagain, Xin Yuan. 449-459 [doi]
- A Coordinated Strategy for GNN Combining Computational Graph and Operator OptimizationsMingyi Li, Junmin Xiao, Kewei Zhang, Zhiheng Lin, Chaoyang Shui, Ke Meng, Zehua Wang, Yunfei Pang, Guangming Tan. 460-472 [doi]
- AutoSched: An Adaptive Self-configured Framework for Scheduling Deep Learning Training WorkloadsWei Gao, Xu Zhang, Shan Huang, Shangwei Guo, Peng Sun, Yonggang Wen 0001, Tianwei Zhang 0004. 473-484 [doi]
- Sylva: Sparse Embedded Adapters via Hierarchical Approximate Second-Order InformationBaorun Mu, Christina Giannoula, Shang Wang 0002, Gennady Pekhimenko. 485-497 [doi]
- Fasor: A Fast Tensor Program Optimization Framework for Efficient DNN DeploymentHanxian Huang, Xin Chen, Jishen Zhao. 498-510 [doi]
- FASTEN: Fast GPU-accelerated Segmented Matrix Multiplication for Heterogenous Graph Neural NetworksKeren Zhou 0001, Karthik Ganapathi Subramanian, Po-Hsun Lin, Matthias Fey, Binqian Yin, Jiajia Li 0001. 511-524 [doi]
- Snoopie: A Multi-GPU Communication Profiler and VisualizerMohammad Kefah Taha Issa, Muhammad Aditya Sasongko, Ilyas Turimbetov, Javid Baydamirli, Dogan Sagbili, Didem Unat. 525-536 [doi]
- RadiK: Scalable and Optimized GPU-Parallel Radix Top-K SelectionYifei Li, Bole Zhou, Jiejing Zhang, Xuechao Wei, Yinghan Li, Yingda Chen. 537-548 [doi]
- Accelerated Auto-Tuning of GPU Kernels for Tensor ComputationsChendi Li, Yufan Xu, Sina Mahdipour Saravani, Ponnuswamy Sadayappan. 549-561 [doi]