Abstract is missing.
- Setting a Course for Post-Moore Software PerformanceCharles E. Leiserson. 1 [doi]
- Helios: Efficient Distributed Dynamic Graph Sampling for Online GNN InferenceJie Sun, Zuocheng Shi, Li Su, Wenting Shen, Zeke Wang, Yong Li 0020, Wenyuan Yu, Wei Lin, Fei Wu 0001, Bingsheng He, Jingren Zhou. 2-15 [doi]
- Accelerating GNNs on GPU Sparse Tensor Cores through N: M Sparsity-Oriented Graph ReorderingJou-An Chen, Hsin-Hsuan Sung, Ruifeng Zhang, Ang Li 0006, Xipeng Shen. 16-28 [doi]
- Adaptive Parallel Training for Graph Neural NetworksKaihao Ma, Renjie Liu, Xiao Yan 0002, Zhenkun Cai, Xiang Song 0003, Minjie Wang, Yichao Li, James Cheng. 29-42 [doi]
- RT-BarnesHut: Accelerating Barnes-Hut Using Ray-Tracing HardwareVani Nagarajan, Rohan Gangaraju, Kirshanthan Sundararajah, Artem Pelenitsyn, Milind Kulkarni 0001. 43-56 [doi]
- EVeREST: An Effective and Versatile Runtime Energy Saving Tool for GPUsAnna Yue, Pen-Chung Yew, Sanyam Mehta. 57-69 [doi]
- TurboFFT: Co-Designed High-Performance and Fault-Tolerant Fast Fourier Transform on GPUsShixun Wu, Yujia Zhai, Jinyang Liu 0003, Jiajun Huang, Zizhe Jian, Huangliang Dai, Sheng Di, Franck Cappello, Zizhong Chen. 70-84 [doi]
- Reciprocating LocksDave Dice, Alex Kogan. 85-98 [doi]
- Aggregating Funnels for Faster Fetch&Add and QueuesYounghun Roh, Yuanhao Wei, Eric Ruppert, Panagiota Fatourou, Siddhartha Jayanti, Julian Shun. 99-114 [doi]
- Fairer and More Scalable Reader-Writer Locks by Optimizing Queue ManagementTakashi Hoshino 0002, Kenjiro Taura. 115-127 [doi]
- Publish on Ping: A Better Way to Publish Reservations in Memory Reclamation for Concurrent Data StructuresAjay Singh, Trevor Brown. 128-141 [doi]
- AC-Cache: A Memory-Efficient Caching System for Small Objects via Exploiting Access CorrelationsFulin Nan, Ronglong Wu, Zhirong Shen, Jiahui Yang, Li Cheng, Zheng Chen, Yiming Zhang, Jiwu Shu. 142-155 [doi]
- Effectively Virtual Page Prefetching via Spatial-Temporal Patterns for Memory-intensive Cloud ApplicationsYun Wang, Liang Chen, Tianmai Deng, Ben Luo, Yibin Shen, Zhixiang Wei, Yixiao Xu, Minglang Huang, Zhengwei Qi. 156-169 [doi]
- Harnessing Inter-GPU Shared Memory for Seamless MoE Communication-Computation FusionHulin Wang, Yaqi Xia, Donglin Yang, Xiaobo Zhou 0002, Dazhao Cheng. 170-182 [doi]
- FlashTensor: Optimizing Tensor Programs by Leveraging Fine-grained Tensor PropertyRunxin Zhong, Yuyang Jin, Chen Zhang 0001, Kinman Lei, Shuangyu Li, Jidong Zhai. 183-196 [doi]
- Mario: Near Zero-cost Activation Checkpointing in Pipeline ParallelismWeijian Liu, Mingzhen Li, Guangming Tan, Weile Jia. 197-211 [doi]
- COMPSO: Optimizing Gradient Compression for Distributed Training with Second-Order OptimizersBaixi Sun, Weijin Liu, J. Gregory Pauloski, Jiannan Tian, Jinda Jia, Daoce Wang, Boyuan Zhang, Mingkai Zheng, Sheng Di, Sian Jin, Zhao Zhang, Xiaodong Yu 0001, Kamil A. Iskra, Pete Beckman, Guangming Tan, Dingwen Tao. 212-224 [doi]
- WeiPipe: Weight Pipeline Parallelism for Communication-Effective Long-Context Large Model TrainingJunfeng Lin, Ziming Liu, Yang You 0001, Jun Wang, Weihao Zhang, Rong Zhao. 225-238 [doi]
- MARLIN: Mixed-Precision Auto-Regressive Parallel Inference on Large Language ModelsElias Frantar, Roberto L. Castro, Jiale Chen, Torsten Hoefler, Dan Alistarh. 239-251 [doi]
- ATTNChecker: Highly-Optimized Fault Tolerant Attention for Large Language Model TrainingYuhang Liang, Xinyi Li, Jie Ren, Ang Li, Bo Fang 0002, Jieyang Chen. 252-266 [doi]
- SGDRC: Software-Defined Dynamic Resource Control for Concurrent DNN Inference on NVIDIA GPUsYongkang Zhang 0003, Haoxuan Yu, Chenxia Han, Cheng Wang, Baotong Lu, Yunzhe Li, Zhifeng Jiang, Yang Li, Xiaowen Chu 0001, Huaicheng Li. 267-281 [doi]
- DORADD: Deterministic Parallel Execution in the Era of Microsecond-Scale ComputingZhengqing Liu, Musa Unal, Matthew J. Parkinson, Marios Kogias. 282-296 [doi]
- WaterWise: Co-optimizing Carbon- and Water-Footprint Toward Environmentally Sustainable Cloud ComputingYankai Jiang 0002, Rohan Basu Roy, Raghavendra Kanakagiri, Devesh Tiwari. 297-311 [doi]
- FlashSparse: Minimizing Computation Redundancy for Fast Sparse Matrix Multiplications on Tensor CoresJinliang Shi, Shigang Li, Youxuan Xu, Rongtian Fu, Xueying Wang, Tong Wu. 312-325 [doi]
- Acc-SpMM: Accelerating General-purpose Sparse Matrix-Matrix Multiplication with GPU Tensor CoresHaisha Zhao, San-li, Jiaheng Wang, Chunbao Zhou, Jue Wang, Zhikuang Xin, Shunde Li, Zhiqiang Liang, Zhijie Pan, Fang Liu, Yan Zeng, Yangang Wang, Xuebin Chi. 326-338 [doi]
- BerryBees: Breadth First Search by Bit-Tensor-CoresYuyao Niu, Marc Casas. 339-354 [doi]
- FlashFFTStencil: Bridging Fast Fourier Transforms to Memory-Efficient Stencil Computations on Tensor Core UnitsHaozhi Han, Kun Li, Wei Cui, Donglin Bai, YiWei Zhang, Liang Yuan, Yifeng Chen, Yunquan Zhang, Ting Cao, Mao Yang. 355-368 [doi]
- PANNS: Enhancing Graph-based Approximate Nearest Neighbor Search through Recency-aware Construction and Parameterized SearchXizhe Yin, Chao Gao, Zhijia Zhao 0001, Rajiv Gupta 0001. 369-381 [doi]
- Balanced Allocations over Efficient Queues: A Fast Relaxed FIFO QueueKåre von Geijer, Philippas Tsigas, Elias Johansson, Sebastian Hermansson. 382-395 [doi]
- LibRTS: A Spatial Indexing Library by Ray TracingLiang Geng, Rubao Lee, Xiaodong Zhang. 396-411 [doi]
- Crystality: A Programming Model for Smart Contracts on Parallel EVMsHao Wang, Minghao Pan, Jiaping Wang. 412-425 [doi]
- Popcorn: Accelerating Kernel K-means on GPUs through Sparse Linear AlgebraJulian Bellavita, Thomas Pasquali, Laura Del Rio Martin, Flavio Vella, Giulia Guidi. 426-440 [doi]
- Swift Unfolding of Communities: GPU-Accelerated Louvain AlgorithmZhibin Wang, Xi Lin, Xue Li, Pinhuan Wang, Ziheng Meng, Hang Liu 0001, Chen Tian 0001, Sheng Zhong. 441-454 [doi]
- GLumin: Fast Connectivity Check Based on LUTs For Efficient Graph Pattern MiningWeichen Cao, Ke Meng, Zhiheng Lin, Guangming Tan. 455-468 [doi]
- Improving Tridiagonalization Performance on GPU ArchitecturesHansheng Wang, Zhekai Duan, Zitian Zhao, Siqi Wu, Saiqi Zheng, Qiao Li, Xu Jiang, Shaoshuai Zhang. 469-480 [doi]
- Jigsaw: Toward Conflict-free Vectorized Stencil Computation by Tessellating Swizzled RegistersYiWei Zhang, Kun Li, Liang Yuan, Haozhi Han, Yunquan Zhang, Ting Cao, Mao Yang. 481-495 [doi]
- Semi-StructMG: A Fast and Scalable Semi-Structured Algebraic MultigridYi Zong, Chensong Zhang, Longjiang Mu, Jianchun Wang, Jian Sun, Xiaowen Xu, Xinliang Wang, Peinan Yu, Wei Xue. 496-511 [doi]
- SBMGT: Scaling Bayesian Multinomial Group TestingWeicong Chen, Hao Qi, Curtis Tatsuoka, Xiaoyi Lu. 512-523 [doi]
- An AI-Enhanced 1km-Resolution Seamless Global Weather and Climate Model to Achieve Year-Scale Simulation Speed using 34 Million CoresXiaohui Duan, Yi Zhang, Kai Xu, Haohuan Fu, Bin Yang, Yiming Wang, Yilun Han, Siyuan Chen, Zhuangzhuang Zhou, Chenyu Wang, Dongqiang Huang, Huihai An, Xiting Ju, Haopeng Huang, Zhuang Liu, Wei Xue, Weiguo Liu, Bowen Yan, Jianye Hou, Maoxue Yu, Wenguang Chen, Jian Li, Zhao Jing, Hailong Liu, Lixin Wu. 524-538 [doi]
- Big Atomics and Fast Hash TablesDaniel Anderson, Guy E. Blelloch, Siddhartha V. Jayanti. 539-541 [doi]
- Frontier-guided Graph ReorderingXinmiao Zhang 0004, Cheng Liu 0008, Shengwen Liang, Chenwei Xiong, Yu Zhang, Lei Zhang 0008, Huawei Li 0001, Xiaowei Li 0001. 542-544 [doi]
- Transactional Data Structures with Orthogonal MetadataYaodong Sheng, Ahmed Hassan, Michael F. Spear. 545-547 [doi]
- Boost Lock-free Queue and Stack with BatchingAo Li, Wenhai Li, Yuan Chen, Lingfeng Deng. 548-550 [doi]
- TensorMD: Molecular Dynamics Simulation with Ab Initio Accuracy of 50 Billion AtomsYucheng Ouyang, Ying Liu, Honghui Shang, Zhenchuan Chen, Jiahao Shan, Huimin Cui, Xiaobing Feng, Xin Chen, Xingyu Gao 0003, Lifang Wang, Haifeng Song 0003, Xin Chen, Rongfen Lin, Fang Li. 551-553 [doi]
- FastBWA: Practical and Cost-Efficient Genome Sequence Alignment PipelineZhonghai Zhang, Yewen Li, Ke Meng, Chunming Zhang, Guangming Tan. 554-556 [doi]
- High-performance Visual Semantics Compression for AI-Driven ScienceBoyuan Zhang 0002, Luanzheng Guo, Jiannan Tian, Jinyang Liu, Daoce Wang, Fanjiang Ye, Chengming Zhang 0006, Jan Strube 0001, Nathan R. Tallent, Dingwen Tao. 557-559 [doi]
- Triangle Counting on Tensor CoresYuang Chen, Jeffrey Xu Yu. 560-562 [doi]
- Magneto: Accelerating Parallel Structures in DNNs via Co-Optimization of OperatorsZhanyuan Di, Leping Wang, ZiYi Ren, En Shao, Jie Zhao, Siyuan Feng, Dingwen Tao, Guangming Tan, Ninghui Sun. 563-565 [doi]
- A General and Scalable GCN Training Framework on CPU SupercomputersChen Zhuang, Peng Chen, Xin Liu, Rio Yokota, Nikoli Dryden, Lingqi Zhang 0001, Toshio Endo, Satoshi Matsuoka, Mohamed Wahib. 566-568 [doi]
- Minimizing speculation overhead in a parallel recognizer for regular textsAngelo Borsotti, Luca Breveglieri, Angelo Morzenti, Stefano Crespi-Reghizzi. 569-572 [doi]