Abstract is missing.
- Binary Compatible Critical Section DelegationJunyao Zhang 0008, Zhuo Wang, Zhe Zhou 0001. 1-12 [doi]
- Hapax Locks: Scalable Value-Based Mutual ExclusionDave Dice, Alex Kogan. 13-25 [doi]
- Fixing Non-blocking Data Structures for Better Compatibility with Memory Reclamation SchemesMd Amit Hasan Arovi, Ruslan Nikolaev 0001. 26-39 [doi]
- Multiverse: Transactional Memory with Dynamic MultiversioningGaetano Coccimiglio, Trevor Brown 0001, Srivatsan Ravi. 40-52 [doi]
- Rethinking Thread Scheduling under Oversubscription: A User-Space Framework for Coordinating Multi-runtime and Multi-process WorkloadsAleix Roca, Vicenç Beltran 0001. 53-67 [doi]
- Waste-Efficient Work StealingKyle Singer, Kunal Agrawal 0001, Tao B. Schardl. 68-80 [doi]
- DiggerBees: Depth First Search Leveraging Hierarchical Block-Level Stealing on GPUsYuyao Niu, Yuechen Lu, Weifeng Liu 0002, Marc Casas. 81-94 [doi]
- PANA: A Fine-Grained Runtime-Adaptive Load Balancing for Parallel SpMV on Multicore CPUsHaodong Bian, Youhui Zhang, Xiang Fei, Jianqiang Huang 0002, Xiaoying Wang 0002. 95-108 [doi]
- UFO Trees: Practical and Provably-Efficient Parallel Batch-Dynamic TreesQuinten De Man, Atharva Sharma, Kishen N. Gowda, Laxman Dhulipala. 109-122 [doi]
- Sharded Elimination and Combining for Highly-Efficient Concurrent StacksAjay Singh 0002, Nikos Metaxakis, Panagiota Fatourou. 123-135 [doi]
- Concurrent Balanced Augmented TreesEvan Wrench, Ajay Singh 0002, Younghun Roh, Panagiota Fatourou, Siddhartha Jayanti, Eric Ruppert, Yuanhao Wei. 136-149 [doi]
- Parallel Dynamic Spatial IndexesZiyang Men, Bo Huang, Yan Gu 0001, Yihan Sun 0001. 150-163 [doi]
- PRISM: An Efficient GPU-Based Lossy Compression Framework for Progressive Data Retrieval with Multi-Level InterpolationBing Lu 0001, Zedong Liu, Hairui Zhao, Dejun Luo, Wenjing Huang 0002, Yida Gu, Jinyang Liu 0003, Guangming Tan, Dingwen Tao. 164-176 [doi]
- Dynamic Detection of Inefficient Data Mapping Patterns in Heterogeneous OpenMP ApplicationsLuke Marzen, Junhyung Shim, Ali Jannesari. 177-189 [doi]
- Root-Down Exposure for Maximal Clique Enumeration on GPUsZhe Pan, Peng Qu, Youhui Zhang. 190-203 [doi]
- ROME: Maximizing GPU Efficiency for All-Pairs Shortest Path via Taming Fine-Grained IrregularitiesWeile Luo, Yuhan Chen, Xiangrui Yu, Qiang Wang 0022, Ruibo Fan, Hongyuan Liu 0002, Xiaowen Chu 0001. 204-217 [doi]
- SPIDER: Unleashing Sparse Tensor Cores for Stencil Computation via Strided SwappingQiqi Gu 0002, Chenpeng Wu, Heng Shi 0005, Jianguo Yao 0002. 218-231 [doi]
- ASM-SpMM: Unleashing the Potential of Arm SME for Sparse Matrix Multiplication AccelerationJiazhi Jiang, Xijia Yao, Jiayu Chen, Jinhui Wei, Dan Huang 0001, Yutong Lu. 232-244 [doi]
- Exploiting Efficient Mapping and Pipelined Execution for Accelerating SpMV on Tensor CoresKaige Zhang, Hailong Yang 0002, Xin You, Tianyu Feng, Yufan Xu 0001, Zhongzhi Luan, Yi Liu 0013, Depei Qian. 245-258 [doi]
- VDHA: Vector-Driven Hash Aggregation for Sparse Matrix-Sparse Vector Multiplication on GPUsYuchen Li, Zhe Pan, Peng Qu, Youhui Zhang. 259-272 [doi]
- RoMeo: Mitigating Dual-dimensional Outliers with Rotated Mixed Precision QuantizationQihao Zhang, Mingliang Tang, Mingshu Zhai, Kinman Lei, Jidong Zhai. 273-287 [doi]
- High-Throughput Non-uniformly Quantized 3-bit LLM InferenceYuang Chen, Wenqi Zeng, Jeffrey Xu Yu. 288-300 [doi]
- JanusQuant: Accurate and Efficient 2-bit KV Cache Quantization for Long-Context InferenceChengyu Sun, Yaqi Xia, Hulin Wang, Donglin Yang, Xiaobo Zhou 0002, Dazhao Cheng. 301-314 [doi]
- HierCut: Enabling 16-bit Format Mixed Precision for Molecular Dynamics through Hierarchical CutoffZeyu Song, Lin Gan, Xiaohui Duan, Zhengrui Li, Jiayu Fu, Yinuo Wang, Guangzhao Li, Guangwen Yang. 315-328 [doi]
- Cacheman: A Comprehensive Last-Level Cache Management System for Multi-tenant CloudsXiaokang Hu, Yuchao Cao, Naixuan Guan, Yifan Wu 0037, Xishi Qiu, Shengdong Dai, Ben Luo, Sanchuan Cheng, Fudong Qiu, Yibin Shen, Jiesheng Wu. 329-341 [doi]
- zBuffer: Zero-Copy and Metadata-Free Serialization for Fast RPC with Scatter-Gather ReflectionXiangyu Liu, Huiba Li, Shun Gai, Youmin Chen, Yiming Zhang 0003. 342-354 [doi]
- Scaling GPU-to-CPU Migration for Efficient Distributed Execution on CPU ClustersRuobing Han, Hyesoon Kim. 355-368 [doi]
- Trojan Horse: Aggregate-and-Batch for Scaling Up Sparse Direct Solvers on GPU ClustersYida Li 0005, Siwei Zhang, Yiduo Niu, Yang Du 0015, Qingxiao Sun, Zhou Jin 0001, Weifeng Liu 0002. 369-383 [doi]
- COCCL: A Collective Communication Library Supporting Easy Integration and Configuration of Customized Compression for Scalable LLM TrainingXingchen Liu, Haoran Kong, Hairui Zhao, Shengkai Lyu, Zheng Wei, Man Liu, Xingjian Tian, Liyang Zhao, Zhuohan Chen, Fakang Wang, Zizhong Chen, Zhan Wang 0003, Guangming Tan, Dingwen Tao. 384-397 [doi]
- Elastor: Elastic and Efficient Model Partitioning and Checkpointing for Fault-Tolerant Distributed TrainingXuanyu Wang, Fangcheng Fu, Haoyang Li 0017, Hao Ge, Sheng Lin, Jiawen Niu, Bin Cui 0001. 398-412 [doi]
- HelixPipe: Efficient Distributed Training of Long Sequence Transformers with Attention Parallel Pipeline ParallelismGeng Zhang, Shenggan Cheng, Xuanlei Zhao, Ziming Liu, Yang You 0001. 413-424 [doi]
- CCL-D: A High-Precision Diagnostic System for Slow and Hang Anomalies in Large-Scale Model TrainingYida Gu, Fakang Wang, Jianhao Fu, Zhenhang Sun, Qianyu Zhang, Hairui Zhao, Xingchen Liu, Yang Tian, Wenjing Huang 0002, Zedong Liu, Yifan Chen, Jinwu Yang, Yueyuan Zhou, Qian Zhao 0021, Haoxu Li, Tao Wang, Feng Yu, Zhan Wang 0003, Guangming Tan, Dingwen Tao. 425-438 [doi]
- Pipelonk: Accelerating End-to-End Zero-Knowledge Proof Generation on GPUs for PLONK-Based ProtocolsZhiyuan Zhang 0008, Yanxin Cai, Wenhao Yin, Xueyu Wu, Yi Wang 0003, Lei Ju 0001, Zhuoran Ji. 439-451 [doi]
- ParDiff: Efficiently Parallelizing Reverse-Mode Automatic Differentiation with Direct IndexingShuhong Huang, Shizhi Tang, Yuan Wen, Huanqi Cao, Ruibai Tang, Yidong Chen 0003, Jiping Yu, Yang Li, Chao Jiang, Limin Xiao, Jidong Zhai. 452-465 [doi]
- Faster and Cheaper: Pushing the Sequence Alignment Throughput with Commercial CPUsZhonghai Zhang, Yewen Li, Ke Meng, Chunming Zhang, Guangming Tan. 466-479 [doi]
- PIM-zd-tree: A Fast Space-Partitioning Index Leveraging Processing-in-MemoryYiwei Zhao, Hongbo Kang, Ziyang Men, Yan Gu 0001, Guy E. Blelloch, Laxman Dhulipala, Charles McGuffey, Phillip B. Gibbons. 480-495 [doi]
- BEEMS: Boosting Machine Vision Efficiency via Computation Graph-Based Memory SmoothingHanjing Shen, Fangxin Liu, Jian Liu, Li Jiang 0002, Haibing Guan. 496-508 [doi]
- Laser: Unlocking Layer-Level Scheduling for Efficient Multi-SLO LLM ServingJianxiong Liao, Quanxing Dong, Yunkai Liang, Zhi Zhou 0006, Xu Chen 0004. 509-521 [doi]
- MixFusion: A Patch-Level Parallel Serving System for Mixed-Resolution Diffusion ModelsDesen Sun, Zepeng Zhao, Yuke Wang. 522-536 [doi]
- ChituDiffusion: A Data-Characteristic-Aware Serving System for Diffusion ModelsChengzhang Wu, Liyan Zheng 0001, Haojie Wang 0004, Kezhao Huang, Zixuan Ma, Dong Dong 0001, Jidong Zhai. 537-550 [doi]
- ElasGNN: An Elastic Training Framework for Distributed GNN TrainingSiqi Wang, Hailong Yang 0002, Pengbo Wang, Hongliang Cao, Yufan Xu 0001, Xuezhu Wang, Zhongzhi Luan, Yi Liu 0013, Depei Qian. 551-563 [doi]
- APERTURE: Algorithm-System Co-optimization for Temporal Graph Network InferenceYiqing Wang, Hailong Yang 0002, Enze Yu, Qingxiao Sun, Kejie Ma, Kaige Zhang, Chenhao Xie 0001, Depei Qian. 564-576 [doi]
- TAC: Cache-Based System for Accelerating Billion-Scale GNN Training on Multi-GPU PlatformZhiqiang Liang, Hongyu Gao, Jue Wang 0013, Fang Liu, Xingguo Shi, Junyu Gu, Peng Di, Sian Li, Lei Tang, Chunbao Zhou, Lian Zhao, Yangang Wang 0002, Xuebin Chi. 577-590 [doi]
- DTMiner: A Data-Centric System for Efficient Temporal Motif MiningYinbo Hou, Hao Qi 0004, Ligang He, Jin Zhao 0003, Yu Zhang 0027, Hui Yu, Longlong Lin, Lin Gu 0002, Wenbin Jiang 0001, Xiaofei Liao, Hai Jin 0001. 591-604 [doi]
- FlashAttention-T: Towards Fully Tensorized Attention by Exploiting Tensor-Vector ParallelismJianxing Xu, Yuanbo Wen 0001, Jun Bi, Ruibai Xu, Guanglin Xu, Rui Zhang 0040, Wei Li 0008, Ling Li 0001, Tianshi Chen 0002, Qi Guo 0001, Yunji Chen. 605-619 [doi]
- Accelerating Sparse Transformer Inference on GPUWenhao Dai, Haodong Deng, Mengfei Rong, Xinyu Yang, Hongyu Liu, Fangxin Liu, Hailong Yang 0002, Qianwen Cao, Qingxiao Sun. 620-634 [doi]
- MetaAttention: A Unified and Performant Attention Framework across Hardware BackendsFeiyang Chen, Yu Cheng, Lei Wang 0222, Yuqing Xia, Ziming Miao, Lingxiao Ma, Fan Yang 0024, Jilong Xue, Zhi Yang 0001, Mao Yang 0004, Xingda Wei, Haibo Chen 0001. 635-647 [doi]
- Towards Singular Value Decomposition for Rank-Deficient Matrices: An Efficient and Accurate Algorithm on GPU ArchitecturesLu Shi, Weiwei Xu, Shaoshuai Zhang. 648-659 [doi]
- A Diagonal Block Memory-Aware Polynomial Preconditioner for Linear and Eigenvalue SolversXiaojian Yang, Yuhui Ni, Fan Yuan, Shengguo Li, Dezun Dong, Chuanfu Xu, Haipeng Jia, Jie Liu 0002. 660-673 [doi]
- A Distributed Matrix-Block-Vector Multiplication in Presence of System Performance VariabilityYuchen Ma 0001, Bin Ren 0002, Andreas Stathopoulos. 674-686 [doi]
- Characterizing Matrix Multiplication Units across General Parallel Patterns in Scientific ComputingYuechen Lu, Hongwei Zeng, Marc Casas, Weifeng Liu 0002. 687-701 [doi]