Abstract is missing.
- Boosting Performance and QoS for Concurrent GPU B+trees by Combining-Based SynchronizationWeihua Zhang, Chuanlei Zhao, Lu Peng 0001, Yuzhe Lin, Fengzhe Zhang, Yunping Lu. 1-13 [doi]
- The State-of-the-Art LCRQ Concurrent Queue Algorithm Does NOT Require CAS2Raed Romanov, Nikita Koval. 14-26 [doi]
- Provably Good Randomized Strategies for Data Placement in Distributed Key-Value StoresZhe Wang, Jinhao Zhao, Kunal Agrawal, He Liu, Meng Xu, Jing Li. 27-38 [doi]
- 2PLSF: Two-Phase Locking with Starvation-FreedomPedro Ramalhete, Andreia Correia, Pascal Felber. 39-51 [doi]
- Provably Fast and Space-Efficient Parallel BiconnectivityXiaojun Dong, Letong Wang, Yan Gu 0001, Yihan Sun 0001. 52-65 [doi]
- Practically and Theoretically Efficient Garbage Collection for MultiversioningYuanhao Wei, Guy E. Blelloch, Panagiota Fatourou, Eric Ruppert. 66-78 [doi]
- A Programming Model for GPU Load BalancingMuhammad Osama, Serban D. Porumbescu, John D. Owens. 79-91 [doi]
- Exploring the Use of WebAssembly in HPCMohak Chadha, Nils Krueger, Jophin John, Anshul Jindal, Michael Gerndt, Shajulin Benedict. 92-106 [doi]
- Fast and Scalable Channels in Kotlin CoroutinesNikita Koval, Dan Alistarh, Roman Elizarov. 107-118 [doi]
- High-Performance GPU-to-CPU Transpilation and Optimization via High-Level Parallel ConstructsWilliam S. Moses, Ivan R. Ivanov, Jens Domke, Toshio Endo, Johannes Doerfert, Oleksandr Zinenko. 119-134 [doi]
- A Scalable Hybrid Total FETI Method for Massively Parallel FEM SimulationsKehao Lin, Chunbao Zhou, Yan Zeng, Ningming Nie, Jue Wang, Shigang Li, Yangde Feng, Yangang Wang, Kehan Yao, Tiechui Yao, Jilin Zhang, Jian Wan 0001. 135-147 [doi]
- Lifetime-Based Optimization for Simulating Quantum Circuits on a New Sunway SupercomputerYaojian Chen, Yong Liu, Xinmin Shi, Jiawei Song, Xin Liu, Lin Gan, Chu Guo, Haohuan Fu, Jie Gao, Dexun Chen, Guangwen Yang. 148-159 [doi]
- High-Performance Filters for GPUsHunter McCoy, Steven A. Hofmeyr, Katherine A. Yelick, Prashant Pandey 0001. 160-173 [doi]
- High-Performance and Scalable Agent-Based Simulation with BioDynaMoLukas Breitwieser, Ahmad Hesam, Fons Rademakers, Juan Gómez-Luna, Onur Mutlu. 174-188 [doi]
- OpenCilk: A Modular and Extensible Software Infrastructure for Fast Task-Parallel CodeTao B. Schardl, I-Ting Angelina Lee. 189-203 [doi]
- Merchandiser: Data Placement on Heterogeneous Memory for Task-Parallel HPC Applications with Load-Balance AwarenessZhen Xie, Jie Liu, Jiajia Li 0001, Dong Li. 204-217 [doi]
- Visibility Algorithms for Dynamic Dependence Analysis and Distributed CoherenceMichael Bauer, Elliott Slaughter, Sean Treichler, Wonchan Lee, Michael Garland, Alex Aiken. 218-231 [doi]
- Block-STM: Scaling Blockchain Execution by Turning Ordering Curse to a Performance BlessingRati Gelashvili, Alexander Spiegelman, Zhuolun Xiang, George Danezis, Zekun Li, Dahlia Malkhi, Yu Xia 0005, Runtian Zhou. 232-244 [doi]
- TL4x: Buffered Durable Transactions on Disk as Fast as in MemoryGal Assa, Andreia Correia, Pedro Ramalhete, Valerio Schiavoni, Pascal Felber. 245-259 [doi]
- TDC: Towards Extremely Efficient CNNs on GPUs via Hardware-Aware Tucker DecompositionLizhi Xiang, Miao Yin, Chengming Zhang 0006, Aravind Sukumaran-Rajam, P. Sadayappan, Bo Yuan 0001, Dingwen Tao. 260-273 [doi]
- Improving Energy Saving of One-Sided Matrix Decompositions on CPU-GPU Heterogeneous SystemsJieyang Chen, Xin Liang 0001, Kai Zhao 0008, Hadi Zamani Sabzi, Laxmi N. Bhuyan, Zizhong Chen. 274-287 [doi]
- End-to-End LU Factorization of Large Matrices on GPUsYang Xia, Peng Jiang 0004, Gagan Agrawal, Rajiv Ramnath. 288-300 [doi]
- Fast Symmetric Eigenvalue Decomposition via WY Representation on Tensor CoreShaoshuai Zhang, Ruchi Shah, Hiroyuki Ootomo, Rio Yokota, Panruo Wu. 301-312 [doi]
- iQAN: Fast and Accurate Vector Search with Efficient Intra-Query Parallelism on Multi-Core ArchitecturesZhen Peng, Minjia Zhang, Kai Li, Ruoming Jin, Bin Ren. 313-328 [doi]
- WISE: Predicting the Performance of Sparse Matrix Vector Multiplication with Machine LearningSerif Yesil, Azin Heidarshenas, Adam Morrison 0001, Josep Torrellas. 329-341 [doi]
- Efficient Direct Convolution Using Long SIMD InstructionsAlexandre de Limas Santana, Adrià Armejach, Marc Casas. 342-353 [doi]
- TGOpt: Redundancy-Aware Optimizations for Temporal Graph Attention NetworksYufeng Wang, Charith Mendis. 354-368 [doi]
- Dynamic N: M Fine-Grained Structured Sparse Attention MechanismZhaodong Chen, Zheng Qu, Yuying Quan, Liu Liu 0017, Yufei Ding, Yuan Xie 0001. 369-379 [doi]
- Elastic Averaging for Efficient Pipelined DNN TrainingZihao Chen, Chen Xu 0001, Weining Qian, Aoying Zhou. 380-391 [doi]
- DSP: Efficient GNN Training with Multiple GPUsZhenkun Cai, Qihui Zhou, Xiao Yan 0002, Da Zheng, Xiang Song, Chenguang Zheng, James Cheng, George Karypis. 392-404 [doi]
- PiPAD: Pipelined and Parallel Dynamic GNN Training on GPUsChunyang Wang, Desen Sun, Yuebin Bai. 405-418 [doi]
- AArch64 Atomics: Might They Be Harming Your Performance?Ricardo Jesus, Michèle Weiland. 419-421 [doi]
- Efficient All-Reduce for Distributed DNN Training in Optical Interconnect SystemsFei Dai, Yawen Chen 0001, Zhiyi Huang 0001, Haibo Zhang 0001, Fangfang Zhang 0002. 422-424 [doi]
- Fast Parallel Exact Inference on Bayesian NetworksJiantong Jiang, Zeyi Wen, Atif Bin Mansoor, Ajmal Mian. 425-426 [doi]
- Generating Fast FFT Kernels on CPUs via FFT-Specific IntrinsicsZhihao Li, Haipeng Jia, Yunquan Zhang, Yuyan Sun, YiWei Zhang, Tun Chen. 427-428 [doi]
- Stream-K: Work-Centric Parallel Decomposition for Dense Matrix-Matrix Multiplication on the GPUMuhammad Osama, Duane Merrill, Cris Cecka, Michael Garland, John D. Owens. 429-431 [doi]
- High-Throughput GPU Random Walk with Fine-Tuned Concurrent Query ProcessingCheng Xu, Chao Li, Pengyu Wang 0003, Xiaofeng Hou, Jing Wang, Shixuan Sun, Minyi Guo, Hanqing Wu, Dongbai Chen, Xiangwen Liu. 432-434 [doi]
- The ERA Theorem for Safe Memory ReclamationGali Sheffi, Erez Petrank. 435-437 [doi]
- Unexpected Scaling in Path Copying TreesVitaly Aksenov, Trevor Brown 0001, Alexander Fedorov, Ilya Kokorin. 438-440 [doi]
- Transactional Composition of Nonblocking Data StructuresWentao Cai 0002, Haosen Wen, Michael L. Scott. 441-443 [doi]
- CuPBoP: A Framework to Make CUDA PortableRuobing Han, Jun Chen, Bhanu Garg, Jeffrey Young 0001, Jaewoong Sim, Hyesoon Kim. 444-446 [doi]
- Swift: Expedited Failure Recovery for Large-Scale DNN TrainingYuchen Zhong, Guangming Sheng, Juncheng Liu, Jinhui Yuan, Chuan Wu 0001. 447-449 [doi]
- Learning to Parallelize in a Shared-Memory Environment with TransformersRe'em Harel, Yuval Pinter, Gal Oren 0001. 450-452 [doi]