Abstract is missing.
- FAZ: A flexible auto-tuned modular error-bounded compression framework for scientific dataJinyang Liu, Sheng Di, Kai Zhao 0008, Xin Liang 0001, Zizhong Chen, Franck Cappello. 1-13 [doi]
- Using Additive Modifications in LU Factorization Instead of PivotingNeil Lindquist, Piotr Luszczek, Jack J. Dongarra. 14-24 [doi]
- FLORIA: A Fast and Featherlight Approach for Predicting Cache PerformanceJun Xiao 0009, Yaocheng Xiang, Xiaolin Wang 0001, Yingwei Luo, Andy D. Pimentel, Zhenlin Wang. 25-36 [doi]
- Transfer-learning-based Autotuning using Gaussian CopulaThomas Randall, Jaehoon Koo, Brice Videau, Michael Kruse, Xingfu Wu, Paul D. Hovland, Mary W. Hall, Rong Ge 0002, Prasanna Balaprakash. 37-49 [doi]
- Performance Embeddings: A Similarity-Based Transfer Tuning Approach to Performance OptimizationLukas Trümper, Tal Ben-Nun, Philipp Schaad, Alexandru Calotoiu, Torsten Hoefler. 50-62 [doi]
- CMLCompiler: A Unified Compiler for Classical Machine LearningXu Wen, Wanling Gao, Anzheng Li, Lei Wang, Zihan Jiang, Jianfeng Zhan. 63-74 [doi]
- PAC: Preference-Aware Co-location Scheduling on Heterogeneous NUMA Architectures To Improve Resource UtilizationPu Pang, Yaoxuan Li, Bo Liu, Quan Chen 0002, Zhou Yu, Zhibin Yu, Deze Zeng, Jingwen Leng, Jieru Zhao, Minyi Guo. 75-86 [doi]
- BiRFIA: Selective Binary Rewriting for Function Interception on ARMKelun Lei, Xin You, Hailong Yang, Zhongzhi Luan, Depei Qian. 87-98 [doi]
- Lightweight Huffman Coding for Efficient GPU CompressionMilan Shah, Xiaodong Yu, Sheng Di, Michela Becchi, Franck Cappello. 99-110 [doi]
- Towards a Unified Implementation of GEMM in BLISRuQing G. Xu, Field G. Van Zee, Robert A. van de Geijn. 111-121 [doi]
- Use Only What You Need: Judicious Parallelism For File Transfers in High Performance NetworksMd. Arifuzzaman, Engin Arslan. 122-132 [doi]
- DStore: A Lightweight Scalable Learning Model Repository with Fine-Grain Tensor-Level AccessMeghana Madhyastha, Robert Underwood, Randal C. Burns, Bogdan Nicolae. 133-143 [doi]
- DyVer: Dynamic Version Handling for Array DatabasesAmelie Chi Zhou, Zhoubin Ke, Jianming Lao. 144-154 [doi]
- Accelerating BWA-MEM Read Mapping on GPUsMinh Pham, Yicheng Tu, Xiaoyi Lv. 155-166 [doi]
- PERKS: a Locality-Optimized Execution Model for Iterative Memory-bound GPU ApplicationsLingqi Zhang, Mohamed Wahib, Peng Chen, Jintao Meng, Xiao Wang, Toshio Endo, Satoshi Matsuoka. 167-179 [doi]
- Wafer-Scale Fast Fourier TransformsMarcelo Orenes-Vera, Ilya Sharapov, Robert Schreiber, Mathias Jacquelin, Philippe Vandermersch, Sharan Chetlur. 180-191 [doi]
- Multi-GPU Communication Schemes for Iterative Solvers: When CPUs are Not in ChargeIsmayil Ismayilov, Javid Baydamirli, Dogan Sagbili, Mohamed Wahib, Didem Unat. 192-202 [doi]
- A Hybrid Tensor-Expert-Data Parallelism Approach to Optimize Mixture-of-Experts TrainingSiddharth Singh, Olatunji Ruwase, Ammar Ahmad Awan, Samyam Rajbhandari, Yuxiong He, Abhinav Bhatele. 203-214 [doi]
- Scalable parallelization for the solution of phonon Boltzmann Transport EquationHan D. Tran, Siddharth Saurav, P. Sadayappan, Sandip Mazumder, Hari Sundar. 215-226 [doi]
- Optimizing Multi-grid Computation and Parallelization on Multi-coresXiaojian Yang, Shengguo Li, Fan Yuan, Dezun Dong, Chun Huang, Zheng Wang. 227-239 [doi]
- FT-topo: Architecture-Driven Folded-Triangle Partitioning for Communication-efficient Graph ProcessingXinbiao Gan, Guang Wu, Ruigeng Zeng, Jiaqi Si, Ji Liu, Daxiang Dong, Chunye Gong, Cong Liu, Tiejun Li. 240-250 [doi]
- Revisiting Temporal Blocking Stencil OptimizationsLingqi Zhang, Mohamed Wahib, Peng Chen, Jintao Meng, Xiao Wang, Toshio Endo, Satoshi Matsuoka. 251-263 [doi]
- BitGNN: Unleashing the Performance Potential of Binary Graph Neural Networks on GPUsJou-An Chen, Hsin-Hsuan Sung, Xipeng Shen, Sutanay Choudhury, Ang Li 0006. 264-276 [doi]
- Fast All-Pairs Shortest Paths Algorithm in Large Sparse GraphShaofeng Yang, Xiandong Liu, Yunting Wang, Xin He, Guangming Tan. 277-288 [doi]
- RT-kNNS Unbound: Using RT Cores to Accelerate Unrestricted Neighbor SearchVani Nagarajan, Durga Mandarapu, Milind Kulkarni 0001. 289-300 [doi]
- Distributed-Memory Parallel JointNMFSrinivas Eswar, Benjamin Cobb, Koby Hayashi, Ramakrishnan Kannan, Grey Ballard, Richard W. Vuduc, Haesun Park. 301-312 [doi]
- Parallel Software for Million-scale Exact Kernel RegressionYu Chen 0036, Lucca Skon, James R. McCombs, Zhenming Liu, Andreas Stathopoulos. 313-323 [doi]
- HEAT: A Highly Efficient and Affordable Training System for Collaborative Filtering Based Recommendation on CPUsChengming Zhang 0006, Shaden Smith, Baixi Sun, Jiannan Tian, Jonathan Soifer, Xiaodong Yu, Shuaiwen Leon Song, Yuxiong He, Dingwen Tao. 324-335 [doi]
- Software-Hardware Co-design of Heterogeneous SmartNIC System for Recommendation Models Inference and TrainingAnqi Guo, Yuchen Hao, Chunshu Wu, Pouya Haghi, Zhenyu Pan, Min-Si, Dingwen Tao, Ang Li, Martin C. Herbordt, Tong Geng. 336-347 [doi]
- GPULZ: Optimizing LZSS Lossless Compression for Multi-byte Data on Modern GPUsBoyuan Zhang, Jiannan Tian, Sheng Di, Xiaodong Yu, Martin Swany, Dingwen Tao, Franck Cappello. 348-359 [doi]
- Anatomy of High-Performance GEMM with Online Fault Tolerance on GPUsShixun Wu, Yujia Zhai, Jinyang Liu, Jiajun Huang, Zizhe Jian, Bryan M. Wong, Zizhong Chen. 360-372 [doi]
- FMI: Fast and Cheap Message Passing for Serverless FunctionsMarcin Copik, Roman Böhringer, Alexandru Calotoiu, Torsten Hoefler. 373-385 [doi]
- Scalable algorithms for compact spanners on real world graphsMaulein Pathak, Yogish Sabharwal, Neelima Gupta. 386-397 [doi]
- OpenFFT: An Adaptive Tuning Framework for 3D FFT on ARM Multicore CPUsTun Chen, Haipeng Jia, Yunquan Zhang, Kun Li, Zhihao Li, Xiang Zhao, Jianyu Yao, Chendi Li. 398-409 [doi]
- Seizing the Bandwidth Scaling of On-Package Interconnect in a Post-Moore's Law WorldGrigory Chirkov, David Wentzlaff. 410-422 [doi]
- Roar: A Router Microarchitecture for In-network AllreduceRuiqi Wang, Dezun Dong, Fei Lei, Junchao Ma, Ke Wu, Kai Lu. 423-436 [doi]
- GRAP: Group-level Resource Allocation Policy for Reconfigurable Dragonfly Network in HPCGuangnan Feng, Dezun Dong, Shizhen Zhao, Yutong Lu. 437-449 [doi]
- FLASH: FPGA-Accelerated Smart Switches with GCN Case StudyPouya Haghi, William Krska, Cheng Tan 0002, Tong Geng, Po-Hao Chen, Connor Greenwood, Anqi Guo, Thomas Hines, Chunshu Wu, Ang Li, Anthony Skjellum, Martin C. Herbordt. 450-462 [doi]
- SPARTA: Spatial Acceleration for Efficient and Scalable Horizontal Diffusion Weather Stencil ComputationGagandeep Singh 0002, Alireza Khodamoradi, Kristof Denolf, Jack Lo, Juan Gómez-Luna, Joseph Melber, Andra Bisca, Henk Corporaal, Onur Mutlu. 463-476 [doi]
- Enabling Reconfigurable HPC through MPI-based Inter-FPGA CommunicationNicholas Contini, Bharath Ramesh 0005, Kaushik Kandadi Suresh, Tu Tran, Benjamin Michalowicz, Mustafa Abduljabbar, Hari Subramoni, Dhabaleswar K. Panda 0001. 477-487 [doi]