Abstract is missing.
- PckGNN: Optimizing Aggregation Operators with Packing Strategies in Graph Neural NetworksZhengding Hu, Jingwei Sun 0001, Zhongyang Li, Guangzhong Sun. 2-13 [doi]
- VNEC: A Vectorized Non-Empty Column Format for SpMV on CPUsLuhan Wang, Haipeng Jia, Lei Xu 0023, Cunyang Wei, Kun Li, Xianmeng Jiang, Yunquan Zhang. 14-25 [doi]
- Two-Stage Block Orthogonalization to Improve Performance of s-step GMRESIchitaro Yamazaki, Andrew J. Higgins 0002, Erik G. Boman, Daniel B. Szyld. 26-37 [doi]
- Alternative Basis Matrix Multiplication is Fast and StableOded Schwartz, Sivan Toledo, Noa Vaknin, Gal Wiernik. 38-51 [doi]
- Fast multiplication of random dense matrices with sparse matricesTianyu Liang, Riley Murray, Aydin Buluç, James Demmel. 52-62 [doi]
- A Cholesky QR type algorithm for computing tall-skinny QR factorization with column pivotingTakeshi Fukaya, Yuji Nakatsukasa, Yusaku Yamamoto. 63-75 [doi]
- CKSM: An Efficient Memory Deduplication Method for Container-based Cloud Computing SystemsYunfei Gu, Yihui Lu, Chentao Wu, Jie Li 0002, Minyi Guo. 76-88 [doi]
- Tackling Cold Start in Serverless Computing with Multi-Level Container ReuseAmelie Chi Zhou, Rongzheng Huang, Zhoubin Ke, Yusen Li, Yi Wang, Rui Mao 0001. 89-99 [doi]
- Paldia: Enabling SLO-Compliant and Cost-Effective Serverless Computing on Heterogeneous HardwareVivek M. Bhasi, Aakash Sharma, Shruti Mohanty, Mahmut Taylan Kandemir, Chita R. Das. 100-113 [doi]
- Application-Attuned Memory Management for Containerized HPC WorkflowsMoiz Arif, Avinash Maurya, M. Mustafa Rafique, Dimitrios S. Nikolopoulos, Ali Raza Butt. 114-127 [doi]
- FEDGE: An Interference-Aware QoS Prediction Framework for Black-Box Scenario in IaaS Clouds with Domain GeneralizationYunlong Cheng, Xiuqi Huang, Zifeng Liu, Jiadong Chen, Xiaofeng Gao 0001, Zhen Fang, Yongqiang Yang. 128-138 [doi]
- Software Resource Disaggregation for HPC with Serverless ComputingMarcin Copik, Marcin Chrapek, Larissa Schmid, Alexandru Calotoiu, Torsten Hoefler. 139-156 [doi]
- AMST: Accelerating Large-Scale Graph Minimum Spanning Tree Computation on FPGAHaishuang Fan, Rui Meng, Qichu Sun, Jingya Wu, Wenyan Lu, Xiaowei Li 0001, Guihai Yan. 157-168 [doi]
- Wait-free Trees with Asymptotically-Efficient Range QueriesIlya Kokorin, Victor Yudov, Vitaly Aksenov, Dan Alistarh. 169-179 [doi]
- Low-Depth Spatial Tree AlgorithmsYves Baumann, Tal Ben-Nun, Maciej Besta, Lukas Gianinazzi, Torsten Hoefler, Piotr Luczynski. 180-192 [doi]
- QSync: Quantization-Minimized Synchronous Distributed Training Across Hybrid DevicesJuntao Zhao, Borui Wan, Yanghua Peng, Haibin Lin, Yibo Zhu, Chuan Wu 0001. 193-204 [doi]
- Enhancing the Generalization of Personalized Federated Learning with Multi-head Model and Ensemble VotingVan An Le, Nam Duong Tran, Phuong Nam Nguyen, Thanh-Hung Nguyen, Phi-Le Nguyen, Truong Thao Nguyen, Yusheng Ji. 205-216 [doi]
- UniFaaS: Programming across Distributed Cyberinfrastructure with Federated Function ServingYifei Li, Ryan Chard, Yadu N. Babuji, Kyle Chard, Ian T. Foster, Zhuozhao Li. 217-229 [doi]
- Scalable and Differentiable Simulator for Quantum Computational ChemistryZhiqian Xu 0005, Honghui Shang, Yi Fan, Xiongzhi Zeng, Yunquan Zhang, Chu Guo. 230-240 [doi]
- Picasso: Memory-Efficient Graph Coloring Using Palettes With Applications in Quantum ComputingS. M. Ferdous, Reece Neff, Bo Peng, Salman Shuvo, Marco Minutoli, Sayak Mukherjee, Karol Kowalski, Michela Becchi, Mahantesh Halappanavar. 241-252 [doi]
- Optimizing and Scaling the 3D Reconstruction of Single-Particle ImagingNiteya Shah, Christine Sweeney, Vinay Ramakrishnaiah, Jeffrey Donatelli, Wu-chun Feng. 253-264 [doi]
- Parallel Approximations for High-Dimensional Multivariate Normal Probability Computation in Confidence Region Detection ApplicationsXiran Zhang, Sameh Abdulah, Jian Cao 0004, Hatem Ltaief, Ying Sun 0002, Marc G. Genton, David E. Keyes. 265-276 [doi]
- Enabling High-Performance Physical Based Rendering on New Sunway SupercomputerZeyu Song, Lin Gan, Shengye Xiang, Yinuo Wang, Xiaohui Duan, Guangwen Yang. 277-288 [doi]
- CoCG: Fine-grained Cloud Game Co-location on Heterogeneous PlatformTaolei Wang, Chao Li, Jing Wang, Cheng Xu, Xiaofeng Hou, Minyi Guo. 289-299 [doi]
- Adaptive Task-Oriented Resource Allocation for Large Dynamic Workflows on Opportunistic ResourcesThanh Son Phung, Douglas Thain. 300-311 [doi]
- nOS-V: Co-Executing HPC Applications Using System-Wide Task SchedulingDavid Álvarez 0006, Kevin Sala, Vicenç Beltran 0001. 312-324 [doi]
- SWEEP: Adaptive Task Scheduling for Exploring Energy Performance Trade-offsJing Chen, Madhavan Manivannan, Bhavishya Goel, Miquel Pericàs. 325-336 [doi]
- Interpretable Analysis of Production GPU Clusters Monitoring Data via Association Rule MiningBaolin Li, Siddharth Samsi, Vijay Gadepally, Devesh Tiwari. 337-349 [doi]
- CloverLeaf on Intel Multi-Core CPUs: A Case Study in Write-Allocate EvasionJan Laukemann, Thomas Gruber, Georg Hager, Dossay Oryspayev, Gerhard Wellein. 350-360 [doi]
- ARGO: An Auto-Tuning Runtime System for Scalable GNN Training on Multi-Core ProcessorYi-Chien Lin, Yuyang Chen, Sameh Gobriel, Nilesh Jain, Gopi Krishna Jha, Viktor K. Prasanna. 361-372 [doi]
- Accelerating Lossy and Lossless Compression on Emerging BlueField DPU ArchitecturesYuke Li, Arjun Kashyap, Weicong Chen, Yanfei Guo, Xiaoyi Lu. 373-385 [doi]
- Performance-Portable Multiphase Flow Solutions with Discontinuous Galerkin MethodsTobias S. Flynn, Robert Manson-Sawko, Gihan R. Mudalige. 386-397 [doi]
- Optimized GPU Implementation of Grid Refinement in Lattice Boltzmann MethodAhmed H. Mahmoud, Hesam Salehipour, Massimiliano Meneghin. 398-407 [doi]
- Alya towards Exascale: Optimal OpenACC Performance of the Navier-Stokes Finite Element Assembly on GPUsHerbert Owen, Dominik Ernst, Thomas Gruber, Oriol Lehmkuhl, Guillaume Houzeaux, Lucas Gasparino, Gerhard Wellein. 408-416 [doi]
- CliZ: Optimizing Lossy Compression for Climate Datasets with Adaptive Fine-tuned Data PredictionZizhe Jian, Sheng Di, Jinyang Liu, Kai Zhao 0008, Xin Liang 0001, Haiying Xu, Robert Underwood, Shixun Wu, Jiajun Huang, Zizhong Chen, Franck Cappello. 417-429 [doi]
- Automating GPU Scalability for Complex Scientific Models: Phonon Boltzmann Transport EquationEric Heisler, Siddharth Saurav, Aadesh Deshmukh, Sandip Mazumder, Hari Sundar. 430-439 [doi]
- An O(N) distributed-memory parallel direct solver for planar integral equationsTianyu Liang, Chao Chen 0008, Per-Gunnar Martinsson, George Biros. 440-452 [doi]
- Exploiting long vectors with a CFD code: a co-design show caseMarc Blancafort, Roger Ferrer, Guillaume Houzeaux, Marta Garcia-Gasulla, Filippo Mantovani. 453-464 [doi]
- Capturing Periodic I/O Using Frequency TechniquesAhmad Tarraf, Alexis Bandet, Francieli Boito, Guillaume Pallez, Felix Wolf 0001. 465-478 [doi]
- To Store or Not to Store: a graph theoretical approach for Dataset VersioningAnxin Guo, Jingwei Li, Pattara Sukprasert, Samir Khuller, Amol Deshpande, Koyel Mukherjee. 479-493 [doi]
- TunIO: An AI-powered Framework for Optimizing HPC I/ONeeraj Rajesh, Keith Bateman, Jean Luca Bez, Suren Byna, Anthony Kougkas, Xian-He Sun. 494-505 [doi]
- A2FL: Autonomous and Adaptive File Layout in HPC through Real-time Access Pattern AnalysisDong-Kyu Sung, Yongseok Son, Alex Sim, Kesheng Wu, Suren Byna, Houjun Tang, Hyeonsang Eom, Changjong Kim, Sunggon Kim. 506-518 [doi]
- NVMe-oPF: Designing Efficient Priority Schemes for NVMe-over-Fabrics with Multi-Tenancy SupportDarren Ng, Andrew Lin, Arjun Kashyap, Guanpeng Li, Xiaoyi Lu. 519-531 [doi]
- Drilling Down I/O Bottlenecks with Cross-layer I/O Profile ExplorationHammad Ather, Jean Luca Bez, Yankun Xia, Suren Byna. 532-543 [doi]
- CachedArrays: Optimizing Data Movement for Heterogeneous Memory SystemsMark Hildebrand, Jason Lowe-Power, Venkatesh Akella. 545-555 [doi]
- Comparative Study of Large Language Model Architectures on FrontierJunqi Yin, Avishek Bose, Guojing Cong, Isaac Lyngaas, Quentin Anthony. 556-569 [doi]
- Predicting Cross-Architecture Performance of Parallel ProgramsDaniel Nichols, Alexander Movsesyan, Jae-Seung Yeom, Abhik Sarkar, Daniel Milroy, Tapasya Patki, Abhinav Bhatele. 570-581 [doi]
- Druto: Upper-Bounding Silent Data Corruption Vulnerability in GPU ApplicationsMd Hasanur Rahman 0001, Sheng Di, Shengjian Guo, Xiaoyi Lu, Guanpeng Li, Franck Cappello. 582-594 [doi]
- MPI Errors Detection using GNN Embedding and Vector Embedding over LLVM IRJad El Karchi, Hanze Chen, Ali TehraniJamsaz, Ali Jannesari, Mihail Popov, Emmanuelle Saillard. 595-607 [doi]
- A Parallel Partial Merge Repair Algorithm for Multi-block Failures for Erasure Storage SystemsShuaipeng Zhang, Shiyi Li, Chentao Wu, Ruobin Wu, Saiqin Long, Wen Xia. 608-618 [doi]
- Harmonica: Hybrid Accelerator to Overcome Imperfections of Mixed-signal DNN AcceleratorsPayman Behnam, Uday Kamal, Ali Shafiee, Alexey Tumanov, Saibal Mukhopadhyay. 619-630 [doi]
- IPU-EpiDet: Identifying Gene Interactions on Massively Parallel Graph-Based AI AcceleratorsRicardo Nobre, Aleksandar Ilic, Sergio Santander-Jiménez, Leonel Sousa. 631-643 [doi]
- DEFCON: Deformable Convolutions Leveraging Interval Search and GPU Texture HardwareMalith Jayaweera, Yanyu Li, Yanzhi Wang, Bin Ren, David R. Kaeli. 644-655 [doi]
- Benchmarking and Dissecting the Nvidia Hopper GPU ArchitectureWeile Luo, Ruibo Fan, ZeYu Li, Dayou Du, Qiang Wang, Xiaowen Chu. 656-667 [doi]
- Exploration of Trade-offs Between General-Purpose and Specialized Processing Elements in HPC-Oriented CGRAEmanuele Del Sozzo, Xinyuan Wang, Boma A. Adhi, Carlos Cortes, Jason Anderson, Kentaro Sano. 668-680 [doi]
- Hadar: Heterogeneity-Aware Optimization-Based Online Scheduling for Deep Learning ClusterAbeda Sultana, Fei Xu, Xu Yuan 0001, Li Chen 0019, Nian-Feng Tzeng. 681-691 [doi]
- Fast Abort-Freedom for Deterministic TransactionsChen Chen, Xingbo Wu, Wenshao Zhong, Jakob Eriksson. 692-704 [doi]
- SYNPA: SMT Performance Analysis and Allocation of Threads to Cores in ARM ProcessorsMarta Navarro, Josué Feliu, Salvador Petit, María Engracia Gómez, Julio Sahuquillo. 705-715 [doi]
- Cross-System Analysis of Job Characterization and Scheduling in Large-Scale Computing ClustersDi Zhang, Monish Soundar Raj, Bing Xie, Sheng Di, Dong Dai 0001. 716-727 [doi]
- Automatic Task Parallelization of Dataflow Graphs in ML/DL ModelsSrinjoy Das, Lawrence Rauchwerger. 728-739 [doi]
- Adaptive Prefetching for Fine-grain Communication in PGAS ProgramsThomas B. Rolinger, Alan Sussman. 740-751 [doi]
- An Optimized Error-controlled MPI Collective Framework Integrated with Lossy CompressionJiajun Huang, Sheng Di, Xiaodong Yu 0001, Yujia Zhai, Zhaorui Zhang, Jinyang Liu, Xiaoyi Lu, Ken Raffenetti, Hui Zhou, Kai Zhao 0008, Zizhong Chen, Franck Cappello, Yanfei Guo, Rajeev Thakur. 752-764 [doi]
- MUSE: A Runtime Incrementally Reconfigurable Network Adapting to HPC Real-Time TrafficZijian Li 0018, Zixuan Chen, Yiying Tang, Xin Ai 0008, Yuanyi Zhu, Zhigao Zhao, Jiang Shao, Guowei Liu, Sen Liu, Bin Liu, Yang Xu 0010. 765-779 [doi]
- Fast Policy Convergence for Traffic Engineering with Proactive Distributed Message-PassingZiCheng Wang, Zirui Zhuang, Jingyu Wang, Qi Qi, Haifeng Sun 0001, Jianxin Liao. 780-790 [doi]
- The Self-adaptive and Topology-aware MPI_Bcast leveraging Collective offload on Tianhe Express InterconnectChongshan Liang, Yi Dai, Jun Xia, Jinbo Xu, Jintao Peng, Weixia Xu, Ming Xie, Jie Liu, Zhiquan Lai, Sheng Ma, Qi Zhu. 791-801 [doi]
- HINT: Designing Cache-Efficient MPI_Alltoall using Hybrid Memory Copy Ordering and Non-Temporal InstructionsBharath Ramesh 0005, Nick Contini, Nawras Alnaasan, Kaushik Kandadi Suresh, Mustafa Abduljabbar, Aamir Shafi, Hari Subramoni, Dhabaleswar K. D. K. Panda. 802-813 [doi]
- Flexible NVMe Request Routing for Virtual MachinesTu Dinh Ngoc, Boris Teabe, Georges Da Costa, Daniel Hagimont. 814-824 [doi]
- HA-CSD: Host and SSD Coordinated Compression for Capacity and PerformanceXiang Chen, Tao Lu, Jiapin Wang, Yu Zhong, Guangchun Xie, Xueming Cao, Yuanpeng Ma, Bing Si, Feng Ding, Ying Yang, Yunxin Huang, Yafei Yang, You Zhou, Fei Wu 0005. 825-838 [doi]
- Graph Analytics on Jellyfish topologyMd Nahid Newaz, Sayan Ghosh, Joshua Suetterlein, Nathan R. Tallent, Md Atiqul Mollah, Hua Ming. 839-851 [doi]
- TEEMO: Temperature Aware Energy Efficient Multi-Retention STT-RAM Cache ArchitectureSukarn Agarwal, Shounak Chakraborty 0001, Magnus Själander. 852-864 [doi]
- LockillerTM: Enhancing Performance Lower Bounds in Best-Effort Hardware Transactional MemoryLi Wan, Fu Chao, Qiang Li, Jun Han 0003. 865-875 [doi]
- Attention, Distillation, and Tabularization: Towards Practical Neural Network-Based PrefetchingPengmiao Zhang, Neelesh Gupta, Rajgopal Kannan, Viktor K. Prasanna. 876-888 [doi]
- Aurora: A Versatile and Flexible Accelerator for Graph Neural NetworksJiaqi Yang, Hao Zheng 0005, Ahmed Louri. 890-902 [doi]
- cuKE: An Efficient Code Generator for Score Function Computation in Knowledge Graph EmbeddingLihan Hu, Jing Li, Peng Jiang 0004. 903-914 [doi]
- Exploiting Inter-Layer Expert Affinity for Accelerating Mixture-of-Experts Model InferenceJinghan Yao, Quentin Anthony, Aamir Shafi, Hari Subramoni, Dhabaleswar K. Panda 0001. 915-925 [doi]
- TASER: Temporal Adaptive Sampling for Fast and Accurate Dynamic Graph Representation LearningGangda Deng, Hongkuan Zhou, Hanqing Zeng, Yinglong Xia, Christopher Leung, Jianbo Li, Rajgopal Kannan, Viktor K. Prasanna. 926-937 [doi]
- OpenFFT-SME: An Efficient Outer Product Pattern FFT Library on ARM SME CPUsRuge Zhang, Haipeng Jia, Yunquan Zhang, Baicheng Yan, Penghao Ma, Long Wang, Wenxuan Zhao. 938-949 [doi]
- Harnessing Deep Learning and HPC Kernels via High-Level Loop and Tensor Abstractions on CPU ArchitecturesEvangelos Georganas, Dhiraj D. Kalamkar, Kirill Voronin, Abhisek Kundu, Antonio Noack, Hans Pabst, Alexander Breuer, Alexander Heinecke. 950-963 [doi]
- Optimizing General Matrix Multiplications on Modern Multi-core DSPsKainan Yu, Xinxin Qi, Peng Zhang 0061, Jianbin Fang, Dezun Dong, Ruibo Wang, Tao Tang 0001, Chun Huang, Yonggang Che, Zheng Wang 0001. 964-975 [doi]
- Machine-Learning-Driven Runtime Optimization of BLAS Level 3 on Modern Multi-Core SystemsYufan Xia, Giuseppe Maria Junior Barca. 976-986 [doi]
- Time-Color Tradeoff on Uniform Circle Formation by Asynchronous RobotsDebasish Pattanayak, Gokarna Sharma. 987-997 [doi]
- LightDAG: A Low-latency DAG-based BFT Consensus through Lightweight BroadcastXiaohai Dai, Guanxiong Wang, Jiang Xiao, Zhengxuan Guo, Rui Hao, Xia Xie, Hai Jin 0001. 998-1008 [doi]
- MAAD: A Distributed Anomaly Detection Architecture for Microservices SystemsRongyuan Tan, Zhuozhao Li. 1009-1021 [doi]
- OneShot: View-Adapting Streamlined BFT Protocols with Trusted Execution EnvironmentsJérémie Decouchant, David Kozhaya, Vincent Rahli, Jiangshan Yu. 1022-1033 [doi]
- Practically Tackling Memory Bottlenecks of Graph-Processing WorkloadsAlexandre Valentin Jamet, Georgios Vavouliotis, Daniel A. Jiménez, Lluc Alvarez, Marc Casas. 1034-1045 [doi]
- GCSM: GPU-Accelerated Continuous Subgraph Matching for Large GraphsYihua Wei, Peng Jiang 0004. 1046-1057 [doi]
- Parallel Derandomization for ColoringSam Coy, Artur Czumaj, Peter Davies-Peck, Gopinath Mishra. 1058-1069 [doi]
- A Comparative Study of Intersection-Based Triangle Counting Algorithms on GPUsJiangbo Li, Zichen Xu 0001, Minh Pham, Yicheng Tu, Qihe Zhou. 1070-1081 [doi]