Abstract is missing.
- Auto-Stencil: Performance-Driven Stencil Optimization with Hardware Feedback for LLMsQuan Deng 0001, Lin Gan 0001, Hongkun Yu 0002, Wenlai Zhao, Guangwen Yang 0002. 1-10 [doi]
- MixLoRA: An Efficient Multi-Tenant Framework for Concurrently Serving Diverse LoRA Models in Large Language ModelsRonghuai Chen, Ce Yu, Hao Fu 0021, Xiaoteng Hu, Bin Yang 0043. 11-21 [doi]
- Origami: Efficient ML-Driven Metadata Load Balancing for Distributed File SystemsYiduo Wang 0002, Wenda Tang, Linghang Meng, Liang Li 0016, Jie Wu 0001. 22-32 [doi]
- Solving Extended Flexible Job Shop Scheduling Problems with Deep Reinforcement LearningHaonan Jiang, Yusen Li, Xiaoguang Liu 0001, Gang Wang 0001, Xuebo Zhang. 33-42 [doi]
- CoreTuner: Predicting and Scheduling Framework for Optimizing the Joint Allocation of CPU and GPU in Training ClusterHao Dong, Yuehao Xu, Xiaohui Wang, Xinhua Ji, Zhijun Ding. 43-52 [doi]
- A High-Accuracy Sketch for Measuring Low-Entropy Flows in Distributed AI TrainingJin Wang 0001, Chenye Zhu, Jinbin Hu 0001. 53-62 [doi]
- P3P-Fed: Peer-to-Peer Personalized Federated Learning with DHT-based Local ClusteringSooho Jang, Ahyeon Lim, Yuchan Lee, Sookwang Lee, Jaehwan Lee 0001. 63-72 [doi]
- FedWCM: Unleashing the Potential of Momentum-based Federated Learning in Long-Tailed ScenariosTianle Li, Yongzhi Huang 0002, Linshan Jiang, Qipeng Xie, Chang Liu, Wenfeng Du, Lu Wang 0002, Kaishun Wu. 73-82 [doi]
- It Takes Two: Accelerating Accurate Federated Learning through Pipelined Intra-Batch Data Sampling and TrainingChenghao Nu, Zhe Zhang 0043, Ye Li 0041, Yanchao Zhao. 83-93 [doi]
- ParEval-Repo: A Benchmark Suite for Evaluating LLMs with Repository-level HPC Translation TasksJoshua Hoke Davis, Daniel Nichols, Ishan Khillan, Abhinav Bhatele. 94-103 [doi]
- Pisces: Towards Adaptive and Fair Congestion Control via Multi-Agent Meta-Reinforcement LearningHe Bai 0011, Hui Li 0022, Jianming Que, Minglong Zhang, Zhiqiang Hu, Ximing Xu, Bing Lin, Runhuai Huang, Junyang Qiu, Shaowen Deng. 104-114 [doi]
- WinRS: Accelerate Winograd Backward-Filter Convolution with Tiny WorkspaceZhiyi Zhang, Junshi Chen 0003, Jingwei Sun 0001, Pengfei Zhang, Zhuopin Xu, Jun Shi 0007, Qi Wang 0131. 115-124 [doi]
- SINA: Accelerating Time Synchronization in Large-Scale Network Simulation Using In-Network AllreduceDinghuang Hu, Dezun Dong, Xiangke Liao. 125-134 [doi]
- VES: Vectorized Sparse General Matrix-Matrix Multiplication on Multi-Core DSPsChuhe Hong, Qinglin Wang, Xing Peng, Gencheng Liu, Qingyang Zhang 0009, Xinhai Chen 0001, Jie Liu 0002. 135-145 [doi]
- Optimizing Direct Convolutions on High-Performance Multi-Core DSPsPengyu Wang, Xiaotian Chen, Jianbin Fang, Peng Zhang 0061, Yonggang Che, Chun Huang 0006, Jie Ren 0007. 146-156 [doi]
- Fast Exact Diameter Computation of Sparse GraphsCameron Bradley, Anju Mongandampulath Akathoott, Martin Burtscher. 157-167 [doi]
- SpeedSketch: An Ultra-Fast Sketch Generation and Delta Encoding Framework for Delta CompressionFengkui Yang, Yuanzhang Wang, Chunhua Li 0002, Ke Zhou 0001, Hui Li. 168-177 [doi]
- Deadline-Aware Scheduling of Mixed-Criticality TasksMaxime Gonthier, Kyle Chard, Ian T. Foster, Loris Marchal, Frédéric Vivien. 178-187 [doi]
- A Fast Sparse Triangular Solve for Structured-grid Problems on Heterogeneous ProcessorsZhengding Hu, Yi Zong, Jingwei Sun 0001, Wei Xue 0003, Guangzhong Sun. 188-198 [doi]
- PISCES: Push-Pull Hybrid Optimization for Graph Pattern MatchingChangjie Xu, Ke Meng, Zhiheng Lin, Guangming Tan. 199-207 [doi]
- AMPED: Accelerating MTTKRP for Billion-Scale Sparse Tensor Decomposition on Multiple GPUsSasindu Wijeratne, Rajgopal Kannan, Viktor K. Prasanna. 208-217 [doi]
- Bridging Cache-Friendliness and Concurrency: A Locality-Optimized In-Memory B-SkiplistYicong Luo, Senhe Hao, Brian Wheatman, Prashant Pandey 0001, Helen Xu 0001. 218-227 [doi]
- Scaling Distributed Graph Processing to Hundreds of GPUsGeorge M. Slota, Michael Mandulak. 228-237 [doi]
- Multiprocessor Scheduling with Memory Constraints: Fundamental Properties and Finding Optimal SolutionsPál András Papp, Toni Böhnlein, Albert-Jan Nicholas Yzelman. 238-247 [doi]
- Heterogeneity-aware Task Scheduling based on Personalized Federated Reinforcement LearningXin Yong, Li Yan 0004, Zhuozhao Li. 248-257 [doi]
- BMapper: A Scalable and Efficient Framework for Brain Simulations Acceleration on SupercomputersYubing Bao, Zhihui Lu 0002, Qiang Duan 0002, Xin Du 0002, Zhongyu Chen, Yicong Zhao, Xiaoyi Li, Yandan Tan, Shuhan Yang, Ziyi Wang, Yang Chen 0001, Yang Xu 0010. 258-267 [doi]
- SYgraph: A Portable Heterogeneous Graph Analytics Framework for GPUsAntonio De Caro, Gennaro Cordasco, Biagio Cosenza. 268-277 [doi]
- FLEX: Leveraging FPGA-CPU Synergy for Mixed-Cell-Height Legalization AccelerationXingyu Liu, Jiawei Liang, Linfeng Du, Yipu Zhang 0002, Chaofang Ma, Hanwei Fan, Jiang Xu 0001, Wei Zhang 0012. 278-287 [doi]
- Fast and Scalable Mixed Precision Euclidean Distance Calculations Using GPU Tensor CoresBrian Curless, Michael Gowanlock. 288-298 [doi]
- Thievory: Graph Processing with Multi-GPU Memory StealingJoão Brotas, Ricardo Nobre, Aleksandar Ilic. 299-308 [doi]
- Efficient Parallel Algorithms for Dynamic Percolation CentralityPrajjwal Nijhara, Lokesh Venkatachalam, Agam Harpreet Singh, Athreya Chandramouli, Sayantan Jana, Kishore Kothapalli, Dip Sankar Banerjee. 309-319 [doi]
- SpiderCache: Semantic-Aware Caching Strategy for DNN TrainingZesong Wang 0001, Peng Fang 0002, Fang Wang 0001, Hong Jiang 0001, Yimin Lu, Zhan Shi 0001, Dan Feng 0001. 320-330 [doi]
- ViReC: The Virtual Register Context Architecture for Efficient Near-Memory MultithreadingMatthew Barondeau, Sophia Jiang, Jonathan Beard, Andreas Gerstlauer. 331-341 [doi]
- Accelerating Erasure Coding on Persistent Memory via Adaptive Prefetcher SchedulingGuanglei Xu, Hai Zhou 0002, Yuchong Hu, Dan Feng 0001, Renzhi Xiao. 342-351 [doi]
- ADAPT: Dynamic Grouping and Cross-Group Aggregation for GC-Efficient Log-Structured Storage in SSD ArraysRuisong Zhou, Peng Wang 0037, Chunhua Li 0002, Ke Zhou 0001, Hui Li. 352-361 [doi]
- HeatList: The Case for Retrofitting In-memory Range Index with Hotspot AwarenessJunru Shen, Miao Cai 0001, Kangyue Gao, Baoliu Ye, Guo Cheng. 362-373 [doi]
- HMGraph: Boosting GNN Training on Hierarchical Memory via Coordinated CacheLiZhi Zhang, Menghan Jia, Zhiquan Lai, Qiao Li 0001, Yiming Zhang 0003, Dongsheng Li 0001. 374-384 [doi]
- PTWalker: Cache-Efficient Random Walks via Alternating Dual-Subgraph Walker UpdatingShuai Lin, Rui Wang 0076, Zaigui Zhang, Long Deng, Wenzhe Zhu, Yongkun Li 0001, Yinlong Xu 0001. 385-395 [doi]
- Efficient Cross-Datacenter Congestion Control with Fast Control LoopsBaosen Zhao, Jianan Sun, Xu Zhou, Wanghong Yang, Wenji Du, Fukang Chen, Yongmao Ren, Stefan Schmid. 396-405 [doi]
- Automated FPGA Accelerator Generation Framework for Transformers with Dataflow OptimizationWenqi Lou, Yunji Qin, Zihao Wang, Chao Wang 0003, Lei Gong 0003, Xuehai Zhou. 406-416 [doi]
- Design of Interposer Interconnection Network Based on High-Radix Interposer RoutersXue Xiao, Yi Dai, Yanqiang Sun, Jianmin Zhang, Tiejun Li. 417-427 [doi]
- SmartBlock: Adaptive Block Floating Point Quantization for Efficient DNN AccelerationXin Ju 0005, Jingkui Yang, Mei Wen, Jun He, Jing Feng, Minjin Tang, Zhaoyun Chen, Yang Shi 0008. 428-438 [doi]
- COF: Cycle and transmission co-mapping framework for CNN mapping in PIM architectureXianfa Zhou, Tun Li, Yuhuan Xia, Ruiyu Zhang. 439-448 [doi]
- Power Capping of GPU Servers for Machine Learning Inference OptimizationYuan Ma, Srinivasan Subramaniyan, Xiaorui Wang. 449-459 [doi]
- LLaMCAT: Optimizing Large Language Model Inference with Cache Arbitration and ThrottlingZhongchun Zhou, Chengtao Lai, Wei Zhang 0012. 460-469 [doi]
- ParaCOSM: A Parallel Framework for Continuous Subgraph MatchingHaibin Lai, Sicheng Zhou, Site Fan, Zhuozhao Li. 470-479 [doi]
- Heterogeneity-aware Federated Edge Learning via UAV Sampling and D2D CommunicationsYanfeng Lu, Tao Wu 0011, Chao Chang 0005, Hongjun Wang 0010, Mingxing Ke, Jian Wang 0014. 480-489 [doi]
- ZTP: A Scalable and Lightweight Privacy-Preserving Blockchain via Scale-Free Quorums and Geometric FragmentationAbdullah Al-Mamun 0001, Dongfang Zhao, Gagan Agrawal, Ahmed AlEroud, Mohamed I. Ibrahem. 490-499 [doi]
- Performant Unified GPU Kernels for Portable Singular Value Computation Across Hardware and PrecisionEvelyne Ringoot, Rabab Alomairy, Valentin Churavy, Alan Edelman. 500-510 [doi]
- Joint Prediction and Matching for Computing Resource Exchange PlatformsDa Huo 0002, Zhenzhe Zheng 0002, Xiaoyao Huang, Hao Chen 0181, Jianfeng Hu 0003, Zhiyong Yan, Fan Wu 0006, Jie Wu 0001. 511-520 [doi]
- Lias: Leveraging Performance Counters for Interference Quantification and Mitigation in Multi-processor SystemsYangfan Qiao, Zhuozhao Li. 521-530 [doi]
- Architecture-Aware Models of AI Engines for High-Performance Matrix Matrix MultiplicationElliott D. Binder, Jeffrey Low, Tze Meng Low. 531-540 [doi]
- Scheduling based on Block Features for Concurrent Inference with Unseen DNN Models on GPUDiaohan Luo, Zhen Tang, Heran Gao, Yuewen Wu, Heng Wu 0001, Xi Han, Wenbo Zhang 0006. 541-552 [doi]
- Optimizing Incomplete Cholesky Factorization on MIMD Many-core ArchitectureYongzhen Shi, Qinglin Wang, Jie Liu 0002, Lian Wang, Zhiyan Liu, Bingwei Wang, Feiming Liu, Xiangdong Pei. 553-563 [doi]
- OVERT: Orchestrating Vector-Scalar Execution for Efficient SpMV on Modern CPUsKelun Lei, Hailong Yang 0002, Kaige Zhang 0002, Shaokang Du, Marc Casas, Yufan Xu 0001, Zhongzhi Luan, Yi Liu 0013, Depei Qian 0002. 564-574 [doi]
- ESC: Effective Submanifold Convolution using Tensor CoresXuezhu Wang, Hailong Yang 0002, Xin You 0001, Yufan Xu 0001, Xiaoyan Liu, Siqi Wang, Kaige Zhang 0002, Mingzhen Li 0001, Zhongzhi Luan, Yi Liu 0013, Depei Qian 0002. 575-585 [doi]
- Joint Task Scheduling and Resource Allocation in Cloud-Edge Collaborative Computing SystemsBoyu Du, Jingya Zhou, Jin Wang, Jiangwei Wang, Zhijun Li. 586-596 [doi]
- One GPU, Many Ranks: Enabling Performance and Energy-Efficient In-Transit Visualization via Resource SharingMatheus Costa, Philippe O. A. Navaux, Silvio Rizzi 0001, Arthur Francisco Lorenzon. 597-606 [doi]
- HHOTuner: Efficient Performance Tuning with Harris Hawks OptimizationAkash Dutta, Ali Jannesari. 607-616 [doi]
- Cross-Architecture Performance Analysis Using the RAJA Performance SuiteDewi Yokelson, Stephanie Brink, Jason Burmark, Michael McKinsey, Befikir Bogale, Ian Lumsden, Michela Taufer, Tom Scogland, Olga Pearce. 617-626 [doi]
- Carbon-Aware Workflow Scheduling with Fixed Mapping and Deadline ConstraintDominik Schweisgut, Anne Benoit, Yves Robert, Henning Meyerhenke. 627-637 [doi]
- Q-GEAR: Improving quantum simulation frameworkZiqing Guo, Jan Balewski, Ziwen Pan. 638-647 [doi]
- Cycle-Aware Parallel Optimization for Mitigating ZZ Crosstalk on Quantum HardwareJiayi Zhong, Yuxin Deng. 648-657 [doi]
- Adaptive Job Scheduling in Quantum Clouds Using Reinforcement LearningWaylon Luo, Jiapeng Zhao, Tong Zhan, Qiang Guan. 658-667 [doi]
- Efficient Construction of Large Search Spaces for Auto-TuningFloris-Jan Willemsen, Rob V. van Nieuwpoort, Ben van Werkhoven. 668-677 [doi]
- Amber: Towards Fast and Space-Efficient Incremental Checkpointing in Large Language Model TrainingZhiqiang Wang, Wenzhe Zhu, Zaigui Zhang, Chaomei Yan, Fan Guo 0003, Yongkun Li 0001, Yinlong Xu 0001. 678-688 [doi]
- TD-Pipe: Temporally-Disaggregated Pipeline Parallelism Architecture for High-Throughput LLM InferenceHongbin Zhang 0006, Taosheng Wei, Zhenyi Zheng, Jiangsu Du, Zhiguang Chen 0001, Yutong Lu. 689-698 [doi]
- Leave No One Behind: Fair and Efficient Tiered Memory Management for Multi-ApplicationsWenda Tang, Yiduo Wang 0002, Yanwen Wang, Jie Wu 0001. 699-709 [doi]
- CompreGel: Efficient Distributed Graph Propagation via Error-Bounded Lossy Message CompressionTianhao Wu 0006, Da Yan 0001, Qihao Cheng, Lyuheng Yuan, Sheng Di, Jiao Han, Zhongyi Huang, Ji Cheng 0002. 710-719 [doi]
- Accelerating Multi-Output GBDTs with GPUsHanfeng Liu, Xuemei Peng, Zeyi Wen. 720-729 [doi]
- Decision Shuffle: Efficient Pre-scheduling System for Push-based Shuffle in DAG Computing FrameworksShihao Zhang, Chi Zhang 0005, Chentao Wu, Jie Li 0002, Minyi Guo, Hui Li, Liqiang Zhang 0010. 730-740 [doi]
- Accelerating an Electromagnetic Simulation via Memory-Constrained Task-Based Load BalancingJonathan Lifflander, Nicole Slattengren, Philippe P. Pebay, Pierre L. Pebay, Caleb Schilly, Robert A. Pfeiffer, Joseph D. Kotulski. 741-752 [doi]
- pyGinkgo: A Sparse Linear Algebra Operator Framework for PythonKeshvi Tuteja, Gregor Olenik, Roman Mishchuk, Yu-Hsiang Tsai, Markus Götz, Achim Streit, Hartwig Anzt, Charlotte Debus. 753-763 [doi]
- IRIS-MASH: Efficient Multi-device Asynchronous Multi-Stream Heterogeneous ComputingNarasinga Rao Miniskar, Aaron R. Young, Mohammad Alaul Haque Monil, Kazi Asifuzzaman, Beau Johnston, Keita Teranishi, Jeffrey S. Vetter. 764-773 [doi]
- Optimizing NumPy with SVE Acceleration on ARM ArchitecturesKuldeep Pal, Aniket P. Garade, Deepika H. V, Haribabu P, S. A. Kumar, S. D. Sudarsan. 774-783 [doi]
- Design and Optimization of GPU-Aware MPI Allreduce Using Direct Sendrecv CommunicationChen-Chun Chen, Jinghan Yao, Hari Subramoni, Dhabaleswar K. Panda 0001. 784-793 [doi]
- Revisiting Multi-threaded Compaction in LSM-trees: Enabling Compaction PipeliningHongsu Byun, Honghyeon Yoo, Sungyong Park. 794-803 [doi]
- TAPAS: Fast and Automatic Derivation of Tensor Parallel Strategies for Large Neural NetworksZiji Shi, Le Jiang, Ang Wang, Jie Zhang 0135, Chencan Wu, Yong Li 0045, Xiaokui Xiao, Wei Lin 0016, Jialin Li 0001. 804-815 [doi]