Abstract is missing.
- O(N) distributed direct factorization of structured dense matrices using runtime systemsSameer Deshmukh, Rio Yokota, George Bosilca, Qianxiang Ma. 1-10 [doi]
- Composable Workflow for Accelerating Neural Architecture Search Using In Situ Analytics for Protein ClassificationGeorgia Channing, Ria Patel, Paula Olaya, Ariel Keller Rorabaugh, Osamu Miyashita, Silvina Caíno-Lores, Catherine D. Schuman, Florence Tama, Michela Taufer. 1 [doi]
- Computing the k-th Eigenvalue of Symmetric H2-MatricesM. Ridwan Apriansyah, Rio Yokota. 11-20 [doi]
- EC-SpMM: Efficient Compilation of SpMM Kernel on GPUsJunqing Lin, Honghe Zhang, Xiaolong Shi, Jingwei Sun 0001, Xianzhi Yu, Jun Yao, Guangzhong Sun. 21-30 [doi]
- Recoil: Parallel rANS Decoding with Decoder-Adaptive ScalabilityFangzheng Lin, Kasidis Arunruangsirilert, Heming Sun, Jiro Katto. 31-40 [doi]
- Minimizing Network and Storage Costs for Consensus with Flexible Erasure CodingMi Zhang 0007, Qihan Kang, Patrick P. C. Lee. 41-50 [doi]
- SNICIT: Accelerating Sparse Neural Network Inference via Compression at Inference Time on GPUShui Jiang, Tsung-Wei Huang, Bei Yu 0001, Tsung-Yi Ho. 51-61 [doi]
- DiffLex: A High-Performance, Memory-Efficient and NUMA-Aware Learned Index using Differentiated ManagementLixiao Cui, Kedi Yang, Yusen Li, Gang Wang, Xiaoguang Liu. 62-71 [doi]
- BIRP: Batch-aware Inference Workload Redistribution and Parallel Scheme for Edge CollaborationHesheng Sun, Xinyi Chen, Zhuzhong Qian, Zengji Li, Ning Chen, Tuo Cao, Suwei Xu, Yitong Zhou. 72-81 [doi]
- PSRA-HGADMM: A Communication Efficient Distributed ADMM AlgorithmYongwen Qiu, Yongmei Lei, Guozheng Wang. 82-91 [doi]
- CoTrain: Efficient Scheduling for Large-Model Training upon GPU and CPU in ParallelZhenxing Li, Qiang Cao 0001, Yajie Chen, Wenrui Yan. 92-101 [doi]
- OSP: Boosting Distributed Model Training with 2-stage SynchronizationZixuan Chen, Lei Shi, Xuandong Liu, Jiahui Li, Sen Liu, Yang Xu 0010. 102-111 [doi]
- ITIF: Integrated Transformers Inference Framework for Multiple Tenants on GPUYuning Zhang, Zao Zhang, Wei Bao, Dong Yuan. 112-121 [doi]
- Parallel Order-Based Core Maintenance in Dynamic GraphsBin Guo 0013, Emil Sekerinski. 122-131 [doi]
- Fast Parallel Index Construction for Efficient K-truss-based Local Community Detection in Large GraphsMd Abdul Motaleb Faysal, Maximilian H. Bremer, Cy P. Chan, John Shalf, Shaikh Arifuzzaman. 132-141 [doi]
- BEEP: Balanced Efficient subgraph Enumeration in ParallelSamiran Kawtikwar, Mohammad Almasri, Wen-mei Hwu, Rakesh Nagi, Jinjun Xiong. 142-152 [doi]
- Improving the Scaling of an Asynchronous Many-Task Runtime with a Lightweight Communication EngineOmri Mor, George Bosilca, Marc Snir. 153-162 [doi]
- Investigating Dependency Graph Discovery Impact on Task-based MPI+OpenMP Applications PerformancesRomain Pereira, Adrien Roussel, Patrick Carribault, Thierry Gautier. 163-172 [doi]
- Implementing OpenMP's SIMD Directive in LLVM's GPU RuntimeEric Wright, Johannes Doerfert, Shilei Tian, Barbara M. Chapman, Sunita Chandrasekaran. 173-182 [doi]
- Smart Cache Insertion and Promotion Policy for Content Delivery NetworksPeng Wang, Yu Liu 0040, Zhelong Zhao, Ke Zhou 0001, Zhihai Huang, Yanxiong Chen. 183-192 [doi]
- BlockPilot: A Proposer-Validator Parallel Execution Framework for BlockchainHaowen Zhang, Jing Li, He Zhao 0011, Tong Zhou, Nianzu Sheng, Hengyu Pan. 193-202 [doi]
- Communication Optimizations for State-vector Quantum Simulator on CPU+GPU ClustersChenyang Jiao, Weihua Zhang, Li Shen. 203-212 [doi]
- RBC: A bandwidth controller to reduce write-stalls and tail latencyZepeng Wang 0007, Shu Yin. 213-222 [doi]
- PMLDS: An LSM-Tree Direct Managed Storage for Key-Value Stores on Byte-Addressable DevicesZiyi Lu, Qiang Cao 0001, Shucheng Wang, Jie Yao, Xiangrui Yang. 223-232 [doi]
- DComp: Efficient Offload of LSM-tree Compaction with Data Processing UnitsChen Ding, Jian Zhou 0004, Jiguang Wan, Yiqin Xiong, Sicen Li, Shuning Chen, Hanyang Liu, Liu Tang, Ling Zhan, Kai Lu, Peng Xu. 233-243 [doi]
- RadarSSD: A Computational Storage for Radar Signal ProcessingJiali Li, Xianzhang Chen, Duo Liu, Ao Ren, Zhaoyang Zeng, Yujuan Tan. 244-253 [doi]
- Communication-Efficient Generalized Neuron Matching for Federated LearningSixu Hu, Qinbin Li, Bingsheng He. 254-263 [doi]
- Group-based Hierarchical Federated Learning: Convergence, Group Formation, and SamplingJiyao Liu, Xinliang Wei, Xuanzhang Liu, Hongchang Gao, Yu Wang 0003. 264-273 [doi]
- FastDimeNet++: Training DimeNet++ in 22 minutesFeiwen Zhu, Michal Futrega, Han Bao, Sukru Burc Eryilmaz, Fei Kong, Kefeng Duan, Xinnian Zheng, Nimrod Angel, Matthias Jouanneaux, Maxmilian Stadler, Michal Marcinkiewicz, Fung Xie, June Yang, Michael Andersch. 274-284 [doi]
- Quantifying the Performance Benefits of Partitioned Communication in MPIThomas Gillis, Ken Raffenetti, Hui Zhou, Yanfei Guo, Rajeev Thakur. 285-294 [doi]
- Impact of Cache Coherence on the Performance of Shared-Memory based MPI Primitives: A Case Study for Broadcast on Intel Xeon Scalable ProcessorsGeorge Katevenis, Manolis Ploumidis, Manolis Marazakis. 295-305 [doi]
- Modeling and Benchmarking the Potential Benefit of Early-Bird Transmission in Fine-Grained CommunicationWhit Schonbein, Scott Levy, Matthew G. F. Dosanjh, W. Pepper Marts, Elizabeth Reid, Ryan E. Grant. 306-316 [doi]
- CoTuner: A Hierarchical Learning Framework for Coordinately Optimizing Resource Partitioning and Parameter TuningTiannuo Yang, Ruobing Chen 0002, Yusen Li, Xiaoguang Liu, Gang Wang. 317-326 [doi]
- DeepPower: Deep Reinforcement Learning based Power Management for Latency Critical Applications in Multi-core SystemsJingrun Zhang, Guangba Yu, Zilong He, Liang Ai, Pengfei Chen 0002. 327-336 [doi]
- AsyncGBP: Unleashing the Potential of Heterogeneous Computing for SSL/TLS with GPU-based ProviderYi Bian, Fangyu Zheng, Yuewu Wang, Lingguang Lei, Yuan Ma, Jiankuo Dong, Jiwu Jing. 337-346 [doi]
- MARS: Fault Localization in Programmable Networking Systems with Low-cost In-Band Network TelemetryBenran Wang, Hongyang Chen, Pengfei Chen, Zilong He, Guangba Yu. 347-357 [doi]
- On Optimizing Traffic Scheduling for Multi-replica Containerized MicroservicesXianzhi Zhu, Yongkun Li 0001, Lulu Yao, Zhihao Qi, Yinlong Xu, Pengcheng Wang, Weiguang Wang, Xia Zhu. 358-368 [doi]
- HighRPM: Combining Integrated Measurement and Sofware Power Modeling for High-Resolution Power MonitoringXinxin Qi, Juan Chen, Yong Dong, Yuan Yuan, Tao Xu, Rongyu Deng, Zekai Li, Kexing Zhou, Zheng Wang. 369-379 [doi]
- Communication-Avoiding Optimizations for Large-Scale Unstructured-Mesh Applications with OP2Suneth Dasantha Ekanayake, István Zoltan Reguly, Fabio Luporini, Gihan Ravideva Mudalige. 380-391 [doi]
- WFAsic: A High-Performance ASIC Accelerator for DNA Sequence Alignment on a RISC-V SoCAbbas Haghi, Lluc Alvarez, Jordi Front, Juan Miguel De Haro Ruiz, Roger Figueras, Max Doblas, Santiago Marco-Sola, Miquel Moretó. 392-401 [doi]
- PFDRL: Personalized Federated Deep Reinforcement Learning for Residential Energy ManagementJiechao Gao, Wenpeng Wang, Fateme Nikseresht, Viswajith Govinda Rajan, Bradford Campbell. 402-411 [doi]
- Mercury: Fast and Optimal Device Placement for Large Deep Learning ModelsHengwei Xu, Pengyuan Zhou, Haiyong Xie 0001, Yong Liao. 412-422 [doi]
- Embracing Uncertainty for Equity in Resource Allocation in ML TrainingSuraiya Tairin, Haiying Shen, Zeyu Zhang 0005. 423-432 [doi]
- Performance-Aware Energy-Efficient GPU Frequency Selection using DNN-based ModelsGhazanfar Ali, Mert Side, Sridutt Bhalachandra, Nicholas J. Wright, Yong Chen 0001. 433-442 [doi]
- ASFL: Adaptive Semi-asynchronous Federated Learning for Balancing Model Accuracy and Total Latency in Mobile Edge NetworksJieling Yu, Ruiting Zhou, Chen Chen 0067, Bo Li 0001, Fang Dong 0001. 443-451 [doi]
- Credit-based Differential Privacy Stochastic Model Aggregation Algorithm for Robust Federated Learning via BlockchainMengyao Du, Miao Zhang, Lin Liu, Kai Xu, Quanjun Yin. 452-461 [doi]
- Learning From Your Neighbours: Mobility-Driven Device-Edge-Cloud Federated LearningSongli Zhang, Zhenzhe Zheng, Fan Wu 0006, Bingshuai Li, Yunfeng Shao 0001, Guihai Chen. 462-471 [doi]
- DAG-Aware Optimization for Geo-Distributed Data AnalyticsQingyuan Wang, Bin Gao, Zhi Zhou 0006, Fei Xu, Chenghao Ouyang. 472-481 [doi]
- Connectivity-Aware Link Analysis for Skewed GraphsYuang Chen, Yeh-Ching Chung. 482-491 [doi]
- BitColor: Accelerating Large-Scale Graph Coloring on FPGA with Parallel Bit-Wise EnginesHaishuang Fan, Ming Li, Jingya Wu, Wenyan Lu, Xiaowei Li, Guihai Yan. 492-502 [doi]
- Fast tree-based algorithms for DBSCAN for low-dimensional data on GPUsAndrey Prokopenko, Damien Lebrun-Grandié, Daniel Arndt 0003. 503-512 [doi]
- GFFT: a Task Graph Based Fast Fourier Transform Optimization FrameworkQinglin Lu, Xinyu Wang, Wenjing Ma, Yuwen Zhao, Daokun Chen, Fangfang Liu. 513-523 [doi]
- ADARNet: Deep Learning Predicts Adaptive Mesh RefinementOctavi Obiols-Sales, Abhinav Vishnu, Nicholas Malaya, Aparna Chandramowlishwaran. 524-534 [doi]
- Hector: A Framework to Design and Evaluate Scheduling Strategies in Persistent Key-Value StoresLouis-Claude Canon, Anthony Dugois, Loris Marchal, Etienne Rivière. 535-545 [doi]
- Warped-MC: An Efficient Memory Controller Scheme for Massively Parallel ProcessorsJong-Hyun Jeong, Myung Kuk Yoon, Yunho Oh, Gunjae Koo. 546-555 [doi]
- Wrht: Efficient All-reduce for Distributed DNN Training in Optical Interconnect SystemsFei Dai, Yawen Chen 0001, Zhiyi Huang, Haibo Zhang 0001. 556-565 [doi]
- SEECHIP: A Scalable and Energy-Efficient Chiplet-based GPU Architecture Using Photonic LinksHao Zhang, Yawen Chen, Zhiyi Huang 0001, Haibo Zhang, Fei Dai. 566-575 [doi]
- RLB: Reordering-Robust Load Balancing in Lossless Datacenter NetworksJinbin Hu, Yi He, Jin Wang, Wangqing Luo, Jiawei Huang 0001. 576-584 [doi]
- Scheduling Dependent Batching TasksHehuan Shi, Lin Chen 0002, Ming Lin, Rapharl Phan. 585-594 [doi]
- Tango: Harmonious Management and Scheduling for Mixed Services Co-located among Distributed Edge-CloudsYicheng Feng, Shihao Shen, Mengwei Xu, Yuanming Ren, Xiaofei Wang, Victor C. M. Leung, Wenyu Wang. 595-604 [doi]
- SPLIT: QoS-Aware DNN Inference on Shared GPU via Evenly-Sized Model SplittingDiaohan Luo, Tian Yu, Yuewen Wu, Heng Wu, Tao Wang 0030, Wenbo Zhang 0006. 605-614 [doi]
- NeiLatS: Neighbor-Aware Latency-Sensitive Application Scheduling in Heterogeneous Cloud-Edge EnvironmentHuadong Li, Hui Liu 0006, Changyuan Liu, Aoqi Chen, Zhaocheng Niu, Junzhao Du. 615-624 [doi]
- Dystri: A Dynamic Inference based Distributed DNN Service Framework on EdgeXueyu Hou, Yongjie Guan, Tao Han 0002. 625-634 [doi]
- FaST-GShare: Enabling Efficient Spatio-Temporal GPU Sharing in Serverless Computing for Deep Learning InferenceJianfeng Gu, Yichao Zhu, Puxuan Wang, Mohak Chadha, Michael Gerndt. 635-644 [doi]
- Output-Directed Dynamic Quantization for DNN AccelerationBeilei Jiang, Xianwei Cheng, Yuan Li, Jocelyn Zhang, Song Fu, Qing Yang, Mingxiong Liu, Alejandro Olvera. 645-654 [doi]
- ORAQL - Optimistic Responses to Alias Queries in LLVMJan Hückelheim, Johannes Doerfert. 655-664 [doi]
- Scalable Incremental Checkpointing using GPU-Accelerated De-DuplicationNigel Tan, Jakob Lüttgau, Jack Marquez, Keita Teranishi, Nicolas Morales, Sanjukta Bhowmick, Franck Cappello, Michela Taufer, Bogdan Nicolae. 665-674 [doi]
- General-purpose Asynchronous Periodic Checkpointing in Hybrid MemoryMasaki Nakata, Shigeyuki Sato, Tomoharu Ugawa. 675-684 [doi]
- Conflux: Exploiting Persistent Memory and RDMA Bandwidth via Adaptive I/O Mode SelectionZhenlin Qi, Shengan Zheng, Yifeng Hui, Bowen Zhang, Linpeng Huang. 685-694 [doi]
- Marlin: A Concurrent and Write-Optimized B+-tree Index on Disaggregated MemoryHang An, Fang Wang 0001, Dan Feng 0001, Xiaomin Zou, Zefeng Liu, Jianshun Zhang. 695-704 [doi]
- GPU Performance Acceleration via Intra-Group Sharing TLBWeiming Huang, Yajuan Du, Mingyang Liu. 705-714 [doi]
- DArray: A High Performance RDMA-Based Distributed ArrayBaorong Ding, Mingcong Han, Rong Chen 0001. 715-724 [doi]
- Toward Optimal Repair and Load Balance in Locally Repairable CodesHao Zhao, Si Wu 0003, Haifeng Liu 0004, Zhixiang Tang, Xiaochun He, Yinlong Xu. 725-735 [doi]
- Re-aligning Across-page Requests for Flash-based Solid-state DrivesZhigang Cai, Chengyong Tang, Minjun Li, François Trahay, Jun Li, Zhibing Sha, Jiaojiao Wu, Fan Yang, Jianwei Liao. 736-745 [doi]
- DEFT: Exploiting Gradient Norm Difference between Model Layers for Scalable Gradient SparsificationDaegun Yoon, Sangyoon Oh 0001. 746-755 [doi]
- Colossal-AI: A Unified Deep Learning System For Large-Scale Parallel TrainingShenggui Li, Hongxin Liu, Zhengda Bian, Jiarui Fang, Haichen Huang, Yuliang Liu, Boxiang Wang, Yang You 0001. 766-775 [doi]
- JSweep: A Patch-centric Data-driven Approach for Parallel Sweeps on Large-scale MeshesJie Yan, Zhang Yang, Aiqing Zhang, Zeyao Mo. 776-785 [doi]
- Exploiting Subgraph Similarities for Efficient Auto-tuning of Tensor ProgramsMingzhen Li, Hailong Yang, Shanjun Zhang, Fengwei Yu, Ruihao Gong, Yi Liu, Zhongzhi Luan, Depei Qian. 786-796 [doi]
- Accelerating Large-Scale CFD Simulations with Lattice Boltzmann Method on a 40-Million-Core Sunway SupercomputerZhao Liu, XueSen Chu, Xiaojing Lv, Hanyue Liu, Haohuan Fu, Guangwen Yang. 797-806 [doi]
- HASpGEMM: Heterogeneity-Aware Sparse General Matrix-Matrix Multiplication on Modern Asymmetric Multicore ProcessorsHelin Cheng, Wenxuan Li, Yuechen Lu, Weifeng Liu 0002. 807-817 [doi]
- An Improved Parallel Overset Grid Method for Fluid Simulation with Moving BoundaryRan Zhao, Chao Li, Xiaowei Guo, Yi Liu, Sifan Long, Sen Zhang, Yanlong Qiu, Canqun Yang. 818-827 [doi]
- JOSS: Joint Exploration of CPU-Memory DVFS and Task Scheduling for Energy EfficiencyJing Chen, Madhavan Manivannan, Bhavishya Goel, Miquel Pericàs. 828-838 [doi]