Abstract is missing.
- MegTaiChi: dynamic tensor-based memory management optimization for DNN trainingZhongzhe Hu, Junmin Xiao, Zheye Deng, Mingyi Li, Kewei Zhang, Xiaoyang Zhang, Ke Meng, Ninghui Sun, Guangming Tan. [doi]
- SnuQS: scaling quantum circuit simulation using storage devicesDaeyoung Park, Heehoon Kim, Jinpyo Kim, Taehyun Kim, Jaejin Lee. [doi]
- Clairvoyant: a log-based transformer-decoder for failure prediction in large-scale systemsKhalid Ayedh Alharthi, Arshad Jhumka, Sheng Di, Franck Cappello. [doi]
- Seamless optimization of the GEMM kernel for task-based programming modelsArthur Francisco Lorenzon, Sandro Matheus V. N. Marques, Antoni C. Navarro, Vicenç Beltran 0001. [doi]
- Towards low-latency I/O services for mixed workloads using ultra-low latency SSDsMingzhe Liu, Haikun Liu, Chencheng Ye, Xiaofei Liao, Hai Jin 0001, Yu Zhang 0027, Ran Zheng, Liting Hu. [doi]
- Rethinking graph data placement for graph neural network training on multiple GPUsShihui Song, Peng Jiang. [doi]
- Lifting C semantics for dataflow optimizationAlexandru Calotoiu, Tal Ben-Nun, Grzegorz Kwasniewski, Johannes de Fine Licht, Timo Schneider, Philipp Schaad, Torsten Hoefler. [doi]
- uiCA: accurate throughput prediction of basic blocks on recent intel microarchitecturesAndreas Abel 0002, Jan Reineke. [doi]
- MASTIFF: structure-aware minimum spanning tree/forestMohsen Koohi Esfahani, Peter Kilpatrick, Hans Vandierendonck. [doi]
- Handling heavy-tailed input of transformer inference on GPUsJiangsu Du, Jiazhi Jiang, Yang You, Dan Huang, Yutong Lu. [doi]
- Fast-track cache: a huge racetrack memory L1 data cacheHugo Tárrega, Alejandro Valero, Vicente Lorente, Salvador Petit, Julio Sahuquillo. [doi]
- ASAP: automatic synthesis of area-efficient and precision-aware CGRAsCheng Tan 0002, Thierry Tambe, Jeff Jun Zhang, Bo Fang, Tong Geng, Gu-Yeon Wei, David Brooks 0001, Antonino Tumeo, Ganesh Gopalakrishnan, Ang Li. [doi]
- LITE: a low-cost practical inter-operable GPU TEEArdhi Wiratama Baskara Yudha, Jake Meyer, Shougang Yuan, Huiyang Zhou, Yan Solihin. [doi]
- Toward accelerated stencil computation by adapting tensor core unit on GPUXiaoyan Liu, Yi Liu, Hailong Yang, Jianjin Liao, Mingzhen Li, Zhongzhi Luan, Depei Qian. [doi]
- High throughput multidimensional tridiagonal system solvers on FPGAsKamalavasan Kamalakkannan, Gihan R. Mudalige, István Z. Reguly, Suhaib A. Fahmy. [doi]
- Optimized MPI collective algorithms for dragonfly topologyGuangnan Feng, Dezun Dong, Yutong Lu. [doi]
- Preparing for performance analysis at exascaleJonathon M. Anderson, Yumeng Liu, John M. Mellor-Crummey. [doi]
- Efficient, out-of-memory sparse MTTKRP on massively parallel architecturesAndy Nguyen, Ahmed E. Helal, Fabio Checconi, Jan Laukemann, Jesmin Jahan Tithi, Yongseok Soh, Teresa M. Ranadive, Fabrizio Petrini, Jee W. Choi. [doi]
- Dynamic memory management in massively parallel systems: a case on GPUsMinh Pham, Hao Li, Yongke Yuan, Chengcheng Mou, Kandethody Ramachandran, Zichen Xu, Yicheng Tu. [doi]
- Parallel K-clique counting on GPUsMohammad Almasri, Izzat El Hajj, Rakesh Nagi, Jinjun Xiong, Wen-mei Hwu. [doi]
- SnuHPL: high performance LINPACK for heterogeneous GPUsJinpyo Kim, Hyungdal Kwon, Jintaek Kang, Jihwan Park, Seungwook Lee, Jaejin Lee. [doi]
- Low overhead and context sensitive profiling of CPU-accelerated applicationsKeren Zhou, Jonathon M. Anderson, Xiaozhu Meng, John M. Mellor-Crummey. [doi]
- AnySeq/GPU: a novel approach for faster sequence alignment on GPUsAndré Müller, Bertil Schmidt, Richard Membarth, Roland Leißa, Sebastian Hack. [doi]
- SparseLNR: accelerating sparse tensor computations using loop nest restructuringAdhitha Dias, Kirshanthan Sundararajah, Charitha Saumya, Milind Kulkarni 0001. [doi]
- CEAZ: accelerating parallel I/O via hardware-algorithm co-designed adaptive lossy compressionChengming Zhang 0006, Sian Jin, Tong Geng, Jiannan Tian, Ang Li, Dingwen Tao. [doi]
- KrakenOnMem: a memristor-augmented HW/SW framework for taxonomic profilingTaha Shahroodi, Mahdi Zahedi, Abhairaj Singh, Stephan Wong, Said Hamdioui. [doi]
- Beyond time complexity: data movement complexity analysis for matrix multiplicationWesley Smith, Aidan Goldfarb, Chen Ding 0001. [doi]
- Software-defined floating-point number formats and their application to graph processingHans Vandierendonck. [doi]
- A data-centric optimization framework for machine learningOliver Rausch, Tal Ben-Nun, Nikoli Dryden, Andrei Ivanov, Shigang Li 0002, Torsten Hoefler. [doi]
- Efficient exact K-nearest neighbor graph construction for billion-scale datasets using GPUs with tensor coresZhuoran Ji, Cho-Li Wang. [doi]
- GAPS: GPU-acceleration of PDE solvers for wave simulationBagus Hanindhito, Dimitrios Gourounas, Arash Fathi, Dimitar Trenev, Andreas Gerstlauer, Lizy K. John. [doi]
- Performance-detective: automatic deduction of cheap and accurate performance modelsLarissa Schmid, Marcin Copik, Alexandru Calotoiu, Dominik Werle, Andreas Reiter, Michael Selzer, Anne Koziolek, Torsten Hoefler. [doi]
- VICO: demand-driven verification for improving compiler optimizationsSharjeel Khan, Bodhisatwa Chatterjee, Santosh Pande. [doi]
- Dense dynamic blocks: optimizing SpMM for processors with vector and matrix units using machine learning techniquesSerif Yesil, José E. Moreira, Josep Torrellas. [doi]
- PAME: precision-aware multi-exit DNN serving for reducing latencies of batched inferencesShulai Zhang, Weihao Cui, Quan Chen, Zhengnian Zhang, Yue Guan, Jingwen Leng, Chao Li, Minyi Guo. [doi]
- Cloak: tolerating non-volatile cache read latencyApostolos Kokolis, Namrata Mantri, Shrikanth Ganapathy, Josep Torrellas, John Kalamatianos. [doi]
- Bring orders into uncertainty: enabling efficient uncertain graph processing via novel path sampling on multi-accelerator systemsHeng Zhang 0005, Lingda Li, Hang Liu 0001, Donglin Zhuang, Rui Liu, Chengying Huan, Shuang Song, Dingwen Tao, Yongchao Liu, Charles He, Yanjun Wu, Shuaiwen Leon Song. [doi]
- Calipers: a criticality-aware framework for modeling processor performanceHossein Golestani, Rathijit Sen, Vinson Young, Gagan Gupta. [doi]
- Efficiently emulating high-bitwidth computation with low-bitwidth hardwareZixuan Ma, Haojie Wang, Guanyu Feng, Chen Zhang, Lei Xie, Jiaao He, Shengqi Chen 0001, Jidong Zhai. [doi]