Abstract is missing.
- A scalable framework for adaptive computational general relativity on heterogeneous clustersMilinda Fernando, David Neilsen, Eric W. Hirschmann, Hari Sundar. 1-12 [doi]
- Parallelizing cryo-EM 3D reconstruction on GPU cluster with a partitioned and streamed modelKunpeng Wang, Shizhen Xu, Haohuan Fu, Hongkun Yu, Wenlai Zhao, Guangwen Yang. 13-23 [doi]
- Efficient GPU tree walks for effective distributed n-body simulationsJianqiao Liu, Michael Robson, Thomas Quinn, Milind Kulkarni 0001. 24-34 [doi]
- Hybrid CPU/GPU clustering in shared memory on the billion point scaleMichael Gowanlock. 35-45 [doi]
- Accelerating reduction and scan using tensor core unitsAbdul Dakkak, Cheng Li, Jinjun Xiong, Isaac Gelado, Wen-mei W. Hwu. 46-57 [doi]
- Laius: Towards latency awareness and improved utilization of spatial multitasking accelerators in datacentersWei Zhang, Weihao Cui, Kaihua Fu, Quan Chen, Daniel Edward Mawhirter, Bo Wu, Chao Li, Minyi Guo. 58-68 [doi]
- HYPHA: a framework based on separation of parallelisms to accelerate persistent homology matrix reductionSimon Zhang, Mengbai Xiao, Chengxin Guo, Liang Geng, Hao Wang, Xiaodong Zhang 0001. 69-81 [doi]
- SDC: a software defined cache for efficient data indexingFan Ni, Song Jiang, Hong Jiang, Jian Huang, Xingbo Wu. 82-93 [doi]
- IA-SpGEMM: an input-aware auto-tuning framework for parallel sparse matrix-matrix multiplicationZhen Xie, Guangming Tan, Weifeng Liu 0002, Ninghui Sun. 94-105 [doi]
- TSM2: optimizing tall-and-skinny matrix-matrix multiplication on GPUsJieyang Chen, Nan Xiong, Xin Liang, Dingwen Tao, Sihuan Li, Kaiming Ouyang, Kai Zhao, Nathan DeBardeleben, Qiang Guan, Zizhong Chen. 106-116 [doi]
- Least squares solvers for distributed-memory machines with GPU acceleratorsJakub Kurzak, Mark Gates, Ali Charara, Asim YarKhan, Jack J. Dongarra. 117-126 [doi]
- A communication-avoiding 3D sparse triangular solverPiyush Sao, Ramakrishnan Kannan, Xiaoye Sherry Li, Richard W. Vuduc. 127-137 [doi]
- Using performance models to understand scalable Krylov solver performance at scale for structured grid problemsPaul R. Eller, Torsten Hoefler, William Gropp. 138-149 [doi]
- Performance optimization of reactive molecular dynamics simulations with dynamic charge distribution models on distributed memory platformsKurt A. O'Hearn, Abdullah Alperen, Hasan Metin Aktulga. 150-159 [doi]
- AMPT-GA: automatic mixed precision floating point tuning for GPU applicationsPradeep V. Kotipalli, Ranvijay Singh, Paul Wood, Ignacio Laguna, Saurabh Bagchi. 160-170 [doi]
- GPU snapshot: checkpoint offloading for GPU-dense systemsKyushick Lee, Michael B. Sullivan, Siva Kumar Sastry Hari, Timothy Tsai, Stephen W. Keckler, Mattan Erez. 171-183 [doi]
- Address-stride assisted approximate load value prediction in GPUsHaonan Wang, Mohamed Ibrahim, Sparsh Mittal, Adwait Jog. 184-194 [doi]
- Diligent TLBs: a mechanism for exploiting heterogeneity in TLB miss behaviorHussein Elnawawy, Rangeen Basu Roy Chowdhury, Amro Awad, Gregory T. Byrd. 195-205 [doi]
- QoSMT: supporting precise performance control for simultaneous multithreading architectureXin Jin, Yaoyang Zhou, Bowen Huang, Zihao Yu, Xusheng Zhan, Huizhe Wang, Sa Wang, Ningmei Yu, Ninghui Sun, Yungang Bao. 206-216 [doi]
- An online quality management framework for approximate communication in network-on-chipsYuechen Chen, Ahmed Louri. 217-226 [doi]
- Efficient and effective sparse tensor reorderingJiajia Li, Bora Uçar, Ümit V. Çatalyürek, Jimeng Sun, Kevin J. Barker, Richard W. Vuduc. 227-237 [doi]
- On optimizing distributed non-negative Tucker decompositionVenkatesan T. Chakaravarthy, Shivmaran S. Pandian, Saurabh Raje, Yogish Sabharwal. 238-249 [doi]
- GPU road network graph contraction and SSSP queryRoozbeh Karimi, David M. Koppelman, Chris J. Michael. 250-260 [doi]
- Multi-criteria partitioning of multi-block structured gridsHengjie Wang, Aparna Chandramowlishwaran. 261-271 [doi]
- Avalon: towards QoS awareness and improved utilization through multi-resource management in datacentersQuan Chen, Zhenning Wang, Jingwen Leng, Chao Li, Wenli Zheng, Minyi Guo. 272-283 [doi]
- Can we trust profiling results?: understanding and fixing the inaccuracy in modern profilersHao Xu, Qingsen Wang, Shuang Song, Lizy Kurian John, Xu Liu 0001. 284-295 [doi]
- Power efficient job scheduling by predicting the impact of processor manufacturing variabilityDimitrios Chasapis, Miquel Moretó, Martin Schulz 0001, Barry Rountree, Mateo Valero, Marc Casas. 296-307 [doi]
- GreenMM: energy efficient GPU matrix multiplication through undervoltingHadi Zamani, Yuanlai Liu, Devashree Tripathy, Laxmi N. Bhuyan, Zizhong Chen. 308-318 [doi]
- WCCV: improving the vectorization of IF-statements with warp-coherent conditionsHuihui Sun, Florian Fey, Jie Zhao, Sergei Gorlatch. 319-329 [doi]
- Automatic construct selection and variable classification in OpenMPMohammad Norouzi Arab, Felix Wolf 0001, Ali Jannesari. 330-341 [doi]
- Efficient thread/page/parallelism autotuning for NUMA systemsMihail Popov, Alexandra Jimborean, David Black-Schaffer. 342-353 [doi]
- Efficient hierarchical online-autotuning: a case study on polyhedral accelerator mappingPhilip Pfaffe, Tobias Grosser, Martin Tillmann. 354-366 [doi]
- Software combining to mitigate multithreaded MPI contentionAbdelhalim Amer, Charles Archer, Michael Blocksome, Chongxiao Cao, Michael Chuvelev, Hajime Fujita 0002, Maria Garzaran, Yanfei Guo, Jeff R. Hammond, Shintaro Iwasaki, Kenneth J. Raffenetti, Mikhail Shiryaev, Min-Si, Kenjiro Taura, Sagar Thapaliya, Pavan Balaji. 367-379 [doi]
- Optimizing computation-communication overlap in asynchronous task-based programsEmilio Castillo, Nikhil Jain, Marc Casas, Miquel Moretó, Martin Schulz 0001, Ramón Beivide, Mateo Valero, Abhinav Bhatele. 380-391 [doi]
- Henosis: workload-driven small array consolidation and placement for HDF5 applications on heterogeneous data storesDonghe Kang, Vedang Patel, Ashwati Nair, Spyros Blanas, Yang Wang, Srinivasan Parthasarathy 0001. 392-402 [doi]
- DeepHiR: improving high-radix router throughput with deep hybrid memory buffer microarchitectureCunlu Li, Dezun Dong, Xiangke Liao, John Kim, Changhyun Kim. 403-413 [doi]
- The anatomy of efficient FFT and winograd convolutions on modern CPUsAleksandar Zlateski, Zhen Jia, Kai Li, Frédo Durand. 414-424 [doi]
- Optimizing the linear fascicle evaluation algorithm for many-core systemsKaran Aggarwal, Uday Bondhugula. 425-437 [doi]
- Deep reuse: streamline CNN inference on the fly via coarse-grained computation reuseLin Ning, Xipeng Shen. 438-448 [doi]
- Full-stack optimization for accelerating CNNs using powers-of-two weights with FPGA validationBradley McDanel, Sai Qian Zhang, H. T. Kung, Xin Dong. 449-460 [doi]
- O3BNN: an out-of-order architecture for high-performance binarized neural network inference with fine-grained pruningTong Geng, Tianqi Wang, Chunshu Wu, Chen Yang, Wei Wu, Ang Li, Martin C. Herbordt. 461-472 [doi]
- RFAcc: a 3D ReRAM associative array based random forest acceleratorLei Zhao, Quan Deng, Youtao Zhang, Jun Yang. 473-483 [doi]
- BonVoision: leveraging spatial data smoothness for recovery from memory soft errorsBo Fang, Hassan Halawa, Karthik Pattabiraman, Matei Ripeanu, Sriram Krishnamoorthy. 484-496 [doi]
- GPUGuard: mitigating contention based side and covert channel attacks on GPUsQiumin Xu, Hoda Naghibijouybari, Shibo Wang, Nael B. Abu-Ghazaleh, Murali Annavaram. 497-509 [doi]
- Dynamically linked MSHRs for adaptive miss handling in GPUsYongbin Gu, Lizhong Chen. 510-521 [doi]