Reading

Schedule

Date Paper Presenter
Jan. 27 Wilde, Michael, et al. "Swift: A language for distributed parallel scripting." Parallel Computing 37.9 (2011): 633-652. link Debashis
Feb. 17 Thomas Herault, Aurelien Bouteiller, George Bosilca, Marc Gamell, Keita Teranishi, Manish Parashar, and Jack Dongarra. 2015. Practical scalable consensus for pseudo-synchronous distributed systems. In Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis (SC '15). ACM, New York, NY, USA, , Article 31 , 12 pages. DOI=http://dx.doi.org/10.1145/2807591.2807665 link Xiaolong
Feb. 24 David Fiala, Frank Mueller, Christian Engelmann, Rolf Riesen, Kurt Ferreira, and Ron Brightwell. 2012. Detection and correction of silent data corruption for large-scale high-performance computing. In Proceedings of the International Conference on High Performance Computing, Networking, Storage and Analysis (SC '12). IEEE Computer Society Press, Los Alamitos, CA, USA, , Article 78 , 12 pages. link Xiaolong
Mar. 16 Costa, Pedro, et al. "Byzantine fault-tolerant MapReduce: Faults are not just crashes." Cloud Computing Technology and Science (CloudCom), 2011 IEEE Third International Conference on. IEEE, 2011. Kenrick
Mar. 30 Yves Robert, et al. "On the complexity of scheduling checkpoints for computational workflows", Dependable Systems and Networks Workshops (DSN-W), 2012 IEEE/IFIP 42nd International Conference on Yuyu
April 6 Catello Di Martino, et al. "Lessons Learned From the Analysis of System Failures at Petascale: The Case of Blue Waters", DSN 2014 Xiaolong
April 13 Ferreira, Kurt, et al. "Evaluating the viability of process replication reliability for exascale systems." Proceedings of 2011 International Conference for High Performance Computing, Networking, Storage and Analysis. ACM, 2011. Kenrick
April 20 Current status in checkpointing Yuyu
April 27 Robert Gerstenberger, Maciej Besta, and Torsten Hoefler. 2013. Enabling highly-scalable remote memory access programming with MPI-3 one sided. In Proceedings of the International Conference on High Performance Computing, Networking, Storage and Analysis (SC '13) Xiaolong

Papers

Fault tolerance

  • Marc Gamell, Daniel S. Katz, Hemanth Kolla, Jacqueline Chen, Scott Klasky, and Manish Parashar. 2014. Exploring automatic, online failure recovery for scientific applications at extreme scales. In Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis (SC '14). IEEE Press, Piscataway, NJ, USA, 895-906. DOI=http://dx.doi.org/10.1109/SC.2014.78link
  • Shen Gao, Bingsheng He, and Jianliang Xu. 2015. Real-Time In-Memory Checkpointing for Future Hybrid Memory Systems. In Proceedings of the 29th ACM on International Conference on Supercomputing (ICS '15). ACM, New York, NY, USA, 263-272. DOI=http://dx.doi.org/10.1145/2751205.2751212 link
  • Yanfei Guo, Wesley Bland, Pavan Balaji, and Xiaobo Zhou. 2015. Fault tolerant MapReduce-MPI for HPC clusters. In Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis (SC '15). ACM, New York, NY, USA, , Article 34 , 12 pages. DOI=http://dx.doi.org/10.1145/2807591.2807617 link
  • Devesh Tiwari, Saurabh Gupta, George Gallarno, Jim Rogers, and Don Maxwell. 2015. Reliability lessons learned from GPU experience with the Titan supercomputer at Oak Ridge leadership computing facility. In Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis (SC '15). ACM, New York, NY, USA, , Article 38 , 12 pages. DOI=http://dx.doi.org/10.1145/2807591.2807666 link
  • Antonio J. Peña, Wesley Bland, and Pavan Balaji. 2015. VOCL-FT: introducing techniques for efficient soft error coprocessor recovery. In Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis (SC '15). ACM, New York, NY, USA, , Article 71 , 12 pages. DOI=http://dx.doi.org/10.1145/2807591.2807640 link
  • Rizwan A. Ashraf, Roberto Gioiosa, Gokcen Kestor, Ronald F. DeMara, Chen-Yong Cher, and Pradip Bose. 2015. Understanding the propagation of transient errors in HPC applications. In Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis (SC '15). ACM, New York, NY, USA, , Article 72 , 12 pages. DOI=http://dx.doi.org/10.1145/2807591.2807670 link
  • Gropp, William, and Ewing Lusk. "Fault tolerance in message passing interface programs." International Journal of High Performance Computing Applications 18.3 (2004): 363-372 link
  • David Fiala, Frank Mueller, Christian Engelmann, Rolf Riesen, Kurt Ferreira, and Ron Brightwell. 2012. Detection and correction of silent data corruption for large-scale high-performance computing. In Proceedings of the International Conference on High Performance Computing, Networking, Storage and Analysis (SC '12). IEEE Computer Society Press, Los Alamitos, CA, USA, , Article 78 , 12 pages. link

Power management

  • Peter E. Bailey, Aniruddha Marathe, David K. Lowenthal, Barry Rountree, and Martin Schulz. 2015. Finding the limits of power-constrained application performance. In Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis (SC '15). ACM, New York, NY, USA, , Article 79 , 12 pages. DOI=http://dx.doi.org/10.1145/2807591.2807637 link
  • Daniel A. Ellsworth, Allen D. Malony, Barry Rountree, and Martin Schulz. 2015. Dynamic power sharing for higher job throughput. In Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis (SC '15). ACM, New York, NY, USA, , Article 80 , 11 pages. DOI=http://dx.doi.org/10.1145/2807591.2807643 link

MPI

  • Akshay Venkatesh, Abhinav Vishnu, Khaled Hamidouche, Nathan Tallent, Dhabaleswar (DK) Panda, Darren Kerbyson, and Adolfy Hoisie. 2015. A case for application-oblivious energy-efficient MPI runtime. In Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis (SC '15). ACM, New York, NY, USA, , Article 29 , 12 pages. DOI=http://dx.doi.org/10.1145/2807591.2807658 link
  • Thomas Herault, Aurelien Bouteiller, George Bosilca, Marc Gamell, Keita Teranishi, Manish Parashar, and Jack Dongarra. 2015. Practical scalable consensus for pseudo-synchronous distributed systems. In Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis (SC '15). ACM, New York, NY, USA, , Article 31 , 12 pages. DOI=http://dx.doi.org/10.1145/2807591.2807665 link

Big data/Cloud

  • Wenting He, Huimin Cui, Binbin Lu, Jiacheng Zhao, Shengmei Li, Gong Ruan, Jingling Xue, Xiaobing Feng, Wensen Yang, and Youliang Yan. 2015. Hadoop+: Modeling and Evaluating the Heterogeneity for MapReduce Applications in Heterogeneous Clusters. In Proceedings of the 29th ACM on International Conference on Supercomputing (ICS '15). ACM, New York, NY, USA, 143-153. DOI=http://dx.doi.org/10.1145/2751205.2751236 link
  • Maciej Besta and Torsten Hoefler. 2015. Active Access: A Mechanism for High-Performance Distributed Data-Centric Computations. In Proceedings of the 29th ACM on International Conference on Supercomputing (ICS '15). ACM, New York, NY, USA, 155-164. DOI=http://dx.doi.org/10.1145/2751205.2751219 link
  • Sergei Shudler, Alexandru Calotoiu, Torsten Hoefler, Alexandre Strube, and Felix Wolf. 2015. Exascaling Your Library: Will Your Implementation Meet Your Expectations?. In Proceedings of the 29th ACM on International Conference on Supercomputing (ICS '15). ACM, New York, NY, USA, 165-175. DOI=http://dx.doi.org/10.1145/2751205.2751216 link
  • Yifan Gong, Bingsheng He, and Amelie Chi Zhou. 2015. Monetary cost optimizations for MPI-based HPC applications on Amazon clouds: checkpoints and replicated execution. In Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis (SC '15). ACM, New York, NY, USA, , Article 32 , 12 pages. DOI=http://dx.doi.org/10.1145/2807591.2807612 link
  • Ron C. Chiang, H. Howie Huang, Timothy Wood, Changbin Liu, and Oliver Spatscheck. 2015. IOrchestra: supporting high-performance data-intensive applications in the cloud via collaborative virtualization. In Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis (SC '15). ACM, New York, NY, USA, , Article 45 , 12 pages. DOI=http://dx.doi.org/10.1145/2807591.2807633 link
  • Jaemyung Kim, Kenneth Salem, Khuzaima Daudjee, Ashraf Aboulnaga, and Xin Pan. 2015. Database high availability using SHADOW systems. In Proceedings of the Sixth ACM Symposium on Cloud Computing (SoCC '15). ACM, New York, NY, USA, 209-221. DOI=http://dx.doi.org/10.1145/2806777.2806841 link
  • Helland, Pat, and David Campbell. "Building on quicksand." arXiv preprint arXiv:0909.1788 (2009)link
  • Charles Reiss, Alexey Tumanov, Gregory R. Ganger, Randy H. Katz, and Michael A. Kozuch. 2012. Heterogeneity and dynamicity of clouds at scale: Google trace analysis. In Proceedings of the Third ACM Symposium on Cloud Computing (SoCC '12). ACM, New York, NY, USA, , Article 7 , 13 pages. DOI=http://dx.doi.org/10.1145/2391229.2391236 link
  • Malte Schwarzkopf, Andy Konwinski, Michael Abd-El-Malek, and John Wilkes. 2013. Omega: flexible, scalable schedulers for large compute clusters. In Proceedings of the 8th ACM European Conference on Computer Systems (EuroSys '13). ACM, New York, NY, USA, 351-364. DOI=http://dx.doi.org/10.1145/2465351.2465386 link
  • Costa, P.; Pasin, M.; Bessani, A.N.; Correia, M.P., "On the Performance of Byzantine Fault-Tolerant MapReduce," in Dependable and Secure Computing, IEEE Transactions on , vol.10, no.5, pp.301-313, Sept.-Oct. 2013 link
  • Zaharia, Matei, et al. "Resilient distributed datasets: A fault-tolerant abstraction for in-memory cluster computing." Proceedings of the 9th USENIX conference on Networked Systems Design and Implementation. USENIX Association, 2012 link

Programming languages

  • Wilde, Michael, et al. "Swift: A language for distributed parallel scripting." Parallel Computing 37.9 (2011): 633-652. link