HPCG on Tianhe2. Yutong Lu 1,Chao Yang 2, Yunfei Du 1

Similar documents
Performance & Energy

Final Review. Chenyang Lu. CSE 467S Embedded Compu5ng Systems

Processes. Criteria for Comparing Scheduling Algorithms

Cyber-Physical Systems Scheduling

Exploring QR Factorization on GPU for Quantum Monte Carlo Simulation

Designing police patrol districts on street network

CS 5523: Operating Systems

CS 5523 Operating Systems: Intro to Distributed Systems

Servilla: Service Provisioning in Wireless Sensor Networks. Chenyang Lu

CSE 520S Real-Time Systems

A Micro-Benchmark Evaluation of Catamount and Cray Linux Environment (CLE) Performance

Concurrent Programing: Why you should care, deeply. Don Porter Portions courtesy Emmett Witchel

Optimization Strategies

Operating Systems. Chenyang Lu

4th International Industrial Supercomputing Workshop Supercomputing for industry and SMEs in the Netherlands

Real-Time CORBA. Chenyang Lu CSE 520S

Chapter 8: Recursion

File Systems: Fundamentals

Processing for Security Systems

An Homophily-based Approach for Fast Post Recommendation in Microblogging Systems

Towards Large Eddy Simulation for Turbo-machinery Flows

Wind power integration and consumer behavior: a complementarity approach

Writing Strong Patent Applications in China. Andy Booth Head of Patents Dyson Technology Limited

Quality of Service in Optical Telecommunication Networks

US Bar EPO Liaison Council 29th Annual Meeting Munich, 18 October EPO practice issues

LPGPU. Low- Power Parallel Compu1ng on GPUs. Ben Juurlink. Technische Universität Berlin. EPoPPEA workshop

Do two parties represent the US? Clustering analysis of US public ideology survey

Dependability in Distributed Systems

Patenting Software-related Inventions according to the European Patent Convention

Hoboken Public Schools. Geometry Curriculum

Microseminar 1. Focus on Operation / Packing and Lashing

Computational Inelasticity FHLN05. Assignment A non-linear elasto-plastic problem

The optical memory card is a Write Once media, a written area cannot be overwritten. Information stored on an optical memory card is non-volatile.

Where Are the Surplus Men? Multi-Dimension of Social Stratification in China s Domestic Marriage Market

Support Vector Machines

CS 2461: Computer Architecture I

Cloud Tutorial: AWS IoT. TA for class CSE 521S, Fall, Jan/18/2018 Haoran Li

Digital research data in the Sigma2 prospective

Software License Agreement for Beckhoff Software Products

Cadac SoundGrid I/O. User Guide

M-Series, Actuator Overview. Machine Screw Cutaway. UNI-LIFT Machine Screw Actuators offer precise. 12

Cluster Analysis. (see also: Segmentation)

Data Processing Development

Case 1:17-cv Document 1 Filed 12/11/17 Page 1 of 17 IN THE UNITED STATES DISTRICT COURT FOR THE WESTERN DISTRICT OF TEXAS AUSTIN DIVISION

Data Sampling using Congressional sampling. by Juhani Heliö

IBM Cognos Open Mic Cognos Analytics 11 Part nd June, IBM Corporation

Last Time. Bit banged SPI I2C LIN Ethernet. u Embedded networks. Ø Characteristics Ø Requirements Ø Simple embedded LANs

Adaptive QoS Control for Real-Time Systems

CS 5523 Operating Systems: Synchronization in Distributed Systems

Combating Friend Spam Using Social Rejections

Major Differences Between Prosecution at EPO and JPO

The Effect of Electoral Geography on Competitive Elections and Partisan Gerrymandering

VOTING DYNAMICS IN INNOVATION SYSTEMS

STATISTICAL GRAPHICS FOR VISUALIZING DATA

PTO Publishes Interim Examination Instructions for Evaluating Subject Matter Eligibility Under 35 U.S.C. 101 in View of In Re Bilski

Robust Electric Power Infrastructures. Response and Recovery during Catastrophic Failures.

Internet of Things Wireless Sensor Networks. Chenyang Lu

General Framework of Electronic Voting and Implementation thereof at National Elections in Estonia

THANKFUL TREE THANKFUL TREE

Contrasting Cold War Terms. Communism v. Democracy

Africa Trade Forum 2012

Installation Instructions HM2085-PLM Strain Gage Input Module

How patents work An introduction for law students

Real-Time Scheduling Single Processor. Chenyang Lu

Genetic Algorithms with Elitism-Based Immigrants for Changing Optimization Problems

PROCEDURES FOR THE USE OF VOTE COUNT TABULATORS

DIGITAL PRESCRIPTION AGRICULTURE. Nathan Paul Operations Manager Cropping Systems Iowa Soybean Association

B-Series Section Overview. Ball Screw Cutaway. UNI-LIFT Ball Screw Actuators provide high. 34

Hat problem on a graph

Storage of refuse.

RECENT CASE LAW OF THE EPO REGARDING SOFTWARE/BUSINESS METHOD- RELATED INVENTIONS

Schedule UNIVERSITY OF NEBRASKA BOARD OF REGENTS GENERAL COUNSEL. October 19, 2011

Hoboken Public Schools. College Algebra Curriculum

Political Districting for Elections to the German Bundestag: An Optimization-Based Multi-Stage Heuristic Respecting Administrative Boundaries

Deadlock. deadlock analysis - primitive processes, parallel composition, avoidance

Note concerning the Patentability of Computer-Related Inventions

Introduction to VI-HPS

City of Vancouver Zoning and Development By-law Planning, Urban Design and Sustainability Department

Nordic Big Biomedical Data for Action. Davit Bzhalava, PhD Dept. of Laboratory Medicine, Karolinska Institutet, Sweden

CD-1 (502) 1304 Hornby Street By-law No (Being a By-law to Amend By-law 3575, being the Zoning and Development By-law) Effective April 19, 2011

An Integrated Tag Recommendation Algorithm Towards Weibo User Profiling

Real- Time Wireless Control Networks for Cyber- Physical Systems

The EPO approach to Computer Implemented Inventions (CII) Yannis Skulikaris Director Operations, Information and Communications Technology

Optimizing Rod Lift in High Water Cut Resource Plays: A Divide County North Dakota Three Forks Case Study

Appendix 2. [Draft] Disclosure Review Document

Enhancement of Attraction of Utility Model System

Paper Entered: July 7, 2016 UNITED STATES PATENT AND TRADEMARK OFFICE BEFORE THE PATENT TRIAL AND APPEAL BOARD

China National Expressway Optical Fiber Communication Network: Planning, Construction and Operation

Schedule UNIVERSITY OF NEBRASKA BOARD OF REGENTS

POLICE AND CRIMINAL EVIDENCE ACT 1984 (PACE) CODE E CODE OF PRACTICE ON AUDIO RECORDING INTERVIEWS WITH SUSPECTS

Introduction to the declination function for gerrymanders

Manipulating Two Stage Voting Rules

Poverty & Inequality: What s next? Seven Suggestions

BUSI 2503 Section A BASIC FINANCIAL MANAGEMENT Summer, 2013(May & June)

WHY, WHEN AND HOW SHOULD THE PAPER RECORD MANDATED BY THE HELP AMERICA VOTE ACT OF 2002 BE USED?

Predicting Information Diffusion Initiated from Multiple Sources in Online Social Networks

Comparison Sorts. EECS 2011 Prof. J. Elder - 1 -

Flash Eurobarometer 354. Entrepreneurship COUNTRY REPORT GREECE

Essence Kernel. Kristian Sandahl

ENTREPRENEURSHIP IN THE EU AND BEYOND

Transcription:

HPCG on 2 Yutong Lu 1,Chao Yang 2, Yunfei Du 1 1, Changsha, Hunan, China 2 Institute of Software, CAS, Beijing, China

Outline r HPCG result overview on -2 r Key Optimization works Ø Hybrid HPCG:CPU+MIC

HPCG result r -2 (Nudt-V11) Ø HPCG version 2.4 Ø Hybrid code Ø Whole scale (with 3Mics each node) Ø Problem Size 136*176*176 r Result Ø 623280GFlops Ø 1.14% of peak performance Ø Efficiency 81.15%

Optimization r Intra-node Ø Improve the performance of hybrid single node r Inter-node Ø Improve the scalability r Choose the suitable problem size to balance the both aspects

Optimization: Intra-node Partition r An inner-outer subdomain partition strategy Ø A regular inner parts for each MIC device, an irregular outer part for CPU Ø isolating MIC computation from MPI communication, avoiding data movement between different MIC devices, thus providing a chance for computation-communication overlapping r Two alternatives Ø 1 MPI process per node (old nudt-v06)) u 3 inner tasks + 1 outer task (per process) u Larger optimization overhead, because async memory alloc on MIC works poorly! Ø 3 MPI processes per node (new nudt-v11) u 1 inner task + 1 outer task (per process) u 6 out of 8 CPU cores for each outer task

Optimization: Optimizing MPI Communication r Pipelined CG (Ghysels 2013 ParCo) for global comm hiding Ø Exactly mathematically equivalent to the standard CG Ø Only need one global communication for two dots and one norm per iteration Ø The global communication can be overlapped with preconditioner and SpMV Ø The number of WAXPBY s per iteration is increased from 3 to 8, the increased cost can be reduced by proper kernel fusions. Ø Overlapping neighboring communication with computation in SpMV Ø Based on the inner-outer subdomain partition Ø Halo exchange is overlapped with the computation of the inner part

Optimization: Load Balance between CPU and MIC r 4 level (ratio 2) V-cycle geometric multigrid preconditioner Ø Load-imbalance exists between CPU and MIC if using inner-outer partitions on all levels (the outer thickness on finest level is at least 8) Ø Adjusting the outer thickness on finest level to be 4 u Hybrid inner-outer partitions on grid levels 0, 1, 2 u CPU-only partition on grid level 3 u Some extra PCI-express transfer is needed to pull the inner blocks from MIC to CPU

Optimization: Asynchronous Data Transfer Scheme r Data movement is needed in SpMV and SymGS. Pack and exchange the halo information at the beginning of the current kernel Exploit the CPU s waiting time to pack and transfer data from CPU to MIC device in the preceding kernel, thus eliminate the MIC s waiting time.

Optimization: Others on MIC and CPU r Sparse matrix storage format Ø SELLPACK on MIC, ELLPACK on CPU r SIMDization Ø Using gather and streaming store instructions on MIC r Different red-black reordering methods Ø Block multi-color ordering on grid level 0 Ø Fusing the forward and backward sweep on other levels r Different parallel methods Ø Multi-level parallelism on MIC side r Optimization of communication among OpenMP threads Ø Employing light-weight kernels such as WaitNeighbors and IntraBarrier instead of a global barrier

Scalability

Future Directions r Further Optimization Ø Continue to improve the hybrid method Ø Communication optimization for network topology aims to improve efficiency ytlu@nudt.edu.cn yangchao@iscas.ac.cn