Performance & Energy

Similar documents
HPCG on Tianhe2. Yutong Lu 1,Chao Yang 2, Yunfei Du 1

Analyzing the Power Consumption Behavior of a Large Scale Data Center

LPGPU. Low- Power Parallel Compu1ng on GPUs. Ben Juurlink. Technische Universität Berlin. EPoPPEA workshop

Final Review. Chenyang Lu. CSE 467S Embedded Compu5ng Systems

Adaptive QoS Control for Real-Time Systems

Exploring QR Factorization on GPU for Quantum Monte Carlo Simulation

4th International Industrial Supercomputing Workshop Supercomputing for industry and SMEs in the Netherlands

Operating Systems. Chenyang Lu

CS 5523: Operating Systems

Servilla: Service Provisioning in Wireless Sensor Networks. Chenyang Lu

Digital research data in the Sigma2 prospective

Introduction to VI-HPS

Philips Lifeline. Ø Chenyang Lu 1

Processes. Criteria for Comparing Scheduling Algorithms

A Micro-Benchmark Evaluation of Catamount and Cray Linux Environment (CLE) Performance

Concurrent Programing: Why you should care, deeply. Don Porter Portions courtesy Emmett Witchel

CSE 520S Real-Time Systems

Cyber-Physical Systems Scheduling

Real- Time Wireless Control Networks for Cyber- Physical Systems

Real-Time Scheduling Single Processor. Chenyang Lu

Case 1:17-cv Document 1 Filed 12/11/17 Page 1 of 17 IN THE UNITED STATES DISTRICT COURT FOR THE WESTERN DISTRICT OF TEXAS AUSTIN DIVISION

Cloud Tutorial: AWS IoT. TA for class CSE 521S, Fall, Jan/18/2018 Haoran Li

Real-Time Wireless Control Networks for Cyber-Physical Systems

Skymet Weather Services Pvt. Ltd. Noida

DevOps Course Content

Wind power integration and consumer behavior: a complementarity approach

Real-Time CORBA. Chenyang Lu CSE 520S

CS 5523 Operating Systems: Intro to Distributed Systems

Dependability in Distributed Systems

CONCRETE: A benchmarking framework to CONtrol and Classify REpeatable Testbed Experiments

Learning Systems. Research at the Intersection of Machine Learning & Data Systems. Joseph E. Gonzalez

Case 2:18-cv JRG Document 1 Filed 05/09/18 Page 1 of 12 PageID #: 1

Case 2:18-cv Document 1 Filed 05/09/18 Page 1 of 11 PageID #: 1

An Investigation into a Circuit Based Supply Chain Analyzer for FPGAs

OPERATING PROCEDURES of the Design Automation Conference Revised and Approved, August 29, 2016

Optimization Strategies

Aadhaar Based Voting System Using Android Application

OPERATING PROCEDURES of the Design Automation Conference Revised and Approved, October 9, 2017

Critiques. Ø Critique #1

Migrants Selection and Replacement in Distributed Evolutionary Algorithms for Dynamic Optimization

Case 1:18-cv TWP-MPB Document 1 Filed 01/04/18 Page 1 of 17 PageID #: 1

Paper Entered: April 3, 2017 UNITED STATES PATENT AND TRADEMARK OFFICE BEFORE THE PATENT TRIAL AND APPEAL BOARD

Smart Voting System using UIDAI

IC Chapter 15. Ballot Card and Electronic Voting Systems; Additional Standards and Procedures for Approving System Changes

Dr. Doran K. Wilde Associate Professor Dept. of Electrical and Computer Engineering Brigham Young University Office: (801) Fax: (801)

File Systems: Fundamentals

M-Series, Actuator Overview. Machine Screw Cutaway. UNI-LIFT Machine Screw Actuators offer precise. 12

Support Vector Machines

Last Time. Bit banged SPI I2C LIN Ethernet. u Embedded networks. Ø Characteristics Ø Requirements Ø Simple embedded LANs

Case5:08-cv PSG Document514 Filed08/21/13 Page1 of 18

US Bar EPO Liaison Council 29th Annual Meeting Munich, 18 October EPO practice issues

NVM EXPRESS, INC. INTELLECTUAL PROPERTY POLICY. Approved as of _November 21_, 2015 ( Effective Date ) by the Board of Directors of NVM Express

Overview. Ø Neural Networks are considered black-box models Ø They are complex and do not provide much insight into variable relationships

JD Edwards EnterpriseOne Applications

There s a Cloud in My Enterprise

Google App Engine 8/10/17. CS Cloud Compu5ng Systems--Summer II 2017

A Bloom Filter Based Scalable Data Integrity Check Tool for Large-scale Dataset

Computer Power Management Rules. Ø Jim Kardach, re-red chief power architect, Intel h6p://

Predicting Information Diffusion Initiated from Multiple Sources in Online Social Networks

UNITED STATES PATENT AND TRADEMARK OFFICE BEFORE THE PATENT TRIAL AND APPEAL BOARD. UNITED PATENTS, INC., Petitioner, REALTIME DATA LLC, Patent Owner.

Ø Project Description. Ø Design Criteria. Ø Design Overview. Ø Design Components. Ø Schedule. Ø Testing Criteria. Background Design Implementation

Exhibit No. 373A-06 to IBM Vendor Access Agreement Page 1 of 5

New features in Oracle 11g for PL/SQL code tuning.

REQUEST FOR PROPOSAL. No Ruby Training Services. July American Association of Motor Vehicle Administrators

CUG Members' Handbook

We should share our secrets

UPDATE ON RULES. Florida Department of State

Deadlock. deadlock analysis - primitive processes, parallel composition, avoidance

Combating Friend Spam Using Social Rejections

Ocean Observatories Initiative Julie Morris Division of Ocean Sciences National Science Foundation

An Electronic Voting System for a Legislative Assembly

30 Transformational Design with Essential Aspect Decomposition: Model-Driven Architecture (MDA)

Aspect Decomposition: Model-Driven Architecture (MDA) 30 Transformational Design with Essential. References. Ø Optional: Ø Obligatory:

CASE STUDY 2 Portuguese Immigration & Border Service

Tender No: CDACP/NSM-SSLAB/2017/223. C-DAC invites ONLINE bids for Supply & Installation of HPC-Cluster with Storage at C-DAC, Pune and Bangalore.

Outline. Your Project Proposal an evolving plan. Project Proposal Guidelines (2 files) ECE496 Design Project Preparing Your Project Proposal

Increased drilling efficiency saved approximately 26,000 gallons of fuel.

Copy. Judgment IN THE NAME OF THE PEOPLE. Christoph Hellwig, Schidlachstraße 11, 6020 Innsbruck, Austria - Plaintiff -

MAPR END USER LICENSE AGREEMENT Last updated: April 20, 2016

Session Patent prosecution practice in Japan Tips for obtaining a patent in Japan - Part I -

SKA Phased Array Feed Advanced Instrumentation Program SKA Engineering Meeting

Approaching a Formal Definition of Fairness in Electronic Commerce. Felix Gärtner Henning Pagnia Holger Vogt

Design and Analysis of College s CPC-Building. System Based on.net Platform

Lecture 8: Verification and Validation

Internet of Things Wireless Sensor Networks. Chenyang Lu

Please see my attached comments. Thank you.

Installation Instructions HM2085-PLM Strain Gage Input Module

IMPLEMENTATION OF SECURE PLATFORM FOR E- VOTING SYSTEM

Batch binary Edwards. D. J. Bernstein University of Illinois at Chicago NSF ITR

One View Watchlists Implementation Guide Release 9.2

Elections, Technology, and the Pursuit of Integrity: the Connecticut Landscape

Cooperation Strategies among States to Address Irregular Migration: Shared Responsibility to Promote Human Development

LICENSE, SUPPORT AND SERVICES AGREEMENT General Conditions

THE PATENTABILITY OF COMPUTER-IMPLEMENTED INVENTIONS. Consultation Paper by the Services of the Directorate General for the Internal Market

Virtual Memory and Address Translation

From Meander Designs to a Routing Application Using a Shape Grammar to Cellular Automata Methodology

Exploiting the dark triad for national defense capabilities. Dimitris Gritzalis

A kernel-oriented algorithm for transmission expansion planning

Guidebook. for Japanese Intellectual Property System 2 nd Edition

ForeScout Extended Module for McAfee epolicy Orchestrator

Transcription:

1 Performance & Energy Optimization @ Md Abdullah Shahneous Bari Abid M. Malik Millad Ghane Ahmad Qawasmeh Barbara M. Chapman 11/28/15

2 Layout of the talk Ø Overview Ø Motivation Ø Factors that affect the performance & Energy Optimization Ø Experimental Results Ø Conclusion & Future Work

3 OpenMP Ø De-facto standard for shared memory parallel programming Ø Thread based parallelism Ø Mainly two kinds of parallelism Ø Regular parallelism (work sharing constructs) Ø Irregular parallelism (task based constructs)

4 Main Barrier Towards Exascale Computing Ø Power, power and power Ø 20MW power limit for exascale machines (DOE) Ø Usually processor vendors concern Ø But to reach the exascale limit software stack have to chip in Ø Any solution????

5 Power Constrained Computing (Overprovisioning) Ø Usually not all application use maximum node power all the time Ø Capping the power at lower limit Ø Allows extra node to be added at the similar power budget Extra Node Extra Compute Power

6 Power Constrained Computing(Contd.) Ø More focus on overall system level performance Ø Some related work, Ø Sarood et al. [1] Ø Patki et al. [2] Ø Rountree et al. [3] 1. Sarood, Osman, et al. "Op?mizing power alloca?on to CPU and memory subsystems in overprovisioned HPC systems." Cluster Compu,ng (CLUSTER), 2013 IEEE Interna,onal Conference on. IEEE, 2013. 2. Patki, Tapasya, et al. "Exploring hardware overprovisioning in power-constrained, high performance compu?ng." Proceedings of the 27th interna,onal ACM conference on Interna,onal conference on supercompu,ng. ACM, 2013. 3. Rountree, Barry, et al. "Beyond DVFS: A first look at performance under a hardware-enforced power bound." Parallel and Distributed Processing Symposium Workshops & PhD Forum (IPDPSW), 2012 IEEE 26th Interna,onal. IEEE, 2012.

7 Why OpenMP??? Ø Current Issue: Less focus on per-node performance Ø Challenge: To reach the peak throughput, per-node performance must be improved Ø OpenMP is the most popular language of choice for intra node parallelism

8 Factors That Impact Work Sharing Parallelism Ø How many workers are working? ~ Thread Ø How the work is scheduled? ~ Scheduling Policy Ø How much work they are given at one time? ~ Chunk Size Ø How the data is laid out for the workers? ~ Thread Affinity Ø What do the workers do during their break? ~ Wait Policy

9 Experimental Details Ø Selected parameters Ø No. Of threads (2, 4, 8, 16, 24, 32) Ø Scheduling policy (STATIC, DYNAMIC, GUIDED) Ø Chunk size(1, 8, 32, 64, 128, 256, 512) Ø Wait policy (active, passive) Ø Thread affinity (OMP_PLACES + OMP_PROC_BIND) Ø Power cap levels Ø (55, 70, 85, 100, 115)w Ø Used technology: Ø Intel RAPL (for power capping & energy measurement) Ø OMPT for kernel level measurement Ø Benchmark ~ NPB

10 Performance improvement using the best configuration compared to default across all kernels % Performance Improvement 100 90 80 70 60 50 40 30 20 10 0 CG_conj_grad_1 CG_main_1 CG_main_2 CG_main_3 CG_main_4 CG_main_5 CG_main_6 EP_main_3 FT_c_s1_1 FT_c_s3_1 FT_c_ts2_1 *FT_c_i_1 **FT_c_i_c_1 FT_evolve_1 FT_init_ui_1 IS_alloc_key_bu IS_create_seq_1 LU_erhs_1 LU_setbv_1 LU_se?v_1 MG_zero3_1 MG_zran3_1 MG_zran3_2 MG_zran3_3 SP_add_1 SP_compute_rhs SP_error_norm_ SP_exact_rhs_1 SP_ini?alize_1 SP_ninvr_1 SP_pinvr_1 SP_rhs_norm_1 SP_txinvr_1 SP_tzetar_1 SP_x_solve_1 SP_y_solve_1 SP_z_solve_1 UA_geom1_2 UA_mortar_3 UA_move_1 Kernels 55W 70W 85W 100W 115W

11 Energy consumption improvement using the best configuration compared to default across all kernels %Energy Improvement 100 80 60 40 20 0-20 CG_conj_grad_1 CG_main_1 CG_main_2 CG_main_3 CG_main_4 CG_main_5 CG_main_6 EP_main_3 FT_c_s1_1 FT_c_s3_1 FT_c_ts2_1 *FT_c_i_1 **FT_c_i_c_1 FT_evolve_1 FT_init_ui_1 IS_alloc_key_buff_1 IS_create_seq_1 LU_erhs_1 LU_setbv_1 LU_se?v_1 MG_zero3_1 MG_zran3_1 55W 70W 85W 100W 115W Kernels MG_zran3_2 MG_zran3_3 SP_add_1 SP_compute_rhs_1 SP_error_norm_1 SP_exact_rhs_1 SP_ini?alize_1 SP_ninvr_1 SP_pinvr_1 SP_rhs_norm_1 SP_txinvr_1 SP_tzetar_1 SP_x_solve_1 SP_y_solve_1 SP_z_solve_1 UA_geom1_2 UA_mortar_3 UA_move_1

12 0.06 Execution time comparison among different configurations (an LU kernel) Best Configura?on Default Configura?on Default Configura?on Without Power Cap Execu?on Time (Sec) 0.05 0.04 0.03 0.02 0.01 0 115W, 32, STATIC, 1 32, STATIC, 1 32, DYNAMIC, 8 115W, 32, STATIC, 1 32, STATIC, 1 24, GUIDED, 8 115W, 32, STATIC, 1 32, STATIC, 1 24, GUIDED, 8 115W, 32, STATIC, 1 32, STATIC, 1 24, DYNAMIC, 8 115W, 32, STATIC, 1 32, STATIC, 1 24, GUIDED, 8 55W 70W 85W 100W 115W Different Power Cap Levels

13 These results are based on STREAM benchmark. Data Size X means the array size for STREAM benchmark is 19,200,000*X. OpenMP ICVs on DRAM Power Ø Developing a model for power consumption of openmp applications Power (W) Power (W) Power (W) Data Size Courtesy: Millad Ghane Data Size Data Size

14 UTS FloorPlan Courtesy: Ahmad Qawasmeh Impact of threads & scheduling policy in task based parallelism

15 Ø Dynamic adaptation (APEX), Ø Active harmony Ø Modeling Ongoing Work Ø Across different software stack (OpenMP runtime), Ø Openuh Ø GCC Ø Intel Ø Across different hardware architecture Ø Intel sandybridge Ø IBM power8

16 Future Work Ø More concrete configuration selection Ø DRAM capping Ø Fine grain (core level) control Ø Other energy efficient techniques, Ø DVFS, frequency modulation etc. Ø Combining it with a inter-node (MPI) programming models for hybrid applications

17 Summary Ø Overview Ø Motivation Ø Factors that affect the performance & Energy Optimization Ø Experimental Results Ø Conclusion & Future Work