1 Performance & Energy Optimization @ Md Abdullah Shahneous Bari Abid M. Malik Millad Ghane Ahmad Qawasmeh Barbara M. Chapman 11/28/15
2 Layout of the talk Ø Overview Ø Motivation Ø Factors that affect the performance & Energy Optimization Ø Experimental Results Ø Conclusion & Future Work
3 OpenMP Ø De-facto standard for shared memory parallel programming Ø Thread based parallelism Ø Mainly two kinds of parallelism Ø Regular parallelism (work sharing constructs) Ø Irregular parallelism (task based constructs)
4 Main Barrier Towards Exascale Computing Ø Power, power and power Ø 20MW power limit for exascale machines (DOE) Ø Usually processor vendors concern Ø But to reach the exascale limit software stack have to chip in Ø Any solution????
5 Power Constrained Computing (Overprovisioning) Ø Usually not all application use maximum node power all the time Ø Capping the power at lower limit Ø Allows extra node to be added at the similar power budget Extra Node Extra Compute Power
6 Power Constrained Computing(Contd.) Ø More focus on overall system level performance Ø Some related work, Ø Sarood et al. [1] Ø Patki et al. [2] Ø Rountree et al. [3] 1. Sarood, Osman, et al. "Op?mizing power alloca?on to CPU and memory subsystems in overprovisioned HPC systems." Cluster Compu,ng (CLUSTER), 2013 IEEE Interna,onal Conference on. IEEE, 2013. 2. Patki, Tapasya, et al. "Exploring hardware overprovisioning in power-constrained, high performance compu?ng." Proceedings of the 27th interna,onal ACM conference on Interna,onal conference on supercompu,ng. ACM, 2013. 3. Rountree, Barry, et al. "Beyond DVFS: A first look at performance under a hardware-enforced power bound." Parallel and Distributed Processing Symposium Workshops & PhD Forum (IPDPSW), 2012 IEEE 26th Interna,onal. IEEE, 2012.
7 Why OpenMP??? Ø Current Issue: Less focus on per-node performance Ø Challenge: To reach the peak throughput, per-node performance must be improved Ø OpenMP is the most popular language of choice for intra node parallelism
8 Factors That Impact Work Sharing Parallelism Ø How many workers are working? ~ Thread Ø How the work is scheduled? ~ Scheduling Policy Ø How much work they are given at one time? ~ Chunk Size Ø How the data is laid out for the workers? ~ Thread Affinity Ø What do the workers do during their break? ~ Wait Policy
9 Experimental Details Ø Selected parameters Ø No. Of threads (2, 4, 8, 16, 24, 32) Ø Scheduling policy (STATIC, DYNAMIC, GUIDED) Ø Chunk size(1, 8, 32, 64, 128, 256, 512) Ø Wait policy (active, passive) Ø Thread affinity (OMP_PLACES + OMP_PROC_BIND) Ø Power cap levels Ø (55, 70, 85, 100, 115)w Ø Used technology: Ø Intel RAPL (for power capping & energy measurement) Ø OMPT for kernel level measurement Ø Benchmark ~ NPB
10 Performance improvement using the best configuration compared to default across all kernels % Performance Improvement 100 90 80 70 60 50 40 30 20 10 0 CG_conj_grad_1 CG_main_1 CG_main_2 CG_main_3 CG_main_4 CG_main_5 CG_main_6 EP_main_3 FT_c_s1_1 FT_c_s3_1 FT_c_ts2_1 *FT_c_i_1 **FT_c_i_c_1 FT_evolve_1 FT_init_ui_1 IS_alloc_key_bu IS_create_seq_1 LU_erhs_1 LU_setbv_1 LU_se?v_1 MG_zero3_1 MG_zran3_1 MG_zran3_2 MG_zran3_3 SP_add_1 SP_compute_rhs SP_error_norm_ SP_exact_rhs_1 SP_ini?alize_1 SP_ninvr_1 SP_pinvr_1 SP_rhs_norm_1 SP_txinvr_1 SP_tzetar_1 SP_x_solve_1 SP_y_solve_1 SP_z_solve_1 UA_geom1_2 UA_mortar_3 UA_move_1 Kernels 55W 70W 85W 100W 115W
11 Energy consumption improvement using the best configuration compared to default across all kernels %Energy Improvement 100 80 60 40 20 0-20 CG_conj_grad_1 CG_main_1 CG_main_2 CG_main_3 CG_main_4 CG_main_5 CG_main_6 EP_main_3 FT_c_s1_1 FT_c_s3_1 FT_c_ts2_1 *FT_c_i_1 **FT_c_i_c_1 FT_evolve_1 FT_init_ui_1 IS_alloc_key_buff_1 IS_create_seq_1 LU_erhs_1 LU_setbv_1 LU_se?v_1 MG_zero3_1 MG_zran3_1 55W 70W 85W 100W 115W Kernels MG_zran3_2 MG_zran3_3 SP_add_1 SP_compute_rhs_1 SP_error_norm_1 SP_exact_rhs_1 SP_ini?alize_1 SP_ninvr_1 SP_pinvr_1 SP_rhs_norm_1 SP_txinvr_1 SP_tzetar_1 SP_x_solve_1 SP_y_solve_1 SP_z_solve_1 UA_geom1_2 UA_mortar_3 UA_move_1
12 0.06 Execution time comparison among different configurations (an LU kernel) Best Configura?on Default Configura?on Default Configura?on Without Power Cap Execu?on Time (Sec) 0.05 0.04 0.03 0.02 0.01 0 115W, 32, STATIC, 1 32, STATIC, 1 32, DYNAMIC, 8 115W, 32, STATIC, 1 32, STATIC, 1 24, GUIDED, 8 115W, 32, STATIC, 1 32, STATIC, 1 24, GUIDED, 8 115W, 32, STATIC, 1 32, STATIC, 1 24, DYNAMIC, 8 115W, 32, STATIC, 1 32, STATIC, 1 24, GUIDED, 8 55W 70W 85W 100W 115W Different Power Cap Levels
13 These results are based on STREAM benchmark. Data Size X means the array size for STREAM benchmark is 19,200,000*X. OpenMP ICVs on DRAM Power Ø Developing a model for power consumption of openmp applications Power (W) Power (W) Power (W) Data Size Courtesy: Millad Ghane Data Size Data Size
14 UTS FloorPlan Courtesy: Ahmad Qawasmeh Impact of threads & scheduling policy in task based parallelism
15 Ø Dynamic adaptation (APEX), Ø Active harmony Ø Modeling Ongoing Work Ø Across different software stack (OpenMP runtime), Ø Openuh Ø GCC Ø Intel Ø Across different hardware architecture Ø Intel sandybridge Ø IBM power8
16 Future Work Ø More concrete configuration selection Ø DRAM capping Ø Fine grain (core level) control Ø Other energy efficient techniques, Ø DVFS, frequency modulation etc. Ø Combining it with a inter-node (MPI) programming models for hybrid applications
17 Summary Ø Overview Ø Motivation Ø Factors that affect the performance & Energy Optimization Ø Experimental Results Ø Conclusion & Future Work