Analyzing the Power Consumption Behavior of a Large Scale Data Center KASHIF NIZAM KHAN, AALTO UNIVERSITY, FINLAND. SANJA S., TAPIO N., JUKKA K. N., SEBASTIAN V. A. & OLLI-PEKKA L. 1
Outline Ø Motivation Ø Contributions Ø Dataset Description Ø Power Consumption of Computing Nodes Ø Analysis of Unsuccessful Jobs Ø Power Consumption Estimation Ø Plug Power Modeling 2
Outline Ø Motivation Ø Contributions Ø Dataset Description Ø Power Consumption of Computing Nodes Ø Analysis of Unsuccessful Jobs Ø Power Consumption Estimation Ø Plug Power Modeling 3
Motivation Ø Data center energy spending is ever increasing Ø System power draw is increasing substantially without a breakthrough in energy efficiency Ø Increased economic, social and environmental pressure to decrease the energy cost Ø Performance of future HPC systems will be constrained by power cost 4
Motivation Ø Data center energy spending is ever increasing Ø Data center power consumption log analysis is relatively less studied 5
Outline Ø Motivation Ø Contributions Ø Dataset Description Ø Power Consumption of Computing Nodes Ø Analysis of Unsuccessful Jobs Ø Power Consumption Estimation Ø Plug Power Modeling 6
Contributions Ø Investigate the impact/relation of OS counters and RAPL on total power consumption Ø Analyse unsuccessful jobs and their influence in energy spending Ø Cluster the nodes based on the OS counter and RAPL values Ø Model/estimate the total power consumption using OS counters and RAPL value. 7
Outline Ø Motivation Ø Contributions Ø Dataset Description Ø Power Consumption of Computing Nodes Ø Analysis of Unsuccessful Jobs Ø Power Consumption Estimation Ø Plug Power Modeling 8
Dataset Description Ø 900 nodes Taito computing cluster 460 Sandybridge, 397 Haswell Ø Approximately 2 days of production data captured in June 2016 Ø vmstat, RAPL, plug power and job info. Ø Sampled at 0.5Hz https://research.csc.fi/taito-supercluster 9
Outline Ø Motivation Ø Contributions Ø Dataset Description Ø Power Consumption of Computing Nodes Ø Analysis of Unsuccessful Jobs Ø Power Consumption Estimation Ø Plug Power Modeling 10
Power Consumption of Computing Nodes 11
Power Consumption of Computing Nodes 12
Power Consumption of Computing Nodes NODE C581 13
Power Consumption of Computing Nodes NODE C749 14
Power Consumption of Computing Nodes NODE C836 15
Power Consumption of Computing Nodes NODE C585 16
Power Consumption of Computing Nodes NODE C626 17
Power Consumption of Computing Nodes NODE C775 18
Power Consumption of Computing Nodes NODE C819 19
Outline Ø Motivation Ø Contributions Ø Dataset Description Ø Power Consumption of Computing Nodes Ø Analysis of Unsuccessful Jobs Ø Power Consumption Estimation Ø Plug Power Modeling 20
Analysis of Unsuccessful Jobs ØCompleted - jobs that ran to completion ØFailed - jobs that failed to complete successfully ØCancelled- jobs that are cancelled by their users ØTimeout- jobs that did not run to successful completion within a given time limit. 21
Analysis of Unsuccessful Jobs 16% 43.5% 22
Outline Ø Motivation Ø Contributions Ø Dataset Description Ø Power Consumption of Computing Nodes Ø Analysis of Unsuccessful Jobs Ø Power Consumption Estimation Ø Plug Power Modeling 23
Power Consumption Estimation Ø Sample 2% of data from all the nodes (251,244 data samples) Ø First 2/3 rd of the data is used as historical data and train ML models Ø Last 1/3 rd of the data is used to validate Ø Random Forest gives the best result 24
Outline Ø Motivation Ø Contributions Ø Dataset Description Ø Power Consumption of Computing Nodes Ø Analysis of Unsuccessful Jobs Ø Power Consumption Estimation Ø Plug Power Modeling 25
Ø Aim - Model the plug power using OS counters and RAPL measurements Ø 30,000 measurements from 'Haswell' type computing nodes. Plug Power Modeling 1 0.8 0.6 0.4 0.2 0 0.2 0.4 0.6 0.8 1 r b swpd free buff cache si so bi bo in1 cs us sy id wa CPU1 DRAM1 CPU2 DRAM2 plug plug.lag5 r b swpd free buff cache si so bi bo in1 cs us sy id wa CPU1 DRAM1 CPU2 DRAM2 plug plug.lag5 Plug power Frequency 50 100 150 200 250 300 350 0 2000 4000 6000 8000 26
Plug Power Modeling MAPE: 2.10% 27
Plug Power Modeling MAPE: 1.97% 28
Clustering 29
Conclusion Ø Estimating plug power from utilization metrics is promising Ø RAPL add to the accuracy of the models by providing real time power consumption data Ø Considering interactions among RAPL variables the error reduces to 1.87% Ø Unsuccessful jobs can consume significant resources and power Ø In future, we aim to utilize such data center logs to produce job specific power consumption models 30
Thank You! 31