LPGPU. Low- Power Parallel Compu1ng on GPUs. Ben Juurlink. Technische Universität Berlin. EPoPPEA workshop

Similar documents
Performance & Energy

Best Prac*ces & Training Guide for Professional Development and Networking - June 2011-

4th International Industrial Supercomputing Workshop Supercomputing for industry and SMEs in the Netherlands

Adaptive QoS Control for Real-Time Systems

CSCI 325: Distributed Systems. Objec?ves. Professor Sprenkle. Course overview Overview of distributed systems Introduc?on to reading research papers

Final Review. Chenyang Lu. CSE 467S Embedded Compu5ng Systems

Coverage tools Eclipse Debugger Object-oriented Design Principles. Oct 26, 2016 Sprenkle - CSCI209 1

Google App Engine 8/10/17. CS Cloud Compu5ng Systems--Summer II 2017

DOING BUSINESS WITH US. Schenectady, NY

Case 1:17-cv Document 1 Filed 12/11/17 Page 1 of 17 IN THE UNITED STATES DISTRICT COURT FOR THE WESTERN DISTRICT OF TEXAS AUSTIN DIVISION

Servilla: Service Provisioning in Wireless Sensor Networks. Chenyang Lu

Real- Time Wireless Control Networks for Cyber- Physical Systems

There s a Cloud in My Enterprise

Gary Hart, PhD. Partners

HISTORY GEOSHARE, DRINET, U2U

Digital research data in the Sigma2 prospective

2014 Second Chance Act Planning and Implementa4on (P&I) Guide

New features in Oracle 11g for PL/SQL code tuning.

CSE 520S Real-Time Systems

Need for a uniform European registra2on system for volunteer par2cipa2on? Annick Peremans Research Centre Aalst Belgium

HPCG on Tianhe2. Yutong Lu 1,Chao Yang 2, Yunfei Du 1

Real-Time CORBA. Chenyang Lu CSE 520S

Philips Lifeline. Ø Chenyang Lu 1

Achieving energy efficiency through behaviour change: what does it take?

Applica'on of UQ Principles to Calibra'on, Sensi'vity, and Experimental Design

Decentralised solutions for renewable energies and water in developing countries

DETERMINING CAUSALITY IN OBESITY

CS 5523: Operating Systems

Text UI. Data Store Ø Example of a backend to a real Could add a different user interface. Good judgment comes from experience

End- term exam: Questions and answers. POL S 427/JSIS B 330: Interna5onal Poli5cal Economy Spring Term 2017 Frank Wendler June 1, 2017

Exploring QR Factorization on GPU for Quantum Monte Carlo Simulation

Objec&ves. Tes&ng 11/8/16. by Frederick P. Brooks, Jr., 1986

CS 5523 Operating Systems: Intro to Distributed Systems

Transla'ng public health research for policymakers and advocates

OUR PANELISTS. Linda Morrison L&D Academy Administrator Travelex. Be+y Mills L&D Manager Centra Health. Ma+ Hart L&OD Resource Officer Metropolitan

Does Decentralization Lessen or Worsen Poverty? Evidence from

TinyOS and nesc. Ø TinyOS: OS for wireless sensor networks. Ø nesc: programming language for TinyOS.

Diaspora engagement: Economic and social remittances

The 10- Year Framework of Programmes on Sustainable Consump=on & Produc=on. * An Intergovernmental mandate * Introduction

Objec&ves. Review. So-ware Quality Metrics Sta&c Analysis Tools Refactoring for Extensibility

Operating Systems. Chenyang Lu

Processes. Criteria for Comparing Scheduling Algorithms

Review: SoBware Development

Cyber-Physical Systems Scheduling

Case 1:18-cv TWP-MPB Document 1 Filed 01/04/18 Page 1 of 17 PageID #: 1

File Systems: Fundamentals

CSG Jus(ce Center Massachuse2s Criminal Jus(ce Review

Immigra'on Se-lement Services and Gaps in Yukon, Northwest Territories and Nunavut

A Micro-Benchmark Evaluation of Catamount and Cray Linux Environment (CLE) Performance

GAO. Statement before the Task Force on Florida-13, Committee on House Administration, House of Representatives

Private Sponsorship in Refugee Resettlement. February 2017

1-1. Copyright 2015 Pearson Education, Inc.

ITALY-KENYA UNIVERSITY NANO SATELLITE (IKUNS)

Last Time. Bit banged SPI I2C LIN Ethernet. u Embedded networks. Ø Characteristics Ø Requirements Ø Simple embedded LANs

S. Rinzivillo DATA VISUALIZATION AND VISUAL ANALYTICS

Introduction to VI-HPS

The Changing Faces of Aid: Challenges in financing the SDGs

Cloud Tutorial: AWS IoT. TA for class CSE 521S, Fall, Jan/18/2018 Haoran Li

Cross- Campus Enrollment System Project Update. December, 2015

Immigra6on Basics. Stephanie Paver, Senior A)orney. 1. Department of Homeland Security (DHS)

ANNUAL SECURITY & FIRE SAFETY REPORT

Concurrent Programing: Why you should care, deeply. Don Porter Portions courtesy Emmett Witchel

MWONGOZO; THE CODE OF GOVERNANCE FOR STATE CORPORATIONS. CS Catherine Musakali

Last Time. Embedded systems introduction

GNSO Council Open Mee0ng 7 December 2010

Indicator : Number of countries with migra;on policies that facilitate orderly, safe, regular and responsible migra;on and mobility of people

DHP P244: Terrorism & Counterterrorism. Dr. James JF Forest. Week 9: Counterterrorism Frameworks & Strategies

Working Group In- progress Report to APNIC Member Mee9ng (AMM)

OPTIMISING MEMBER ENGAGEMENT

CS 2461: Computer Architecture I

Case5:08-cv PSG Document514 Filed08/21/13 Page1 of 18

From Astronomy to Policy A Not En(rely Unexpected Journey

PACIFIC REGION. ABNJ Regional Leaders from the Pacific Region:

Sector Discrimination: Sector Identification with Similarity Digest Fingerprints

Access to informa.on: Lessons from Fukushima Nuclear Accident

FROM E-HEALTH-LITERACY TO E- OCCUPATIONAL HEALTH LITERACY

Voting through Power Line Communication with Biometric Verification

Justice Reinvestment in Alabama

Predic'ng Armed Conflict Using Machine Learning. Graig R. Klein, Binghamton University Nicholas P. TatoneB, Columbia University

Annual General Meeting September 18, 2017

Records Reten+on Basics for ESDs Texas State Associa+on of Fire and Emergency Districts (SAFE-D) Annual Conference Galveston, TX February 24, 2018

Amendment to the Infinite Campus END USER LICENSE AGREEMENT

Analyzing the Power Consumption Behavior of a Large Scale Data Center

RainGain High resolu,on rainfall radar for urban flood modelling and predic,on

Software License Agreement for Beckhoff Software Products

Krell Ins)tute related tools (O SS, CBTF, SWAT) Implementa)on Details, Issues,and Status

Real-Time Wireless Control Networks for Cyber-Physical Systems

C2- SIM IN SIMPLE ENVIRONMENTS

DOE s Office of Science and the FY2016 Budget Request

DHP P244: Terrorism & Counterterrorism. Dr. James JF Forest. Exploiting Group Vulnerabilities and Encouraging Terrorist Disengagement

An#- Social Behaviour, Crime and Policing Act 2014

Interpre'ng our Results & Condi'onal Effects. Andrea Ruggeri WK 2 Q Step, Year 2

Designing a Social Network Prep for Lab 10. March 26, 2018 Sprenkle - CSCI Why classes and objects? How do we create new data types?

Kjell-Einar Anderssen. Country Manager Norway - Nutanix

FEDERALISM SS.7.C.3.4 Identify the relationship and division of powers between the federal government and state governments.

Key Considerations for Implementing Bodies and Oversight Actors

Implementation of aadhar based voting machine using

Retrofit, Governance and Urban Sustainability: Comparing City Regional Responses in Greater Manchester and Cardiff/SE Wales

c. References herein to the singular includes the plural and vice versa; and

Crea%ng Leaders. How to iden%fy, empower, and nurture leaders in any organiza%on. Lawrence Kane. Sea<le Regional Roundtable February 4, 2015

Transcription:

LPGPU Low- Power Parallel Compu1ng on GPUs Ben Juurlink Technische Universität Berlin

Cri1cal Ques1ons We Seek to Ask Power consump9on has become the cri9cal limi9ng factor in performance of processors (both CPUs and GPUs) GPUs are becoming the vanguard of parallel programming, delivering increasingly greater performance and programmability But the cri9cal issue for power consump9on is about bandwidth and hierarchical memory architectures, about which we have very liele reliable informa9on Ques9ons we seek to obtain answers to: Ø How do we compare the huge range of memory architecture choices? Ø What are the bandwidth requirements for performance- cri9cal somware on hierarchical memory architectures? Ø How can we op9mize somware for new memory architectures? Ø What tools do we need to bring performance- cri9cal somware onto GPUs?

Partners To answer these ques9ons we have brought together a group of complementary groups To analyse the somware on different architectures, we have: Ø A commercial tools provider: Codeplay Ø And an academic tools and architecture research group at TU Berlin To produce GPU designs and memory architectures, we have: Ø Think Silicon: a GPU architecture designer Ø And an academic architecture research group at Uppsala To produce relevant benchmark somware, we have: Ø Geomerics: a producer of new real- 9me ligh9ng somware for games Ø AiGameDev.com: a company that researches and teaches about commercial game AI techniques

Project Objec1ves To develop applica9ons for and port applica9ons to massively parallel, low- power GPUs Ø ligh9ng, game AI, video coding To develop a set of tools that will allow analyzing and reducing power consump9on To propose and evaluate architectural enhancements that enable the efficient execu9on of applica9ons that contain a lot of condi9onally executed code Ø To evaluate the trade- off of SIMD versus MIMD To propose and evaluate architectural techniques to reduce the power consump9on of GPUs To develop a hardware demonstrator for the most promising architecture techniques

Power: Where is it being used? From Bill Dally s presenta9on at SC10 John Gustafson, HPC User Forum, SeaEle, September 2010 To deal with power, we need to control how far data has to move, right down to 9ny distances on a chip. Even different kinds of registers have massively different power consump9ons We want to measure and inves9gate this

GPU Power Density original figure due to John Y. Chen, NVIDIA

! Applica1ons SIMD GPUs most suited for data- parallel workloads But many important applica9on domains (e.g. advanced ligh9ng, game AI) are control- intensive According to game developers increased GPU performance is not leading to improvements in visual quality because the way GPUs render the graphics fundamentally restricts their flexibility Need to inves9gate new graphics techniques and how they impact GPU design

Applica1ons: Graphics Port Enlighten real 9me radiosity to mobile (in progress) Mobile graphics radically different to desktop Ø PowerVR architecture 9le- based deferred renderer in hardware Inves9gate new somware techniques for mobile graphics

Applica1ons: Video Codecs Video coding applica9ons require more compu9ng power with each genera9on (e.g.: FHD (1920x1080) QHD (3840x2160)) No direct match between video requirements and GPU capabili9es: Ø Entropy decoding: Bit- level dependencies not appropriate for GPU Ø Inverse Transform (IDCT): frame- level parallelism, regular data accesses Ø Mo9on Compensa9on (MC): frame- level parallelism, non- regular data accesses, branch divergence due to mul9ple interpola9on modes. Ø Intra- Predic9on: wavefront parallelism, branch divergence Ø Deblocking Filter: wavefront parallelism, divergence due to pixel adapta9on Ø Current work: H.264/AVC IDCT on GPU Ø Next steps: High Efficiency Video Coding (HEVC) on GPUs

Tools: Kernel Fusion GPU applica9ons consist of several kernels If data set larger than on- chip memory, data must be streamed in and off- chip Off- chip memory accesses consume two orders of magnitude more energy than on- chip memory accesses Goal is to develop a tool that fuses kernels such that kernels are itera9vely applied to data subset that can be kept on- chip kernel1 kernel2

Tools: Offload Instrument Codeplay s PS3/GPU Offload C++ compiler Ø Monitor accesses to global versus local data Ø Apply the concepts to unsupported architectures Ø Visualise bandwidth and power consump9on of real- world code from AIGameDev and Geomerics Apply the tool to the Geomerics Enlighten codebase Ø Accelerate the reference implementa9on on PS3 and GPU Ø Apply to ThinkSilicon GPU hardware designs Modify exis9ng OpenCL tools for power consump9on es9mates

Architecture To improve GPU power efficiency, we will explore several direc9ons Ø Different memory architectures GPUs are designed with a variety of hierarchical memory architectures to reduce bandwidth Ø Redundancy redundant computa9ons and data movement can be omieed by transforming computa9on into caching Ø Slack - slack origina9ng from unbalanced processing in each graphics pipeline stage is major source for power- inefficiency. Can exploit this slack by applying DVFS to underu9lized pipeline stages Ø Accuracy (QoS) - Reducing computa9onal accuracy may not have a significant impact on QoS but at the same 9me save considerable energy

Expected Impact We will produce: Ø Prototypes of commercially licensable tools to analyse memory architecture op9ons Ø New commercial graphics techniques for power- efficient ligh9ng Ø Research results showing impact on power of various op9ons available in designing GPUs Ø New ideas for designing more power efficient GPUs Ø Training materials and examples to show how to take complex video game code (such as AI code) and move them onto GPU- accelerated architectures By working together we will achieve more than we can achieve individually!

Beyond LPGPU Bit too early to tell Research will show where addi9onal research is needed DARPA study iden9fies four challenges for ExaScale Compu9ng Ø Energy and Power challenge Ø Memory and Storage challenge Ø Concurrency and Locality challenge Ø Resiliency challenge