Optimization Strategies

Similar documents
File Systems: Fundamentals

Exploring QR Factorization on GPU for Quantum Monte Carlo Simulation

CS 5523: Operating Systems

Concurrent Programing: Why you should care, deeply. Don Porter Portions courtesy Emmett Witchel

Priority Queues & Heaps

Maps and Hash Tables. EECS 2011 Prof. J. Elder - 1 -

Comparison Sorts. EECS 2011 Prof. J. Elder - 1 -

Maps, Hash Tables and Dictionaries

Chapter 8: Recursion

Midterm Review. EECS 2011 Prof. J. Elder - 1 -

The optical memory card is a Write Once media, a written area cannot be overwritten. Information stored on an optical memory card is non-volatile.

Priority Queues & Heaps

HPCG on Tianhe2. Yutong Lu 1,Chao Yang 2, Yunfei Du 1

Year 1 Mental mathematics and fluency in rapid recall of number facts are one of the main aims of the new Mathematics Curriculum.

Priority Queues & Heaps

COMP : DATA STRUCTURES 2/27/14. Are binary trees satisfying two additional properties:

Midterm Review. EECS 2011 Prof. J. Elder - 1 -

Cyber-Physical Systems Scheduling

Document Approval Process. SDR Forum Policy 001

Deadlock. deadlock analysis - primitive processes, parallel composition, avoidance

Servilla: Service Provisioning in Wireless Sensor Networks. Chenyang Lu

A Bloom Filter Based Scalable Data Integrity Check Tool for Large-scale Dataset

Search Trees. Chapter 10. CSE 2011 Prof. J. Elder Last Updated: :51 PM

CS 5523 Operating Systems: Synchronization in Distributed Systems

CAPILLARY DIAL THERMOMETERS TYPE TXC

Check off these skills when you feel that you have mastered them. Identify if a dictator exists in a given weighted voting system.

Document Approval Process. Wireless Innovation Forum Policy 001 Version 3.1.0

TAFTW (Take Aways for the Week) APT Quiz and Markov Overview. Comparing objects and tradeoffs. From Comparable to TreeMap/Sort

Processes. Criteria for Comparing Scheduling Algorithms

Subreddit Recommendations within Reddit Communities

Final Review. Chenyang Lu. CSE 467S Embedded Compu5ng Systems

Supreme Court of Florida

This policy sets out how we collect, use, disclose and protect personal information which we have collected or acquired.

Performance & Energy

Other variations which are not mentioned in this document are on demand (if possible).

Oracle FLEXCUBE Bills User Manual Release Part No E

Uninformed search. Lirong Xia

Title: Solving Problems by Searching AIMA: Chapter 3 (Sections 3.1, 3.2 and 3.3)

4th International Industrial Supercomputing Workshop Supercomputing for industry and SMEs in the Netherlands

BOARD MEMBERS NOMINATION AND ELECTION PROCEDURE FRAMEWORK

Case3:10-cv JW Document81 Filed06/12/12 Page1 of 23 SAN FRANCISCO DIVISION

Chapter 20. Preview. What Is the EU? Optimum Currency Areas and the European Experience

ALEX4.2 A program for the simulation and the evaluation of electoral systems

Social welfare functions

Chapter 21 (10) Optimum Currency Areas and the Euro

Quality of Service in Optical Telecommunication Networks

Preserving the Long Peace in Asia

A C O R N 4 1 S E R I E S S T A I N L E S S S T E E L R A N G E A mm Straight Lever on concealed bearing rose A4103.

Mixed-Strategies for Linear Tabling in Prolog

UAW Local 75 Collection. Papers, linear feet 23 storage boxes

THE SYSTEM OF PROVIDING INFORMATION ON SAFEGUARDS (SIS) SHOULD BE BASED ON RIGHTS-BASED INDICATORS TO ASSESS, AMONG OTHERS:

We should share our secrets

Title: Adverserial Search AIMA: Chapter 5 (Sections 5.1, 5.2 and 5.3)

Genetic Algorithms with Elitism-Based Immigrants for Changing Optimization Problems

Lecture 6 Cryptographic Hash Functions

Patent protection on Software. Software as an asset for technology transfer 29 September 2015

STATE OF ILLINOIS COUNTY OF BUREAU GENERAL CONSTRUC TION HIGHWAY PERMIT. Whereas, I (we),, hereinafter termed the

Virtual Memory and Address Translation

Advertising clocks, largesize clocks and facade

Event Based Sequential Program Development: Application to Constructing a Pointer Program

Digital humanities methods in comparative law

Optimizing Foreign Aid to Developing Countries: A Study of Aid, Economic Freedom, and Growth

ETSI TS V8.3.0 ( )

Sentencing Guidelines, Judicial Discretion, And Social Values

Last Time. Bit banged SPI I2C LIN Ethernet. u Embedded networks. Ø Characteristics Ø Requirements Ø Simple embedded LANs

Comparison of the Psychometric Properties of Several Computer-Based Test Designs for. Credentialing Exams

Estimating the Margin of Victory for Instant-Runoff Voting

Chapter 20. Optimum Currency Areas and the European Experience. Slides prepared by Thomas Bishop

Hoboken Public Schools. College Algebra Curriculum

ISO INTERNATIONAL STANDARD. Footwear Test methods for uppers Water resistance. Chaussures Méthodes d'essai des tiges Résistance à l'eau

Wire rope and chain scraper

RateForce, LLC Terms of Use Agreement

Fall 2016 COP 3223H Program #5: Election Season Nears an End Due date: Please consult WebCourses for your section

Taper Pins, Unhardened DIN 1 B ISO 2339

Chapter 11. Weighted Voting Systems. For All Practical Purposes: Effective Teaching

information on safeguards (SIS): Inclusion of data relevant for indigenous peoples

BCS. BCS Enterprise Architecture SPECIALIST GROUP CONSTITUTION. The name shall be the BCS Enterprise Architecture Specialist Group.

Polydisciplinary Faculty of Larache Abdelmalek Essaadi University, MOROCCO 3 Department of Mathematics and Informatics

Chapter 10: Congress Section 1

Philips Lifeline. Ø Chenyang Lu 1

CSE 520S Real-Time Systems

SQL Server T-SQL Recipes

The Benefits of Enhanced Transparency for the Effectiveness of Monetary and Financial Policies. Carl E. Walsh *

Homework 7 Answers PS 30 November 2013

CUG Members' Handbook

Gross Floor Area Exclusion

THE BROWN ACT. Open MEETINGS FOR LOCAL LEGISLATIVE BODIES. California Attorney General s Office

Advertising clocks, largesize clocks and facade

ISO Stand Alone Remittance Messages. Introduced in April 2014

BALLOT BOX CHECKLIST

Supreme Court of Florida

Case5:08-cv PSG Document514 Filed08/21/13 Page1 of 18

Please reach out to for a complete list of our GET::search method conditions. 3

Chapter 13: The Presidency Section 1

IMPLEMENTATION OF SECURE PLATFORM FOR E- VOTING SYSTEM

Telit Jupiter MT33xx Host EPO Application Note NT11385A r

Constitution of Pi Tau Sigma

Town of Orrington, Maine Employment Application

Modular Slab Track. Asfordby Slab Installation IVES PORR V-Tras. PWI Winter Conference December

1. The augmented matrix for this system is " " " # (remember, I can't draw the V Ç V ß #V V Ä V ß $V V Ä V

Transcription:

Global Memory Access Pattern and Control Flow

Objectives Ø Ø Global Memory Access Pattern (Coalescing) Ø Control Flow (Divergent branch) Copyright 2013 by Yong Cao, Referencing UIUC ECE498AL Course Notes

Global Memory Access Ø Highest latency instructions: 200-400 clock cycles Ø Likely to be performance bottleneck Ø Optimizations can greatly increase performance Ø Best access pattern: Coalescing Ø Up to 10x speedup Copyright 2013 by Yong Cao, Referencing UIUC ECE498AL Course Notes

Coalesced Memory Access Ø A coordinated read by a half-warp (16 threads) Ø A contiguous region of global memory: Ø 64 bytes - each thread reads a word: int, float, Ø 128 bytes - each thread reads a double-word: int2, float2, Ø 256 bytes each thread reads a quad-word: int4, float4, Ø Additional restrictions on G8X architecture: Ø Starting address for a region must be a multiple of region size Ø The k th thread in a half-warp must access the k th element in a block being read Ø Exception: not all threads must be participating Ø Predicated access, divergence within a halfwarp Copyright 2013 by Yong Cao, Referencing UIUC ECE498AL Course Notes

Coalesced Access: Reading floats All threads participate Some threads do not participate

Non-Coalesced Access: Reading floats Permuted access by threads Misaligned starting address (not a multiple of 64)

Example: Non-coalesced float3 read global void accessfloat3(float3 *d_in, float3 d_out) { } int index = blockidx.x * blockdim.x + threadidx.x; float3 a = d_in[index]; a.x += 2; a.y += 2; a.z += 2; d_out[index] = a; Copyright 2013 by Yong Cao, Referencing UIUC ECE498AL Course Notes

Example: Non-coalesced float3 read (Cont ) Ø float3 is 12 bytes Ø Each thread ends up executing 3 reads Ø sizeof(float3) 4, 8, or 12 Ø Half-warp reads three 64B non-contiguous regions Copyright 2013 by Yong Cao, Referencing UIUC ECE498AL Course Notes

Example: Non-coalesced float3 read (2) Similarly, step 3 start at offset 512

Example: Non-coalesced float3 read (3) Ø Use shared memory to allow coalescing Ø Need sizeof(float3)*(threads/block) bytes of SMEM Ø Each thread reads 3 scalar floats: Ø Offsets: 0, (threads/block), 2*(threads/block) Ø These will likely be processed by other threads, so sync Ø Processing Ø Each thread retrieves its float3 from SMEM array Ø Cast the SMEM pointer to (float3*) Ø Use thread ID as index Ø Rest of the compute code does not change! Copyright 2013 by Yong Cao, Referencing UIUC ECE498AL Course Notes

Example: Final Coalesced Code

Coalescing: Structure of Size 4, 8, 16 Bytes Ø Use a structure of arrays instead of Array of Structure Ø If Array of Structure is not viable: Ø Force structure alignment: align(x), where X = 4, 8, or 16 Ø Use SMEM to achieve coalescing Copyright 2013 by Yong Cao, Referencing UIUC ECE498AL Course Notes

Control Flow Instructions in GPUs Ø Main performance concern with branching is divergence Ø Threads within a single warp take different paths Ø Different execution paths are serialized Ø The control paths taken by the threads in a warp are traversed one at a time until there is no more.

Divergent Branch Ø A common case: avoid divergence when branch condition is a function of thread ID Ø Example with divergence: Ø If (threadidx.x > 2) { } Ø This creates two different control paths for threads in a block Ø Branch granularity < warp size; threads 0 and 1 follow different path than the rest of the threads in the first warp Ø Example without divergence: Ø If (threadidx.x / WARP_SIZE > 2) { } Ø Also creates two different control paths for threads in a block Ø Branch granularity is a whole multiple of warp size; all threads in any given warp follow the same path 14

Parallel Reduction Ø Ø Ø Given an array of values, reduce them to a single value in parallel Examples Ø sum reduction: sum of all values in the array Ø Max reduction: maximum of all values in the array Typically parallel implementation: Ø Ø Recursively halve # threads, add two values per thread Takes log(n) steps for n elements, requires n/2 threads 15

A Vector Reduction Example Ø Assume an in-place reduction using shared memory Ø The original vector is in device global memory Ø The shared memory used to hold a partial sum vector Ø Each iteration brings the partial sum vector closer to the final sum Ø The final solution will be in element 0 16 Copyright 2013 by Yong Cao, Referencing UIUC ECE498AL Course Notes

Vector Reduction Array elements 0 1 2 3 4 5 6 7 8 9 10 11 1 0+1 2+3 4+5 6+7 8+9 10+11 2 0...3 4..7 8..11 3 iterations 0..7 8..15

A simple implementation 18

Interleaved Reduction 2 4 6 8 10 12 14 4 8 12 8

Some Observations Ø In each iterations, two control flow paths will be sequentially traversed for each warp Ø Threads that perform addition and threads that do not Ø Threads that do not perform addition may cost extra cycles depending on the implementation of divergence 20

Some Observations (Cont ) Ø No more than half of threads will be executing at any time Ø All odd index threads are disabled right from the beginning! Ø On average, less than ¼ of the threads will be activated for all warps over time. Ø After the 5 th iteration, entire warps in each block will be disabled, poor resource utilization but no divergence. Ø This can go on for a while, up to 4 more iterations (512/32=16= 2 4 ), where each iteration only has one thread activated until all warps retire 21

Optimization 1: Ø Replace divergent branch Ø With strided index and non-divergent branch 22

Optimization 1: (Cont ) No divergence until less than 16 sub sum.

Optimization 1: Bank Conflict Issue Bank Conflict due to the Strided Addressing 24

Optimization 2: Sequential Addressing

Optimization 2: (Cont ) Ø Replace strided indexing Ø With reversed loop and threadid-based indexing

Some Observations About the New Implementation Ø Only the last 5 iterations will have divergence Ø Entire warps will be shut down as iterations progress Ø For a 512-thread block, 4 iterations to shut down all but one warps in each block Ø Better resource utilization, will likely retire warps and thus blocks faster Ø Recall, no bank conflicts either 27 Copyright 2013 by Yong Cao, Referencing UIUC ECE498AL Course Notes