A Micro-Benchmark Evaluation of Catamount and Cray Linux Environment (CLE) Performance

Similar documents
CS 5523: Operating Systems

Cyber-Physical Systems Scheduling

File Systems: Fundamentals

Final Review. Chenyang Lu. CSE 467S Embedded Compu5ng Systems

Operating Systems. Chenyang Lu

Processes. Criteria for Comparing Scheduling Algorithms

Performance & Energy

Department of Physics

Cloud Tutorial: AWS IoT. TA for class CSE 521S, Fall, Jan/18/2018 Haoran Li

Cadac SoundGrid I/O. User Guide

HPCG on Tianhe2. Yutong Lu 1,Chao Yang 2, Yunfei Du 1

A Bloom Filter Based Scalable Data Integrity Check Tool for Large-scale Dataset

CUG Members' Handbook

4th International Industrial Supercomputing Workshop Supercomputing for industry and SMEs in the Netherlands

Real-Time CORBA. Chenyang Lu CSE 520S

Bidding Document For Annual Procurement

Introduction to VI-HPS

JD Edwards EnterpriseOne Applications

Downloaded from: justpaste.it/vlxf

Exploring QR Factorization on GPU for Quantum Monte Carlo Simulation

CS 2461: Computer Architecture I

CSE 520S Real-Time Systems

Skymet Weather Services Pvt. Ltd. Noida

Tender No: CDACP/NSM-SSLAB/2017/223. C-DAC invites ONLINE bids for Supply & Installation of HPC-Cluster with Storage at C-DAC, Pune and Bangalore.

Exploring the use of Intel SGX for Secure Many-Party Applications

Case 1:18-cv TWP-MPB Document 1 Filed 01/04/18 Page 1 of 17 PageID #: 1

Concurrent Programing: Why you should care, deeply. Don Porter Portions courtesy Emmett Witchel

One View Watchlists Implementation Guide Release 9.2

Kjell-Einar Anderssen. Country Manager Norway - Nutanix

InfiniBand Topologies and Routing in the Real World

LPGPU. Low- Power Parallel Compu1ng on GPUs. Ben Juurlink. Technische Universität Berlin. EPoPPEA workshop

CS 5523 Operating Systems: Intro to Distributed Systems

Closely Held Businesses Strategies for Tackling Key Issues

Digital research data in the Sigma2 prospective

IGS Ionosphere WG Status Report: Performance of IGS. Ionosphere TEC Maps -Position Paper-

Servilla: Service Provisioning in Wireless Sensor Networks. Chenyang Lu

Status of Krell Tools Built using Dyninst/MRNet

Outline. From Pixels to Semantics Research on automatic indexing and retrieval of large collections of images. Research: Main Areas

CENTRE FOR DEVELOPMENT OF IMAGING TECHNOLOGY (C-DIT) Chitranjali Hills, Thiruvallam, Thiruvananthapuram-27 Phone: , 912 Fax:

UNITED STATES DISTRICT COURT CENTRAL DISTRICT OF CALIFORNIA

County of Collier CLERK OF THE CIRCUIT COURT COLLIER COUNTY COURTHOUSE

Krell Ins)tute related tools (O SS, CBTF, SWAT) Implementa)on Details, Issues,and Status

Citizen engagement and compliance with the legal, technical and operational measures in ivoting

INDIAN INSTITUTE OF TECHNOLOGY GANDHINAGAR

Internet of Things Wireless Sensor Networks. Chenyang Lu

Real-Time Scheduling Single Processor. Chenyang Lu

An Homophily-based Approach for Fast Post Recommendation in Microblogging Systems

OPEN SOURCE CRYPTOCURRENCY

TENDER DOCUMENT FOR SUPPLY AND DELIVERY OF STAND ALONE SERVER & PERSONAL COMPUTERS

FairCom Press Release Archive:

United States District Court, D. Delaware. LUCENT TECHNOLOGIES, INC. Plaintiff. v. NEWBRIDGE NETWORKS CORP. and Newbridge Networks, Inc. Defendants.

Colorado CLE brings our quality CLE Books to a live program format INTERMEDIATE TO ADVANCED CLASS

Members action as necessary on closed session items. 12 Future agenda items Members 13 Next meeting date Members 14 Adjourn Members

Adaptive QoS Control for Real-Time Systems

SUSE(R) LINUX Enterprise Server (SLES(R)) 10 SP4 Novell(R) Software License Agreement

Last Time. Bit banged SPI I2C LIN Ethernet. u Embedded networks. Ø Characteristics Ø Requirements Ø Simple embedded LANs

FM Legacy Converter User Guide

Case 5:18-cv EJD Document 1 Filed 01/12/18 Page 1 of 27 UNITED STATES DISTRICT COURT NORTHERN DISTRICT OF CALIFORNIA SAN JOSE DIVISION

REQUEST FOR QUOTATION (RFQ) 11 May 2010 REFERENCE: RFQ-SS-ITEQUIPMENT-CSAC

Speed of processing at the EPO. Timely delivery of quality products

LEGAL NOTICE REQUEST FOR BID SEALED BID For. Digital Forensic Computers. For ST. CHARLES COUNTY GOVERNMENT ST.

Improving Financial Literacy and Capability: What Works?

ANNUAL SURVEY REPORT: REGIONAL OVERVIEW

Migrants Selection and Replacement in Distributed Evolutionary Algorithms for Dynamic Optimization

ECC Report 194. Extra-Territorial Use of E.164 Numbers. 17 April 2013

Tender. for. Supply & Installation of Laptops. Indian Institute of Technology Jodhpur

County of Collier CLERK OF THE CIRCUIT COURT COLLIER COUNTY COURTHOUSE

Hong Kong Public Opinion & Political Development Opinion Survey Second Round Survey Results

MAPR END USER LICENSE AGREEMENT Last updated: April 20, 2016

Auto-negotiation for 10GBASE-T

Analyzing the Power Consumption Behavior of a Large Scale Data Center

Test Specification Protocol Implementation Conformance Statement (PICS) proforma for IRAP interfaces

Patterns of Poll Movement *

Doctoral Research Agenda

Invitation to Bid. Subject: Procurement of IT Equipment (Servers, Laptops and Desk Tops computers) Ref: ITB/DDR/KRT/10/04

IMMIGRATION LAW. Co-sponsored by the Immigration Law Section of the Colorado Bar Association

The first of these contains the FAQs concerning the main document.

ALEX4.2 A program for the simulation and the evaluation of electoral systems

Seminar on Strategic Trade Controls in Southeast Asia: Session 6: Industry Outreach

Exhibit No. 373A-06 to IBM Vendor Access Agreement Page 1 of 5

Case 1:17-cv Document 1 Filed 12/11/17 Page 1 of 17 IN THE UNITED STATES DISTRICT COURT FOR THE WESTERN DISTRICT OF TEXAS AUSTIN DIVISION

REQUEST FOR QUOTATION (RFQ) (Goods)

Supreme Court of Florida

HIGHER EDUCATION LOANS BOARD. Financing Higher Education now and in the future HELB/T/06/ SUPPLY OF BRANDED AIO DESKTOP COMPUTERS AND LAPTOPS

Preventing Legal Malpractice 2015

Forth 200x Standards Committee #11 Pratt s Hotel, Bath, UK 2 4 October Draft Agenda

Tariff M.P.S.C. No. 25 (U) Original Sheet 1 ACCESS SERVICE

RULES AND REGULATIONS Title 61 REVENUE

Voting over the Internet in 2014? Union of British Columbia Municipalities 2012 Annual Convention UBCM 2012 Annual Convention In Conversation

Polydisciplinary Faculty of Larache Abdelmalek Essaadi University, MOROCCO 3 Department of Mathematics and Informatics

Abstract: Submitted on:

2016 Appointed Boards and Commissions Diversity Survey Report

Case 5:18-cv Document 1 Filed 01/03/18 Page 1 of 26

Public Consultation on the Lobbying Regulations and Registration System

The attendees introduced themselves and reported affiliations.

Case Study. MegaMatcher Accelerator

ON-LINE DISPUTE RESOLUTION AND ADMINISTRATIVE JUSTICE 1

ATES Technical Assistance to the Local Health Authority of Siracusa

Department of Mechanical Engineering

Washington Military Department Statement of Work Microsoft Surface Professional Tablet Computer RFP-14-PUR-015

Transcription:

A Micro-Benchmark Evaluation of Catamount and Cray Linux Environment (CLE) Performance Jeff Larkin Cray Inc. <larkin@cray.com> Jeff Kuehn ORNL <kuehn@ornl.gov>

Does CLE waddle like a penguin, or run like a catamount? THE BIG QUESTION! 2

Overview Background Motivation Catamount and CLE Benchmarks Benchmark System Benchmark Results IMB HPCC Conclusions 3

BACKGROUND 4

Motivation Last year at CUG CNL was in its infancy Since CUG07 Significant effort spent scaling on large machines CNL reached GA status in Fall 2007 Compute Node Linux (CNL) renamed Cray Linux Environment (CLE) A significant number of sites have already made the change Many codes have already ported from Catamount to CLE Catamount scalability has always been touted, so how does CLE compare? Fundamentals of communication performance HPCC IMB What should sites/users know before they switch? 5

Background: Catamount Developed by Sandia for Red Storm Adopted by Cray for the XT3 Extremely light weight Simple Memory Model No Virtual Memory No mmap Reduced System Calls Single Threaded No Unix Sockets No dynamic libraries Few Interrupts to user codes Virtual Node (VN) mode added for Dual-Core 6

Background: CLE First, we tried a full SUSE Linux Kernel. Then, we put Linux on a diet. With the help of ORNL and NERSC, we began running at large scale By Fall 2007, we released Linux for the compute nodes What did we gain? Threading Unix Sockets I/O Buffering 7

Background: Benchmarks HPCC Suite of several benchmarks, released as part of DARPA HPCS program MPI performance Performance for varied temporal and spatial localities Benchmarks are run in 3 modes SP 1 node runs the benchmark EP Every node runs a copy of the same benchmark Global All nodes run benchmark together Intel MPI Benchmarks (IMB) 3.0 Formerly Pallas benchmarks Benchmarks standard MPI routines at varying scales and message sizes 8

Background: Benchmark System All benchmarks were run on the same system, Shark, and with the latest OS versions as of Spring 2008 System Basics Cray XT4 2.6 GHz Dual-Core Opterons (Able to run to 1280 Cores) DDR2-667 Memory, 2GB/core Catamount (1.5.61) CLE, MPT2 (2.0.50) CLE, MPT3 (2.0.50, xt-mpt 3.0.0.10) 9

BENCHMARK RESULTS 10

HPCC 11

GB/s Parallel Transpose (Cores) 140 120 100 80 60 40 20 Catamount SN Catamount VN CLE MPT2 N1 CLE MPT2 N2 CLE MPT3 N1 CLE MPT3 N2 0 0 500 1000 1500 Processor Cores 12

GB/s Parallel Transpose (Sockets) 120 100 80 60 40 20 Catamount SN Catamount VN CLE MPT2 N1 CLE MPT2 N2 CLE MPT3 N1 CLE MPT3 N2 0 0 100 200 300 400 500 600 Sockets 13

GUP/s MPI Random Access 3 2.5 2 1.5 1 0.5 Catamount SN Catamount VN CLE MPT2 N1 CLE MPT2 N2 CLE MPT3 N1 CLE MPT3 N2 0 0 500 1000 1500 Processor Cores 14

GFlops/s MPI-FFT (cores) 250 200 150 100 50 Catamount SN Catamount VN CLE MPT2 N1 CLE MPT2 N2 CLE MPT3 N1 CLE MPT3 N2 0 0 200 400 600 800 1000 1200 Processor Cores 15

GFlops/s MPI-FFT (Sockets) 250 200 150 100 50 Catamount SN Catamount VN CLE MPT2 N1 CLE MPT2 N2 CLE MPT3 N1 CLE MPT3 N2 0 0 100 200 300 400 500 600 Sockets 16

Time (usec) Naturally Ordered Latency 16 14 12 10 8 6 4 2 0 512 Catamount SN 6.41346 CLE MPT2 N1 9.08375 CLE MPT3 N1 9.41753 Catamount VN 12.3024 CLE MPT2 N2 13.8044 CLE MPT3 N2 9.799 17

MB/s Naturally Ordered Bandwidth 1.2 1 0.8 0.6 0.4 0.2 0 512 Catamount SN 1.07688 CLE MPT2 N1 0.900693 CLE MPT3 N1 0.81866 Catamount VN 0.171141 CLE MPT2 N2 0.197301 CLE MPT3 N2 0.329071 18

IMB 19

Time (usec) IMB Ping Pong Latency (N1) 12 10 8 6 4 Catamount CLE MPT2 CLE MPT3 2 0 0 200 400 600 800 1000 1200 Message Size (B) 20

Avg usec IMB Ping Pong Latency (N2) 10 9 8 7 6 5 4 3 Catamount CLE MPT2 CLE MPT3 2 1 0 0 200 400 600 800 1000 1200 Bytes 21

MB/s IMB Ping Pong Bandwidth 600 500 400 300 200 Catamount CLE MPT2 CLE MPT3 100 0 0 200 400 600 800 1000 1200 Message Size (Bytes) 22

Time (usec) MPI Barrier (Lin/Lin) 160 140 120 100 80 60 40 Catamount CLE MPT2 CLE MPT3 20 0 0 500 1000 1500 Processor Cores 23

Time (usec) MPI Barrier (Lin/Log) 160 140 120 100 80 60 40 Catamount CLE MPT2 CLE MPT3 20 0 1 10 100 1000 10000 Processor Cores 24

Time (usec) MPI Barrier (Log/Log) 1000 100 10 Catamount CLE MPT2 CLE MPT3 1 1 10 100 1000 10000 0.1 Processor Cores 25

SendRecv (Catamount/CLE MPT2) 26

SendRecv (Catamount/CLE MPT3) 27

Broadcast (Catamount/CLE MPT2) 28

Broadcast (Catamount/CLE MPT3) 29

Allreduce (Catamount/CLE MPT2) 30

Allreduce (Catamount/CLE MPT3) 31

AlltoAll (Catamount/CLE MPT2) 32

AlltoAll (Catamount/CLE MPT3) 33

CONCLUSIONS 34

What we saw Catamount Handles Single Core (SN/N1) Runs slightly better Seems to handle small messages and small core counts slightly better CLE Does very well on dualcore Likes large messages and large core counts MPT3 helps performance and closes the gap between QK and CLE 35

What s left to do? We d really like to try this again on a larger machine Does CLE continue to beat Catamount above 1024, or will the lines converge or cross? What about I/O? Linux adds I/O buffering, how does this affect I/O performance at scale? How does this translate into application performance? See "Cray XT4 Quadcore: A First Look", Richard Barrett, et.al., Oak Ridge National Laboratory (ORNL) 36

Does CLE waddle like a penguin, or run like a catamount? CLE RUNS LIKE A BIG CAT! 37

Acknowledgements This research used resources of the National Center for Computational Sciences at Oak Ridge National Laboratory, which is supported by the Office of Science of the U.S. Department of Energy under Contract No. DE-AC05-00OR22725. Thanks to Steve, Norm, Howard, and others for help investigating and understanding these results 38