Dependability in Distributed Systems

Similar documents
CS 5523 Operating Systems: Synchronization in Distributed Systems

CS 5523: Operating Systems

CS 5523 Operating Systems: Intro to Distributed Systems

Last Time. Bit banged SPI I2C LIN Ethernet. u Embedded networks. Ø Characteristics Ø Requirements Ø Simple embedded LANs

TSQL SONARQUBE ANALYSIS PLUGIN

FEDEX SAMEDAY CITY WEB SERVICES END USER LICENSE AGREEMENT

Cloud Tutorial: AWS IoT. TA for class CSE 521S, Fall, Jan/18/2018 Haoran Li

Statement on Security & Auditability

Swiss E-Voting Workshop 2010

1854 Media Ltd (formerly trading as Apptitude Media Ltd) General Competition Terms & Conditions

(a) Unless otherwise expressly stated to the contrary, terms used herein shall bear the following meanings:

State Election Commission Maharashtra (EMP)

Training Calendar ALPHA PARTNERS

SUPPORT AND UPDATE AGREEMENT ( SUA ) Concerning support and maintenance for IAR Embedded Workbench and IAR visualstate from IAR Systems AB

DevOps Course Content

Special Terms and Conditions of Business for telegra DSLAccess of telegra GmbH

Orbis Cascade Alliance Memorandum of Understanding

SOFTWARE LICENCE. In this agreement the following expressions shall have the following meanings:

UNITED STATES DISTRICT COURT EASTERN DISTRICT OF WISCONSIN. In re: Two accounts stored at Google, Case No. 17-M-1235 MEMORANDUM AND ORDER

Balsamiq End User License Agreement

Legal Supplement Part C to the Trinidad and Tobago Gazette, Vol. 43, No. 48, 25th March, 2004

Licence shall mean the terms and conditions for use of the Software as set out in this Agreement.

CSE 520S Real-Time Systems

USE OF ANY CWGS ENTERPRISES, LLC WEB SITE OR MOBILE APP SIGNIFIES YOUR AGREEMENT TO THESE TERMS OF USE.

Real-Time Scheduling Single Processor. Chenyang Lu

Etherparty Terms of Use. Last Updated: April 2, 2018

Instruction, Note (Civ) RULES GOVERNING JUROR CONDUCT DURING TRIAL

NOTICES ACCOMPANYING THE ELECTRONIC PROSPECTUS/INFORMATION MEMORANDUM/KNOWLEDGE PACK AND E-IPO APPLICATION FORMS FROM THE WEBSITE

Cost Implications of State Ownership of the Verbatim Record in California

Cadac SoundGrid I/O. User Guide

This Web Site is owned by: Olley Court, LLC Mailing Address: 418 Main Street, Ridgefield, CT Phone: Web:

GUEST WIFI NETWORK. Terms and Conditions and Acceptable Use Protocol

Liberalism and Neoliberalism

Product Description

YOOCHOOSE GmbH Terms and Conditions Subject Matter

Norfolk-Google Fiber to the Home

Terms of Use Coach Me

Paper Entered: July 7, 2016 UNITED STATES PATENT AND TRADEMARK OFFICE BEFORE THE PATENT TRIAL AND APPEAL BOARD

SOFTWARE END USER LICENSE AGREEMENT

SECURITY, ACCURACY, AND RELIABILITY OF TARRANT COUNTY S VOTING SYSTEM

SOFTWARE AS A SERVICE (SaaS) TERMS and CONDITIONS FOR REMOTE ACCESS SERVICE SOLD BY VIDEOJET

Belton I.S.D. Records Management Policy and Procedural Manual. Compiled by: Record Management Committee

THE WEB SERVICES-INTEROPERABILITY ORGANIZATION BYLAWS ARTICLE I PURPOSES AND DEFINITIONS

Review: Background on Bits. PFTD: What is Computer Science? Scale and Bits: Binary Digits. BIT: Binary Digit. Understanding scale, what does it mean?

IceCube Project Monthly Report November 2007

1. Sponsor will conduct all Impact Radio Group station contests, including on-air contests, online and text-based contests and contests conducted

Targeted Enumeration and Voter Registration

Terms of Use When you Access FoodSwitch you agree to these Terms of Use ("Terms"). General Terms and Conditions of Use

UNITED STATES DISTRICT COURT WESTERN DISTRICT OF WASHINGTON AT SEATTLE. THIS MATTER comes before the Court on Defendants Motion for Judgment on the

Certified Translation from German. Licence Agreement. 1. Subject-matter of the Agreement

Welcome to afrocoinworldwide.com

SUBSCRIPTION AGREEMENT FOR CORECON ONLINE SERVICE

Note concerning the Patentability of Computer-Related Inventions

RECOMMENDATION FOR USE RFU RST 082

GENERAL TERMS AND CONDITIONS OF ACCESS TO AND USE OF AVIO AERO DATA EXCHANGE PORTAL

RECITALS. B. The System includes devices attached to home appliances that limit electricity use at the Residence.

Real-Time CORBA. Chenyang Lu CSE 520S

HPCG on Tianhe2. Yutong Lu 1,Chao Yang 2, Yunfei Du 1

CRIMINAL INVESTIGATIONS AND TECHNOLOGY: PROTECTING DATA AND RIGHTS

PURCHASE ORDER ATTACHMENT IP-006 ADDENDUM TO SOFTWARE LICENSES WITH RAYTHEON

Patent Local Rule 3 1 requires, in pertinent part:

NINJATRADER TERMS OF SERVICE AGREEMENT

THE W DISH FEELING FIZZY CONTEST (the Contest )

Estonian National Electoral Committee. E-Voting System. General Overview

General Framework of Electronic Voting and Implementation thereof at National Elections in Estonia

Tackling Electrical System Efficiency, Safety and Reliability for pharmaceutical plants

Belonging and Exclusion in the Internet Era: Estonian Case

Training Calendar ALPHA PARTNERS

Unless explicitly stated otherwise, any new features that augment or enhance the current Service shall be subject to this Agreement.

Florida Supreme Court Standards for Electronic Access to the Courts

Highway246.net INTERNET ACCESS AGREEMENT

Copyright (c) 1999 to 2018 (inclusive) Omni Accounts (tm) All rights reserved.

CS 5523: Operating Systems

Final Review. Chenyang Lu. CSE 467S Embedded Compu5ng Systems

ASSETMARK TRUST COMPANY TOTALCASH MANAGER TM ACCESS AUTHORIZATION AGREEMENT

TERMS AND CONDITIONS FOR CHECKMARX PRODUCTS AND SERVICES TERM SOFTWARE LICENSE AND SUPPORT AGREEMENT

Conditions for Processing Banking Transactions via the Corporate Banking Portal and HBCI/FinTS Service

End User License Agreement

BY USING THIS CLICK-THROUGH WEBSITE, YOU INDICATE YOUR ACCEPTANCE OF THESE TERMS AND CONDITIONS.

Kiss Your Landlord Goodbye! Contest Official Rules

Reports must be submitted via this on-line form, NO LATER THAN NOVEMBER 7, 2012.

LAB-on-line License Terms and Service Agreement

Remote Support Terms of Service Agreement Version 1.0 / Revised March 29, 2013

SOFTWARE LICENSE AGREEMENT

Contract for Consultancy Services (Small)

Mobile Application End User License Agreement

Distributed Interval Voting with Node Failures of Various Types

General Terms and Conditions of taxiid BV in Amsterdam (including t&c Use Software Licence)

AWAREITY, INC. AWAREITY TERMS OF SERVICE & END-USER AGREEMENT

SHAWN MENDES CONCERT GIVEAWAY (the Contest )

(2) (Company Number ) whose correspondence address is at

UOB BUSINESS APPLICATION TERMS AND CONDITIONS

American Government I GOVT 2301 Collin College, Spring Creek

- 1 - End-User License Agreement

The usage of electronic voting is spreading because of the potential benefits of anonymity,

XMX. A bridge of trust between the Mexican Peso and Cryptocurrency. April 2018 (v1.7)

LED545-series TECHNICAL DATA. Specifications. Absolute Maximum Ratings (T a =25 C) Electro-Optical Characteristics (T a =25 C)

Performance & Energy

Agreement for iseries and AS/400 System Restore Test Service

Conditions for Processing Banking Transactions via the Corporate Banking Portal

Transcription:

Dependability in Distributed Systems INF 5360 spring 2014 INF5360, Amir Taherkordi & Roman Vitenberg 1

Average Cost of Downtime Ø Revenue loss, productivity loss, reputation loss Ø Revenue loss + productivity loss for the IT industry in the US $4.54 billion in 1996 $6.6 billion in 1999 Over $10 billion in 2003 INF5360, Amir Taherkordi & Roman Vitenberg 2

More Figures Ø Breakdown according to branches in 1999 Ø $25,000 per minute for amazon.com in 2001 Amazon.com: down for 49 min in Jan. 2013: $4 million or more in lost sales Ø $125,000/hour for a typical US enterprise (2004) INF5360, Amir Taherkordi & Roman Vitenberg 3

Why does it happen? Ø Hardware failures Ø Unreliable networks Chances of dropping a message for a UDP ping Ø Software bugs Reproducible problem Side-effects of problems that occurred much earlier in an execution, e.g., overrunning an array Ø Human error Sysadmins and appadmins (misconfigurations) Other INF5360, Amir Taherkordi & Roman Vitenberg 4

The unexpected happens (from amazon.com) Ø A fuse blows and darkens a set of racks Ø Chillers die in a datacenter and a fraction of servers are down Ø The electric plug of a rack bursts into flames Ø A Telco server s connectivity to a datacenter Ø Tornados and lightening strike a datacenter Ø A datacenter floods from the roof down Ø Simultaneous infant mortality occurs of servers newly deployed in multiple datacenters Ø Power generation doesn t start because the ambient temperature is too high Ø The DNS provider creates a black hole Ø Load INF5360, Amir Taherkordi & Roman Vitenberg 5

Can we really expect to depend on computer systems? Ø "The only secure computer is one that's unplugged, locked in a safe, and buried 20 feet under the ground in a secret location... and I'm not even too sure about that one" -Dennis Hughes, FBI INF5360, Amir Taherkordi & Roman Vitenberg 6

Main Aspects and Concerns of Dependability Ø Availability The probability that the system is available at any given time Affected by the Mean Time to Failure (MTF), the Failure Detection Time, and the Recovery Time Typically expressed as a series of 9s, e.g., 0.9999999 Ø Reliability The property of running continuously w/o failures Ø Safety A temporary failure leads to no calamity Graceful degradation, e.g., of service Ø Maintainability Ease of repairing failures as well as short detection and recovery time Self-stabilization (self-* property) Ø Security (beyond the scope of this course) INF5360, Amir Taherkordi & Roman Vitenberg 7

Classes of High Availability Availability Total accumulated downtime per year Class 90% More than a month 1 99% Less than 4 days 2 99.9% Less than 9 hours 3 99.99% About an hour 4 99.999% A little over 5 minutes 5 99.9999% About 3 seconds 6 Ø Standard computers with normal system administration achieve Class 2 Ø Clusters usually achieve Class 3 or 4 Ø Mainframes typically provide Class 3 or 4. New ones are claimed to provide Class 5 in a well managed environment Ø Phone switches require Class 5 Ø In-flight aircraft computers are required to provide Class 6 INF5360, Amir Taherkordi & Roman Vitenberg 8

Failure Models Ø Failure types that may occur in a given system Ø Failure Model: How the system behaves when it doesn t behave properly Ø Motivation for considering Different solutions for different models Different possibility limits and expectations Part of the underlying context How to adapt to dynamic changes in the model? Ø Consist of the following parts Dependency, failure classification, failure semantics, failure masking INF5360, Amir Taherkordi & Roman Vitenberg 9

Failure Models Ø Fail stop: a process crashes and remains halted. Ø Send-omission: a process completes a send, but the message is not in the outgoing buffer Ø Receive-omission: a msg is put into a process s incoming buffer, but that process does not receive it. Ø Omission (channel): a message is lost Ø Arbitrary (malicious, byzantine): Anything can happen. Ø Other failure model concepts: Process failure: generates incorrect results; e.g., deadlock, protection fault, divide by zero. Software or hardware fault Network partition INF5360, Amir Taherkordi & Roman Vitenberg 11

Failure Detection Ø FD model: What FD precision is guaranteed? Perfect FD is impossible in asynchronous systems Ø The evergreen I am alive mechanism Send messages periodically If I do not hear from a node, I assume that it is failed Ø Why should we ever want something different? It does not distinguish between network and node failures And then, a garbage collector kicked in At what level and using which communication stack Ultimately crude Ø Propagating knowledge about failures INF5360, Amir Taherkordi & Roman Vitenberg 12

Key elements of dependable systems Ø Data consistency (integrity and freshness) Transactions and the ACID properties Checkpointing and recovery Ø Data, service, and computation availability through redundancy Redundancy types: physical, computational, and data, (communication is rarely redundant) Techniques: replication, membership monitoring and maintenance, group communication Ø Overcoming unreliable communication (unicast, multicast) Techniques: omission discovery via ACKs & timeouts, retransmissions The dream of exactly once message delivery INF5360, Amir Taherkordi & Roman Vitenberg 13

The CAP Conjecture (by Eric Brewer) INF5360, Amir Taherkordi & Roman Vitenberg 14

Forfeit partitions INF5360, Amir Taherkordi & Roman Vitenberg 15

Forfeit availability INF5360, Amir Taherkordi & Roman Vitenberg 16

Forfeit consistency INF5360, Amir Taherkordi & Roman Vitenberg 17

The tradeoffs are real Ø The whole space is useful Ø Real internet systems are a careful mixture INF5360, Amir Taherkordi & Roman Vitenberg 18

Another Dimension in the Equation: Scale & Dynamicity Ø Classical dependability Static universe and small scale Full-mesh communication Pessimistic replication, typically with strong consistency Little need for autonomic self-organization and recovery Ø Modern and groovy dependability Dynamic, large-scale, and mobile Explicitly probabilistic consistency guarantees Scalable membership and failure detection Scalable update propagation (epidemic dissemination) Autonomic recovery & self-organization in presence of churn Optimistic replication INF5360, Amir Taherkordi & Roman Vitenberg 20

Dependability in Industrial Middleware Ø Google s Chubby and Yahoo s Zookeeper Paxos-based Ø Highly-available cluster technologies from IBM and Microsoft Ø Reliable storage solutions in all major companies Ø Transaction monitors An old but still relevant technology Oracle, BEA, IBM Ø SOA and Web Services Developed messaging reliability standards (WS Reliability) Dire need for service composition, SLAs, and practical models for dependability evaluation Ø Service-Availability Forum (SAF) Mostly focuses on telephony, embedded applications and missioncritical systems. Ø Many solutions embedded in apps and other middleware INF5360, Amir Taherkordi & Roman Vitenberg 21

Other research directions in dependability Ø Measuring and assessing dependability Fault-injection Ø Making dependability more adaptive Switching between active and passive replication on the fly Taking advantage of componentization and self-awareness Ø Practical Byzantine fault-tolerance INF5360, Amir Taherkordi & Roman Vitenberg 22

Textbooks Ø Not needed for the course but highly recommended for an interested reader Ø Distributed Systems chapter 8 by Tanenbaum and van Steen Parts of chapter 7 are also relevant Ø Reliable Distributed Systems by Ken Birman Ø Distributed Systems collection by Sape Mullender, chapter 16 INF5360, Amir Taherkordi & Roman Vitenberg 23