Dependability in Distributed Systems

Dependability in Distributed Systems INF 5360 spring 2014 INF5360, Amir Taherkordi & Roman Vitenberg 1

Average Cost of Downtime Ø Revenue loss, productivity loss, reputation loss Ø Revenue loss + productivity loss for the IT industry in the US $4.54 billion in 1996 $6.6 billion in 1999 Over $10 billion in 2003 INF5360, Amir Taherkordi & Roman Vitenberg 2

More Figures Ø Breakdown according to branches in 1999 Ø $25,000 per minute for amazon.com in 2001 Amazon.com: down for 49 min in Jan. 2013: $4 million or more in lost sales Ø $125,000/hour for a typical US enterprise (2004) INF5360, Amir Taherkordi & Roman Vitenberg 3

Why does it happen? Ø Hardware failures Ø Unreliable networks Chances of dropping a message for a UDP ping Ø Software bugs Reproducible problem Side-effects of problems that occurred much earlier in an execution, e.g., overrunning an array Ø Human error Sysadmins and appadmins (misconfigurations) Other INF5360, Amir Taherkordi & Roman Vitenberg 4

The unexpected happens (from amazon.com) Ø A fuse blows and darkens a set of racks Ø Chillers die in a datacenter and a fraction of servers are down Ø The electric plug of a rack bursts into flames Ø A Telco server s connectivity to a datacenter Ø Tornados and lightening strike a datacenter Ø A datacenter floods from the roof down Ø Simultaneous infant mortality occurs of servers newly deployed in multiple datacenters Ø Power generation doesn t start because the ambient temperature is too high Ø The DNS provider creates a black hole Ø Load INF5360, Amir Taherkordi & Roman Vitenberg 5

Can we really expect to depend on computer systems? Ø "The only secure computer is one that's unplugged, locked in a safe, and buried 20 feet under the ground in a secret location... and I'm not even too sure about that one" -Dennis Hughes, FBI INF5360, Amir Taherkordi & Roman Vitenberg 6

Main Aspects and Concerns of Dependability Ø Availability The probability that the system is available at any given time Affected by the Mean Time to Failure (MTF), the Failure Detection Time, and the Recovery Time Typically expressed as a series of 9s, e.g., 0.9999999 Ø Reliability The property of running continuously w/o failures Ø Safety A temporary failure leads to no calamity Graceful degradation, e.g., of service Ø Maintainability Ease of repairing failures as well as short detection and recovery time Self-stabilization (self-* property) Ø Security (beyond the scope of this course) INF5360, Amir Taherkordi & Roman Vitenberg 7

Classes of High Availability Availability Total accumulated downtime per year Class 90% More than a month 1 99% Less than 4 days 2 99.9% Less than 9 hours 3 99.99% About an hour 4 99.999% A little over 5 minutes 5 99.9999% About 3 seconds 6 Ø Standard computers with normal system administration achieve Class 2 Ø Clusters usually achieve Class 3 or 4 Ø Mainframes typically provide Class 3 or 4. New ones are claimed to provide Class 5 in a well managed environment Ø Phone switches require Class 5 Ø In-flight aircraft computers are required to provide Class 6 INF5360, Amir Taherkordi & Roman Vitenberg 8

Failure Models Ø Failure types that may occur in a given system Ø Failure Model: How the system behaves when it doesn t behave properly Ø Motivation for considering Different solutions for different models Different possibility limits and expectations Part of the underlying context How to adapt to dynamic changes in the model? Ø Consist of the following parts Dependency, failure classification, failure semantics, failure masking INF5360, Amir Taherkordi & Roman Vitenberg 9

Failure Models Ø Fail stop: a process crashes and remains halted. Ø Send-omission: a process completes a send, but the message is not in the outgoing buffer Ø Receive-omission: a msg is put into a process s incoming buffer, but that process does not receive it. Ø Omission (channel): a message is lost Ø Arbitrary (malicious, byzantine): Anything can happen. Ø Other failure model concepts: Process failure: generates incorrect results; e.g., deadlock, protection fault, divide by zero. Software or hardware fault Network partition INF5360, Amir Taherkordi & Roman Vitenberg 11

Failure Detection Ø FD model: What FD precision is guaranteed? Perfect FD is impossible in asynchronous systems Ø The evergreen I am alive mechanism Send messages periodically If I do not hear from a node, I assume that it is failed Ø Why should we ever want something different? It does not distinguish between network and node failures And then, a garbage collector kicked in At what level and using which communication stack Ultimately crude Ø Propagating knowledge about failures INF5360, Amir Taherkordi & Roman Vitenberg 12

Key elements of dependable systems Ø Data consistency (integrity and freshness) Transactions and the ACID properties Checkpointing and recovery Ø Data, service, and computation availability through redundancy Redundancy types: physical, computational, and data, (communication is rarely redundant) Techniques: replication, membership monitoring and maintenance, group communication Ø Overcoming unreliable communication (unicast, multicast) Techniques: omission discovery via ACKs & timeouts, retransmissions The dream of exactly once message delivery INF5360, Amir Taherkordi & Roman Vitenberg 13

The CAP Conjecture (by Eric Brewer) INF5360, Amir Taherkordi & Roman Vitenberg 14

Forfeit partitions INF5360, Amir Taherkordi & Roman Vitenberg 15

Forfeit availability INF5360, Amir Taherkordi & Roman Vitenberg 16

Forfeit consistency INF5360, Amir Taherkordi & Roman Vitenberg 17

The tradeoffs are real Ø The whole space is useful Ø Real internet systems are a careful mixture INF5360, Amir Taherkordi & Roman Vitenberg 18

Another Dimension in the Equation: Scale & Dynamicity Ø Classical dependability Static universe and small scale Full-mesh communication Pessimistic replication, typically with strong consistency Little need for autonomic self-organization and recovery Ø Modern and groovy dependability Dynamic, large-scale, and mobile Explicitly probabilistic consistency guarantees Scalable membership and failure detection Scalable update propagation (epidemic dissemination) Autonomic recovery & self-organization in presence of churn Optimistic replication INF5360, Amir Taherkordi & Roman Vitenberg 20

Dependability in Industrial Middleware Ø Google s Chubby and Yahoo s Zookeeper Paxos-based Ø Highly-available cluster technologies from IBM and Microsoft Ø Reliable storage solutions in all major companies Ø Transaction monitors An old but still relevant technology Oracle, BEA, IBM Ø SOA and Web Services Developed messaging reliability standards (WS Reliability) Dire need for service composition, SLAs, and practical models for dependability evaluation Ø Service-Availability Forum (SAF) Mostly focuses on telephony, embedded applications and missioncritical systems. Ø Many solutions embedded in apps and other middleware INF5360, Amir Taherkordi & Roman Vitenberg 21

Other research directions in dependability Ø Measuring and assessing dependability Fault-injection Ø Making dependability more adaptive Switching between active and passive replication on the fly Taking advantage of componentization and self-awareness Ø Practical Byzantine fault-tolerance INF5360, Amir Taherkordi & Roman Vitenberg 22

Textbooks Ø Not needed for the course but highly recommended for an interested reader Ø Distributed Systems chapter 8 by Tanenbaum and van Steen Parts of chapter 7 are also relevant Ø Reliable Distributed Systems by Ken Birman Ø Distributed Systems collection by Sape Mullender, chapter 16 INF5360, Amir Taherkordi & Roman Vitenberg 23