A Bloom Filter Based Scalable Data Integrity Check Tool for Large-scale Dataset

Similar documents
File Systems: Fundamentals

Last Time. Bit banged SPI I2C LIN Ethernet. u Embedded networks. Ø Characteristics Ø Requirements Ø Simple embedded LANs

Comparison Sorts. EECS 2011 Prof. J. Elder - 1 -

CS 5523: Operating Systems

Random Forests. Gradient Boosting. and. Bagging and Boosting

Swiss E-Voting Workshop 2010

Maps and Hash Tables. EECS 2011 Prof. J. Elder - 1 -

A Micro-Benchmark Evaluation of Catamount and Cray Linux Environment (CLE) Performance

Maps, Hash Tables and Dictionaries

Midterm Review. EECS 2011 Prof. J. Elder - 1 -

The Social Web: Social networks, tagging and what you can learn from them. Kristina Lerman USC Information Sciences Institute

CS 5523 Operating Systems: Intro to Distributed Systems

Lecture 6 Cryptographic Hash Functions

Sector Discrimination: Sector Identification with Similarity Digest Fingerprints

Analysis of the Reputation System and User Contributions on a Question Answering Website: StackOverflow

Optimization Strategies

Combating Friend Spam Using Social Rejections

Priority Queues & Heaps

Concurrent Programing: Why you should care, deeply. Don Porter Portions courtesy Emmett Witchel

Support Vector Machines

Priority Queues & Heaps

Colorado Secretary of State Election Rules [8 CCR ]

Probabilistic Latent Semantic Analysis Hofmann (1999)

UNITED STATES [DISTRICT/BANKRUPTCY] COURT FOR THE DISTRICT OF DIVISION., ) ) Plaintiff, ) ) vs. ) Case No. ), ) Judge ) Defendant.

Estonian National Electoral Committee. E-Voting System. General Overview

Chapter. Sampling Distributions Pearson Prentice Hall. All rights reserved

ECE250: Algorithms and Data Structures Trees

Contents. Bibliography 121. Index 123

Digital humanities methods in comparative law

Learning Systems. Research at the Intersection of Machine Learning & Data Systems. Joseph E. Gonzalez

Running head: ROCK THE BLOCKCHAIN 1. Rock the Blockchain: Next Generation Voting. Nikolas Roby, Patrick Gill, Michael Williams

United States District Court, D. Delaware. LUCENT TECHNOLOGIES, INC. Plaintiff. v. NEWBRIDGE NETWORKS CORP. and Newbridge Networks, Inc. Defendants.

SECURE REMOTE VOTER REGISTRATION

Protocol to Check Correctness of Colorado s Risk-Limiting Tabulation Audit

THE ECONOMIC EFFECT OF CORRUPTION IN ITALY: A REGIONAL PANEL ANALYSIS (M. LISCIANDRA & E. MILLEMACI) APPENDIX A: CORRUPTION CRIMES AND GROWTH RATES

Exploring QR Factorization on GPU for Quantum Monte Carlo Simulation

Non-Binding Trusted Party Consortium Agreement. Accession Agreement. ASERL-GWLA Consortium Membership v

Priority Queues & Heaps

Cloud Tutorial: AWS IoT. TA for class CSE 521S, Fall, Jan/18/2018 Haoran Li

GST 104: Cartographic Design Lab 6: Countries with Refugees and Internally Displaced Persons Over 1 Million Map Design

Testing the Waters: Working With CSS Data in Congressional Collections

Classifier Evaluation and Selection. Review and Overview of Methods

7th CIRCUIT ELECTRONIC DISCOVERY COMMITTEE PRINCIPLES RELATING TO THE DISCOVERY OF ELECTRONICALLY STORED INFORMATION. Second Edition, January, 2018

Midterm Review. EECS 2011 Prof. J. Elder - 1 -

Understanding factors that influence L1-visa outcomes in US

Paper: Entered: January 16, 2018 UNITED STATES PATENT AND TRADEMARK OFFICE BEFORE THE PATENT TRIAL AND APPEAL BOARD

General Framework of Electronic Voting and Implementation thereof at National Elections in Estonia

Real-Time Scheduling Single Processor. Chenyang Lu

Samiah Ibrahim. Canada Border Services Agency

IBM Cognos Open Mic Cognos Analytics 11 Part nd June, IBM Corporation

Follow this and additional works at:

IceCube Project Monthly Report November 2007

Measures for Consular Legalization

Analysis of Social Voting Patterns on Digg

Introduction-cont Pattern classification

Draft rules issued for comment on July 20, Ballot cast should be when voter relinquishes control of a marked, sealed ballot.

The bill now must pass the Senate. We hope that will be the case later this year.

Follow this and additional works at:

CASE STUDY 2 Portuguese Immigration & Border Service

The HeLIx + inversion code Genetic algorithms. A. Lagg - Abisko Winter School 1

Florida Supreme Court Standards for Electronic Access to the Courts

what Flaws IN the PPWT? --THE WAY FORWARD FOR ARMS CONTROL IN OUTER SPACE

Highlights of Fourth Annual Tulane University Report

Proving correctness of Stable Matching algorithm Analyzing algorithms Asymptotic running times

COMP : DATA STRUCTURES 2/27/14. Are binary trees satisfying two additional properties:

Comparison of the Psychometric Properties of Several Computer-Based Test Designs for. Credentialing Exams

MINUTES OF THE SENATE COMMITTEE ON JUDICIARY. Seventy-Eighth Session February 10, 2015

Georg Lutz, Nicolas Pekari, Marina Shkapina. CSES Module 5 pre-test report, Switzerland

Performance & Energy

IN THE UNITED STATES DISTRICT COURT FOR THE EASTERN DISTRICT OF PENNSYLVANIA

VITA. February Robert A. Nakosteen EDUCATION

IDENTIFYING FAULT-PRONE MODULES IN SOFTWARE FOR DIAGNOSIS AND TREATMENT USING EEPORTERS CLASSIFICATION TREE

Data Assimilation in Geosciences

CHINA IN THE WORLD PODCAST. Host: Paul Haenle Guest: Claire Reade

INSTRUCTIONS FOR USE

Resolution Establishing a Surplus Property Policy

(14 November 1997 to date) HAZARDOUS SUBSTANCES ACT 15 OF (Gazette No. 3834, No. 550 dated 4 April See Act for commencement dates.

Probabilistic earthquake early warning in complex earth models using prior sampling

An Audit of the November 2, 2010 Election in Richland County Duncan A. Buell, Eleanor Hare, Frank Heindel, Chip Moore 14 February 2011.

And for such other and further relief as to this Court may deem just and proper.

Cyber-Physical Systems Scheduling

The documents listed below were utilized in the development of this Test Report:

DBH 4 Social Science Contemporary history Unit 1: Political Revolutions: French Revolution. Name & last name:

XMM-Newton Instrument Operations & Data Generation. J.R. Muñoz and the IOT members, SRE-OOX UGM#15, ESAC Apr 2014

Virtual Memory and Address Translation

BILL HORN SUPERVISOR, FIFTH DISTRICT SAN DIEGO COUNTY BOARD OF SUPERVISORS

Commission Agreement for On-Line Course Materials

Processes. Criteria for Comparing Scheduling Algorithms

Interrogatories Are Written Questions For Which Written Answers Are Prepared And Signed Under Oath

The Gender Gap's Back

Production Sharing Agreements as a Form of International Cooperation N. Chebaeva, post-graduate student Supervisor Professor Dr. Igor B.

Real-Time CORBA. Chenyang Lu CSE 520S

PRE BEEISI8PJAL/ FeR 8FFlEI:\L else 8PJLY. Biometric Pathway. Transforming Air Travel. December 1, 2016 Version 3.0

Why Biometrics? Why Biometrics? Biometric Technologies: Security and Privacy 2/25/2014. Dr. Rigoberto Chinchilla School of Technology

2. Information concerning the host company s contact person the inviting party PLEASE COMPLETE IN CAPITAL LETTERS

Search Trees. Chapter 10. CSE 2011 Prof. J. Elder Last Updated: :51 PM

Requested by BE NCP EMN on 26 th October Compilation produced on 19 th December 2011

Downloaded from: justpaste.it/vlxf

An Algorithmic and Computational Approach to Optimizing Gerrymandering

Complex systems theory & anarchism

Transcription:

A Bloom Filter Based Scalable Data Integrity Check Tool for Large-scale Dataset Sisi Xiong*, Feiyi Wang + and Qing Cao* *University of Tennessee Knoxville, Knoxville, TN, USA + Oak Ridge National Laboratory, Oak Ridge, TN, USA 1

Data integrity check n Motivations Ø Silent data corruption Ø Data movement n Checksumming Ø Generate a signature for a dataset 0x23ac After 3 months 0x23ac Dataset 1 Dataset 1 0x23ac After movement 0x4abf Dataset 1 Dataset 2 2

Scalability n Traditional approaches Ø Serial and file based n File distribution on Spider 2* Ø 50 million directories, half a billion files *Feiyi Wang, Veronica G. Vergara Larrea, Dustin Leverman, Sarp Oral. FCP: A Fast and Scalable Data Copy Tool for High Performance Parallel File Systems 3

Design goals n Develop a parallel checksumming tool Ø Use multiple processes/hosts to achieve horizontal scaling n Generate signatures for large-scale datasets Ø A two-step task: l Generate a signature for each file l Aggregate all the file-level signatures to a dataset-level signature n The resulting design: fsum 4

File-level signature vs chunk-level signature n Increase parallelism: break files into chunks File A Chunk0(fileA, 0) Chunk1(fileA, 4MiB) Chunking Chunk2(fileA, 8MiB) Chunk3(fileA, 12MiB) Chunk4(fileA, 16MiB) Chunk5(fileA, 20MiB) Chunk size: 4MiB 5

Workload distribution n Work stealing pattern Ø An idle process sends out work request Ø A busy process distributes its work queue equally P0: P1: P2: P3: Work queue split P0: P1: P2: P3: P0: P1: P2: P3: Work queue split P0: P1: P2: P3: work processing 6

Work stealing pattern n Random work distribution Ø Same/different number of processes 7

Aggregation of chunk-level signatures n Two possible solutions: Ø Hash list Ø Merkle tree: hash tree n Sorting is necessary! Hash list 1 s0 s1 s2 s3 s4 s5 s6 s7 Hash list 2 s4 s1 s3 s7 s2 s5 s6 s0 a6 b6 a4 a5 b4 b5 a0 a1 a2 a3 b0 b1 b2 b3 s0 s1 s2 s3 s4 s5 s6 s7 s4 s1 s3 s7 s2 s5 s6 s0 Hash tree1 Hash tree2 8

Bloom filter based signature aggregation approach n Bloom filter Ø An array of bits, initialized to all 0s Ø Check membership Ø Two operations: insert and query n Configuration parameters Ø Size: m Ø Number of hash functions: k m=10, k=3 0 10 0 01 01 10 0 10 0 10 Insert{a,b} Query{a,c} a In the set! b c Not In the in set the! set! False positive error 9

Bloom filter based signature aggregation approach n Probabilistic nature: errors Ø False negative error never happens Ø False positive error - with a probability p: l m: bloom filter size, n: number of elements l Given n, larger m results in a smaller p. l Give p, m increases linearly with respect to n p = e %& ' ()* +)- ln p m = n (ln 2) + 10

Bloom filter based signature aggregation approach n Features Ø Independent of insertion orders {A, B, C, D, E} insert {E, D, C, B, A} insert Ø Use OR result to represent the union of multiple sets {A, B, C} {D, E} insert insert OR {A, B, C, D, E} 11

Bloom filter based signature aggregation approach n P0: {(file1chunk1, s), (file2chunk2, s), (file2chunk3, s), (file3chunk2, s)} n P1: {(file1chunk2, s), (file1chunk3, s), (file2chunk1, s), (file3chunk4, s)} n P2: {(file1chunk4, s), (file1chunk6, s), (file2chunk6, s), (file3chunk3, s)} n P3: {(file1chunk5, s), (file2chunk4, s), (file2chunk5, s), (file3chunk1, s)} n OR result 12

Bloom filter based signature aggregation approach n Parameter setting and error probability Ø Error case: two different datasets have the same signature Ø Suppose there are r different signatures l C 5 = {x 5, x +,, x :, c :, c :<5,, c ' } l C + = {y 5, y +,, y :, c :, c :<5,, c ' } l Given r, the error probability is p +:, p is false positive probability of bloom filter Ø Worst case: r = 1 l Number of errors follows a binomial distribution e~b(n, p + ) l P e = 0 = (1 p + ) ' l p = e %D E ()* +)- l The relationship between P(e=0) and m/n 13

Evaluation n Two datasets Ø Scientific datasets generated on Spider 2 Dataset 1 Dataset 2 Total size 5.39TiB 14.74TiB Number of files 28,114,281 15,590 Average file size 205.83KiB 1.19GiB Chunk size 16MiB 64MiB Number of chunks 28,343,725 251,629 14

Evaluation n Scalability test Ø Higher process rate when using more processes, until bounded by I/O bandwidth Ø Handling large files is more efficient than handling small files, potentially bounded by metadata retrieval in Lustre file system Processing time Processing rate 15

Evaluation n Function verification Ø Corrupt the dataset l 100 times, corrupt a single byte in a single file each time Ø Compare two signatures of the original and the corrupted dataset Ø Results: all the signatures are different 16

Evaluation n Compare with related approaches 5.8x 2.6x 7.1x 4.4x Memory usage compared with sorting (P(e=0) = 99.99%) Runtime compared with sha1 (a 500GiB file, 4 processes) 17

Conclusions n Present the design and implementation of scalable parallel checksumming tool, fsum, for large scale datasets n Design a bloom filter based signature aggregation approach and analyze the relationship between error probability and parameter selection n Test on representative and real production datasets and demonstrate that fsum exhibits near-linear scalability and is able to detect data corruption n fsum is both memory and time efficient than other approaches Code available at github.com/olcf/pcircle 18

Thanks! 19