Sector Discrimination: Sector Identification with Similarity Digest Fingerprints

Similar documents
File Systems: Fundamentals

CS 5523: Operating Systems

SECURE REMOTE VOTER REGISTRATION

Real- Time Wireless Control Networks for Cyber- Physical Systems

Hoboken Public Schools. PLTW Introduction to Computer Science Curriculum

A Bloom Filter Based Scalable Data Integrity Check Tool for Large-scale Dataset

Lecture 6 Cryptographic Hash Functions

Why Biometrics? Why Biometrics? Biometric Technologies: Security and Privacy 2/25/2014. Dr. Rigoberto Chinchilla School of Technology

Recommendations For Reddit Users Avideh Taalimanesh and Mohammad Aleagha Stanford University, December 2012

Processes. Criteria for Comparing Scheduling Algorithms

The Social Web: Social networks, tagging and what you can learn from them. Kristina Lerman USC Information Sciences Institute

Case Study. MegaMatcher Accelerator

Comparison of the Psychometric Properties of Several Computer-Based Test Designs for. Credentialing Exams

Designing a Social Network Prep for Lab 10. March 26, 2018 Sprenkle - CSCI Why classes and objects? How do we create new data types?

Luciano Nicastro

WTO Research Workshop on BLOCKCHAIN

Secure Electronic Voting

Economic and Social Council

LPGPU. Low- Power Parallel Compu1ng on GPUs. Ben Juurlink. Technische Universität Berlin. EPoPPEA workshop

We should share our secrets

Last Time. Bit banged SPI I2C LIN Ethernet. u Embedded networks. Ø Characteristics Ø Requirements Ø Simple embedded LANs

E-DISCOVERY Will it byte you or your client? COPYRIGHT 2014 ALL RIGHTS RESERVED

Supreme Court of Florida

CSCI 325: Distributed Systems. Objec?ves. Professor Sprenkle. Course overview Overview of distributed systems Introduc?on to reading research papers

Does Decentralization Lessen or Worsen Poverty? Evidence from

4th International Industrial Supercomputing Workshop Supercomputing for industry and SMEs in the Netherlands

Objec&ves. Usability Project Discussion. May 9, 2016 Sprenkle - CSCI335 1

CS 5523 Operating Systems: Intro to Distributed Systems

Final Review. Chenyang Lu. CSE 467S Embedded Compu5ng Systems

Servilla: Service Provisioning in Wireless Sensor Networks. Chenyang Lu

Estonian National Electoral Committee. E-Voting System. General Overview

Addressing the Challenges of e-voting Through Crypto Design

Adaptive QoS Control for Real-Time Systems

Question 1. Does your library plan to remain in the Federal Depository Library Program?

Internet of Things Wireless Sensor Networks. Chenyang Lu

Analysis of Social Voting Patterns on Digg

bitqy The official cryptocurrency of bitqyck, Inc. per valorem coeptis Whitepaper v1.0 bitqy The official cryptocurrency of bitqyck, Inc.

IMPLEMENTATION OF SECURE PLATFORM FOR E- VOTING SYSTEM

Telephone Survey. Contents *

Case: 1:16-cv Document #: 586 Filed: 01/03/18 Page 1 of 10 PageID #:10007 FOR THE NORTHERN DISTRICT OF ILLINOIS EASTERN DIVISION

Cyber-Physical Systems Scheduling

Outline. From Pixels to Semantics Research on automatic indexing and retrieval of large collections of images. Research: Main Areas

General Framework of Electronic Voting and Implementation thereof at National Elections in Estonia

Data 100. Lecture 9: Scraping Web Technologies. Slides by: Joseph E. Gonzalez, Deb Nolan

Combating Friend Spam Using Social Rejections

Maps and Hash Tables. EECS 2011 Prof. J. Elder - 1 -

UNITED STATES DISTRICT COURT SOUTHERN DISTRICT OF CALIFORNIA. Plaintiff, Defendant.

Constraint satisfaction problems. Lirong Xia

Cluster Analysis. (see also: Segmentation)

Estimating the Margin of Victory for Instant-Runoff Voting

The Open Biometrics Initiative and World Card

Key Considerations for Implementing Bodies and Oversight Actors

Local differential privacy

Subreddit Recommendations within Reddit Communities

Exposure-Resilience for Free: The Hierarchical ID-based Encryption Case

Decentralised solutions for renewable energies and water in developing countries

Comparison Sorts. EECS 2011 Prof. J. Elder - 1 -

Florida Supreme Court Standards for Electronic Access to the Courts

HISTORY GEOSHARE, DRINET, U2U

Polydisciplinary Faculty of Larache Abdelmalek Essaadi University, MOROCCO 3 Department of Mathematics and Informatics

Text UI. Data Store Ø Example of a backend to a real Could add a different user interface. Good judgment comes from experience

Maps, Hash Tables and Dictionaries

Applica'on of UQ Principles to Calibra'on, Sensi'vity, and Experimental Design

Supreme Court of Florida

Electronic Voting Service Using Block-Chain

Election Audit Report for Pinellas County, FL. March 7, 2006 Elections Using Sequoia Voting Systems, Inc. ACV Edge Voting System, Release Level 4.

Colorado Secretary of State Election Rules [8 CCR ]

NetTest A European Solution from Austria for measuring Broadband Quality SERENTSCHY.COM ADVISORY SERVICES GMBH

2014 Second Chance Act Planning and Implementa4on (P&I) Guide

Towards Large Eddy Simulation for Turbo-machinery Flows

Natural Language Technologies for E-Rulemaking. Claire Cardie Department of Computer Science Cornell University

FREQUENTLY ASKED QUESTION

Response to the Report Evaluation of Edison/Mitofsky Election System

Towards a Practical, Secure, and Very Large Scale Online Election

Ballot Reconciliation Procedure Guide

City of Toronto Election Services Internet Voting for Persons with Disabilities Demonstration Script December 2013

Real-Time Wireless Control Networks for Cyber-Physical Systems

Belton I.S.D. Records Management Policy and Procedural Manual. Compiled by: Record Management Committee

Report for the Associated Press. November 2015 Election Studies in Kentucky and Mississippi. Randall K. Thomas, Frances M. Barlas, Linda McPetrie,

Malicious URI resolving in PDFs

SPARC Version New Features

Coin-Vote. Abstract: Version 0.1 Sunday, 21 June, Year 7 funkenstein the dwarf

SYRIAN ARAB REPUBLIC

Please reach out to for a complete list of our GET::search method conditions. 3

Random Forests. Gradient Boosting. and. Bagging and Boosting

Voting and Complexity

Concurrent Programing: Why you should care, deeply. Don Porter Portions courtesy Emmett Witchel

Introduction-cont Pattern classification

VoteCastr methodology

An Application of time stamped proxy blind signature in e-voting

Measurement and Analysis of an Online Content Voting Network: A Case Study of Digg

Google App Engine 8/10/17. CS Cloud Compu5ng Systems--Summer II 2017

A Retrospective Study of State Aid Control in the German Broadband Market

Feasibility Study on a system for Registration of Albanian Emigrants TO BE. June, 30P

Position Paper IDENT Implementation for U.S. VISIT

Minimum Spanning Tree Union-Find Data Structure. Feb 28, 2018 CSCI211 - Sprenkle. Comcast wants to lay cable in a neighborhood. Neighborhood Layout

This tutorial also provides a glimpse of various security issues related to biometric systems, and the comparison of various biometric systems.

CONCRETE: A benchmarking framework to CONtrol and Classify REpeatable Testbed Experiments

User s Guide and Codebook for the ANES 2016 Time Series Voter Validation Supplemental Data

Protocol to Check Correctness of Colorado s Risk-Limiting Tabulation Audit

Transcription:

Sector Discrimination: Sector Identification with Similarity Digest Fingerprints Vassil Roussev vassil@cs.uno.edu 1

Problem: given a set of fragments, iden4fy the original ar4fact. Source objects (files) v Disk fragments (sectors) Network fragments (packets) Fragments of interest are 1-4KB in size Fragment alignment is arbitrary; fragment size may vary. 2

Key idea: generate a similarity digest to enable approximate matching. SD fingerprint features Accuracy: >99% identification (@4KB) Efficiency: ~3% of the original data Scalability: compare objects of any size Performance: expected to be I/O-bound (100 MB/s) 3

SD fingerprint: local representa4on using sta4s4cally improbable features. Each Bloom filter (BF) represents, on average, 8KB. 4

SD comparison is based on all- pairs comparison of BFs. For fragments up to 8KB, the SD fingerprint is a single Bloom filter. 5

SD comparison is approximate fragments may be represented in two BFs. Bloom filters are compared bitwise greater overlap signifies greater overlap between the respective data sources. 6

Improved feature selec4on is achieved by filtering out low- entropy content. Data with low information content 7

Results: Success & Error Rates Correctly classified Misclassified: FP + FN Not classified All fragments The algorithm favors rejection of weak data (low-entropy fragments) over false positives. 8

Test Cases 7 x 100MB sets: doc html jpg rnd pdf txt xls Four fragment sizes: 512, 1024, 2048, & 4096 bytes 9

Detec4on Rates > 0.999 for C t = 20 10

Non- classifica4on Rates 11

Typical Misclassifica4on (MC) Behavior 0 < MC <= 0.005 for C t = 20 12

Summary of Misclassifica4on Rate Ranges 13

Conclusions Developed a robust, scalable fragment identification methodology. Accuracy >99%, due to filtering of weak features Implementation: http://www.cs.uno.edu/~vassil/ sdhash The same tool can be used to detect file versions, such as updates libraries/executables. 14

Future Work Performance optimization: 100 MB/s Hash the NSRL and other corpora Evaluate effectiveness of version detection Combine with sector discrimination Multi-resolution implementation 15

Digital Forensic Research Conference Aug 1-3, 2011 New Orleans, LA An Evaluation of Forensic Similarity Hashes Vassil Roussev vassil@cs.uno.edu

Agenda Ø Intro o Motivation, problems, goals, requirements, Ø High- level tool design Ø Evalua4on studies Ø Current/planned sdhash infrastructure Ø Quick demo (4me permiung) Ø Q & A 17

Mo4va4on: Tradi4onal Filtering Approaches Fail Ø Known file filtering: o Crypto- hash known ;iles, store in library (e.g. NSRL) o Hash ;iles on target o Filter in/out depending on interest Ø Challenges o Static libraries are falling behind Dynamic sovware updates, trivial ar4fact transforma4ons è We need version correla4on o Need to ;ind embedded objects Block/file in file/volume/network trace o Need higher- level correlations Disk- to- RAM Disk- to- network 18

Similarity Hash Requirements/Scenarios Ø Iden4fica4on of embedded/trace evidence o Needle in a haystack Ø Iden4fica4on of code versions o File- to- ;ile correlation Ø Iden4fica4on of related documents o File- to- ;ile correlation Ø Correla4on of RAM and disk sources o Different representation of same objects Ø Correla4on of network and disk sources o Fragmentation/alignment issues o No ;low reconstruction 19

Exis4ng Similarity Hashing: ssdeep Ø Context- triggered piecewise hashing o Developed by Jesse Kornblum (2006) o An adaptation of an early spam ;iltering algorithm Ø General idea o Break up the ;ile into chunks o Generate a 6- bit hash for each chunk o Concatenate the hashes to obtain the ;ile signature: 24576:fBovHm8YnR/tDn7uSt8P8SRLAD/5Qvhfpt8P8SRLm:mvHKnx5C868MAD/5uz68Mm, file.pdf" o Treat the signatures as strings; use edit distance to estimate similarity 20

ssdeep: Problems Ø Methodology (random polynomial fingerprin4ng) o Works well on mid- /high- entropy data Text/compressed data o Degenerates on lower- entropy data Uneven coverage Many false posi4ves o Dif;icult to ;ix Ø Design o Fixed- size signature (does not scale) o Distance metric choice (edit distance) is questionable è Fixes essen4ally require a new tool 21

sdhash: Similarity Digests Ø Terminology: o Feature: a 64- byte sequence (other varia8ons are possible) Ø Idea: o Consider all features: Compute rolling entropy measure o Filter out low- entropy/extreme high entropy ones o From each neighborhood, pick the rarest ones Based on entropy score and empirical observa4ons o Hash selected features and put into a Bloom ;ilter Bloom filter == probabilis4c, compressed set representa4on o Create more ;ilters as necessary o Signature is a sequence of Bloom ;ilters 22

Feature Selec4on data selected 3iltered out 23

Similarity Digest Signature ~7-8KB data 128 features (up to) 128 features f 1 f 2 f 3 f 4 256 bytes On average, a 256- byte filter represents 7-8KB chunk of the original data. Digest size is ~3% of original data (could be smaller). (No original data is stored.) 24

Similarity Digest Comparison g 1 g 2 g 3 g m f 1 D(f 1, g 1 ) D(f 1, g 2 ) D(f 1, g 3 ) D(f 1, g m ) max i=1..m D(f 1, g i ) f 2 f n D(f n, g 1 ) D(f n, g 2 ) D(f n, g m ) max i=1..m D(f n, g i ) S = Avg i=1..i D i 25

ssdeep vs. sdhash Round 1: Controlled Study Ø Controlled study o All targets generated using random data o Allows for precise control of common data o Provides a baseline for the tools capabilities o Best case scenario Ø Scenarios o Embedded object detection o Single- common- block ;ile correlation o Multiple- common- blocks ;ile correlation 26

Embedded Object Detec4on Target Object 27

Embedded Object Detec4on Target Object 27

Embedded Object Detec4on Target Ø Scenario implementa4on Object o o o o Generate target & object Place object randomly in target Run tools on <object, target> Do 1,000 runs changing target, object, and placement Ø Evalua4on criterion o o Given: target of ;ixed size Q: What is the smallest embedded object that can be reliably detected? Reliable detec4on == 95% + successful correla4ons 27

Min Embedded Block Correla4on (KB) (smaller is beder) * * max values tested 28

Single- Common- Block Correla4on T 1 T 2 29

Single- Common- Block Correla4on T 1 T 2 29

Single- Common- Block Correla4on T 1 T 2 29

Single- Common- Block Correla4on T 1 T 2 Ø Scenario implementa4on o o Generate targets & object Place object randomly in both target o Run tools on <T 1, T 2 > o Do 1,000 runs changing target, object, and placement Ø Evalua4on criterion o o Given: targets of ;ixed size Q: What is the smallest embedded object that can be reliably detected? Reliable detec4on == 95% + successful correla4ons 29

Min Common Block Correla4on (KB) (smaller is beder) ssdeep sdhash 700 624 525 512 350 320 175 0 160 80 96 32 48 16 24 256 512 1024 2048 4096 30

Mul4ple- Common- Blocks Correla4on T 1 T 2 31

Mul4ple- Common- Blocks Correla4on T 1 T 2 31

Mul4ple- Common- Blocks Correla4on T 1 T 2 31

Mul4ple- Common- Blocks Correla4on T 1 T 2 Ø Scenario implementa4on o Generate targets & object; split object in 4/8 pieces o Place pieces randomly in both target o Run tools on <T 1, T 2 > o Do 1,000 runs changing target, object, and placement Ø Evalua4on criterion o Given: targets of ;ixed size, object size = ½ target size o Q: What is the probability that a tool will detect it? 31

Mul4ple Common Block Correla4on (Frac4on) (BIGGER is beder) 32

ssdeep vs. sdhash Round 2: Real Data Study Ø Real files from the NPS GovDocs1 corpus o Fundamentally, a user study Ø Q: How does byte- level correla4on map to human- perceived ar4fact correla4on? o Not all commonality is re;lected at the semantic level Ø Related files defined: o Versions of the same ;ile o Shared format/content (e.g. web layout, JPEG) o Flash evaluation: similarity obvious within 30sec 33

Real Data Study Ø The T5 set o GovDocs1 sample: 000-004 o 4,557 ;iles, 1.8GB total o 4KB- 16.4MB Ø Evalua4on o For all unique pairs (~10 mln.) Run ssdeep Run sdhash Evaluate posi4ve results manually 34

Evalua4on Sta4s4cs 35

The Raw Numbers 16 36

Recall Rates: TP/Total 37

Precision Rates: TP/(TP+FP) 38

ssdeep: FP & TP substan4ally scores overlap Ø Cannot use thresholds for ROC trade off 39

sdhash: FP & TP scores are separable Threshold used in study Ø Thresholding is effec4ve in cheaply elimina4ng FPs 40

Example ssdeep false posi4ves (score: 54-86) 41

Evalua4on Summary Ø New hashing scheme based on similarity digests o Scalable, robust, parallelizable o Evaluated under controlled & realistic conditions o Outperforms existing work by a wide margin Recall: 95% vs. 55% Precision: 94% vs. 68% o Graceful behavior at the margin Intui4ve behavior of the similarity score Scores drop gradually as detec4on limits are approached o Meets at least three requirements More evalua4on needed for disk/network & disk/ram 42

Current Throughput (ver 1.3) 43

Current Throughput (ver 1.3) Ø Hash genera4on rate o Six- core Intel Xeon X5670 @ 2.93GHz ~27MB/s per core o Quad- Core Intel Xeon @ 2.8 GHz ~20MB/s per core 43

Current Throughput (ver 1.3) Ø Hash genera4on rate o Six- core Intel Xeon X5670 @ 2.93GHz ~27MB/s per core o Quad- Core Intel Xeon @ 2.8 GHz ~20MB/s per core Ø Hash comparison o 1MB vs. 1MB: 0.5ms 43

Current Throughput (ver 1.3) Ø Hash genera4on rate o Six- core Intel Xeon X5670 @ 2.93GHz ~27MB/s per core o Quad- Core Intel Xeon @ 2.8 GHz ~20MB/s per core Ø Hash comparison o 1MB vs. 1MB: 0.5ms Ø T5 corpus (4,457 files, all pairs) o 10mln ;ile comparisons in ~ 15min 667K file comps per second Single core 43

The Envisioned Architecture libsd CLI: Files: Disk: Network: Servi Cluster: Client: Client: API C/C++ C# Python 44

The Current State CLI: libsd Servi API Files: Disk: Network: Cluster: Client: Client: C/C++ C# Python 45

Todo List (1) Ø libsdbf o Ver 2.0 rewrite o Full parallelization (TBB?) o Compression (?) Ø sdhash- file o More command line options/compatibility w/ssdeep o Parallel processing o Service- based processing (w/ sdbf_d) Ø sdhash- pcap o Pcap- aware processing: payload extrac4on, file discovery, 4melining 46

Todo List (2) Ø sdhash- dd o Block- aware processing, compression Ø sdbf_d o Persistance: XML o Service interface: JSON o Server clustering Ø sdbfweb o Browser- based management/query Ø sdbfviz o Large- scale visualization & clustering 47

Further Development Ø Ø Ø Ø Integra4on w/ RDS o sdhash- set: construct SDBFs from existing SHA1 sets Compare/iden4fy whole folders, distribu4ons, etc. Structural feature selec4on o E.g., exe/dll, pdf, zip, Op4miza4ons o o o o Sampling Skipping Under min con4nuous block assump4on Cluster core extraction/comparison GPU acceleration Representa4on o o o Multi- resolution digests New crypto hashes Data offsets 48

Thank you! Ø hdp://roussev.net/sdhash o wget http://roussev.net/sdhash/sdhash- 1.3.zip o make o./sdhash Ø References o V. Roussev, Data Fingerprinting with Similarity Digests, in K.- P. Chow, S. Shenoi (Eds.): Advances in Digital Forensics VI, IFIP AICT 337, pp. 207-225, 2010 o V. Roussev, An Evaluation of Forensic Similarity Hashes, in DFRWS 2011 Ø Contact: Vassil Roussev vassil@roussev.net Ø Q & A 49