Sector Discrimination: Sector Identification with Similarity Digest Fingerprints

Sector Discrimination: Sector Identification with Similarity Digest Fingerprints Vassil Roussev vassil@cs.uno.edu 1

Problem: given a set of fragments, iden4fy the original ar4fact. Source objects (files) v Disk fragments (sectors) Network fragments (packets) Fragments of interest are 1-4KB in size Fragment alignment is arbitrary; fragment size may vary. 2

Key idea: generate a similarity digest to enable approximate matching. SD fingerprint features Accuracy: >99% identification (@4KB) Efficiency: ~3% of the original data Scalability: compare objects of any size Performance: expected to be I/O-bound (100 MB/s) 3

SD fingerprint: local representa4on using sta4s4cally improbable features. Each Bloom filter (BF) represents, on average, 8KB. 4

SD comparison is based on all- pairs comparison of BFs. For fragments up to 8KB, the SD fingerprint is a single Bloom filter. 5

SD comparison is approximate fragments may be represented in two BFs. Bloom filters are compared bitwise greater overlap signifies greater overlap between the respective data sources. 6

Improved feature selec4on is achieved by filtering out low- entropy content. Data with low information content 7

Results: Success & Error Rates Correctly classified Misclassified: FP + FN Not classified All fragments The algorithm favors rejection of weak data (low-entropy fragments) over false positives. 8

Test Cases 7 x 100MB sets: doc html jpg rnd pdf txt xls Four fragment sizes: 512, 1024, 2048, & 4096 bytes 9

Detec4on Rates > 0.999 for C t = 20 10

Non- classifica4on Rates 11

Typical Misclassifica4on (MC) Behavior 0 < MC <= 0.005 for C t = 20 12

Summary of Misclassifica4on Rate Ranges 13

Conclusions Developed a robust, scalable fragment identification methodology. Accuracy >99%, due to filtering of weak features Implementation: http://www.cs.uno.edu/~vassil/ sdhash The same tool can be used to detect file versions, such as updates libraries/executables. 14

Future Work Performance optimization: 100 MB/s Hash the NSRL and other corpora Evaluate effectiveness of version detection Combine with sector discrimination Multi-resolution implementation 15

Digital Forensic Research Conference Aug 1-3, 2011 New Orleans, LA An Evaluation of Forensic Similarity Hashes Vassil Roussev vassil@cs.uno.edu

Agenda Ø Intro o Motivation, problems, goals, requirements, Ø High- level tool design Ø Evalua4on studies Ø Current/planned sdhash infrastructure Ø Quick demo (4me permiung) Ø Q & A 17

Mo4va4on: Tradi4onal Filtering Approaches Fail Ø Known file filtering: o Crypto- hash known ;iles, store in library (e.g. NSRL) o Hash ;iles on target o Filter in/out depending on interest Ø Challenges o Static libraries are falling behind Dynamic sovware updates, trivial ar4fact transforma4ons è We need version correla4on o Need to ;ind embedded objects Block/file in file/volume/network trace o Need higher- level correlations Disk- to- RAM Disk- to- network 18

Similarity Hash Requirements/Scenarios Ø Iden4fica4on of embedded/trace evidence o Needle in a haystack Ø Iden4fica4on of code versions o File- to- ;ile correlation Ø Iden4fica4on of related documents o File- to- ;ile correlation Ø Correla4on of RAM and disk sources o Different representation of same objects Ø Correla4on of network and disk sources o Fragmentation/alignment issues o No ;low reconstruction 19

Exis4ng Similarity Hashing: ssdeep Ø Context- triggered piecewise hashing o Developed by Jesse Kornblum (2006) o An adaptation of an early spam ;iltering algorithm Ø General idea o Break up the ;ile into chunks o Generate a 6- bit hash for each chunk o Concatenate the hashes to obtain the ;ile signature: 24576:fBovHm8YnR/tDn7uSt8P8SRLAD/5Qvhfpt8P8SRLm:mvHKnx5C868MAD/5uz68Mm, file.pdf" o Treat the signatures as strings; use edit distance to estimate similarity 20

ssdeep: Problems Ø Methodology (random polynomial fingerprin4ng) o Works well on mid- /high- entropy data Text/compressed data o Degenerates on lower- entropy data Uneven coverage Many false posi4ves o Dif;icult to ;ix Ø Design o Fixed- size signature (does not scale) o Distance metric choice (edit distance) is questionable è Fixes essen4ally require a new tool 21

sdhash: Similarity Digests Ø Terminology: o Feature: a 64- byte sequence (other varia8ons are possible) Ø Idea: o Consider all features: Compute rolling entropy measure o Filter out low- entropy/extreme high entropy ones o From each neighborhood, pick the rarest ones Based on entropy score and empirical observa4ons o Hash selected features and put into a Bloom ;ilter Bloom filter == probabilis4c, compressed set representa4on o Create more ;ilters as necessary o Signature is a sequence of Bloom ;ilters 22

Feature Selec4on data selected 3iltered out 23

Similarity Digest Signature ~7-8KB data 128 features (up to) 128 features f 1 f 2 f 3 f 4 256 bytes On average, a 256- byte filter represents 7-8KB chunk of the original data. Digest size is ~3% of original data (could be smaller). (No original data is stored.) 24

Similarity Digest Comparison g 1 g 2 g 3 g m f 1 D(f 1, g 1 ) D(f 1, g 2 ) D(f 1, g 3 ) D(f 1, g m ) max i=1..m D(f 1, g i ) f 2 f n D(f n, g 1 ) D(f n, g 2 ) D(f n, g m ) max i=1..m D(f n, g i ) S = Avg i=1..i D i 25

ssdeep vs. sdhash Round 1: Controlled Study Ø Controlled study o All targets generated using random data o Allows for precise control of common data o Provides a baseline for the tools capabilities o Best case scenario Ø Scenarios o Embedded object detection o Single- common- block ;ile correlation o Multiple- common- blocks ;ile correlation 26

Embedded Object Detec4on Target Object 27

Embedded Object Detec4on Target Ø Scenario implementa4on Object o o o o Generate target & object Place object randomly in target Run tools on <object, target> Do 1,000 runs changing target, object, and placement Ø Evalua4on criterion o o Given: target of ;ixed size Q: What is the smallest embedded object that can be reliably detected? Reliable detec4on == 95% + successful correla4ons 27

Min Embedded Block Correla4on (KB) (smaller is beder) * * max values tested 28

Single- Common- Block Correla4on T 1 T 2 29

Single- Common- Block Correla4on T 1 T 2 Ø Scenario implementa4on o o Generate targets & object Place object randomly in both target o Run tools on <T 1, T 2 > o Do 1,000 runs changing target, object, and placement Ø Evalua4on criterion o o Given: targets of ;ixed size Q: What is the smallest embedded object that can be reliably detected? Reliable detec4on == 95% + successful correla4ons 29

Min Common Block Correla4on (KB) (smaller is beder) ssdeep sdhash 700 624 525 512 350 320 175 0 160 80 96 32 48 16 24 256 512 1024 2048 4096 30

Mul4ple- Common- Blocks Correla4on T 1 T 2 31

Mul4ple- Common- Blocks Correla4on T 1 T 2 Ø Scenario implementa4on o Generate targets & object; split object in 4/8 pieces o Place pieces randomly in both target o Run tools on <T 1, T 2 > o Do 1,000 runs changing target, object, and placement Ø Evalua4on criterion o Given: targets of ;ixed size, object size = ½ target size o Q: What is the probability that a tool will detect it? 31

Mul4ple Common Block Correla4on (Frac4on) (BIGGER is beder) 32

ssdeep vs. sdhash Round 2: Real Data Study Ø Real files from the NPS GovDocs1 corpus o Fundamentally, a user study Ø Q: How does byte- level correla4on map to human- perceived ar4fact correla4on? o Not all commonality is re;lected at the semantic level Ø Related files defined: o Versions of the same ;ile o Shared format/content (e.g. web layout, JPEG) o Flash evaluation: similarity obvious within 30sec 33

Real Data Study Ø The T5 set o GovDocs1 sample: 000-004 o 4,557 ;iles, 1.8GB total o 4KB- 16.4MB Ø Evalua4on o For all unique pairs (~10 mln.) Run ssdeep Run sdhash Evaluate posi4ve results manually 34

Evalua4on Sta4s4cs 35

The Raw Numbers 16 36

Recall Rates: TP/Total 37

Precision Rates: TP/(TP+FP) 38

ssdeep: FP & TP substan4ally scores overlap Ø Cannot use thresholds for ROC trade off 39

sdhash: FP & TP scores are separable Threshold used in study Ø Thresholding is effec4ve in cheaply elimina4ng FPs 40

Example ssdeep false posi4ves (score: 54-86) 41

Evalua4on Summary Ø New hashing scheme based on similarity digests o Scalable, robust, parallelizable o Evaluated under controlled & realistic conditions o Outperforms existing work by a wide margin Recall: 95% vs. 55% Precision: 94% vs. 68% o Graceful behavior at the margin Intui4ve behavior of the similarity score Scores drop gradually as detec4on limits are approached o Meets at least three requirements More evalua4on needed for disk/network & disk/ram 42

Current Throughput (ver 1.3) 43

Current Throughput (ver 1.3) Ø Hash genera4on rate o Six- core Intel Xeon X5670 @ 2.93GHz ~27MB/s per core o Quad- Core Intel Xeon @ 2.8 GHz ~20MB/s per core 43

Current Throughput (ver 1.3) Ø Hash genera4on rate o Six- core Intel Xeon X5670 @ 2.93GHz ~27MB/s per core o Quad- Core Intel Xeon @ 2.8 GHz ~20MB/s per core Ø Hash comparison o 1MB vs. 1MB: 0.5ms Ø T5 corpus (4,457 files, all pairs) o 10mln ;ile comparisons in ~ 15min 667K file comps per second Single core 43

The Envisioned Architecture libsd CLI: Files: Disk: Network: Servi Cluster: Client: Client: API C/C++ C# Python 44

The Current State CLI: libsd Servi API Files: Disk: Network: Cluster: Client: Client: C/C++ C# Python 45

Todo List (1) Ø libsdbf o Ver 2.0 rewrite o Full parallelization (TBB?) o Compression (?) Ø sdhash- file o More command line options/compatibility w/ssdeep o Parallel processing o Service- based processing (w/ sdbf_d) Ø sdhash- pcap o Pcap- aware processing: payload extrac4on, file discovery, 4melining 46

Todo List (2) Ø sdhash- dd o Block- aware processing, compression Ø sdbf_d o Persistance: XML o Service interface: JSON o Server clustering Ø sdbfweb o Browser- based management/query Ø sdbfviz o Large- scale visualization & clustering 47

Further Development Ø Ø Ø Ø Integra4on w/ RDS o sdhash- set: construct SDBFs from existing SHA1 sets Compare/iden4fy whole folders, distribu4ons, etc. Structural feature selec4on o E.g., exe/dll, pdf, zip, Op4miza4ons o o o o Sampling Skipping Under min con4nuous block assump4on Cluster core extraction/comparison GPU acceleration Representa4on o o o Multi- resolution digests New crypto hashes Data offsets 48

Thank you! Ø hdp://roussev.net/sdhash o wget http://roussev.net/sdhash/sdhash- 1.3.zip o make o./sdhash Ø References o V. Roussev, Data Fingerprinting with Similarity Digests, in K.- P. Chow, S. Shenoi (Eds.): Advances in Digital Forensics VI, IFIP AICT 337, pp. 207-225, 2010 o V. Roussev, An Evaluation of Forensic Similarity Hashes, in DFRWS 2011 Ø Contact: Vassil Roussev vassil@roussev.net Ø Q & A 49