Data Structures for Big Data: Bloom Filter
Transcrição
Data Structures for Big Data: Bloom Filter
Data Structures for Big Data: Bloom Filter Vinicius Vielmo Cogo Smalltalks, DI, FC/UL. October 16, 2014. is relative is not defined by a specific number of TB, PB, EB is when it becomes big for you is when your solutions become inefficient/impractical 2 / 30 Data Structures for Big Data Traditional DSs are subject to the same problems e.g., lists, trees (e.g., YARN, NoSQL) or (e.g., index, metadata) reached the point of thinking in new DSs for BD 3 / 30 Outline Bloom Filter Use Cases Implementations Other Filters Other Data Structures for Big Data 4 / 30 Bloom Filter Membership testing Does my collection contain this element? 5 / 30 Bloom Filter City Coimbra Leiria 6 / 30 Bloom Filter Index i 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 bf[i] 0 0 0 0 0 0 0 0 0 0 0 0 0 0 http://billmill.org/bloomfilter-tutorial/ 0 7 / 30 Bloom Filter City Hash Function Coimbra Fnv Leiria Murmur Index i 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 bf[i] 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 8 / 30 Bloom Filter City Hash Function Coimbra Fnv i=4 Leiria Murmur i=7 Index i 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 bf[i] 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 9 / 30 Bloom Filter City Hash Function Coimbra Fnv i=4 Leiria Murmur i=7 Index i 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 bf[i] 0 0 0 0 1 0 0 1 0 0 0 0 0 0 0 10 / 30 Bloom Filter City Hash Function Coimbra Fnv Leiria Murmur Index i 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 bf[i] 0 0 0 0 1 0 0 1 0 0 0 0 0 0 0 11 / 30 Bloom Filter City Hash Function Coimbra Fnv i=2 Leiria Murmur i=9 Index i 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 bf[i] 0 0 0 0 1 0 0 1 0 0 0 0 0 0 0 12 / 30 Bloom Filter City Hash Function Coimbra Fnv i=2 Leiria Murmur i=9 Index i 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 bf[i] 0 0 1 0 1 0 0 1 0 1 0 0 0 0 0 13 / 30 Bloom Filter City Hash Function Coimbra Fnv Leiria Murmur Index i 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 bf[i] 0 0 1 0 1 0 0 1 0 1 0 0 0 0 0 14 / 30 Bloom Filter City Braga Guarda Coimbra Lisboa 15 / 30 Bloom Filter City Hash Function Braga Fnv i=10 Guarda Murmur i=14 Coimbra Lisboa Index i 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 bf[i] 0 0 1 0 1 0 0 1 0 1 0 Result: false 0 0 0 0 16 / 30 Bloom Filter City Hash Function Braga Fnv i=2 Guarda Murmur i=12 Coimbra Lisboa Index i 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 bf[i] 0 0 1 0 1 0 0 1 0 1 0 Result: false 0 0 0 0 17 / 30 Bloom Filter City Hash Function Braga Fnv i=4 Guarda Murmur i=7 Coimbra Lisboa Index i 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 bf[i] 0 0 1 0 1 0 0 1 0 1 0 Result: true 0 0 0 0 18 / 30 Bloom Filter City Hash Function Braga Fnv i=7 Guarda Murmur i=9 Coimbra Lisboa Index i 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 bf[i] 0 0 1 0 1 0 0 1 0 1 0 Result: true (but it is a false positive) 0 0 0 0 19 / 30 Bloom Filter DS proposed by Burton Howard Bloom in 1970 Design principles Space-efficient Smaller than the original dataset Time-efficient Low latency R/W O(k), which is much smaller than O(n) High throughput Probabilistic E.g., myCollection.mightContain(myObject) False positives happen (but in a configurable way) 20 / 30 Bloom Filter Important variables City = Expected collection size Coimbra Leiria = False positive rate (e.g., 0.0001% or 1 in 1M) = Bitmap size 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 Hash Function = Optimal number of hash functions Fnv Murmur 21 / 30 Bloom Filter Important variables 22 / 30 Bloom Filter Users define two of them (normally n and any other) The other two are calculated with those equations Interesting relations: Bigger collection ( ) Larger bitmap ( Bigger collection ( ) More false positives ( Larger bitmap ( Larger bitmap ( ) Less hash functions ( ) ) Less false positives ( ) Less hash functions ( ) ) 23 / 30 Bloom Filter Bloom filter size vs. False positive rate 24 / 30 Use Cases Reducing unnecessary disk reads Client 1? 1 No 2? 2 BloomFilter Dataset F F T T 2 3? 3 necessary read(2) T unnecessary read(3) F No RAM Hard Disk 25 / 30 Use Cases Google BigTable, Apache Cassandra and HBase Reducing disk lookups Google Chrome Lookup a list of known malicious URLs Bitcoin Get only the transactions relevant to your wallet Others In my Ph.D. work Lookup a list of known privacy-sensitive DNA sequences 26 / 30 Implementations -libraries https://code.google.com/p/guava-libraries/ Orestes-Bloomfilter https://github.com/Baqend/Orestes-Bloomfilter java-bloomfilter https://github.com/magnuss/java-bloomfilter java-longfastbloomfilter https://code.google.com/p/java-longfastbloomfilter/ 27 / 30 Other Filters Counting Bloom filters Allow deletions (use a 4-bit counter instead of 1 bit) Buffered Bloom filters Sub-filters in SSD with buffered R/W exploring bit locality Quotient and Cascade filters Uses an SSD, instead of the main memory, for scalability 28 / 30 Other DSs (and techniques) for Big Data Locality-sensitive hashing (LSH) Hashing similar elements into the same bucket with high probability HyperLogLog for computing cardinality Counting the number of distinct elements in a collection Log Structured Merge (LSM) trees Indexed access to files with high insert volume and background batch synchronization 29 / 30 Thank you! Vinicius Vielmo Cogo Smalltalks, DI, FC/UL. October 16, 2014.