Data Structures for Big Data: Bloom Filter

Transcrição

Data Structures for Big Data: Bloom Filter
Data Structures for Big Data:
Bloom Filter
Vinicius Vielmo Cogo
Smalltalks, DI, FC/UL. October 16, 2014.
is relative
is not defined by a specific number of TB, PB, EB
is when it becomes big for you
is when your solutions become inefficient/impractical
2 / 30
Data Structures for Big Data
Traditional DSs are subject to the same problems
e.g., lists, trees
(e.g., YARN, NoSQL)
or
(e.g., index, metadata)
reached the point of thinking in new DSs for BD
3 / 30
Outline
Bloom Filter
Use Cases
Implementations
Other Filters
Other Data Structures for Big Data
4 / 30
Bloom Filter
Membership testing
Does my collection contain this element?
5 / 30
Bloom Filter
City
Coimbra
Leiria
6 / 30
Bloom Filter
Index i
0
1
2
3
4
5
6
7
8
9
10 11 12 13 14
bf[i]
0
0
0
0
0
0
0
0
0
0
0
0
0
0
http://billmill.org/bloomfilter-tutorial/
0
7 / 30
Bloom Filter
City
Hash Function
Coimbra
Fnv
Leiria
Murmur
Index i
0
1
2
3
4
5
6
7
8
9
10 11 12 13 14
bf[i]
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
8 / 30
Bloom Filter
City
Hash Function
Coimbra
Fnv
i=4
Leiria
Murmur
i=7
Index i
0
1
2
3
4
5
6
7
8
9
10 11 12 13 14
bf[i]
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
9 / 30
Bloom Filter
City
Hash Function
Coimbra
Fnv
i=4
Leiria
Murmur
i=7
Index i
0
1
2
3
4
5
6
7
8
9
10 11 12 13 14
bf[i]
0
0
0
0
1
0
0
1
0
0
0
0
0
0
0
10 / 30
Bloom Filter
City
Hash Function
Coimbra
Fnv
Leiria
Murmur
Index i
0
1
2
3
4
5
6
7
8
9
10 11 12 13 14
bf[i]
0
0
0
0
1
0
0
1
0
0
0
0
0
0
0
11 / 30
Bloom Filter
City
Hash Function
Coimbra
Fnv
i=2
Leiria
Murmur
i=9
Index i
0
1
2
3
4
5
6
7
8
9
10 11 12 13 14
bf[i]
0
0
0
0
1
0
0
1
0
0
0
0
0
0
0
12 / 30
Bloom Filter
City
Hash Function
Coimbra
Fnv
i=2
Leiria
Murmur
i=9
Index i
0
1
2
3
4
5
6
7
8
9
10 11 12 13 14
bf[i]
0
0
1
0
1
0
0
1
0
1
0
0
0
0
0
13 / 30
Bloom Filter
City
Hash Function
Coimbra
Fnv
Leiria
Murmur
Index i
0
1
2
3
4
5
6
7
8
9
10 11 12 13 14
bf[i]
0
0
1
0
1
0
0
1
0
1
0
0
0
0
0
14 / 30
Bloom Filter
City
Braga
Guarda
Coimbra
Lisboa
15 / 30
Bloom Filter
City
Hash Function
Braga
Fnv
i=10
Guarda
Murmur
i=14
Coimbra
Lisboa
Index i
0
1
2
3
4
5
6
7
8
9
10 11 12 13 14
bf[i]
0
0
1
0
1
0
0
1
0
1
0
Result: false
0
0
0
0
16 / 30
Bloom Filter
City
Hash Function
Braga
Fnv
i=2
Guarda
Murmur
i=12
Coimbra
Lisboa
Index i
0
1
2
3
4
5
6
7
8
9
10 11 12 13 14
bf[i]
0
0
1
0
1
0
0
1
0
1
0
Result: false
0
0
0
0
17 / 30
Bloom Filter
City
Hash Function
Braga
Fnv
i=4
Guarda
Murmur
i=7
Coimbra
Lisboa
Index i
0
1
2
3
4
5
6
7
8
9
10 11 12 13 14
bf[i]
0
0
1
0
1
0
0
1
0
1
0
Result: true
0
0
0
0
18 / 30
Bloom Filter
City
Hash Function
Braga
Fnv
i=7
Guarda
Murmur
i=9
Coimbra
Lisboa
Index i
0
1
2
3
4
5
6
7
8
9
10 11 12 13 14
bf[i]
0
0
1
0
1
0
0
1
0
1
0
Result: true (but it is a false positive)
0
0
0
0
19 / 30
Bloom Filter
DS proposed by Burton Howard Bloom in 1970
Design principles
Space-efficient
Smaller than the original dataset
Time-efficient
Low latency R/W
O(k), which is much smaller than O(n)
High throughput
Probabilistic
E.g., myCollection.mightContain(myObject)
False positives happen (but in a configurable way)
20 / 30
Bloom Filter
Important variables
City
= Expected collection size
Coimbra
Leiria
= False positive rate (e.g., 0.0001% or 1 in 1M)
= Bitmap size
0
1
2
3
4
5
6
7
8
9
10
11
12
13
14
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
Hash Function
= Optimal number of hash functions
Fnv
Murmur
21 / 30
Bloom Filter
Important variables
22 / 30
Bloom Filter
Users define two of them (normally n and any other)
The other two are calculated with those equations
Interesting relations:
Bigger collection (
)
Larger bitmap (
Bigger collection (
)
More false positives (
Larger bitmap (
Larger bitmap (
)
Less hash functions (
)
)
Less false positives (
)
Less hash functions (
)
)
23 / 30
Bloom Filter
Bloom filter size vs. False positive rate
24 / 30
Use Cases
Reducing unnecessary disk reads
Client
1?
1
No
2?
2
BloomFilter
Dataset
F
F
T
T
2
3?
3
necessary
read(2)
T
unnecessary
read(3)
F
No
RAM
Hard Disk
25 / 30
Use Cases
Google BigTable, Apache Cassandra and HBase
Reducing disk lookups
Google Chrome
Lookup a list of known malicious URLs
Bitcoin
Get only the transactions relevant to your wallet
Others
In my Ph.D. work
Lookup a list of known privacy-sensitive DNA
sequences
26 / 30
Implementations
-libraries
https://code.google.com/p/guava-libraries/
Orestes-Bloomfilter
https://github.com/Baqend/Orestes-Bloomfilter
java-bloomfilter
https://github.com/magnuss/java-bloomfilter
java-longfastbloomfilter
https://code.google.com/p/java-longfastbloomfilter/
27 / 30
Other Filters
Counting Bloom filters
Allow deletions (use a 4-bit counter instead of 1 bit)
Buffered Bloom filters
Sub-filters in SSD with buffered R/W exploring bit locality
Quotient and Cascade filters
Uses an SSD, instead of the main memory, for scalability
28 / 30
Other DSs (and techniques) for Big Data
Locality-sensitive hashing (LSH)
Hashing similar elements into the same bucket with high probability
HyperLogLog for computing cardinality
Counting the number of distinct elements in a collection
Log Structured Merge (LSM) trees
Indexed access to files with high insert volume and background batch synchronization
29 / 30
Thank you!
Vinicius Vielmo Cogo
Smalltalks, DI, FC/UL. October 16, 2014.