Locality Sensitive Hashing (1)

Transcrição

Locality Sensitive Hashing (1)
technology
from seed
Locality Sensitive Hashing
José Portêlo
Instituto de Engenharia de Sistemas e Computadores Investigação e Desenvolvimento em Lisboa
L2 F - Spoken Language Systems Laboratory
1
technology
Outline
from seed
• Nearest Neighbor
–
–
–
–
Motivation
Nearest Neighbor Search
c-Approximate Near Neighbor Search
Solutions
• Locality Sensitive Hashing
–
–
–
–
Basic idea
Definition
Algorithm
Applications
• References
Instituto de Engenharia de Sistemas e Computadores Investigação e Desenvolvimento em Lisboa
L2 F - Spoken Language Systems Laboratory
2
technology
Nearest Neighbor (1)
from seed
• Motivation
–
–
–
–
Learning: nearest neighbour rule
Database retrieval
Vector quantization (compression)
etc.
Instituto de Engenharia de Sistemas e Computadores Investigação e Desenvolvimento em Lisboa
L2 F - Spoken Language Systems Laboratory
3
technology
from seed
Nearest Neighbor (2)
• Nearest Neighbor Search
– Problem
• given a set P of points in a metric space M of dimension d and a query
point q ∈ M, find the point p ∈ P closest to q
– Metric space
• usually the Euclidean space
– Distance function
• Euclidean
• Hamming
q
Instituto de Engenharia de Sistemas e Computadores Investigação e Desenvolvimento em Lisboa
L2 F - Spoken Language Systems Laboratory
4
technology
from seed
Nearest Neighbor (3)
• c-Approximate Near Neighbor Search
– Problem:
• given a set P of points in a metric space M of dimension d and a query
point q ∈ M, find the points p' ∈ P such that:
D(q,p') ≤ c * D(q,p), c > 1
where d(q,p) is the distance of q to its closest point in P
– Metric space
• usually the Euclidean space
– Distance function
• Euclidean
• Hamming
D(q,p)
q
c*D(q,p)
Instituto de Engenharia de Sistemas e Computadores Investigação e Desenvolvimento em Lisboa
L2 F - Spoken Language Systems Laboratory
5
technology
Nearest Neighbour (4)
from seed
• Solutions
– Linear search
• Compute the distance from query q to every other point in P
• Naive approach, search time is O(n)
– Trees (space partitioning)
• Iteratively partition the search space into two regions, each containing
half of the points of the parent region
• Queries performed via transversal of the tree, from root to leafs
• d=1: search time is O(log n), d >> 1: search time grows to O(n)
– Hashes
• Use a data structure to quickly map points into values into a table
• Points that are close together are mapped into different values
• If O(n) memory is used, the search time is O(1)
Instituto de Engenharia de Sistemas e Computadores Investigação e Desenvolvimento em Lisboa
L2 F - Spoken Language Systems Laboratory
6
technology
Locality Sensitive Hashing (1)
from seed
• Basic idea
– If two points are close together, they will remain close together for
most projections; if two points are far apart, they will remain far
apart for most projections
image adapted from “Locality-Sensitive Hashing for Finding Nearest Neighbors”,
Malcolm Slaney, Michael Casey, 2008
Instituto de Engenharia de Sistemas e Computadores Investigação e Desenvolvimento em Lisboa
L2 F - Spoken Language Systems Laboratory
7
technology
Locality Sensitive Hashing (2)
from seed
• Definition
– Method for performing probabilistic dimension reduction of highdimensional data
– Hashing functions are the scalar projections (dot products)
– Family of hashes H is (R,cR,P1,P2)-sensitive for any p,q ∈ P:
• B(q,r) = {p: d(q,p) ≤ R}, ball of radius R
• If p ∈ B(q,R) then Pr[h(q) = h(p)] ≥ P1
• If p ∉ B(q,cR) then Pr[h(q) = h(p)] ≤ P2
• H is useful if P1 > P2 and R < cR
– Example: Hamming distance
• LSH functions: h(p) = pi, i.e., the i-th bit of p
• Probabilities: Pr[h(q) = h(p)] = 1 – D(q,p)/d
Instituto de Engenharia de Sistemas e Computadores Investigação e Desenvolvimento em Lisboa
L2 F - Spoken Language Systems Laboratory
8
technology
Locality Sensitive Hashing (3)
from seed
• Algorithm
– Consider functions of the form g(p) = <h1(p),h2(p),...,hk(p)>
– Preprocessing
• Choose g1,...,gL
• For all p ∈ P, hash p to buckets g1(p),...,gL(p)
– Query
• Retrieve the points from buckets g1(p),...,gL(p)
• Answer the query based on the retrieved points
Instituto de Engenharia de Sistemas e Computadores Investigação e Desenvolvimento em Lisboa
L2 F - Spoken Language Systems Laboratory
9
technology
from seed
Locality Sensitive Hashing (4)
• Algorithm (cont)
Reference
Distance
function
Space
Time
Comments
[PR99],
[APR99]
Hamming
dn + n1+1/c
dn1/c
-
[MNPV04]
Euclidean
dn + n1+ρ'(c)
dnρ'(c)
ρ'(c) < 1/c
[R05]
Euclidean
dn + n
dnρ''(c)
ρ''(c)/c → 2.09
[AP06]
Euclidean
dn + n1+1/c^2+o(1)
dn1/c^2+o(1)
-
[AP06]
Euclidean
dn + n*log n
dnO(1/c^2)
-
n: number of points in P
d: dimension of P
c: approximation factor
Instituto de Engenharia de Sistemas e Computadores Investigação e Desenvolvimento em Lisboa
L2 F - Spoken Language Systems Laboratory
10
technology
Locality Sensitive Hashing (5)
from seed
• Applications
– Audio related
• musical similarity measurement [MM06]
• similar acoustic events search [CD09]
• speaker search (work in progress by Woojay Jeon)
– Non-audio related
• content-based video identification [ZWQ04]
• image search [BK09]
• biological network analysis [L08]
Instituto de Engenharia de Sistemas e Computadores Investigação e Desenvolvimento em Lisboa
L2 F - Spoken Language Systems Laboratory
11
technology
References (1)
from seed
• Algorithms
– [PR99] “Approximate Nearest Neighbors: Towards Removing The
Curse Of Dimensionality”; Piotr Indyk, Rajeev Motwani; 1999
– [APR99] “Similarity Search In High Dimensions Via Hashing”;
Aristides Gionis, Piotr Indyk, Rajeev Motwani; 1999
– [MNPV04] “Locality-Sensitive Hashing Scheme Based On p-Stable
Distributions”; Mayur Datar, Nicole Immorlica, Piotr Indyk, Vahab
Mirrokni; 2004
– [R05] “Entropy Based Nearest Neighbor Search In High
Dimensions”; Rina Panigrahy; 2005
– [AP06] “Near-Optimal Hashing Algorithms For Approximate
Nearest Neighbor In High Dimensions”; Alexandr Andoni, Piotr
Indyk; 2006
Instituto de Engenharia de Sistemas e Computadores Investigação e Desenvolvimento em Lisboa
L2 F - Spoken Language Systems Laboratory
12
technology
References (2)
from seed
• Applications – audio related
– [MM06] "Audio Shingling For Measuring Musical Similarity";
Michael Casey, Malcolm Slaney; 2006
– [CD09] "Finding Similar Acoustic Events Using Matching Pursuit
And Locality-Sensitive Hashing"; Courtenay Cotton, Daniel P. W.
Ellis; 2009
– "Efficient Speaker Search Over Large Populations Using
Kernelized Locality-Sensitive Hashing"; Woojay Jeon et al.;
working paper
– "Large Population Speaker Search Using Kernelized LocalitySensitive Hashing"; Woojay Jeon et al.; working paper
Instituto de Engenharia de Sistemas e Computadores Investigação e Desenvolvimento em Lisboa
L2 F - Spoken Language Systems Laboratory
13
technology
References (3)
from seed
• Applications – non-audio related
– [ZWQ04] "Hierarchical, Non-Uniform Locality Sensitive Hashing
And Its Application To Video Identification"; Zixiang Yang, Wei
Tsang Ooi, Qibin Sun; 2004
– [BK09] "Kernelized Locality-Sensitive Hashing For Scalable Image
Search"; Brian Kulis, Kristen Grauman; 2009
– [L08] "Locality-Sensitive Hashing And Biological Network
Alignment"; Laura LeGault; 2008
Instituto de Engenharia de Sistemas e Computadores Investigação e Desenvolvimento em Lisboa
L2 F - Spoken Language Systems Laboratory
14
technology
from seed
technology
from seed
L2 F - Spoken Language Systems Laboratory
Instituto de Engenharia de Sistemas e Computadores Investigação e Desenvolvimento em Lisboa
L2 F - Spoken Language Systems Laboratory
15

Documentos relacionados

Caravela: A Distributed Stream-based Computing

Caravela: A Distributed Stream-based Computing methods toward a uniform programming interface for GPU-based applications”, ACM International conference of Computing Frontier, May 2007 4. Shinichi Yamagiwa, Leonel Sousa, Tomas Brandao, "Meta-Pip...

Leia mais