Locality Sensitive Hashing (1)
Transcrição
Locality Sensitive Hashing (1)
technology from seed Locality Sensitive Hashing José Portêlo Instituto de Engenharia de Sistemas e Computadores Investigação e Desenvolvimento em Lisboa L2 F - Spoken Language Systems Laboratory 1 technology Outline from seed • Nearest Neighbor – – – – Motivation Nearest Neighbor Search c-Approximate Near Neighbor Search Solutions • Locality Sensitive Hashing – – – – Basic idea Definition Algorithm Applications • References Instituto de Engenharia de Sistemas e Computadores Investigação e Desenvolvimento em Lisboa L2 F - Spoken Language Systems Laboratory 2 technology Nearest Neighbor (1) from seed • Motivation – – – – Learning: nearest neighbour rule Database retrieval Vector quantization (compression) etc. Instituto de Engenharia de Sistemas e Computadores Investigação e Desenvolvimento em Lisboa L2 F - Spoken Language Systems Laboratory 3 technology from seed Nearest Neighbor (2) • Nearest Neighbor Search – Problem • given a set P of points in a metric space M of dimension d and a query point q ∈ M, find the point p ∈ P closest to q – Metric space • usually the Euclidean space – Distance function • Euclidean • Hamming q Instituto de Engenharia de Sistemas e Computadores Investigação e Desenvolvimento em Lisboa L2 F - Spoken Language Systems Laboratory 4 technology from seed Nearest Neighbor (3) • c-Approximate Near Neighbor Search – Problem: • given a set P of points in a metric space M of dimension d and a query point q ∈ M, find the points p' ∈ P such that: D(q,p') ≤ c * D(q,p), c > 1 where d(q,p) is the distance of q to its closest point in P – Metric space • usually the Euclidean space – Distance function • Euclidean • Hamming D(q,p) q c*D(q,p) Instituto de Engenharia de Sistemas e Computadores Investigação e Desenvolvimento em Lisboa L2 F - Spoken Language Systems Laboratory 5 technology Nearest Neighbour (4) from seed • Solutions – Linear search • Compute the distance from query q to every other point in P • Naive approach, search time is O(n) – Trees (space partitioning) • Iteratively partition the search space into two regions, each containing half of the points of the parent region • Queries performed via transversal of the tree, from root to leafs • d=1: search time is O(log n), d >> 1: search time grows to O(n) – Hashes • Use a data structure to quickly map points into values into a table • Points that are close together are mapped into different values • If O(n) memory is used, the search time is O(1) Instituto de Engenharia de Sistemas e Computadores Investigação e Desenvolvimento em Lisboa L2 F - Spoken Language Systems Laboratory 6 technology Locality Sensitive Hashing (1) from seed • Basic idea – If two points are close together, they will remain close together for most projections; if two points are far apart, they will remain far apart for most projections image adapted from “Locality-Sensitive Hashing for Finding Nearest Neighbors”, Malcolm Slaney, Michael Casey, 2008 Instituto de Engenharia de Sistemas e Computadores Investigação e Desenvolvimento em Lisboa L2 F - Spoken Language Systems Laboratory 7 technology Locality Sensitive Hashing (2) from seed • Definition – Method for performing probabilistic dimension reduction of highdimensional data – Hashing functions are the scalar projections (dot products) – Family of hashes H is (R,cR,P1,P2)-sensitive for any p,q ∈ P: • B(q,r) = {p: d(q,p) ≤ R}, ball of radius R • If p ∈ B(q,R) then Pr[h(q) = h(p)] ≥ P1 • If p ∉ B(q,cR) then Pr[h(q) = h(p)] ≤ P2 • H is useful if P1 > P2 and R < cR – Example: Hamming distance • LSH functions: h(p) = pi, i.e., the i-th bit of p • Probabilities: Pr[h(q) = h(p)] = 1 – D(q,p)/d Instituto de Engenharia de Sistemas e Computadores Investigação e Desenvolvimento em Lisboa L2 F - Spoken Language Systems Laboratory 8 technology Locality Sensitive Hashing (3) from seed • Algorithm – Consider functions of the form g(p) = <h1(p),h2(p),...,hk(p)> – Preprocessing • Choose g1,...,gL • For all p ∈ P, hash p to buckets g1(p),...,gL(p) – Query • Retrieve the points from buckets g1(p),...,gL(p) • Answer the query based on the retrieved points Instituto de Engenharia de Sistemas e Computadores Investigação e Desenvolvimento em Lisboa L2 F - Spoken Language Systems Laboratory 9 technology from seed Locality Sensitive Hashing (4) • Algorithm (cont) Reference Distance function Space Time Comments [PR99], [APR99] Hamming dn + n1+1/c dn1/c - [MNPV04] Euclidean dn + n1+ρ'(c) dnρ'(c) ρ'(c) < 1/c [R05] Euclidean dn + n dnρ''(c) ρ''(c)/c → 2.09 [AP06] Euclidean dn + n1+1/c^2+o(1) dn1/c^2+o(1) - [AP06] Euclidean dn + n*log n dnO(1/c^2) - n: number of points in P d: dimension of P c: approximation factor Instituto de Engenharia de Sistemas e Computadores Investigação e Desenvolvimento em Lisboa L2 F - Spoken Language Systems Laboratory 10 technology Locality Sensitive Hashing (5) from seed • Applications – Audio related • musical similarity measurement [MM06] • similar acoustic events search [CD09] • speaker search (work in progress by Woojay Jeon) – Non-audio related • content-based video identification [ZWQ04] • image search [BK09] • biological network analysis [L08] Instituto de Engenharia de Sistemas e Computadores Investigação e Desenvolvimento em Lisboa L2 F - Spoken Language Systems Laboratory 11 technology References (1) from seed • Algorithms – [PR99] “Approximate Nearest Neighbors: Towards Removing The Curse Of Dimensionality”; Piotr Indyk, Rajeev Motwani; 1999 – [APR99] “Similarity Search In High Dimensions Via Hashing”; Aristides Gionis, Piotr Indyk, Rajeev Motwani; 1999 – [MNPV04] “Locality-Sensitive Hashing Scheme Based On p-Stable Distributions”; Mayur Datar, Nicole Immorlica, Piotr Indyk, Vahab Mirrokni; 2004 – [R05] “Entropy Based Nearest Neighbor Search In High Dimensions”; Rina Panigrahy; 2005 – [AP06] “Near-Optimal Hashing Algorithms For Approximate Nearest Neighbor In High Dimensions”; Alexandr Andoni, Piotr Indyk; 2006 Instituto de Engenharia de Sistemas e Computadores Investigação e Desenvolvimento em Lisboa L2 F - Spoken Language Systems Laboratory 12 technology References (2) from seed • Applications – audio related – [MM06] "Audio Shingling For Measuring Musical Similarity"; Michael Casey, Malcolm Slaney; 2006 – [CD09] "Finding Similar Acoustic Events Using Matching Pursuit And Locality-Sensitive Hashing"; Courtenay Cotton, Daniel P. W. Ellis; 2009 – "Efficient Speaker Search Over Large Populations Using Kernelized Locality-Sensitive Hashing"; Woojay Jeon et al.; working paper – "Large Population Speaker Search Using Kernelized LocalitySensitive Hashing"; Woojay Jeon et al.; working paper Instituto de Engenharia de Sistemas e Computadores Investigação e Desenvolvimento em Lisboa L2 F - Spoken Language Systems Laboratory 13 technology References (3) from seed • Applications – non-audio related – [ZWQ04] "Hierarchical, Non-Uniform Locality Sensitive Hashing And Its Application To Video Identification"; Zixiang Yang, Wei Tsang Ooi, Qibin Sun; 2004 – [BK09] "Kernelized Locality-Sensitive Hashing For Scalable Image Search"; Brian Kulis, Kristen Grauman; 2009 – [L08] "Locality-Sensitive Hashing And Biological Network Alignment"; Laura LeGault; 2008 Instituto de Engenharia de Sistemas e Computadores Investigação e Desenvolvimento em Lisboa L2 F - Spoken Language Systems Laboratory 14 technology from seed technology from seed L2 F - Spoken Language Systems Laboratory Instituto de Engenharia de Sistemas e Computadores Investigação e Desenvolvimento em Lisboa L2 F - Spoken Language Systems Laboratory 15
Documentos relacionados
Caravela: A Distributed Stream-based Computing
methods toward a uniform programming interface for GPU-based applications”, ACM International conference of Computing Frontier, May 2007 4. Shinichi Yamagiwa, Leonel Sousa, Tomas Brandao, "Meta-Pip...
Leia mais