Machine Learning: Clustering
Transcrição
Machine Learning: Clustering
Clustering k-Means Agglomerative Clustering Use Case Summary Machine Learning: Clustering Steffen Rendle Information Systems and Machine Learning Lab (ISMLL) University of Hildesheim Wintersemester 2007 / 2008 Steffen Rendle Information Systems and Machine Learning Lab (ISMLL), University of Hildesheim Clustering k-Means Agglomerative Clustering Use Case Summary Clustering Overview Examples Clustering Tasks k-Means Overview Algorithm Agglomerative Clustering Overview Algorithm Use Case Task Method Summary Steffen Rendle Information Systems and Machine Learning Lab (ISMLL), University of Hildesheim Clustering k-Means Agglomerative Clustering Use Case Summary Overview The objective of clustering is to group similar data D. I the groups are called clusters I clustering is unsupervised, i.e. neither training data nor classes are given in advance I grouping/ clustering depends on the algorithm Steffen Rendle Information Systems and Machine Learning Lab (ISMLL), University of Hildesheim Clustering k-Means Agglomerative Clustering Use Case Summary Example Steffen Rendle Information Systems and Machine Learning Lab (ISMLL), University of Hildesheim Clustering k-Means Agglomerative Clustering Use Case Summary Example Steffen Rendle Information Systems and Machine Learning Lab (ISMLL), University of Hildesheim Clustering k-Means Agglomerative Clustering Use Case Summary Example Steffen Rendle Information Systems and Machine Learning Lab (ISMLL), University of Hildesheim Clustering k-Means Agglomerative Clustering Use Case Summary Example Steffen Rendle Information Systems and Machine Learning Lab (ISMLL), University of Hildesheim Clustering k-Means Agglomerative Clustering Use Case Summary Example Steffen Rendle Information Systems and Machine Learning Lab (ISMLL), University of Hildesheim Clustering k-Means Agglomerative Clustering Use Case Summary Example: Clustering of Search Results Steffen Rendle Information Systems and Machine Learning Lab (ISMLL), University of Hildesheim Clustering k-Means Agglomerative Clustering Use Case Summary Example: Clustering of Search Results Steffen Rendle Information Systems and Machine Learning Lab (ISMLL), University of Hildesheim Clustering k-Means Agglomerative Clustering Use Case Summary Example: Clustering of Search Results Steffen Rendle Information Systems and Machine Learning Lab (ISMLL), University of Hildesheim Clustering k-Means Agglomerative Clustering Use Case Summary Example: Wafer Analysis Fertigungsprozess 1 Fertigungsprozess 2 Fehlerursache 1 Fehlerursache 2 ... Test 1..n Steffen Rendle Information Systems and Machine Learning Lab (ISMLL), University of Hildesheim Clustering k-Means Agglomerative Clustering Use Case Summary Example: Wafer Analysis Steffen Rendle Information Systems and Machine Learning Lab (ISMLL), University of Hildesheim Clustering k-Means Agglomerative Clustering Use Case Summary Example: Wafer Analysis Steffen Rendle Information Systems and Machine Learning Lab (ISMLL), University of Hildesheim Clustering k-Means Agglomerative Clustering Use Case Summary Example: Object Identification DB Shop T-Online Amazon Cyberport Mediamarkt Mediamarkt Amazon Steffen Rendle Product name Fuji FinePix S5600 FujiFilm FinePix S5600 Digitalkamera (5 Megapixel, 10fach Zoom) Fuji FinePix S5600 Fine Pix S 5600 Fine Pix S 9500 Fuji FinePix S5500 Digitalkamera (4 Megapixel, 10x opt. Zoom) Price 279,00 254,90 259,90 245,00 515,00 349,99 Information Systems and Machine Learning Lab (ISMLL), University of Hildesheim Clustering k-Means Agglomerative Clustering Use Case Summary Clustering Tasks I Hard-Clustering: find a partition of the data I Soft-Clustering/ Fuzzy-Clustering: find propabilities of group membership for each item I Hierarchical Clustering: find a dendrogram (tree) of the data Steffen Rendle Information Systems and Machine Learning Lab (ISMLL), University of Hildesheim Clustering k-Means Agglomerative Clustering Use Case Summary Hard-Clustering Steffen Rendle Information Systems and Machine Learning Lab (ISMLL), University of Hildesheim Clustering k-Means Agglomerative Clustering Use Case Summary Hard-Clustering Steffen Rendle Information Systems and Machine Learning Lab (ISMLL), University of Hildesheim Clustering k-Means Agglomerative Clustering Use Case Summary Soft-Clustering Steffen Rendle Information Systems and Machine Learning Lab (ISMLL), University of Hildesheim Clustering k-Means Agglomerative Clustering Use Case Summary Soft-Clustering Steffen Rendle Information Systems and Machine Learning Lab (ISMLL), University of Hildesheim Clustering k-Means Agglomerative Clustering Use Case Summary Hierarchical-Clustering A B D C H I K J E G F A Steffen Rendle B C D E F G H I J K Information Systems and Machine Learning Lab (ISMLL), University of Hildesheim Clustering k-Means Agglomerative Clustering Use Case Summary Hierarchical-Clustering A B D C H I K J E G F A Steffen Rendle B C D E F G H I J K Information Systems and Machine Learning Lab (ISMLL), University of Hildesheim Clustering k-Means Agglomerative Clustering Use Case Summary Hierarchical-Clustering A B D C H I K J E G F A Steffen Rendle B C D E F G H I J K Information Systems and Machine Learning Lab (ISMLL), University of Hildesheim Clustering k-Means Agglomerative Clustering Use Case Summary Hierarchical-Clustering A B D C H I K J E G F A Steffen Rendle B C D E F G H I J K Information Systems and Machine Learning Lab (ISMLL), University of Hildesheim Clustering k-Means Agglomerative Clustering Use Case Summary Hierarchical-Clustering A B D C H I K J E G F A Steffen Rendle B C D E F G H I J K Information Systems and Machine Learning Lab (ISMLL), University of Hildesheim Clustering k-Means Agglomerative Clustering Use Case Summary k-Means I I partitional clustering given I I I I to find I Steffen Rendle Data D = {d1 , ..., dn } ∈ P(Rm ) with di = (xi,1 , . . . , xi,m ) ∈ Rm Number of clusters k Similarity sim : Rm × Rm → R+ Partition of the data f : D → {1, . . . , k} Information Systems and Machine Learning Lab (ISMLL), University of Hildesheim Clustering k-Means Agglomerative Clustering Use Case Summary k-Means Algorithm function k-Means(D, k, sim) for all j ∈ {1, . . . , k} do yj ← random D end for repeat f0 ←f f (d) ← argmax sim(yj , d) j∈{1,...,k} for all j ∈ {1, . . . , k} do yj ← avg d d∈{d|f (d)=j} end for until f’ = f return f end function Steffen Rendle Information Systems and Machine Learning Lab (ISMLL), University of Hildesheim Clustering k-Means Agglomerative Clustering Use Case Summary Example Steffen Rendle Information Systems and Machine Learning Lab (ISMLL), University of Hildesheim Clustering k-Means Agglomerative Clustering Use Case Summary Example Steffen Rendle Information Systems and Machine Learning Lab (ISMLL), University of Hildesheim Clustering k-Means Agglomerative Clustering Use Case Summary Example Steffen Rendle Information Systems and Machine Learning Lab (ISMLL), University of Hildesheim Clustering k-Means Agglomerative Clustering Use Case Summary Example Steffen Rendle Information Systems and Machine Learning Lab (ISMLL), University of Hildesheim Clustering k-Means Agglomerative Clustering Use Case Summary Example Steffen Rendle Information Systems and Machine Learning Lab (ISMLL), University of Hildesheim Clustering k-Means Agglomerative Clustering Use Case Summary Example Steffen Rendle Information Systems and Machine Learning Lab (ISMLL), University of Hildesheim Clustering k-Means Agglomerative Clustering Use Case Summary Example Steffen Rendle Information Systems and Machine Learning Lab (ISMLL), University of Hildesheim Clustering k-Means Agglomerative Clustering Use Case Summary Example Steffen Rendle Information Systems and Machine Learning Lab (ISMLL), University of Hildesheim Clustering k-Means Agglomerative Clustering Use Case Summary Example Steffen Rendle Information Systems and Machine Learning Lab (ISMLL), University of Hildesheim Clustering k-Means Agglomerative Clustering Use Case Summary Example Steffen Rendle Information Systems and Machine Learning Lab (ISMLL), University of Hildesheim Clustering k-Means Agglomerative Clustering Use Case Summary Example Steffen Rendle Information Systems and Machine Learning Lab (ISMLL), University of Hildesheim Clustering k-Means Agglomerative Clustering Use Case Summary Example Steffen Rendle Information Systems and Machine Learning Lab (ISMLL), University of Hildesheim Clustering k-Means Agglomerative Clustering Use Case Summary Example Steffen Rendle Information Systems and Machine Learning Lab (ISMLL), University of Hildesheim Clustering k-Means Agglomerative Clustering Use Case Summary Example Steffen Rendle Information Systems and Machine Learning Lab (ISMLL), University of Hildesheim Clustering k-Means Agglomerative Clustering Use Case Summary Example Steffen Rendle Information Systems and Machine Learning Lab (ISMLL), University of Hildesheim Clustering k-Means Agglomerative Clustering Use Case Summary Example Steffen Rendle Information Systems and Machine Learning Lab (ISMLL), University of Hildesheim Clustering k-Means Agglomerative Clustering Use Case Summary Example Steffen Rendle Information Systems and Machine Learning Lab (ISMLL), University of Hildesheim Clustering k-Means Agglomerative Clustering Use Case Summary Example Steffen Rendle Information Systems and Machine Learning Lab (ISMLL), University of Hildesheim Clustering k-Means Agglomerative Clustering Use Case Summary Example Steffen Rendle Information Systems and Machine Learning Lab (ISMLL), University of Hildesheim Clustering k-Means Agglomerative Clustering Use Case Summary Example Steffen Rendle Information Systems and Machine Learning Lab (ISMLL), University of Hildesheim Clustering k-Means Agglomerative Clustering Use Case Summary Problems of k-Means I Steffen Rendle Information Systems and Machine Learning Lab (ISMLL), University of Hildesheim Clustering k-Means Agglomerative Clustering Use Case Summary Problems of k-Means I Steffen Rendle Information Systems and Machine Learning Lab (ISMLL), University of Hildesheim Clustering k-Means Agglomerative Clustering Use Case Summary Problems of k-Means I Steffen Rendle Information Systems and Machine Learning Lab (ISMLL), University of Hildesheim Clustering k-Means Agglomerative Clustering Use Case Summary Problems of k-Means I Steffen Rendle Information Systems and Machine Learning Lab (ISMLL), University of Hildesheim Clustering k-Means Agglomerative Clustering Use Case Summary Problems of k-Means I Steffen Rendle Information Systems and Machine Learning Lab (ISMLL), University of Hildesheim Clustering k-Means Agglomerative Clustering Use Case Summary Problems of k-Means II Steffen Rendle Information Systems and Machine Learning Lab (ISMLL), University of Hildesheim Clustering k-Means Agglomerative Clustering Use Case Summary Problems of k-Means II Steffen Rendle Information Systems and Machine Learning Lab (ISMLL), University of Hildesheim Clustering k-Means Agglomerative Clustering Use Case Summary Problems of k-Means II Steffen Rendle Information Systems and Machine Learning Lab (ISMLL), University of Hildesheim Clustering k-Means Agglomerative Clustering Use Case Summary Problems of k-Means II Steffen Rendle Information Systems and Machine Learning Lab (ISMLL), University of Hildesheim Clustering k-Means Agglomerative Clustering Use Case Summary Problems of k-Means II Steffen Rendle Information Systems and Machine Learning Lab (ISMLL), University of Hildesheim Clustering k-Means Agglomerative Clustering Use Case Summary Problems of k-Means I I k-Means is run several times and the best“ result is returned. ” for determing the best“ partition heuristic measures like intra ” cluster variance can be used: 2 k X X ICV(f , D) = avg d 0 d − d 0 ∈{d 0 |f (d 0 )=j} j=1 d∈{d|f (d)=j} Steffen Rendle Information Systems and Machine Learning Lab (ISMLL), University of Hildesheim Clustering k-Means Agglomerative Clustering Use Case Summary Properties of k-Means I easy to implement I in practice often fast data must be present in a metric space (e.g. euclidian space: Rn with k·k) so that centroids can be calculated. I I Steffen Rendle Counter example: strings Information Systems and Machine Learning Lab (ISMLL), University of Hildesheim Clustering k-Means Agglomerative Clustering Use Case Summary Agglomerative Clustering Agglomerative Clustering can solve several tasks: I partitional clustering with given number of clusters k or similarity threshold θ I hierarchical clustering Steffen Rendle Information Systems and Machine Learning Lab (ISMLL), University of Hildesheim Clustering k-Means Agglomerative Clustering Use Case Summary Greedy Agglomerative Clustering I I partitional clustering given I I I I to find I Steffen Rendle Data D = {d1 , ..., dn } Similarity sim : D × D → R+ Number of clusters k or threshold θ on similarities Partition of the data f : D → N Information Systems and Machine Learning Lab (ISMLL), University of Hildesheim Clustering k-Means Agglomerative Clustering Use Case Summary Hierarchical Agglomerative Clustering I I hierarchical clustering given I I I to find I Steffen Rendle Data D = {d1 , ..., dn } Similarity sim : D × D → R+ Series fi of partitions of the data fi : D → N with img fi ⊂ img fi+1 Information Systems and Machine Learning Lab (ISMLL), University of Hildesheim Clustering k-Means Agglomerative Clustering Use Case Summary Agglomerative Clustering Algorithm function AgglomerativeClustering(D, sim) m←0 for all i ∈ {1, . . . , n} do fm (di ) ← i end for repeat (i, j) = argmax sim? (fm , i, j) i,j∈img(fm ),i6=j fm+1 ← fm for all d ∈ {d 0 |fm (d 0 ) = j} do fm+1 (d) ← i end for m ←m+1 until convergence(fm ) return fm end function Steffen Rendle Information Systems and Machine Learning Lab (ISMLL), University of Hildesheim Clustering k-Means Agglomerative Clustering Use Case Summary Convergence convergence(f ) depends on the task: I if k given: convergence(f ) ⇔ | img(f )| ≤ k I if θ given: convergence(f ) ⇔ max i,j∈img(f ),i6=j I Steffen Rendle simX (f , i, j) ≤ θ in case of hierarchical clustering: convergence(f ) ⇔ | img(f )| = 1 Information Systems and Machine Learning Lab (ISMLL), University of Hildesheim Clustering k-Means Agglomerative Clustering Use Case Summary Similarity between Clusters A 0.9 0.63 0.6 B 0.7 0.82 0.2 C Steffen Rendle A D E B ? D E C Information Systems and Machine Learning Lab (ISMLL), University of Hildesheim Clustering k-Means Agglomerative Clustering Use Case Summary Similarity between Clusters Several possibilities for similarity sim? (f , i, j) between clusters I I I single linkage: simSL (f , i, j) = complete linkage: simCL (f , i, j) = average linkage: simAL (f , i, j) = max sim(d, d 0 ) min sim(d, d 0 ) avg sim(d, d 0 ) (d,d 0 )∈f −1 (i)×f −1 (j) (d,d 0 )∈f −1 (i)×f −1 (j) (d,d 0 )∈f −1 (i)×f −1 (j) Steffen Rendle Information Systems and Machine Learning Lab (ISMLL), University of Hildesheim Clustering k-Means Agglomerative Clustering Use Case Summary Single Linkage A 0.9 0.63 0.6 B 0.7 0.82 0.2 C Steffen Rendle A D E B 0.9 D E C Information Systems and Machine Learning Lab (ISMLL), University of Hildesheim Clustering k-Means Agglomerative Clustering Use Case Summary Complete Linkage A 0.9 0.63 0.6 B 0.7 0.82 0.2 C Steffen Rendle A D E B 0.2 D E C Information Systems and Machine Learning Lab (ISMLL), University of Hildesheim Clustering k-Means Agglomerative Clustering Use Case Summary Average Linkage A 0.9 0.63 0.6 B 0.7 0.82 0.2 C Steffen Rendle A D E B 0.64 D E C Information Systems and Machine Learning Lab (ISMLL), University of Hildesheim Clustering k-Means Agglomerative Clustering Use Case Summary Example: Agglomerative Clustering with Average Linkage A B D C H I K J E G F Steffen Rendle Information Systems and Machine Learning Lab (ISMLL), University of Hildesheim Clustering k-Means Agglomerative Clustering Use Case Summary Example: Agglomerative Clustering with Average Linkage A B C D E F G H I J K A B D C H A B C D E F G H I J K .8 .8 .5 .2 .1 .1 .2 .2 .0 .0 A .9 .9 .2 .1 .2 .2 .2 .1 .1 B .8 .3 .2 .3 .2 .2 .1 .1 C .2 .1 .2 .3 .3 .2 .2 D .9 .9 .1 .2 .2 .1 E .8 .0 .1 .1 .0 F .2 .3 .3 .3 G .9 .8 .8 H .9 .9 I .9 J K I K J E G F A B C D E F G H I J K Steffen Rendle Information Systems and Machine Learning Lab (ISMLL), University of Hildesheim Clustering k-Means Agglomerative Clustering Use Case Summary Example: Agglomerative Clustering with Average Linkage A BC D E F G H I J K A B D C H A BC D E F G H I J K .8 .5 .2 .1 .1 .2 .2 .0 .0 A .85 .25 .15 .25 .20 .20 .10 .10 BC .2 .1 .2 .3 .3 .2 .2 D .9 .9 .1 .2 .2 .1 E .8 .0 .1 .1 .0 F .2 .3 .3 .3 G .9 .8 .8 H .9 .9 I .9 J K I K J E G F A B C D E F G H I J K Steffen Rendle Information Systems and Machine Learning Lab (ISMLL), University of Hildesheim Clustering k-Means Agglomerative Clustering Use Case Summary Example: Agglomerative Clustering with Average Linkage A BC D E F G H I JK A B D C A BC D E F G H I JK .8 .5 .2 .1 .1 .2 .2 .0 A .85 .25 .15 .25 .20 .20 .10 BC .2 .1 .2 .3 .3 .2 D .9 .9 .1 .2 .15 E .8 .0 .1 .05 F .2 .3 .3 G .9 .8 H .9 I JK H I K J E G F A B C D E F G H I J K Steffen Rendle Information Systems and Machine Learning Lab (ISMLL), University of Hildesheim Clustering k-Means Agglomerative Clustering Use Case Summary Example: Agglomerative Clustering with Average Linkage A BC D E F G HI JK A B A BC D E F G HI JK .8 .5 .2 .1 .1 .2 .0 A .85 .25 .15 .25 .20 .10 BC .2 .1 .2 .3 .2 D .9 .9 .15 .15 E .8 .05 .05 F .25 .3 G .85 HI JK D C H I K J E G F A B C D E F G H I J K Steffen Rendle Information Systems and Machine Learning Lab (ISMLL), University of Hildesheim Clustering k-Means Agglomerative Clustering Use Case Summary Example: Agglomerative Clustering with Average Linkage A BC D EF G HI JK A A BC D EF G HI JK .8 .5 .15 .1 .2 .0 A .85 .20 .25 .20 .10 BC .15 .2 .3 .2 D .85 .10 .10 EF .25 .3 G .85 HI JK B D C H I K J E G F A B C D E F G H I J K Steffen Rendle Information Systems and Machine Learning Lab (ISMLL), University of Hildesheim Clustering k-Means Agglomerative Clustering Use Case Summary Example: Agglomerative Clustering with Average Linkage A BCD EF G HI JK A BCD EF G HI JK .7 .15 .1 .2 .0 A .18 .23 .23 .13 BCD .85 .10 .10 EF .25 .3 G .85 HI JK A B D C H I K J E G F A B C D E F G H I J K Steffen Rendle Information Systems and Machine Learning Lab (ISMLL), University of Hildesheim Clustering k-Means Agglomerative Clustering Use Case Summary Example: Agglomerative Clustering with Average Linkage A BCD EFG HI JK A BCD EFG HI JK .7 .13 .2 .0 A .20 .23 .13 BCD .15 .16 EFG .85 HI JK A B D C H I K J E G F A B C D E F G H I J K Steffen Rendle Information Systems and Machine Learning Lab (ISMLL), University of Hildesheim Clustering k-Means Agglomerative Clustering Use Case Summary Example: Agglomerative Clustering with Average Linkage A BCD EFG HIJK A BCD EFG HIJK .7 .13 .1 A .20 .18 BCD .16 EFG HIJK A B D C H I K J E G F A B C D E F G H I J K Steffen Rendle Information Systems and Machine Learning Lab (ISMLL), University of Hildesheim Clustering k-Means Agglomerative Clustering Use Case Summary Example: Agglomerative Clustering with Average Linkage ABCD EFG HIJK ABCD EFG HIJK .18 .16 ABCD .16 EFG HIJK A B D C H I K J E G F A B C D E F G H I J K Steffen Rendle Information Systems and Machine Learning Lab (ISMLL), University of Hildesheim Clustering k-Means Agglomerative Clustering Use Case Summary Example: Agglomerative Clustering with Average Linkage ABCDEFG HIJK ABCDEFG HIJK .16 ABCDEFG HIJK A B D C H I K J E G F A B C D E F G H I J K Steffen Rendle Information Systems and Machine Learning Lab (ISMLL), University of Hildesheim Clustering k-Means Agglomerative Clustering Use Case Summary Example: Agglomerative Clustering with Average Linkage ABCDEFGHIJK ABCDEFGHIJK ABCDEFGHIJK A B D C H I K J E G F A B C D E F G H I J K Steffen Rendle Information Systems and Machine Learning Lab (ISMLL), University of Hildesheim Clustering k-Means Agglomerative Clustering Use Case Summary Properties of Agglomerative Clustering I several tasks can be solved: partitional clustering with number of clusters or threshold and hierarchical clustering I no metric space is necessary I runtime complexity O(n2 log(n)) Steffen Rendle Information Systems and Machine Learning Lab (ISMLL), University of Hildesheim Clustering k-Means Agglomerative Clustering Use Case Summary Use Case: Object Identification I Object Identification (OI) finds identical items for information integration. I OI tasks are semi-supervised. I OI models use both clustering and classification techniques. Steffen Rendle Information Systems and Machine Learning Lab (ISMLL), University of Hildesheim Clustering Steffen Rendle k-Means Agglomerative Clustering Use Case Summary Information Systems and Machine Learning Lab (ISMLL), University of Hildesheim Clustering k-Means Agglomerative Clustering Use Case Summary DB Shop T-Online Amazon Cyberport Mediamarkt Mediamarkt Amazon Steffen Rendle Product name Fuji FinePix S5600 FujiFilm FinePix S5600 Digitalkamera (5 Megapixel, 10fach Zoom) Fuji FinePix S5600 Fine Pix S 5600 Fine Pix S 9500 Fuji FinePix S5500 Digitalkamera (4 Megapixel, 10x opt. Zoom) Price 279,00 254,90 259,90 245,00 515,00 349,99 Information Systems and Machine Learning Lab (ISMLL), University of Hildesheim Clustering k-Means Agglomerative Clustering Use Case Summary Object Identification Problem Problem Solution D A B G Steffen Rendle H I C B E F D A C E F G H I Information Systems and Machine Learning Lab (ISMLL), University of Hildesheim Clustering k-Means Agglomerative Clustering Use Case Summary Adaptive Setting Problem D A G Training Set J L M N Steffen Rendle L2 H I Q P L1 K E F I H C B E F D A C B G Solution L3 R O Information Systems and Machine Learning Lab (ISMLL), University of Hildesheim Clustering k-Means Agglomerative Clustering Use Case Summary Types of Labels Often some parts of the data provide information about identities: I Some offers are labeled by a unique identifier – e.g. an EAN, UPC, ISBN. I New offers should be merged into an already integrated database – e.g. new products, new shops should be integrated. I Some offers are known to be identical / different – e.g. provided by a supervisor. I N databases should be merged and each database contains no duplicates. Steffen Rendle Information Systems and Machine Learning Lab (ISMLL), University of Hildesheim Clustering k-Means Agglomerative Clustering Use Case Summary Iterative Problem Citer A Consistent Solution Iterative Problem D A L1 C B C B E F I L3 L2 E F I H G D A H G Unknown class label Steffen Rendle Information Systems and Machine Learning Lab (ISMLL), University of Hildesheim Clustering k-Means Agglomerative Clustering Use Case Summary Constrained Problem Cconstr A Consistent Solution Constrained Problem D A D A C B C B E F I F I H G E H G Must-Link Constraint Cannot-Link Constraint Steffen Rendle Information Systems and Machine Learning Lab (ISMLL), University of Hildesheim Clustering k-Means Agglomerative Clustering Use Case Summary Problem Classes Problem classes are defined by their preconditions, that restrict the space E ⊆ X 2 of consistent solutions: I Iterative Problems Citer given: EY with Y ⊆ X E = {E |EY = E ∩ Y 2 } I Constrained Problems Cconstr given: Rml ⊆ X 2 , Rcl ⊆ X 2 E = {E |E ⊇ Eml ∧ E ∩ Rcl = ∅} I Matching Problems Cmatch S given: X = Ai with A = (A1 , . . . , An ) S E = {E |E ∩ (X 2 \ ( A2i \ {x, x|x ∈ Ai })) = ∅} Steffen Rendle Information Systems and Machine Learning Lab (ISMLL), University of Hildesheim Clustering k-Means Agglomerative Clustering Use Case Summary Hierarchy of Problem Classes One can show: Cclassic ⊂ Citer ⊂ Cconstr Cclassic ⊂ Cmatch ⊂ Cconstr Citer 6⊆ Cmatch Cmatch 6⊆ Citer Steffen Rendle Information Systems and Machine Learning Lab (ISMLL), University of Hildesheim Clustering k-Means Agglomerative Clustering Use Case Summary There are constrained problems that cannot be expressed as an iterative problem: Iterative Problem A L1 Constrained Problem B A H G L2 H B G Iterative Problem A H Must-Link Constraint B Steffen Rendle G L1 Information Systems and Machine Learning Lab (ISMLL), University of Hildesheim Clustering k-Means Agglomerative Clustering Use Case Summary Generic Object Identification Model I Feature Extraction ffeature : X 2 → Rn I Probabilistic pairwise decision model fpairwise : X 2 → [0, 1] I Collective decision model fglobal : P(X ) × P(X 2 ) × P(X 2 ) → E Steffen Rendle Information Systems and Machine Learning Lab (ISMLL), University of Hildesheim Clustering Data Object x1 x2 x3 k-Means Brand Hewlett Packard HP Canon Agglomerative Clustering Use Case Product Name Photosmart 435 Digital Camera HP Photosmart 435 16MB memory Canon EOS 300D black 18-55 Camera Feature Extraction Object Pair TFIDF-Cosine Similarity (Product Name) (x1 , x2 ) 0.6 (x1 , x3 ) 0.1 (x2 , x3 ) 0.0 FirstNumberEqual (Product Name) 1 0 0 Summary Price 118.99 110.00 786.00 Rel. Difference (Price) 0.076 0.849 0.860 Probabilistic Pairwise Decision Model Object Pair P[xi ≡ xj ] (x1 , x2 ) 0.8 (x1 , x3 ) 0.2 (x2 , x3 ) 0.1 Steffen Rendle Information Systems and Machine Learning Lab (ISMLL), University of Hildesheim Clustering k-Means Agglomerative Clustering Use Case Summary Learning and Constraints Information provided by constraints can be used for training an identification model: I Probabilistic pairwise decision model: trained classifier (e.g. SVM) I Collective decision model: constrained clustering algorithm (e.g. constrained HAC) using the pairwise decision model as a learned similarity measure. Steffen Rendle Information Systems and Machine Learning Lab (ISMLL), University of Hildesheim Clustering k-Means Agglomerative Clustering Use Case Summary Constrained Agglomerative Clustering Algorithm function ConstrainedAgglClustering(X , Rml , Rcl , sim) m←0 for all i ∈ {1, . . . , n} do fm (xi ) ← i end for fm ← ApplyMustLink(f , Rml ) repeat (i, j) = argmax sim? (fm , i, j) i,j∈img(fm ),i6=j,not HasCannotLink(fm ,i,j,Rcl ) fm+1 ← fm for all x ∈ {y |fm (y ) = j do fm+1 (x) ← i end for m ←m+1 until convergence(fm ) return fm end function Steffen Rendle Information Systems and Machine Learning Lab (ISMLL), University of Hildesheim Clustering k-Means Agglomerative Clustering Use Case Summary Constrained Agglomerative Clustering Algorithm function ApplyMustLink(f , Rml ) for all (x, y ) ∈ Rml do for all x 0 : f (x 0 ) = f (x) do f (x 0 ) ← f (y ) end for end for return f end function function HasCannotLink(f , i, j, Rcl ) return ∃x ∈ f −1 (i), y ∈ f −1 (j) : (x, y ) ∈ Rcl end function Steffen Rendle Information Systems and Machine Learning Lab (ISMLL), University of Hildesheim Clustering k-Means Agglomerative Clustering Use Case Summary Summary I Clustering groups data I Groups depend on the similarity and the clustering method I Clustering is an unsupervised task I Semi-supervised clustering can use labels (e.g. on relations) to learn the similarity measure and to enhance clustering. Steffen Rendle Information Systems and Machine Learning Lab (ISMLL), University of Hildesheim Clustering k-Means Agglomerative Clustering Use Case Summary Outlook I Fuzzy / Soft clustering, e.g. Fuzzy C-Means I I I I I similarity matrix Sij := sim(di , dj ) use spectral methods on Sij – e.g. eigenvectors – to compute clusters Constrained / Semi-supervised clustering I I Steffen Rendle cluster membership is a probability distribution Spectral clustering constraints on objects, pairs, etc. are present example: object identification Information Systems and Machine Learning Lab (ISMLL), University of Hildesheim Clustering k-Means Agglomerative Clustering Use Case Summary Literature A. K. Jain, M. N. Murty, and P. J. Flynn. Data clustering: a review. ACM Comput. Surv., 31(3):264–323, 1999. S. Rendle and L. Schmidt-Thieme. Object identification with constraints. In Proceedings of the 6th IEEE International Conference on Data Mining (ICDM-2006), Hong Kong, 2006. Steffen Rendle Information Systems and Machine Learning Lab (ISMLL), University of Hildesheim