Machine Learning: Clustering

Transcrição

Clustering
k-Means
Agglomerative Clustering
Use Case
Summary
Machine Learning: Clustering
Steffen Rendle
Information Systems and Machine Learning Lab (ISMLL)
University of Hildesheim
Wintersemester 2007 / 2008
Steffen Rendle
Information Systems and Machine Learning Lab (ISMLL), University of Hildesheim
Clustering
k-Means
Use Case
Summary
Clustering
Overview
Examples
Clustering Tasks
k-Means
Overview
Algorithm
Overview
Algorithm
Use Case
Task
Method
Summary
Steffen Rendle
Clustering
k-Means
Use Case
Summary
Overview
The objective of clustering is to group similar data D.
I
the groups are called clusters
I
clustering is unsupervised, i.e. neither training data nor classes
are given in advance
I
grouping/ clustering depends on the algorithm
Steffen Rendle
Clustering
k-Means
Use Case
Summary
Example
Steffen Rendle
Clustering
k-Means
Use Case
Summary
Example
Steffen Rendle
Clustering
k-Means
Use Case
Summary
Example
Steffen Rendle
Clustering
k-Means
Use Case
Summary
Example
Steffen Rendle
Clustering
k-Means
Use Case
Summary
Example
Steffen Rendle
Clustering
k-Means
Use Case
Summary
Example: Clustering of Search Results
Steffen Rendle
Clustering
k-Means
Use Case
Summary
Steffen Rendle
Clustering
k-Means
Use Case
Summary
Steffen Rendle
Clustering
k-Means
Use Case
Summary
Example: Wafer Analysis
Fertigungsprozess 1
Fertigungsprozess 2
Fehlerursache 1
Fehlerursache 2
...
Test 1..n
Steffen Rendle
Clustering
k-Means
Use Case
Summary
Steffen Rendle
Clustering
k-Means
Use Case
Summary
Steffen Rendle
Clustering
k-Means
Use Case
Summary
Example: Object Identification
DB
Shop
T-Online
Amazon
Cyberport
Mediamarkt
Mediamarkt
Amazon
Steffen Rendle
Product name
Fuji FinePix S5600
FujiFilm FinePix S5600 Digitalkamera (5 Megapixel, 10fach Zoom)
Fuji FinePix S5600
Fine Pix S 5600
Fine Pix S 9500
Fuji FinePix S5500 Digitalkamera (4 Megapixel, 10x opt. Zoom)
Price
279,00
254,90
259,90
245,00
515,00
349,99
Clustering
k-Means
Use Case
Summary
Clustering Tasks
I
Hard-Clustering: find a partition of the data
I
Soft-Clustering/ Fuzzy-Clustering: find propabilities of group
membership for each item
I
Hierarchical Clustering: find a dendrogram (tree) of the data
Steffen Rendle
Clustering
k-Means
Use Case
Summary
Hard-Clustering
Steffen Rendle
Clustering
k-Means
Use Case
Summary
Hard-Clustering
Steffen Rendle
Clustering
k-Means
Use Case
Summary
Soft-Clustering
Steffen Rendle
Clustering
k-Means
Use Case
Summary
Soft-Clustering
Steffen Rendle
Clustering
k-Means
Use Case
Summary
Hierarchical-Clustering
A
B
D
C
H
I
K
J
E
G
F
A
Steffen Rendle
B
C
D
E
F
G
H
I
J
K
Clustering
k-Means
Use Case
Summary
A
B
D
C
H
I
K
J
E
G
F
A
Steffen Rendle
B
C
D
E
F
G
H
I
J
K
Clustering
k-Means
Use Case
Summary
A
B
D
C
H
I
K
J
E
G
F
A
Steffen Rendle
B
C
D
E
F
G
H
I
J
K
Clustering
k-Means
Use Case
Summary
A
B
D
C
H
I
K
J
E
G
F
A
Steffen Rendle
B
C
D
E
F
G
H
I
J
K
Clustering
k-Means
Use Case
Summary
A
B
D
C
H
I
K
J
E
G
F
A
Steffen Rendle
B
C
D
E
F
G
H
I
J
K
Clustering
k-Means
Use Case
Summary
k-Means
I
I
partitional clustering
given
I
I
I
I
to find
I
Steffen Rendle
Data D = {d1 , ..., dn } ∈ P(Rm ) with di = (xi,1 , . . . , xi,m ) ∈ Rm
Number of clusters k
Similarity sim : Rm × Rm → R+
Partition of the data f : D → {1, . . . , k}
Clustering
k-Means
Use Case
Summary
k-Means Algorithm
function k-Means(D, k, sim)
for all j ∈ {1, . . . , k} do
yj ← random
D
end for
repeat
f0 ←f
f (d) ← argmax sim(yj , d)
j∈{1,...,k}
for all j ∈ {1, . . . , k} do
yj ←
avg
d
d∈{d|f (d)=j}
end for
until f’ = f
return f
end function
Steffen Rendle
Clustering
k-Means
Use Case
Summary
Example
Steffen Rendle
Clustering
k-Means
Use Case
Summary
Example
Steffen Rendle
Clustering
k-Means
Use Case
Summary
Example
Steffen Rendle
Clustering
k-Means
Use Case
Summary
Example
Steffen Rendle
Clustering
k-Means
Use Case
Summary
Example
Steffen Rendle
Clustering
k-Means
Use Case
Summary
Example
Steffen Rendle
Clustering
k-Means
Use Case
Summary
Example
Steffen Rendle
Clustering
k-Means
Use Case
Summary
Example
Steffen Rendle
Clustering
k-Means
Use Case
Summary
Example
Steffen Rendle
Clustering
k-Means
Use Case
Summary
Example
Steffen Rendle
Clustering
k-Means
Use Case
Summary
Example
Steffen Rendle
Clustering
k-Means
Use Case
Summary
Example
Steffen Rendle
Clustering
k-Means
Use Case
Summary
Example
Steffen Rendle
Clustering
k-Means
Use Case
Summary
Example
Steffen Rendle
Clustering
k-Means
Use Case
Summary
Example
Steffen Rendle
Clustering
k-Means
Use Case
Summary
Example
Steffen Rendle
Clustering
k-Means
Use Case
Summary
Example
Steffen Rendle
Clustering
k-Means
Use Case
Summary
Example
Steffen Rendle
Clustering
k-Means
Use Case
Summary
Example
Steffen Rendle
Clustering
k-Means
Use Case
Summary
Example
Steffen Rendle
Clustering
k-Means
Use Case
Summary
Problems of k-Means I
Steffen Rendle
Clustering
k-Means
Use Case
Summary
Steffen Rendle
Clustering
k-Means
Use Case
Summary
Steffen Rendle
Clustering
k-Means
Use Case
Summary
Steffen Rendle
Clustering
k-Means
Use Case
Summary
Steffen Rendle
Clustering
k-Means
Use Case
Summary
Problems of k-Means II
Steffen Rendle
Clustering
k-Means
Use Case
Summary
Steffen Rendle
Clustering
k-Means
Use Case
Summary
Steffen Rendle
Clustering
k-Means
Use Case
Summary
Steffen Rendle
Clustering
k-Means
Use Case
Summary
Steffen Rendle
Clustering
k-Means
Use Case
Summary
Problems of k-Means
I
I
k-Means is run several times and the best“ result is returned.
”
for determing the best“ partition heuristic measures like intra
”
cluster variance can be used:
2
k
X
X
ICV(f , D) =
avg
d 0
d −
d 0 ∈{d 0 |f (d 0 )=j} j=1 d∈{d|f (d)=j}
Steffen Rendle
Clustering
k-Means
Use Case
Summary
Properties of k-Means
I
easy to implement
I
in practice often fast
data must be present in a metric space (e.g. euclidian space:
Rn with k·k) so that centroids can be calculated.
I
I
Steffen Rendle
Counter example: strings
Clustering
k-Means
Use Case
Summary
Agglomerative Clustering can solve several tasks:
I
partitional clustering with given number of clusters k or
similarity threshold θ
I
hierarchical clustering
Steffen Rendle
Clustering
k-Means
Use Case
Summary
Greedy Agglomerative Clustering
I
I
partitional clustering
given
I
I
I
I
to find
I
Steffen Rendle
Data D = {d1 , ..., dn }
Similarity sim : D × D → R+
Number of clusters k or threshold θ on similarities
Partition of the data f : D → N
Clustering
k-Means
Use Case
Summary
Hierarchical Agglomerative Clustering
I
I
hierarchical clustering
given
I
I
I
to find
I
Steffen Rendle
Data D = {d1 , ..., dn }
Similarity sim : D × D → R+
Series fi of partitions of the data fi : D → N with
img fi ⊂ img fi+1
Clustering
k-Means
Use Case
Summary
Agglomerative Clustering Algorithm
function AgglomerativeClustering(D, sim)
m←0
for all i ∈ {1, . . . , n} do
fm (di ) ← i
end for
repeat
(i, j) = argmax sim? (fm , i, j)
i,j∈img(fm ),i6=j
fm+1 ← fm
for all d ∈ {d 0 |fm (d 0 ) = j} do
fm+1 (d) ← i
end for
m ←m+1
until convergence(fm )
return fm
end function
Steffen Rendle
Clustering
k-Means
Use Case
Summary
Convergence
convergence(f ) depends on the task:
I
if k given:
convergence(f ) ⇔ | img(f )| ≤ k
I
if θ given:
convergence(f ) ⇔
max
i,j∈img(f ),i6=j
I
Steffen Rendle
simX (f , i, j) ≤ θ
in case of hierarchical clustering:
convergence(f ) ⇔ | img(f )| = 1
Clustering
k-Means
Use Case
Summary
Similarity between Clusters
A
0.9
0.63
0.6
B 0.7
0.82
0.2
C
Steffen Rendle
A
D
E
B
?
D
E
C
Clustering
k-Means
Use Case
Summary
Similarity between Clusters
Several possibilities for similarity sim? (f , i, j) between clusters
I
I
I
single linkage:
simSL (f , i, j) =
complete linkage:
simCL (f , i, j) =
average linkage:
simAL (f , i, j) =
max
sim(d, d 0 )
min
sim(d, d 0 )
avg
sim(d, d 0 )
(d,d 0 )∈f −1 (i)×f −1 (j)
(d,d 0 )∈f −1 (i)×f −1 (j)
(d,d 0 )∈f −1 (i)×f −1 (j)
Steffen Rendle
Clustering
k-Means
Use Case
Summary
Single Linkage
A
0.9
0.63
0.6
B 0.7
0.82
0.2
C
Steffen Rendle
A
D
E
B
0.9
D
E
C
Clustering
k-Means
Use Case
Summary
Complete Linkage
A
0.9
0.63
0.6
B 0.7
0.82
0.2
C
Steffen Rendle
A
D
E
B
0.2
D
E
C
Clustering
k-Means
Use Case
Summary
Average Linkage
A
0.9
0.63
0.6
B 0.7
0.82
0.2
C
Steffen Rendle
A
D
E
B
0.64
D
E
C
Clustering
k-Means
Use Case
Summary
Example: Agglomerative Clustering with Average Linkage
A
B
D
C
H
I
K
J
E
G
F
Steffen Rendle
Clustering
k-Means
Use Case
Summary
A
B
C
D
E
F
G
H
I
J
K
A
B
D
C
H
A
B
C
D
E
F
G
H
I
J
K
.8
.8
.5
.2
.1
.1
.2
.2
.0
.0
A
.9
.9
.2
.1
.2
.2
.2
.1
.1
B
.8
.3
.2
.3
.2
.2
.1
.1
C
.2
.1
.2
.3
.3
.2
.2
D
.9
.9
.1
.2
.2
.1
E
.8
.0
.1
.1
.0
F
.2
.3
.3
.3
G
.9
.8
.8
H
.9
.9
I
.9
J
K
I
K
J
E
G
F
A B C D E F G H I J K
Steffen Rendle
Clustering
k-Means
Use Case
Summary
A
BC
D
E
F
G
H
I
J
K
A
B
D
C
H
A
BC
D
E
F
G
H
I
J
K
.8
.5
.2
.1
.1
.2
.2
.0
.0
A
.85
.25
.15
.25
.20
.20
.10
.10
BC
.2
.1
.2
.3
.3
.2
.2
D
.9
.9
.1
.2
.2
.1
E
.8
.0
.1
.1
.0
F
.2
.3
.3
.3
G
.9
.8
.8
H
.9
.9
I
.9
J
K
I
K
J
E
G
F
Steffen Rendle
Clustering
k-Means
Use Case
Summary
A
BC
D
E
F
G
H
I
JK
A
B
D
C
A
BC
D
E
F
G
H
I
JK
.8
.5
.2
.1
.1
.2
.2
.0
A
.85
.25
.15
.25
.20
.20
.10
BC
.2
.1
.2
.3
.3
.2
D
.9
.9
.1
.2
.15
E
.8
.0
.1
.05
F
.2
.3
.3
G
.9
.8
H
.9
I
JK
H
I
K
J
E
G
F
Steffen Rendle
Clustering
k-Means
Use Case
Summary
A
BC
D
E
F
G
HI
JK
A
B
A
BC
D
E
F
G
HI
JK
.8
.5
.2
.1
.1
.2
.0
A
.85
.25
.15
.25
.20
.10
BC
.2
.1
.2
.3
.2
D
.9
.9
.15
.15
E
.8
.05
.05
F
.25
.3
G
.85
HI
JK
D
C
H
I
K
J
E
G
F
Steffen Rendle
Clustering
k-Means
Use Case
Summary
A
BC
D
EF
G
HI
JK
A
A
BC
D
EF
G
HI
JK
.8
.5
.15
.1
.2
.0
A
.85
.20
.25
.20
.10
BC
.15
.2
.3
.2
D
.85
.10
.10
EF
.25
.3
G
.85
HI
JK
B
D
C
H
I
K
J
E
G
F
Steffen Rendle
Clustering
k-Means
Use Case
Summary
A
BCD
EF
G
HI
JK
A
BCD
EF
G
HI
JK
.7
.15
.1
.2
.0
A
.18
.23
.23
.13
BCD
.85
.10
.10
EF
.25
.3
G
.85
HI
JK
A
B
D
C
H
I
K
J
E
G
F
Steffen Rendle
Clustering
k-Means
Use Case
Summary
A
BCD
EFG
HI
JK
A
BCD
EFG
HI
JK
.7
.13
.2
.0
A
.20
.23
.13
BCD
.15
.16
EFG
.85
HI
JK
A
B
D
C
H
I
K
J
E
G
F
Steffen Rendle
Clustering
k-Means
Use Case
Summary
A
BCD
EFG
HIJK
A
BCD
EFG
HIJK
.7
.13
.1
A
.20
.18
BCD
.16
EFG
HIJK
A
B
D
C
H
I
K
J
E
G
F
Steffen Rendle
Clustering
k-Means
Use Case
Summary
ABCD
EFG
HIJK
ABCD
EFG
HIJK
.18
.16
ABCD
.16
EFG
HIJK
A
B
D
C
H
I
K
J
E
G
F
Steffen Rendle
Clustering
k-Means
Use Case
Summary
ABCDEFG
HIJK
ABCDEFG
HIJK
.16
ABCDEFG
HIJK
A
B
D
C
H
I
K
J
E
G
F
Steffen Rendle
Clustering
k-Means
Use Case
Summary
ABCDEFGHIJK
ABCDEFGHIJK
ABCDEFGHIJK
A
B
D
C
H
I
K
J
E
G
F
Steffen Rendle
Clustering
k-Means
Use Case
Summary
Properties of Agglomerative Clustering
I
several tasks can be solved: partitional clustering with number
of clusters or threshold and hierarchical clustering
I
no metric space is necessary
I
runtime complexity O(n2 log(n))
Steffen Rendle
Clustering
k-Means
Use Case
Summary
Use Case: Object Identification
I
Object Identification (OI) finds identical items for information
integration.
I
OI tasks are semi-supervised.
I
OI models use both clustering and classification techniques.
Steffen Rendle
Clustering
Steffen Rendle
k-Means
Use Case
Summary
Clustering
k-Means
Use Case
Summary
DB
Shop
T-Online
Amazon
Cyberport
Mediamarkt
Mediamarkt
Amazon
Steffen Rendle
Product name
Fuji FinePix S5600
FujiFilm FinePix S5600 Digitalkamera (5 Megapixel, 10fach Zoom)
Fuji FinePix S5600
Fine Pix S 5600
Fine Pix S 9500
Fuji FinePix S5500 Digitalkamera (4 Megapixel, 10x opt. Zoom)
Price
279,00
254,90
259,90
245,00
515,00
349,99
Clustering
k-Means
Use Case
Summary
Object Identification Problem
Problem
Solution
D
A
B
G
Steffen Rendle
H
I
C
B
E
F
D
A
C
E
F
G
H
I
Clustering
k-Means
Use Case
Summary
Adaptive Setting
Problem
D
A
G
Training Set
J
L
M
N
Steffen Rendle
L2
H
I
Q
P
L1
K
E
F
I
H
C
B
E
F
D
A
C
B
G
Solution
L3
R
O
Clustering
k-Means
Use Case
Summary
Types of Labels
Often some parts of the data provide information about identities:
I
Some offers are labeled by a unique identifier
– e.g. an EAN, UPC, ISBN.
I
New offers should be merged into an already integrated
database
– e.g. new products, new shops should be integrated.
I
Some offers are known to be identical / different
– e.g. provided by a supervisor.
I
N databases should be merged and each database contains no
duplicates.
Steffen Rendle
Clustering
k-Means
Use Case
Summary
Iterative Problem Citer
A Consistent Solution
Iterative Problem
D
A
L1
C
B
C
B
E
F
I
L3
L2
E
F
I
H
G
D
A
H
G
Unknown class label
Steffen Rendle
Clustering
k-Means
Use Case
Summary
Constrained Problem Cconstr
A Consistent Solution
Constrained Problem
D
A
D
A
C
B
C
B
E
F
I
F
I
H
G
E
H
G
Must-Link Constraint
Cannot-Link Constraint
Steffen Rendle
Clustering
k-Means
Use Case
Summary
Problem Classes
Problem classes are defined by their preconditions, that restrict the
space E ⊆ X 2 of consistent solutions:
I
Iterative Problems Citer
given: EY with Y ⊆ X
E = {E |EY = E ∩ Y 2 }
I
Constrained Problems Cconstr
given: Rml ⊆ X 2 , Rcl ⊆ X 2
E = {E |E ⊇ Eml ∧ E ∩ Rcl = ∅}
I
Matching Problems
Cmatch
S
given: X = Ai with
A = (A1 , . . . , An )
S
E = {E |E ∩ (X 2 \ ( A2i \ {x, x|x ∈ Ai })) = ∅}
Steffen Rendle
Clustering
k-Means
Use Case
Summary
Hierarchy of Problem Classes
One can show:
Cclassic ⊂ Citer ⊂ Cconstr
Cclassic ⊂ Cmatch ⊂ Cconstr
Citer 6⊆ Cmatch
Cmatch 6⊆ Citer
Steffen Rendle
Clustering
k-Means
Use Case
Summary
There are constrained problems that cannot be expressed as an
iterative problem:
Iterative Problem
A
L1
Constrained Problem
B
A
H
G
L2
H
B
G
Iterative Problem
A
H
Must-Link Constraint
B
Steffen Rendle
G
L1
Clustering
k-Means
Use Case
Summary
Generic Object Identification Model
I
Feature Extraction
ffeature : X 2 → Rn
I
Probabilistic pairwise decision model
fpairwise : X 2 → [0, 1]
I
Collective decision model
fglobal : P(X ) × P(X 2 ) × P(X 2 ) → E
Steffen Rendle
Clustering
Data
Object
x1
x2
x3
k-Means
Brand
Hewlett Packard
HP
Canon
Use Case
Product Name
Photosmart 435 Digital Camera
HP Photosmart 435 16MB memory
Canon EOS 300D black 18-55 Camera
Feature Extraction
Object Pair TFIDF-Cosine Similarity
(Product Name)
(x1 , x2 )
0.6
(x1 , x3 )
0.1
(x2 , x3 )
0.0
FirstNumberEqual
(Product Name)
1
0
0
Summary
Price
118.99
110.00
786.00
Rel. Difference
(Price)
0.076
0.849
0.860
Probabilistic Pairwise Decision Model
Object Pair P[xi ≡ xj ]
(x1 , x2 )
0.8
(x1 , x3 )
0.2
(x2 , x3 )
0.1
Steffen Rendle
Clustering
k-Means
Use Case
Summary
Learning and Constraints
Information provided by constraints can be used for training an
identification model:
I
Probabilistic pairwise decision model: trained classifier (e.g.
SVM)
I
Collective decision model: constrained clustering algorithm
(e.g. constrained HAC) using the pairwise decision model as a
learned similarity measure.
Steffen Rendle
Clustering
k-Means
Use Case
Summary
Constrained Agglomerative Clustering Algorithm
function ConstrainedAgglClustering(X , Rml , Rcl , sim)
m←0
for all i ∈ {1, . . . , n} do
fm (xi ) ← i
end for
fm ← ApplyMustLink(f , Rml )
repeat
(i, j) =
argmax
sim? (fm , i, j)
i,j∈img(fm ),i6=j,not HasCannotLink(fm ,i,j,Rcl )
fm+1 ← fm
for all x ∈ {y |fm (y ) = j do
fm+1 (x) ← i
end for
m ←m+1
until convergence(fm )
return fm
end function
Steffen Rendle
Clustering
k-Means
Use Case
Summary
Constrained Agglomerative Clustering Algorithm
function ApplyMustLink(f , Rml )
for all (x, y ) ∈ Rml do
for all x 0 : f (x 0 ) = f (x) do
f (x 0 ) ← f (y )
end for
end for
return f
end function
function HasCannotLink(f , i, j, Rcl )
return ∃x ∈ f −1 (i), y ∈ f −1 (j) : (x, y ) ∈ Rcl
end function
Steffen Rendle
Clustering
k-Means
Use Case
Summary
Summary
I
Clustering groups data
I
Groups depend on the similarity and the clustering method
I
Clustering is an unsupervised task
I
Semi-supervised clustering can use labels (e.g. on relations) to
learn the similarity measure and to enhance clustering.
Steffen Rendle
Clustering
k-Means
Use Case
Summary
Outlook
I
Fuzzy / Soft clustering, e.g. Fuzzy C-Means
I
I
I
I
I
similarity matrix Sij := sim(di , dj )
use spectral methods on Sij – e.g. eigenvectors – to compute
clusters
Constrained / Semi-supervised clustering
I
I
Steffen Rendle
cluster membership is a probability distribution
Spectral clustering
constraints on objects, pairs, etc. are present
example: object identification
Clustering
k-Means
Use Case
Summary
Literature
A. K. Jain, M. N. Murty, and P. J. Flynn.
Data clustering: a review.
ACM Comput. Surv., 31(3):264–323, 1999.
S. Rendle and L. Schmidt-Thieme.
Object identification with constraints.
In Proceedings of the 6th IEEE International Conference on
Data Mining (ICDM-2006), Hong Kong, 2006.
Steffen Rendle

Machine Learning: Clustering

Transcrição

Documentos relacionados

AV550 - Scanstore

Smart and compact machine for hot chocolate

TO2015_EQ_C93_EQ0000000

Z 2.2 S

+ Peças de reposição

DATI TECNICI della TC 450 TC 450 TECHNICAL FEATURES

Sharpener machine

m32 bistro - class

Untitled

Scarica il catalogo in PDF