as a PDF
Transcrição
as a PDF
Brandenburgische Technische Universität Cottbus Institut für Informatik Lehrstuhl Software-Systemtechnik Diplomarbeit A Multi-Level Algorithm for Modularity Graph Clustering Randolf Rotta June 30, 2008 1. Gutachter: Prof. Claus Lewerentz 2. Gutachter: Prof. Klaus Meer Betreuer: Dr. Andreas Noack Selbstständigkeitserklärung Ich erkläre hiermit, dass ich die vorliegende Diplomarbeit selbstständig und ohne unerlaubter Hilfe angefertigt habe. Alle verwendeten Hilfsmittel und Quellen sind im Literaturverzeichnis vollständig aufgeführt und die aus den benutzten Quellen wörtlich oder inhaltlich entnommenen Stellen als solche kenntlich gemacht. Cottbus, Juni 2008 Danksagungen Einen ganz besonderen Dank möchte ich meinem Betreuer Dr. Andreas Noack ausprechen. Erst seine wertvollen, zielgerichteten Kommentare haben diese Arbeit möglich gemacht. Weiterer Dank gilt den Lehrstühlen Software-Systemtechnik und Theoretische Informatik für die Bereitstellung von Rechentechnik für die Evaluation der Algorithmen. Zu großem Dank fühle ich mich auch meinen Eltern und Dr. Romain Gengler für ihre Geduld und motivierenden Hilfestellungen verpflichtet. Desweiteren danke ich meinen zahlreichen Kommilitonen für viele anregende und aufschlussreiche Diskussionen. Zu guter Letzt sei den vielen Autoren hilfreicher Werkzeuge und Programme gedankt. Dazu zählen die GNU Autotools, die GNU Compiler Collection und die C++ Boost-Bibliotheken, insbesondere dem Spirit Parser und Scanner Framework für die Dateiverarbeitung, dem IOStreams-Paket für transparente Kompression, und der Boost Graph Library für praktische Algorithmen und viele Ideen. Die Graphen für die Evaluation stammen aus Sammlungen von Mark Newman, Alex Arenas, Uri Alon, und dem Pajek Project um Vladimir Batagelj. Die Analyse der experimentellen Daten wurde durch das GNU R Project for Statistical Computing, der iGraph Bibliothek und dem GGobi Data Visualization System ermöglicht. Contents List of Figures VII List of Tables IX 1 Introduction 1.1 An Introductory Example . . . . . . . . . . . . . . . . . . . . . . . . 1.2 Objectives and Outline . . . . . . . . . . . . . . . . . . . . . . . . . . 2 Graph Clustering 2.1 Basic Definitions . . . . . . . . . . . . . . . . . . 2.1.1 Graphs . . . . . . . . . . . . . . . . . . . 2.1.2 Attributed Graphs . . . . . . . . . . . . . 2.1.3 Graph Clusterings and Quality Measures 2.2 The Modularity Measure of Newman . . . . . . . 2.3 Density-Based Clustering Quality Measures . . . 2.3.1 Volume and Density . . . . . . . . . . . . 2.3.2 Bias and the Null Model . . . . . . . . . . 2.3.3 Derived Quality Measures . . . . . . . . . 2.3.4 Volume Models . . . . . . . . . . . . . . . 2.4 Fundamental Clustering Strategies . . . . . . . . 2.4.1 Directing Components . . . . . . . . . . . 2.4.2 Constructing Components . . . . . . . . . 2.4.3 Meta-Strategies . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3 The Multi-Level Refinement Algorithm 3.1 The Multi-Level Scheme . . . . . . . . . . . . . . . . 3.2 Graph Coarsening . . . . . . . . . . . . . . . . . . . 3.2.1 Greedy Grouping . . . . . . . . . . . . . . . . 3.2.2 Greedy Matching . . . . . . . . . . . . . . . . 3.3 Merge Selectors . . . . . . . . . . . . . . . . . . . . . 3.3.1 Local Merge Selectors . . . . . . . . . . . . . 3.3.2 Random Walks . . . . . . . . . . . . . . . . . 3.3.3 Spectral Methods . . . . . . . . . . . . . . . . 3.4 Cluster Refinement . . . . . . . . . . . . . . . . . . . 3.4.1 Design Space of Local Refinement Algorithms 3.4.2 Greedy Refinement . . . . . . . . . . . . . . . 3.4.3 Kernighan-Lin Refinement . . . . . . . . . . . 3.5 Further Implementation Notes . . . . . . . . . . . . 3.5.1 Index Spaces and Graphs . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1 1 3 . . . . . . . . . . . . . . 5 5 5 6 7 7 9 9 10 10 13 17 19 21 22 . . . . . . . . . . . . . . 25 26 28 29 31 32 34 35 39 42 43 48 49 53 53 V Contents 3.5.2 Data Management . . . . . . . . . . . . . . . . . . . . . . . . 4 Evaluation 4.1 Methods and Data . . . . . . . . . . . . 4.1.1 Configuration Space . . . . . . . 4.1.2 Effectiveness . . . . . . . . . . . 4.1.3 Efficiency and Scalability . . . . 4.2 Effectiveness of the Graph Coarsening . 4.2.1 Match Fraction . . . . . . . . . . 4.2.2 Coarsening Methods . . . . . . . 4.2.3 Reduction Factor . . . . . . . . . 4.3 Effectiveness of the Merge Selectors . . . 4.3.1 Random Walk Distance . . . . . 4.3.2 Random Walk Reachability . . . 4.3.3 Comparison of Merge Selectors . 4.4 Effectiveness of the Cluster Refinement . 4.4.1 Greedy Refinement . . . . . . . . 4.4.2 Kernighan-Lin Refinement . . . . 4.5 Scalability . . . . . . . . . . . . . . . . . 4.6 Comparison to Reference Algorithms . . 4.7 Comparison to Published Results . . . . 4.7.1 The Graphs and Clusterings . . . 4.7.2 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5 Results and Future Work 5.1 Results of this Work . . . . . . . . . . . . . . . . 5.2 By-Products . . . . . . . . . . . . . . . . . . . . . 5.3 Directions for Future Work . . . . . . . . . . . . 5.3.1 Pre-Coarsening . . . . . . . . . . . . . . . 5.3.2 Study of Merge Selectors . . . . . . . . . 5.3.3 Linear Programming . . . . . . . . . . . . 5.3.4 Multi-Pass Clustering and Randomization 5.3.5 High-Level Refinement Search . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 56 . . . . . . . . . . . . . . . . . . . . 59 59 59 61 63 64 64 65 67 67 67 69 70 72 73 74 75 76 79 79 85 . . . . . . . . 87 87 88 89 89 90 90 90 90 Bibliography 93 A The Benchmark Graph Collection 99 B Clustering Results VI 101 List of Figures 1.1 Graph of the Mexican political elite . . . . . . . . . . . . . . . . . . 2 2.1 2.2 2.3 Example Volume Models . . . . . . . . . . . . . . . . . . . . . . . . . Graph of US Airports . . . . . . . . . . . . . . . . . . . . . . . . . . Recursive Subdivision and Hierarchical Clusterings . . . . . . . . . . 14 16 23 3.1 3.2 3.3 3.4 3.5 3.6 3.7 3.8 3.9 3.10 3.11 3.12 3.13 3.14 3.15 3.16 3.17 Operations for local cluster modification . . . . . . . . . . . . . . The Multi-Level Scheme . . . . . . . . . . . . . . . . . . . . . . . The Multi-Level Algorithm . . . . . . . . . . . . . . . . . . . . . Coarsening Method: Greedy Grouping . . . . . . . . . . . . . . . Merging two Vertices and their Edges . . . . . . . . . . . . . . . Coarsening Method: Greedy Matching . . . . . . . . . . . . . . . Contribution of Neighbor Vertices to the Visit Probability . . . . Spectral vertex vectors and two cluster vectors . . . . . . . . . . Dependencies when Moving a Vertex . . . . . . . . . . . . . . . . Refinement Method: Complete Greedy . . . . . . . . . . . . . . . Refinement Method: Sorted Greedy . . . . . . . . . . . . . . . . Refinement Method: basic Kernighan-Lin . . . . . . . . . . . . . Kernighan-Lin Refinement Creating Clusters on Graph Epa main Effective Search Depth of Kernighan-Lin Refinement . . . . . . . Index Spaces: Class and Concept Diagram . . . . . . . . . . . . . Index Maps: Class and Concept Diagram . . . . . . . . . . . . . C++ Example: Calculation of Vertex Degrees . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 26 26 27 29 30 31 36 40 45 48 49 50 51 52 53 54 55 4.1 4.2 4.3 4.4 4.5 4.6 4.7 4.8 4.9 4.10 4.11 4.12 4.13 4.14 Mean Modularity by Match Fraction (reduced set) . . . . . . Modularity and Runtime by Reduction Factor . . . . . . . . . The Random Walk Distance . . . . . . . . . . . . . . . . . . . The Random Walk Reachability . . . . . . . . . . . . . . . . . Mean Modularity of the Merge Selectors (large set) . . . . . . Runtime of the Merge Selectors . . . . . . . . . . . . . . . . . Mean Modularity by Refinement Method (reduced set) . . . . Mean Modularity by Refinement Method (large set) . . . . . Runtime by Graph Size . . . . . . . . . . . . . . . . . . . . . Clustering Results and Runtime of the Reference Algorithms Reference Graphs (karate and dolphins) . . . . . . . . . . . . Reference Graphs (polBooks and afootball) . . . . . . . . . . Reference Graphs (polBooks and celegans metabolic) . . . . . Reference Graphs (circuit s838 and email) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 64 65 68 69 71 71 73 75 76 78 80 81 82 83 . . . . . . . . . . . . . . . . . . . . . . . . . . . . VII List of Tables 2.1 2.2 Overview of Modularity Graph Clustering Algorithms . . . . . . . . Overview of Clustering and Partitioning Algorithms . . . . . . . . . 18 19 3.1 3.2 3.3 Proposed Merge Selection Qualities . . . . . . . . . . . . . . . . . . . Classification of Refinement Algorithms . . . . . . . . . . . . . . . . Hierarchical Naming Convention for File Names . . . . . . . . . . . . 33 47 56 4.1 4.2 4.3 4.4 4.5 4.6 4.7 4.8 4.9 4.10 4.11 The Configuration Space . . . . . . . . . . . . . . . . . Mean Modularity by Match Fraction . . . . . . . . . . Mean Modularity by Reduction Factor . . . . . . . . . Runtime by Reduction Factor . . . . . . . . . . . . . . Mean Modularity with Random Walk Distance . . . . Mean Modularity with Random Walk Reachability . . Mean Modularity of Merge Selectors . . . . . . . . . . Mean Modularity by Refinement Method (reduced set) Mean Modularity by Refinement Method (large set) . Clustering Results of Reference Algorithms . . . . . . Comparison to Published Results . . . . . . . . . . . . 60 65 66 66 68 70 72 74 74 77 85 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . A.1 The Benchmark Graph Collection . . . . . . . . . . . . . . . . . . . . 99 A.2 References to the Graph Sources . . . . . . . . . . . . . . . . . . . . 100 B.1 B.2 B.3 B.4 B.5 B.6 Random Walk Distance by Graph . . . . . . . Random Walk Reachability, Length 2 . . . . . Random Walk Reachability, Length 3 . . . . . Clustering Results from the Refinement Phase . Runtime Measurements . . . . . . . . . . . . . Runtime of the Reference Algorithms . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 101 101 101 102 103 104 IX 1 Introduction Since the rise of computers and related information technologies it became easy to collect vast amounts of data. Combined with technological advances in linking persons, discussions, and documents, various relational information is readily available for analysis. These networks are able to provide insight into the function of whole societies. Thus nowadays network analysis is an important tool also outside scientific communities, for example in market analysis and politics. The exploration and utilization of collected data requires elaborate analysis strategies supported by computational methods. A common abstraction for relational networks are graphs. In these entities like persons are represented by vertices. Graphs describe the observed relationships between entities by edges connecting related vertices. In addition the edges may be attributed by weights to describe the strength of connections. Graph clustering is the problem and process of grouping vertices together into disjoint sets called clusters. Discovering such clusters allows to analyze the underlying shared properties in and differences between groups. Certainly many different clusterings are possible. In order to find practically useful ones a quality measure is necessary. This measure leads optimization algorithms to interesting clusterings. A few years ago Mark Newman introduced a specific quality measure called modularity [60]. It was developed for the analysis of social networks but also proved to be useful in other scientific areas. 1.1 An Introductory Example An example drawing of such a network is given in Figure 1.1. It is based on a study of Jorge Gil-Mendieta and Samuel Schmidt [30] about the network of influential Mexican politicians between 1910 and 1994. In the graph each politician is represented by a vertex and edges connect pairs of politicians based on friendships, business partnerships, marriages, and similar. In the figure vertices are displayed as boxes and circles with a size proportional to their number of connections. The placement of the vertices was computed with the LinLog energy model [62] and edges are drawn as straight lines connecting related vertices. In the figure the vertices are colored according to a clustering of the politicians. Based on optimizing the modularity measure, the shown clustering was automatically constructed by the algorithm developed in this work. Three major clusters are visible. These roughly corresponding to three generational changes and differences in the military vs. civil background. Similar groups were also discovered by the Gil-Mendieta and Schmidt, who manually derived groups by studying the importance and influence of single persons to the network using various special purpose centrality indices. 1 1 Introduction Figure 1.1: Graph of the Mexican political elite [30] from 1910 to 1994. Each box represents a Mexican politician. The family name and the main year of political activity is printed for each. The names of actual presidents are printed in red color. The lines connect related politicians based on friendships, business partnerships, marriages, and similar. Finally the color of the boxes shows the best clustering computed with the algorithms presented in this work. 2 1.2 Objectives and Outline 1.2 Objectives and Outline The main objective of this work is the implementation and evaluation of effective and efficient clustering algorithms for the modularity quality measure of Newman [60]. Optimizing the modularity is a NP-complete problem [13]. Therefore it is unlikely that a fast, polynomial time algorithm finds a clustering of optimal modularity in all graphs. Instead relatively good clusterings have to suffice. The major design goal for the clustering algorithm is the effectiveness. The developed algorithm shall be able to find clusterings of very good modularity compared to alternative methods. Secondly the efficiency is a concern. Complicated and expensive algorithms should replaced by simpler alternatives in case they are not able to produce significantly better clusterings. This work focuses on multi-level refinement heuristics. In the past these were successfully applied on similar problems like minimum cut partitioning [41, 43, 78]. However modularity clustering differs significantly from these. The optimization aim is for example more complex because the number of clusters is not known in advance but has to be discovered by the algorithm. To date no detailed study exists on the adaption of multi-level refinement to modularity clustering. In this context also a comparison of the effectiveness and efficiency to other clustering algorithms is necessary. Chapter 2 introduces the mathematical formulation of the modularity clustering aim and discusses related concepts. This analysis provides the base for the later development of strategies specific to the modularity. The chapter concludes in Section 2.4 with the study of existing clustering algorithms. From these fundamental strategies and components are derived. The developed multi-level algorithm will combine several particularly useful components to benefit from their advantages. Based on the insights in the mathematical structure and fundamental clustering strategies Chapter 3 presents the developed family of algorithms. The algorithms are based on multi-level refinement methods adapted to the special needs of modularity clustering. The algorithm operates in two phases. First a hierarchy of coarsened graphs is produced. Then this hierarchy is used in the refinement of an initial clustering. The coarsening phase involves the selection of pairs of clusters to be merged. A sub-goal of this work is the exploration of alternative selection criteria in order to improve the effectiveness and overall clustering results. This includes criteria derived from random walks and spectral methods like presented in Section 3.3. In addition the implementation should incorporate appropriate data structures and algorithms to efficiently support these merge selector. The second major component of the algorithm are cluster refinement heuristics. These try to improve an initial clustering by exploring similar clusterings and selecting one of them. Here such neighbor clusterings are generated by moving single vertices. The objective of Section 3.4 is the exploration of heuristics for the search of local optima, i.e. clusterings without any neighbor clusterings of better modularity. Here the efficiency is an important concern because many traditional heuristics, like Kernighan-Lin refinement [45, 24], cannot be efficiently implemented for the modularity measure. 3 1 Introduction In order to assess the practical usefulness of the developed algorithm an empirical evaluation is carried out in Chapter 4. The experimental measurements are based on a set of benchmark graphs collected from a wide range of application domains. First the different configurations of the algorithm are compared regarding their efficiency and effectiveness in finding good clusterings. Based on this evaluation an efficient and effective default configuration and a few effective variations are chosen. Then these are compared against a range of reference algorithms and clustering results found in the literature. 4 2 Graph Clustering This chapter introduces into graph clustering with the focus on Newman’s modularity as optimization aim. It should provide the necessary foundations to develop and compare clustering algorithms. For this purpose a mathematical analysis of the modularity and related models is as necessary as the study of existing algorithms. This chapter is organized as follows. First graphs and clusterings are formally defined together with helpful mathematical notations. The second section introduces the modularity measure in the form used throughout this work. In order to substantiate its relationship to other quality measures the third sections presents an axiomatic derivation from the density concept. The chapter concludes with a review and summary of fundamental clustering strategies. 2.1 Basic Definitions This section formally introduces graphs, weights and clusterings. In the first subsection different types of graphs are discussed. Undirected graphs will be used in the derivation of quality measures while symmetric, directed graphs are used as internal representation in the implementation. The second subsection on attributed graphs defines vertex and edge properties and basic computations on edge weights. The final subsection formally introduces clusterings and quality measures. 2.1.1 Graphs An undirected graph G = (V, E) is a set of vertices V combined with a set of edges E. Two vertices u, v ∈ V are said to be adjacent, if they are connected by an edge e ∈ E. In that case, u and v are called end-vertices of the edge e and the edge is incident to both vertices. Each edge connects at most two different vertices and is called self-edge if it connects a single vertex just to itself. Let E(X, Y ) ⊆ E be the subset of edges connecting vertices in X ⊆ V with vertices in Y ⊆ V in an undirected graph. The edges incident to vertex v are E(v) := E({v}, V ) and two vertices u, v are adjacent if E(u, v) is non-empty. Conversely let V (e) ⊆ V be the set of end-vertices incident to edge e ∈ E. For example the neighborhood of a vertex is the set of all adjacent vertices including itself and is expressed by N (v) := V (E(v)) ∪ {v}. The subgraph induced by vertices X ⊆ V is G[X] := (X, E(X, X)). A graph is bipartite if the vertices can be dissected into two disjoint sets A, B and all edges lie between both sets, i.e. E(A, B) = E. In directed graphs all edges are oriented and connect a start-vertex with an endvertex. A directed graph is symmetric if for each edge also an edge in the opposing direction exists. In that case the graph topology is equal to an undirected graph. In symmetric, directed graphs for each vertex an enumeration of its incident edges is 5 undirected graph E(X, Y ), V (e), N (v) directed, symmetric 2 Graph Clustering stored and these edges point to the other end vertex. This form is used as internal representation for undirected graphs because it is easier to traverse. An undirected graph is called simple if each pair of vertices is connected by at most one edge. Otherwise several parallel edges may exist connecting the same endvertices. Such graphs are called multi graph. Similarly a directed graph is simple if it has at most one edge per direction. Throughout this work only simple graphs are used. 2.1.2 Attributed Graphs w(v), f (e) f (u, v), f (X, Y ) (V, ω) f (X) Commonly vertices and edges are attributed with additional properties. A vertex property is a function mapping each vertex to its property value. Edge properties are defined analogous. The simplest properties are textual labels like VertexName : V → Σ∗ . They can be used in visual graph representations to identify and describe vertices and edges in more detail. Another frequently used type are numerical properties like size, volume, weight, color, or the spatial position of vertices and the distance between adjacent vertices. Throughout this work mainly the abstract vertex weight w : V → Q and edge weight f : E → Q is used. When not otherwise noted these weights are non-negative. To simplify P calculations the edge weight between a pair of vertices is denoted by e∈E(u,v) f (e) and between vertex sets X, Y ⊆ V by f (X, Y ) := P f (u, v) := f (e). Therefore the edge weight between non-adjacent vertices (no cone∈E(X,Y ) necting edge) is zero. In undirected graphs all edge weights are symmetric with f (u, v) = f (v, u). In case edge weights ω(u, v) between arbitrary vertex pairs are required the weight-induced graph (V, ω) with E(u, v) ⇔ ω(u, v) 6= 0 can be used. However note, that while non-zero edge weights indicate connected vertices, the opposite is not necessarily true because the edges of a graph define neighborhood relationships where nevertheless some edges might have a zero weight. In cases where the edge weights of the graph are more important than the single edges the notation (V, f ) will be used. The total edge weight f (X) of a set of vertices X ⊆ V is the weight of all edges having both end-vertices in X. Some care is necessary not to count edges twice in undirected graphs. For disjoint subsets X ∩ Y = ∅ the equation f (X, Y ) = f (X ∪ Y ) − f (X) − f (Y ) holds. Therefore total edge weight in X is: f (X) = X 1 X f (u, v) + f (v, v) 2 u6=v∈X deg(v) (2.1) v∈X The degree of a vertex is the number of incident edges |E({v}, V )|. The weighted degree degf (v) is the sum over Pthe weights f (e) of incident edges. Self-edges are counted twice in order to gain v degf (v) = 2f (V ). From hereon only the weighted degree is used and hence the subscript is dropped. In summary the degree is calcu- 6 2.2 The Modularity Measure of Newman lated by: deg(v) = X f (v, x) + f (v, v) (2.2) x∈V 2.1.3 Graph Clusterings and Quality Measures A clustering C partitions the vertices of a graph into non-empty, disjoint subsets covering the whole set of vertices. The clustering consists of the clusters i ∈ C with their vertices Ci ⊆ V . Hence the number of clusters is |C|. The cluster containing vertex v is denoted by C(v) ∈ C and thus C(v) = i ⇔ v ∈ Ci . To improve readability C[v] := CC(v) directly refers to the vertices of the cluster containing vertex v. The term clustering is mostly used in the context of (social) network analysis because there the aim is to group vertices together. The load balancing community mostly uses the term partitioning to emphasize the search for balanced dissections. A clustering quality measure Q(C) maps clusterings of a graph to rational numbers Q called clustering quality. This way the quality measure defines a semi-order on the set of clusterings with clustering C being at least as good as clustering D if Q(C) ≥ Q(D). This enables the easy comparison of clusterings and the search for high quality clusterings although several different clusterings may be of same quality. Two quality measures Q1 , Q2 are said to be ranking equivalent when they order all clusterings equally. This is the case when for all pairs of clusterings C, D holds that Q1 (C) ≤ Q1 (D) ⇔ Q2 (C) ≤ Q2 (D). Ranking equivalence is invariant under addition of constants α ∈ Q and multiplication with positive constants 0 < β ∈ Q: Q(C) ≤ Q(D) ⇐⇒ α + β Q(C) ≤ α + β Q(D). This allows the normalization of quality measures against other graph properties which is necessary before comparing qualities between different graphs. Often the maximum quality over all clusterings or upper limits are used for normalization. 2.2 The Modularity Measure of Newman This section shortly introduces the clustering quality measure called modularity, which used throughout this work. The modularity was developed by Newman for community detection in social networks [58, 59]. Here the modularity definition from [57] is used and it is connected to a volume and density model. The section concludes with basic properties of this quality measure. An axiomatic derivation of the modularity is presented in the next section. Proofs for the NP-completeness of modularity optimization and other properties can be found in [13]. The modularity has a resolution limit. On growing graphs small clusters become invisible [25, 5]. Looking at the edges covered by a cluster Newman observes the necessity to make a ”judgment about when a particular density of edges is significant enough to define a community” [57]. Hence more edges should be in clusters than expected from a random redistribution of all edges. This is achieved by maximizing P C(u)=C(v) (f (u, v) − P (u, v)), where f (u, v) is the actual edge weight and P (u, v) 7 Ci , C(v), C[v] Q(C) ranking equivalence 2 Graph Clustering the expected weight called null model. The model chosen by Newman has following properties: The weights are symmetric with P (u, v) = P (v, u), the total weight equals the total weight of the original graph with P (V ) = f (V ) and the vertex degree should be reproduced for all vertices by degP (v) = degf (v). Putting everything together yields P (u, v) := deg(u) deg(v)/(2f (V )). As last step Newman normalized the modularity against the total edge weight f (V ) and gained: 1 Q(C) := f (V ) = deg(u) deg(v) f (u, v) − 2f (V ) C(u)=C(v) X f (u, v) deg(u) deg(v) − f (V ) 2f (V )2 X (2.3) (2.4) C(u)=C(v) More insight into this measure can be gained by using the volume Vol(u, v) with Vol(u, v) = deg(u) deg(v) and Vol(v, v) = deg(v)2 /2. Then the null model equals the volume scaled to the total edge weight: P (u, v) = f (V ) Vol(u, v)/ Vol(V ). Here P 2 Vol(V ) is the total volume and equals (2f (V )) /2 because deg(v) = 2f (V ). As the above scaling factor looks similar to a density (mass per volume) it is called connection density and denoted by ρ(V ) = f (V )/ Vol(V ). In this context the null model is the local volume multiplied by the global density and the modularity is rewritten as: Q(C) = = 1 f (V ) X (f (u, v) − ρ(V ) Vol(u, v)) (2.5) C(u)=C(v) X C(u)=C(v) f (u, v) Vol(u, v) − f (V ) Vol(V ) (2.6) Now it is visible that the modularity is optimized by maximizing the fraction of edges lying in clusters while minimizing the fraction of expected edges. As consequence the modularity of a clustering with just one cluster is zero. Moreover in a graph where the actual edge weights are randomly distributed like the null model all clusterings have zero modularity. A further useful property of the modularity is the additivity as it can be written as sum of edge weights or the contribution of vertices or clusters: Q(C) = X C(u)=C(V ) f (u, v) Vol(u, v) − f (V ) Vol(V ) X X f (u, v) Vol(u, v) = − f (V ) Vol(V ) v∈V u∈C[v] X f (Ci ) Vol(Ci ) = − f (V ) Vol(V ) i∈C 8 (2.7) (2.8) (2.9) 2.3 Density-Based Clustering Quality Measures 2.3 Density-Based Clustering Quality Measures The popular modularity measure of Newman was introduced in the previous section. However often algorithms proposed in the literature optimize different but related quality measures. In order to gain some insight into these an axiomatic derivation is presented in this section. Further connections, differences and possible interpretations are outlined. The whole discussion underlines why the modularity is used and where its properties come from. The main purpose of the clustering quality measure is to direct the search for good vertex clusters. The measures of interest here are based on following properties: • The optimization should lead to clusters with strong internal connectivity and sparse connections between clusters. This is the intuition behind ”a group of people” and ”a subsystem”. • Secondly the strength of connectivity should be relative to an expected, average connectivity. This allows to adapt the search to application-specific questions by providing a reference graph of expected edge weights. • In structureless graphs similar to the expected average all clusterings should have only low quality. Otherwise the measure is biased towards discovering structure where none is expected. • Finally the number of clusters is not known in advance. It is the task of the quality measure to determine this number. Thus the measure has to avoid trivial solutions. For example measures always leading to bisections are of no use when more than two clusters are possible. The axiomatic derivation of quality measures is a popular approach. Here, like in most cases, also the desired properties (axioms) are more or less designed to fit to the target quality measures. In contrast sometimes directly axioms leading to impossible quality measures are searched. For example in [46] three properties for similarity clustering based on vertex distances are defined: Scale-invariance (scaling all distances does not change the cluster structure), richness (for any possible clustering also a matching distance-function exists), and consistency (increasing inter-cluster distances or decreasing intra-cluster distances does not change the clustering). No quality measure can fulfill all three properties at the same time. Fortunately the properties chosen above for graph clustering are not that restrictive. The next subsection introduces the connection density and the basic volume model. Then the relation between biased measures and the density analyzed. The third subsection explores density-based clustering qualities by starting from the intercluster edge weight. This discussion also shows how the widely accepted modularity embeds into this framework. All quality measures will be independent of concrete volume models. Some specific volume models are discussed in the last subsection. 2.3.1 Volume and Density How to measure the strength of connectivity in and between clusters? One possibility arises by transferring the density from physics to graph clustering. Then 9 ρ(A, B) 2 Graph Clustering the connection density is the proportion of actual edges (mass) to ”possible” edges (volume). The actual edges are measured by the sum of their weights f (u, v) and the ”possible” edge weight is given by the reference graph (V, Vol(u, v)) called volume model from hereon. Hence the connection density is defined as: f (A, B) Vol(A, B) f (A) ρ(A) := Vol(A) ρ(A, B) := Vol(u, v) (2.10) (2.11) As only undirected graphs are considered, it is reasonable to require the volume to be symmetric Vol(u, v) = Vol(v, u) and positive Vol(u, v) ≥ 0. Because edges can also be expected where no real edges are, the volume weights are defined on all vertex pairs. The volume has to be non-zero where the edge weight is above zero. 2.3.2 Bias and the Null Model f0 (u, v) Given graph (V, f ) an uniform density graph (V, f0 ) is derived by evenly distributing the global density over all vertex pairs using ρ(V ) = f0 (u, v)/ Vol(u, v). The uniform density weights f0 (u, v) = ρ(V ) Vol(u, v) are called null model. The scaling by ρ(V ) transforms the volume into average edge weights. A quality measure is called biased when in it does not assign the same quality to all clusterings of the graph (V, f0 ). This bias is unfavorable because it signals structure where no structure is expected. In reverse a quality measure is unbiased if applying it to the null model yields a constant quality. The bias can be removed from a quality measure by either dividing by its null model quality (scaling) or subtracting the null model quality (shifting) [28, 62]. Let Qf0 (C) be the quality of clustering C in the null model. Then unbiased measures are derived like in the equations below. In uniform density graphs the scaled measure has the same quality 1.0 for all clusterings and shifting yields constant zero quality. Qscaled (C) := Q(C)/Qf0 (C) shifted Q (C) := Q(C) − Qf0 (C) (2.12) (2.13) 2.3.3 Derived Quality Measures Reformulating the first clustering aim in terms of density, vertex partitions with low connection density between clusters and high inner density are searched. This aim may be approached directly by maximizing the difference or the ratio between intra-cluster density and inter-cluster density: P P i6=j∈C f (Ci , Cj ) i∈C f (Ci ) Q (C) := P −P i∈C Vol(Ci ) i6=j∈C Vol(Ci , Cj ) !−1 P P i6=j∈C f (Ci , Cj ) i∈C f (Ci ) ratio P Q (C) := P i∈C Vol(Ci ) i6=j∈C Vol(Ci , Cj ) diff 10 (2.14) (2.15) 2.3 Density-Based Clustering Quality Measures Both quality measures are not well studied and probably rarely used. In practice it proofed sufficient to concentrate on either the inter- or intra-cluster edges. To improve readability let Cutω (C) be the total inter-cluster weights ω(u, v) of clustering C and Intω (C) the total intra-cluster weights (the interior): Cutω (C) := X ω(Ci , Cj )/2 (2.16) i6=j∈C Intω (C) := X ω(Ci ) (2.17) i∈C Interior and cut are related by Cutω (C) = ω(V ) − Intω (C) where ω(V ) is the total weight. The edge cut Cutf refers to the edge weights f whereas the cut volume is CutVol . Using this short notation for example both measures of above can be written as: Intf (C) Cutf (C) − IntVol (C) CutVol (C) Cutf (C) −1 Intf (C) ratio Q (C) := IntVol (C) CutVol (C) Qdiff (C) := (2.18) (2.19) Traditionally the edge cut is used in load balancing problems to minimize the communication overhead represented by the edge weights. But without any further constrains, like a predefined number of clusters or balance requirements, edge cut minimization has trivial solutions. For example placing all vertices in the same partition has zero cut. Additionally the edge cut is biased with Cutf0 (C) = ρ(V ) CutVol (C) and the interior with Intf0 (C) = ρ(V ) IntVol (C). Removing this bias yields the scaled cut and scaled interior as discussed in the next subsection and the shifted cut and shifted interior discussed in the second subsection. Scaled Cut and Interior Removing the bias from cut and interior by scaling yields the scaled cut and scaled interior : scaled cut Q 1 Cutf (C) (C) := ρ(V ) CutVol (C) Qscaled int (C) := 1 Intf (C) ρ(V ) IntVol (C) Cutf (C) CutVol (C) −1 = f (V ) Vol(V ) Intf (C) IntVol (C) −1 = f (V ) Vol(V ) (2.20) (2.21) Remembering the definition of density it is visible that the scaled cut minimizes the inter-cluster density. The complete measure is the inter-cluster density normalized against the global density. It can also be interpreted as the ratio between cut edges in input graph and null model. The same holds for the scaled interior which maximizes the intra-cluster density. 11 Cut(C), Int(C) 2 Graph Clustering In order to optimize the separation between cluster the scaled cut is minimized. However this prefers coarse clusterings and bisections. Merging two clusters i 6= j ∈ C does not increase the inter-cluster density in when following condition holds: Cutf (C) Cutf (C) − f (Ci , Cj ) ≥ CutVol (C) CutVol (C) − Vol(Ci , Cj ) (2.22) This is the case for ρ(Ci , Cj ) ≥ Cutf (C)/ CutVol (C), i.e. when Ci , Cj have a higher inter-cluster density than the overall inter-cluster density. Intuitively there should always be at least one pair with inter-density higher than the current ”average” interdensity and therefore at least one nontrivial clustering of minimal inter-density exists having only two clusters. Another good reason to distrust the scaled cut is that the lowest inter-cluster density 0.0 can easily be obtained by putting all vertices into the same cluster. Also emphasizing the inner cohesion in clusters by maximizing the scaled interior tends to trivial solutions: Splitting a cluster Ci into Ci,1 , Ci,2 does not decrease the overall intra-density in case ρ(Ci,1 , Ci,2 ) ≤ Intf (C)/ IntVol (C). Finding a cluster which can be cut with lower inter-density than the current overall intra-cluster density is very likely. Optimal quality thus leads to very fine-grained clusterings. Placing each vertex into a singleton cluster except the vertex pair of highest intradensity yields an optimal clustering. Adding other clusters can only lower the overall intra-density. Although the scaled cut and interior are not biased they have trivial solutions. Hence both are not well suited for graph clustering. The scaled cut is related to random walks and is known as conductance in that context [49]. Shifted Cut and Interior Shifting the edge cut and interior by their null model bias produces the shifted cut and shifted interior: Qshifted cut (C) := Cutf (C) − ρ(V ) CutVol (C) shifted int Q (C) := Intf (C) − ρ(V ) IntVol (C) (2.23) (2.24) Minimizing the shifted cut searches clusters where the weight of cut edges is lower than the expected weight. Similarly shifted interior maximization prefers clusters with more internal edge weights than expected. Because of Intf (C) = f (V ) − Cutf (C) maximizing the shifted interior is ranking equivalent to minimizing the shifted cut. Thus both produce identical clusterings and share the same properties. Applying the analysis method of the previous section on the shifted cut yields the condition ρ(Ci , Cj ) ≥ ρ(V ) for successfully merging two clusters and ρ(Ci,1 , Ci,2 ) ≤ ρ(V ) for successfully splitting cluster Ci into Ci,1 ∪ Ci,2 = Ci . Hence the shifted cut tends towards cuts with inter-cluster density smaller than the global ”average” density. The parameter ρ(V ) was introduced through the null model and is independent of the current clustering. It is possible to replace it by αρ(V ) in order to control the granularity of the clusterings. But with α going to zero for example the shifted cut would develop towards the cut which ignores the volume. 12 2.3 Density-Based Clustering Quality Measures A further question is how the quality measure handles clusters composed of disconnected components. Let C 0 be the components of a cluster not connected by any edge. By definition then the cut edge weight is zero: Cutf (C 0 ) = 0. Hence splitting the components into separate clusters does not change the cut and interior edge weight. But the interior volume decreases and the cut volume increases. Therefore conforming with intuition the quality measured with shifted interior and cut improves through this separation. As already mentioned the shifted cut minimizes the cut of actual edges while maximizing the cut of expected edges. Normalizing against the total edge weight f (V ) allows another interpretation related to the scaled cut: The normalized shifted cut Qn-shifted cut (C) minimizes the fraction of inter-cluster edges while maximizing the fraction in the reference graph. Similarly the normalized shifted interior maximizes fraction of edges in clusters while minimizing the fraction of reference edges. This is consistent with the desired property to find groups with stronger connections than expected. Cutf (C) CutVol (C) − f (V ) Vol(V ) Intf (C) IntVol (C) − Qn-shifted int (C) := f (V ) Vol(V ) Qn-shifted cut (C) := (2.25) = −Qn-shifted cut (C) (2.26) Here it is visible that the normalized version of the shifted interior looks very similar to the modularity measure introduced in Section 2.2. In fact just the matching volume-model needs to be inserted. The condition for successful merges ρ(Ci , Cj ) ≥ ρ(V ) also helps to explain the observed resolution limit of the modularity [25]: Increasing the size of a graph by inserting vertices and edges most likely lowers the global density. Optimizing the modularity or shifted interior will make it necessary to place several previous clusters into one as long as their inter-density is still above to new global density. Recently Arenas et al. proposed a parametric correction for this problem by adding weighted self-edges [5]. Increasing the self-edge weights mostly increases the vertex degree without changing the graph topology. This results in additional terms in the volume and thus indirectly lowers the global density ρ(V ). Altogether the preceding analysis shows that shifted cut and interior meet the desired properties for clustering qualities: Stronger than expected internal connectivity is preferred by both measures, they are unbiased on uniform density graphs, and do not tend to trivial solutions. 2.3.4 Volume Models Community structure results from the uneven distribution of edges in the input graph. The purpose of the volume model is to describe an even distribution as reference. The simplest method is to assign each possible edge unit volume by Vol(u, v) := 1. But then vertices with a higher degree dominate the found cluster structure as they have much more edges than the volume model expects. However for example it is quite common that important vertices have a higher number if incident 13 2 Graph Clustering V ol(u, v) = 6 w(u) = 3 V ol(u, v) = 5 w(v) = 2 (a) Undirected Atomic Vertices w+ (u) = 3 w+ (v) = 2 (b) Undirected Atomic Vertices Figure 2.1: Example Volume Models. Fig. (a) shows two vertices containing three and resp. two atomic vertices. In (b) the vertices contain three and resp. two atomic sources but just one atomic target. edges than less important vertices. Thus for many applications it also necessary to reproduce chosen structural properties. In this section the basic volume model used in this work, called degree multiplicity, is derived from simple principles. The derivation is based on atomic vertices and coarsening invariance. Both are introduced first and then the degree model is presented as a special case. Finally the discussion of related models highlights important aspects for flexible implementations. A volume model for the graph (V, f ) is a function Vol : V × V → Q. As the graph is undirected the volume should be symmetric Vol(u, v) = Vol(v, u). No negative amounts of vertices can exist, thus the volume is positive Vol(u, v) ≥ 0 for all vertex pairs. A zero volume would produce infinite density. Therefore the volume has to be nonzero Vol(u, v) > 0 where the edge weight f (u, v) is nonzero. With the normalized quality measures of above the volume is scale invariant because constant positive factors are canceled out by the normalization and the global density term. Given vertex weights w(v) ∈ N each vertex can be interpreted as a collection of w(v) atomic vertices. This is like vertices representing whole subsystems as shown in Figure 2.1a. Effectively a vertex with weight n is accounted with n times more edges than it would have with unit weight. Thus the vertex weight describes the importance of the vertex. Counting all possible unit edges between these atomic vertices leads to the family of multiplicity volume models with Vol(u, v) := w(u)w(v) for all vertex pairs u 6= v. Thanks to the scale invariance the relaxation to positive rational ”multiplicities” w(v) ∈ Q+ is possible. The internal volume of a vertex is Vol(v, v). Several reasonable alternatives exists. For example in case no self-edges are expected anywhere it can be zero. But this work also handles self-edges and Vol(v, v) := w(v)2 /2 is used. This also greatly simplifies the computations as will be visible below. Total volumes in the multiplicity model can be computed efficiently because the sum over weight products can be replaced by a product of two sums. The inter-cluster volume is computed summing the vertex weights separately and just multiplying these two sums. A similar approach is used for the intra-cluster vol- 14 2.3 Density-Based Clustering Quality Measures ume. Here each vertex pair is counted twice and all vertices get a fixed self-edge volume which needs to be compensated. However inserting the self-edge volume Vol(v, v) := w(v)2 /2 removes this compensation and yields: X Vol(A, B) = w(u)w(v) = u∈A,v∈B Vol(A) = 1 2 u∈A !2 X v∈A X w(v) − w(u) X w(v) v∈B X 1X w(v)2 + Vol(v, v) 2 v∈A = w(A)w(B) (2.27) v∈A = w(A)2 2 (2.28) This will proof to be an important property for the optimization algorithms of this work which are based on graph coarsening. A volume model is coarsening invariant if vertices can be merged without changing the volume: Let v 0 be the vertex produced by merging v1 and v2 , then Vol(v 0 , x) = Vol(v1 , x) + Vol(v2 , x) shall hold for all other vertices x and Vol(v 0 , v 0 ) = Vol(v1 , v1 ) + Vol(v1 , v2 ) + Vol(v2 , v2 ). In example the multiplicity model is coarsening invariant by using w(v 0 ) = w(v1 ) + w(v2 ) for merging the vertex weights. The degree multiplicity uses the above model to reproduce the degree centrality by using the weighted vertex degree deg(v) as vertex weight. In this special case the global volume is Vol(V ) = w(V )2 /2 = 2f (V )2 and the global density is ρ(V ) = (2f (V ))−1 . Therefore the null model derived from this volume is f0 (u, v) = deg(u) deg(v)/(2f (V )). This is exactly the same null model as used for the modularity derived from expected values of random graphs with probabilities proportional to the vertex degree [57]. Other volume models based on the multiplicity are possible. In this work just the degree multiplicity is used simply because it allows the comparison with other algorithms and implementations. Nevertheless the study of alternative models highlights demands on the implementation. The following list points out some possibilities: • Other centrality indices (importance measures) can be used as vertex multiplicity. For example in the analysis of software system often the size of the represented components in number of source lines (SLOC) is preferred as it produces subjectively better results.1 • When using another self-edge volume it is necessary to store it for each vertex in order to correctly implement the coarsening invariance. Then the vertex weights and the self-edge volume are merged together with the vertices. • In some graphs vertex classes exist with no edges between vertices of the same class. The volume between vertices of the same class should be zero because no edges can be expected there. This can be achieved by storing separate vertex weights w1 , . . . , wn for each class and replacing the product w(u)w(v) by a sum of products for each pair of classes with expected edges. When merging vertices simply their weight vectors are summed. For example bipartite graphs have two classes and w(u)w(v) = w1 (u)w2 (v) + w2 (u)w1 (v). 1 from private communication with Ralf Kopsch at the Software Systems Engineering Research Group in Cottbus 15 degree multiplicity 2 Graph Clustering Figure 2.2: Graph of US Airports connected by direct flights. Airports lying far away were removed from the plot. The graph is called USAir97 in the graph collection. • For directed graphs the atomic vertex model could be extended by splitting the atomic vertices into two classes like shown in Fig. 2.1b: Atomic sources and targets for directed edges with multiplicities based on the original in- and outdegrees. This leads to volume models similar to the one proposed for bipartite graphs. This way the graph and volume used for the graph clustering still can be undirected and symmetric and existing algorithms can be reused. In example for software systems the source multiplicity of a component could be the number of contained function calls and the target multiplicity the number of provided functions. Reviewing the above possibilities indicates that the multiplicity model has a mathematical structure similar to inner product spaces. Let K be the set of possible vertex weights w(v) ∈ K. Then calculating the volume w(u) · w(v) ∈ Q is an operation similar to the inner product. The symmetry of the volume equals the commutativity of the implemented operator. Merging vertices uses the addition w(u) + w(v) ∈ K. The coarsening invariance is satisfied by the distributivity of addition and multiplication. However compared to standard vector spaces the inner products used for volume models have much complexer semantics. A completely different class of volume models is based on geographic information 16 2.4 Fundamental Clustering Strategies about the vertices. In such cases vertices lying far away are expected to have weaker connectivity. For example the number of direct flights between airports is mainly inverse proportional to their distance. Now consider a graph of airports connected by edges when they have direct flight connections like shown in Figure 2.2. The edges are weighted by the frequency of direct flights. Graph clustering methods then would indirectly just discover the distance as building component of clusters. However some airports might have exceptionally strong connections because they for example have economical ties or belong to the same country. In order to study such structures it would be necessary to remove the influence of the geographical distance. This might be achieved by integrating this distance into the volume model. More abstractly the volume is derived from a suitable chosen vertex (dis)similarity measure. Then weak connections are expected between dissimilar, high-distance vertices. In reverse similar, low-distance vertices are expected to be strongly connected. In this context the density-based graph clustering can be interpreted as a search for vertex groups being actually stronger connected than their similarity suggests. 2.4 Fundamental Clustering Strategies This section tries to give an overview about clustering methods. Many methods combine several basic approaches. These components are provide useful ideas for the design of new algorithms. From a very short review of the literature common, often used components were identified. They can be broadly classified into constructing components, directing components, and meta-strategies. Constructing components generate a clustering. They contain for example dissection, subdivision, agglomeration, and refinement algorithms. Directing components control how the clustering is generated. This includes selection criteria and randomization methods. Finally meta-strategies combine constructing components. Examples are recursive subdivision, multi-level refinement, basin hopping and genetic algorithms. In the following text the literature is shortly summarized. The next three sections describe basic directing and constructing components, and meta-strategies. Table 2.1 summarizes the graph clustering algorithms for the modularity measure. Most information was derived from reviews of Danon et al. [18] and Newman [56]. The entries are classified by their basic strategy and the third column summarizes some details. A few publications appear several times as they incorporate different strategies. Other algorithms exist, which were not designed to optimize the modularity measure. These include algorithms for the minimum cut partitioning problem, which is an important topic in load balancing of for example computational fluid dynamic simulations. In this context a lot of research is already done on graph clustering algorithms. Therefore this literature might contain useful ideas. A historic review of graph partitioning methods can be found in [14]. Table 2.2 summarizes some related methods. 17 2 Graph Clustering Article Basic Strategy agglomeration strategies [15, 60] [17] [76] [65] [4] [57] [80] [64] [21] [20] greedy cluster merging by maximal modularity increase greedy merging by modified selector accounting for the cluster size greedy merging by modularity increase and cluster size (consolidation ratio) greedy merging by modularity increase; pre-coarsening of the graph using random walks pre-coarsening by directly merging single-edge vertices with their neighbor vector clustering using vertex vectors derived from eigenvectors of the modularity matrix k-means clustering or recursive bisection on of the random walk matrix greedy merging by similarity of visit probability distributions from short random walks greedy merging by distance of vertex vectors from the Laplace-Matrix improved by using a modified matrix similar to the random walk matrix subdivision strategies [58] [80] [81] [69] recursive bisection using the first eigenvector of the modularity matrix recursive bisection based on a few eigenvectors of the random walk matrix bisection at voltage gap, similar to random walk betweenness; replaces matrix inversion by truncated power series recursive spectral k-way subdivision with normalized Laplacian matrix; mixed with refinement using vertex moves and cluster merges dissection strategies [59] [26] [66] removes edges by highest random walk betweenness removes edges by highest information centrality removes edges by lowest edge-clustering coefficient refinement methods [57] [23] [67, 68] [39, 38] [52] [19] [5] Kernighan-Lin refinement of clusterings from spectral methods; recommends greedy refinement extremal optimization: moves vertices with bad contribution to modularity; combined with recursive bisection simulated annealing with fuzzy clustering (co-appearance in the same cluster), null-model derived from a Q-potts spin model simulated annealing, spin glass system simulated annealing combined with greedy refinement (quenches); basin hopping (move several vertices, apply greedy refinement, and then accept new clustering like with simulated annealing) reduction to minimum cut partitioning with negative edge weights; adapted multi-level refinement from the METIS library tabu search: move vertices and prohibit reversing any of the last few moves linear programming [13] [2] modularity optimization formulated in integer linear programming formulated as fractional linear program; produces vertex distances in polynomial time, then rounding the distances to {0, 1} Table 2.1: Overview of Modularity Graph Clustering Algorithms 18 2.4 Fundamental Clustering Strategies Article Basic Strategy [7] growing cluster as shortest path neighborhood around a start vertex until only a few edges go outside this cluster greedy merging of clusters using mean first-passage time of random walks; transition probability is biased by number of common neighbors iteratively modifies the edge weights based on short random walks; the weight of ”inter-cluster” edges declines while edge weights in clusters raise; developed for clustering spatial data sets removes edges by highest shortest path betweenness clusters from flows in the graph; a generalized form of random walks and other Markov processes approximation algorithm for a quality measure similar to inter-cluster density; applies spectral methods and linear programming moves vertices to neighbor clusters with escaping from local optima greedy merging by inter-cluster edge weight; already minimizes the edge cut during graph coarsening multi-level refinement based on Kernighan-Lin uses snapshots of the agglomeration, called multi-scale; applies greedy smoothing as refinement evolutionary (genetic) search combined with multi-level min-cut partitioning [83] [40] [32] [75] [6] [45, 24] [43, 1] [43, 1, 78, 41] [40] [74, 71, 73, 72] Table 2.2: Overview of Clustering and Partitioning Algorithms 2.4.1 Directing Components Modularity clustering is a computational difficult problem and polynomial time algorithm will unlikely be able to find the optimum clustering in all cases. Actually most algorithms are heuristics which try to construct a relatively good clustering from many small decisions. Unbiased heuristics use completely random decisions. In contrast biased heuristics use a selection criterion to direct the algorithm. In each step this selector chooses one of several available alternatives using information inferred from the graph. The next paragraphs presents commonly used deterministic selection criteria. The second paragraph show how linear programming can be used to construct selectors. And the last paragraph discusses randomized selectors. Selection Criteria Selection criteria can be derived from the global optimization aim. For clustering by modularity simple selectors directly follow from the mathematical formulation. A more global view might be attained using spectral methods [3, 21, 21, 6, 80, 57]. Other heuristics use centrality or betweenness indices to select single edges or vertices. These are for example derived from shortest paths [32], the information centrality [26], clustering coefficients [66], and random walks [40, 59, 64, 81, 83]. Unfortunately sometimes just the selector alone is defined as optimization aim while the global reference measure is ignored. This leaves unclear what the final 19 2 Graph Clustering clustering really describes and makes it difficult to compare different selectors for a task. Another important question is to which global aim the selector really is leading. Its possible that the local decisions optimize another property than desired. Here a small analogy to statistical mechanics is visible. For example in Lattice-Boltzmann fluid simulations statistical methods are necessary to proof that the simple discrete model with microscopic (local) collision rules results in the desired macroscopic (global) flows. Spectral methods replace the graph by a matrix and clusters by vectors. Then the clustering quality measure is reformulated with matrix-vector products. This allows to apply matrix diagonalization and map the problem into the eigenvector space. Similar to principal component analysis a few strongest eigenvalues and their eigenvectors are used to derive vertex vectors. Then selection criteria are computed from their length, distance, or direction. The vertex vectors actually position the vertices in a vector space. Other methods to compute such positions exist. In example for graph drawing often spring embedder algorithms minimizing an energy model are used. With an appropriate model vertex distances are related to the cluster structure. This transforms graph clustering into a geometric clustering problem. For the modularity measure such layouts are computable from the LinLog energy model [62]. In energy minima of this model the distance between two clusters is roughly inverse proportional to their inter-cluster density. However despite the provided global information this method is practically unusable as computing layouts is very expensive. Linear Programming Linear programming methods reformulate the clustering quality measure as a linear target function. As input to this function a vector of pairwise vertex distances is given. Vertices of the same cluster have for example zero distance and vertices in different clusters have distance one. The transitivity, reflexivity, and symmetry of clusterings is enforced through linear constraints similar to triangle inequalities. Now additional constraints limit the vertex distances to the range from zero to one. This is known as fractional linear programming. Using inner-point methods exact fractional solutions can be found in polynomial time. Agarwal and Kempe [2] proposed an agglomeration algorithm to round the distances to zero and one. Thus basically the distances are used as merge selector. As side product the fractional solution provides an upper bound for the highest possible modularity of a graph. Because the set of possible clusterings is a subset of all allowed distance vectors, no clustering can have a better modularity than the fractional solution. Brandes et al. [13] used a similar method to derive a binary linear program. There the vertex distances are restricted to zero and one. Finding an optimum solution still is NP-complete and requires exponential time. But quite good algorithms exists implementing this search given the target function and constraints. This allowed to find the global optimum clusterings of graphs with up to 100 vertices. Randomization All optimization heuristics may become trapped in suboptimal clusterings. Randomizing the decisions a little bit is one method to avoid or escape 20 2.4 Fundamental Clustering Strategies such situations. For example this can be achieved by using random selection criteria or adding a little bit of randomness to the selector. One often used method is simulated annealing [67, 68, 39, 38]. Typically a clusterings similar to the current clustering is constructed, i.e. by randomly moving a vertex to another cluster. The main idea is to direct the search by not accepting each generated clustering. Better clusterings are always accepted. But worse clusterings are just accepted with a probability proportional to the modularity decrease ∆Q. Often the metropolis criterion exp(T −1 ∆Q) is applied. The parameter T controls the temperature of the algorithm. In the beginning the temperature is high and many modularity decreasing moves are accepted. This enables a widespread search. Later the temperature is lowered, increasing the portion of modularity increasing moves. With zero temperature only better clusterings are accepted and thus the algorithm finishes in a local optimum. Duch and Arenas [23] proposed a recursive bisection algorithm. The two clusters are computed using extremal optimization applied to an initial random bisection. Extremal optimization can be interpreted as a biased simulated annealing. The moved vertices are not randomly chosen but selected based on their current contribution to the modularity. This selection criterion is called vertex fitness. The idea is that moving low-fitness vertices also improves the clustering quality. 2.4.2 Constructing Components This subsection presents various heuristics to construct clusterings. They use the selection criteria presented above to direct the construction. Dissection is one of the oldest strategies often used in the analysis of social networks. It observes how graphs and clusters fall apart. More details are presented in the next paragraph. The second paragraph describes agglomeration method. They observe how large clusters can be put together from smaller clusters. The last paragraph presents refinement methods that try to improve given clusterings by moving vertices between clusters. Dissection Dissection algorithms repeatedly remove vertices or edges from the graph and observe the developing connectivity components. Because clusterings of the vertex set are searched it is more common to remove edges. Removing an edge can split a cluster in at most two parts. Thus the produced hierarchical clustering is a binary tree. Girvan [32] proposed a dissection algorithm which removes edges which are between clusters and least central to any cluster. These are identified by counting how many shortest paths from arbitrary vertex pairs pass through each edge. Less edges are expected between clusters and thus more paths should touch them. After each removal all betweenness values are recalculated. Later Girvan and Newman [59] proposed the random walk betweenness as improved selector. In the same paper the modularity was introduced as measure for the optimal number of clusters. Agglomeration Agglomeration methods grow clusters by merging smaller clusters. The process begins with each vertex placed in a separate cluster. In each step a pair of clusters is selected and merged. Greedy methods only merge pairs which increase 21 2 Graph Clustering the global quality until no such pair is left. The merges are directed by a selector that tries to predict the optimal cluster structure. As the method alone is unable to correct wrong merges later, good predictors are necessary. These should lead to a faithful cluster structure with a high quality. As side product the merge history provides a hierarchical clustering. One of the most used agglomeration methods is due to Ward [79]. He developed a hierarchical clustering method to assign US Air Force personnel to tasks while optimizing multiple objectives. By design agglomeration is very similar to the graph coarsening used in multi-level graph partitioning algorithms [43, 78, 41] Mark Newman was the first using agglomeration to directly optimize the modularity measure [60]. He called it greedy joining and selected clusters by the maximal increase of modularity. Later the algorithm was accelerated by using sorted trees for faster lookups [15]. Recently Wakita and Tsurumi [76] proposed a different implementation. Like in [17] they observed an unbalanced growth behavior with the original selector. They introduced various consolidation ratios as relative size of cluster pairs. Then as selection criterion the quotient of modularity increase and consolidation ratio was applied. Cluster Refinement Refinement methods improve an initial clustering by applying local modifications. They are an ideal way to correct small errors in the clustering. Generally refinement methods construct several similar clusterings from the current. The most common type of such neighborhood is formed by moving single vertices into other clusters. As the next clustering the best from this neighborhood is selected. Greedy refinement accepts only new clusterings with a better quality than the current. Thus the refinement ends in a local optimum where all clusterings of the neighborhood have worse quality. Kernighan and Lin [45] proposed a very popular method that tries to escape such local optima by accepting small sequences of slightly worse clusterings. Fiduccia and Mattheyses [24] simplified this strategy by just considering single vertex moves instead of constructing and moving whole groups. Most of these methods were developed for minimum cut graph partitioning. In the context of modularity clustering recently simple refinement methods based on greedy and Kernighan-Lin refinement were proposed [57, 5, 69]. Unfortunately no detailed study exists how the refinement perform in modularity clustering although the objective is quite different from minimum cut partitioning. 2.4.3 Meta-Strategies Finally meta-strategies combine construction methods from the previous subsection. The idea is to exploit advantages of single components while compensating disadvantages. The strategies discussed in the next paragraphs are recursive subdivisions, multi-level approaches, basin hopping, and genetic algorithms. Recursive Subdivision and Hierarchical Clusterings Some algorithms can only find a fixed number of clusters, for example [80, 81, 80, 58]. In order to find more clusters and better clusterings the algorithms are applied recursively on each previously computed cluster. Subdivisions into exactly two groups are called bisection and 22 2.4 Fundamental Clustering Strategies first bisection second bisection (a) Recursive Bisection (b) Hierarchical Clustering and Dendrogram Figure 2.3: Recursive Subdivision and Hierarchical Clusterings. The red dashed line marks the cut through the dendrogram. The corresponding clustering is drawn by red circles. an example recursive bisection is shown in Figure 2.3a. The recursive application results in a hierarchical clustering. Like agglomeration methods recursive subdivision is unable to correct errors. Moreover the subdivision methods work with a fixed number of clusters. This imposes a structure on the clusterings which may hide the correct structure [57]. For example a bisection of a graph with three clusters might falsely split one of the clusters into two parts. Further bisections will not correct this. Also other algorithms producing hierarchical clusterings exist. For example the merge steps in agglomeration methods and the connectivity components of dissection methods provide a cluster hierarchy. Such hierarchies are often drawn as a tree called dendrogram. The root is a cluster containing all vertices and each leaf contains just a single vertex. To derive a normal, flat clustering a cut through this tree is searched as shown in Figure 2.3b. The simplest variants select the level with best modularity [59]. More elaborate implementations search for better clusterings by shifting the level up and down in subtrees [63]. Of course this search is inherently limited by the underlying tree. Multi-Level Approaches Multi-level methods were developed to improve the search range and speed of partition refinement methods by moving small vertex groups at once. These groups and the connections between the groups are represented as a coarse graph. In this graph it is sufficient for the refinement to move just single coarse-vertices. Applying this idea recursively leads to multi-level refinement methods like [41, 78]. Karypis and Kumar [43] were the first proposing to use agglomeration methods to already optimize the clustering quality during the construction of coarse graphs. However up to now no multi-level refinement algorithm exists to directly optimize the modularity. 23 2 Graph Clustering Basin Hopping The basin hopping of Massen and Doye [52] is a high-level search algorithm. The algorithm proceeds by jumping between local optima. Like in simulated annealing a neighbor local optimum is accepted as new clustering based on the metropolis criterion. These neighbor clusterings are found by first performing some random vertex moves to leave the current local optimum. Then greedy refinement, called quenching, is applied to move into a (hopefully new) local optimum. In summary this strategy combines unbiased search methods with strongly directed local optimization. Genetic Algorithms Another form of high-level randomization are genetic algorithms. These store a pool of candidate clusterings. New clusterings are created by combining several old clusterings (crossing) or modifying a clustering (mutation). The clusterings are rated according to a quality measure, which is called fitness in this context. Just the best, fittest clusterings survive while old are removed. For minimum cut load balancing Soper et al. [74, 71, 73, 72] developed a particularly good genetic algorithm called evolutionary search. In order to push the pool to high-quality clusterings they improve the genetically generated clusterings using greedy multi-level refinement. Crossing and mutation is implemented by temporarily modifying the edge weights of the graph based on the input clusterings. Afterwards the quality of the refined clustering is evaluated on the original graph and used as fitness of the clustering. 24 3 The Multi-Level Refinement Algorithm This chapter presents the design of a family of algorithms for finding solutions to the modularity clustering problem. The input is an undirected graph with weighted edges and the algorithms search a clustering (partitioning) of the vertex set with high modularity. Finding clusterings with maximum modularity is known to be NPcomplete [13]. Therefore the presented algorithms are heuristics with polynomial bounded runtime and are unlikely to find global optima. Nowadays quite often agglomeration heuristics are employed for modularity clustering. These operate by successively merging adjacent clusters until the clustering quality cannot be improved further. Most of them share one strong disadvantage: The merge decisions are based on insufficient local information but wrong decisions are not corrected later. On the other hand refinement heuristics have the potential to correct these merge errors. In general refinement heuristics construct a neighborhood of problem solutions similar to the current solution and select the one with the best quality out of these. A situation in which all neighbor solutions have worse quality is called local optimum. In the context of graph clustering the problem solutions are clusterings of the graph and local cluster refinement searches for clusterings with higher modularity by moving single vertices between clusters. Unfortunately all refinement methods require an initial clustering and it is difficult to find good clusterings to start with. Furthermore refinement is not effective in huge graphs where each vertex has only a tiny effect on the modularity. Many very small movements would be necessary to walk in and out of local optima. This makes it very unlikely to find other local optima by walking through a series of low-quality clusterings. Both refinement problems are tackled by the multi-level approach presented in the next sections. It is based on the min-cut partitioning methods of Karypis [1] and basically widens and accelerates the refinement search by moving whole groups of vertices at once. These groups are constructed using agglomeration methods which are presented the section about graph coarsening. They are similar to Newman’s greedy joining [60, 15] but use other merge selection criteria. The fourth section explores the available refinement methods. Those are mostly derived from Kernighan, Lin, Fiduccia, and Mattheyses [45, 24] but are adapted to the dynamic number of clusters present in modularity optimization. The employed local operations are summarized in Figure 3.1. Merging clusters is exclusively used during graph coarsening. And during refinement vertex movements are applied. The last section completes the chapter with additional insight into key concepts of the implementation. 25 3 The Multi-Level Refinement Algorithm move to other cluster move to new cluster merge clusters Figure 3.1: Operations for local cluster modification. The operations move the cluster boundaries, displayed as dashed lines. This changes which edges are in the edge cut. These are highlighted by red lines. input graph refinement coarsening recursion or initial clustering Figure 3.2: The Multi-Level Scheme 3.1 The Multi-Level Scheme As already mentioned the refinement heuristic requires a good initial clustering. Unfortunately also the number of clusters is not known in advance. Assigning all vertices to a big cluster cannot be good because the number of clusters is much too small. Similarly assigning each vertex to a separate cluster starts with too many and there is no good reason why a random clustering should be better. On the other hand agglomerative clustering methods perform relatively well at finding good clusterings. To accelerate and widen the refinement the algorithm initially trades search range against accuracy by moving whole groups of vertices at once (cf. Fig. reffig:alg:multilevel). Afterwards smaller groups or single vertices are moved for local improvements. The underlying data structures are unified by contracting these vertex groups into a coarse graph. This approach is applied recursively to produce a hierarchy of bigger groups and coarser graphs. Refinement is applied at each of these coarsening levels beginning on the coarsest. The question of good vertex groups is coupled to the initial clustering: The coarsest groups already are a clustering and can be used 26 3.1 The Multi-Level Scheme as starting point. Inversely the coarsening can be interpreted as an agglomerative method which regularly stores snapshots as coarsening levels. Data: original graph,coarsener,refiner Result: clustering create level[1] from original graph; for l from 1 to max levels do // coarsening phase clustering ← coarsener(level[l]); create level[l + 1] from clustering; reduce edge and vertex weights level[l] → level[l + 1]; lmax ← l; if not successful then break; clustering ← vertices of level[lmax ] ; // initial clustering phase for l from lmax to 2 do // refinement phase project clustering onto level[l − 1]; clustering ←refiner(clustering); project clustering onto original graph; Figure 3.3: The Multi-Level Algorithm This approach is also shown in pseudo-code in Figure 3.3. It is known as multilevel partitioning and is a special form of the Algebraic Multi-Grid method (AMG). To simplify the implementation the method is reformulated in three phases. This makes it easier to centrally select the implementations used for the coarsener and refiner. In the coarsening phase vertex groups are created by merging clusters beginning with singleton clusters until the number of groups is satisfyingly reduced. From these groups the coarse graph is build. This is repeated until the graph cannot be coarsened further. The initial clustering phase simply uses the vertices of the coarsest graph as initial clustering. Finally during the refinement phase the clustering is successively projected onto the next finer graph and refined with vertex movements until the original graph is reached. Implementation Notes The vertex groups build during the coarsening phase are clusterings. They are stored as a mapping of vertices to their current cluster. To efficiently update this mapping on cluster merges the disjoint-set data structure with path compression is used like in many union-find algorithms. Thus for all practical graph sizes updates and lookups take constant time on average. After each coarsening pass the contracted graph is generated by registering a coarse vertex for each cluster. For each edge in the fine graph an edge between the coarse vertices of their end-vertices is added if not already existing. With the necessary edge lookups this costs O(|V | + |E|2 ), using a temporary search tree O(|V ||E| log |E|). Because in practice this proved to be a significant difference the implementation uses a tree for each vertex to quickly search in its incident edges. Simultaneously a mapping of fine-graph vertices to their coarse vertex and fine edges to their coarse edge is built. Structurally this is a graph homomorphism respecting vertex adjacency. The homomorphism is stored together with the coarse graph in level[l+1] and allows to transfer vertex and edge properties between both 27 3 The Multi-Level Refinement Algorithm graphs. Transferring weights from the fine to the coarse graph requires to add up the weights and is therefore called reduction or restriction. Some care is necessary to transfer the weight of edges reduced to self-edges correctly because two directed edges (one in each direction) are mapped to onto one self-edge. Later clusterings are transfered back from the coarse to the fine graph by assigning all fine vertices to the same cluster like their coarse vertex. This is called projection or prolongation. Thanks to the stored homomorphism these transfers cost linear time in the size of the finer graph, i.e. O(|V | + |E|). 3.2 Graph Coarsening Graph coarsening is the first phase of multi-level refinement methods. Its task is to group vertices together and build condensed, smaller graphs from these groups. The overall aim is to find a good initial clustering, which is indirectly given through the coarsest graph, and to build vertex groups that assist the refinement heuristics later. This section presents the implemented coarsening algorithms. The coarsening is accomplished in two steps: First an agglomeration heuristic produces a clustering of the vertex set. Later with this the next coarsening level is built by contracting all vertices of each cluster into a coarse-graph vertex. The agglomeration heuristic successively merges pairs of clusters starting from one-vertex clusters. Each merge decreases the number of future coarse-graph vertices by one. The maximal number of merges is defined by the reduction factor multiplied with the number of initial vertices. Thus ideally each coarsening pass reduces the number of vertices by a constant factor leading to a logarithmic number of levels. To find good coarse graphs the process is controlled by a selection strategy: Only pairs of adjacent clusters increasing the modularity are considered for merges and these pairs are further ranked by a selection quality. Various selection qualities are discussed in the subsequent section about merge selectors. The next subsections discuss the two implemented agglomeration methods called greedy grouping and greedy matching. The first is based on Newman’s greedy joining [60]. Using matchings is a more traditional approach from multi-level graph partitioning [43]. Following list summarizes all parameters controlling the coarsening. Their use is explained together with both algorithms in the next two sections. coarsening method The agglomeration method: Either grouping or ”matching. reduction factor Multiplied with the number of vertices. Gives the number of merge steps to perform on each coarsening level. Indirectly defines the number of coarsening levels. match fraction Multiplied with number of good vertex pairs. Excludes worst ranked pairs from the matching selector The merge selection quality to use. 28 3.2 Graph Coarsening Data: graph,selector,reduction factor Result: clustering merge count ← reduction factor ∗ num vertices(graph ) + 1; while merge count > 0 do // (a, b) ← selector best edge() in two steps: a ← selector best vertex(); b ← selector best neighbor(a); if found (a, b) then clustering.merge(a,b), update graph and selector; merge count ← merge count − 1; else break; Figure 3.4: Coarsening Method: Greedy Grouping 3.2.1 Greedy Grouping In each step greedy grouping searches the globally best ranked pair of clusters and merges both without compromises. Figure 3.4 summarizes the algorithm in pseudocode. To quickly find the best ranked pair the clustering is also represented by a temporary coarse graph. There each cluster is a vertex and the best neighbor of each vertex is stored by the selector. Thus just the vertex with the best ranked neighbor is searched and then its internally stored best neighbor is retrieved. The coarsening ends if either no good pair is found or enough clusters were merged. During each merge the best-neighbor information has to be updated. This requires to visit the neighbors of the two merged clusters for updating the selection qualities between them. Thus the temporary coarse graph is used to cache merged vertex and edge weights and the merged adjacency lists. Incrementally updating this graph leads to complex data structures and interactions but still is much faster than collecting the information in the original fine graph each time. Implementation Notes Computing selection quality and updating the modularity requires to access edge and vertex weights. With each merge these weights have to be updated. This is achieved by informing the container data structures during the merge operation about elementary operations like merging and moving edges. Additionally the merge selector is informed in order to update the best merge neighbor like described in [76]: For finding the best merge neighbor of a vertex the selection quality to all adjacent vertices has to be evaluated. But a new best neighbor has to be searched only when the selection quality to the old best neighbor decreases. In all other cases it is sufficient to compare the changed selection quality of a vertex pair to the current best neighbor With greedy grouping the next best merge pair has to be found repeatedly. This is efficiently done with a binary heap stored in a supplementary array. The heap is updated when the best merge neighbor of a vertex changes. In this case the merge 29 3 The Multi-Level Refinement Algorithm 1 merge edges 1 2 a 1 1 b move edge ab 1 1 new self-edge Figure 3.5: Merging two Vertices and their Edges selector informs the heap about the new neighbor and selection quality and the vertex is moved up and down through the heap into its correct position. Empiric tests suggested that the runtime is already acceptable without the heap. Thus it is disabled by default to reduce possible sources of error and regression. A temporary coarse graph is maintained to store edge weights and adjacency information for the merge selector. In this graph each cluster is represented by a vertex. Merging two clusters equals merging their vertices and incident edges. This is known as edge contraction and is the most expensive operation of each merge step because the incident edges of both vertices have to be processed. Let the two involved vertices be the to be removed and the target vertex. For example in Figure 3.5 the vertex b is to be removed and vertex a is the target. Edges to a common neighbor are merged by simply dropping the superfluous edge of the removed vertex. Edges to a vertex not adjacent to the target vertex have to be moved to it. The merge data structures proposed by Wakita and Tsurumi [76] were implemented. The outgoing edges of each vertex are stored in a double-linked list sorted by the end-vertex index. Now both lists can be merged in one pass without searching the matching out-edges of the target vertex. Some special cases arise however: Updating self-edges is difficult because their position is not predictable. Thus they are always stored as the first out-edge. Unfortunately the original paper leaves open how to correct the position of moved edges in the list of their end-vertices. The lists should be sorted by the end-vertex index but moving an edge changes the index from the removed to the target vertex. The simplest fix is to reinsert such edges into the list, which costs linear time in the size of the list. Therefore worst-case runtime O(|V |2 ) of this coarsening method is reached in case all |V | edges of the removed vertex have to be moved to the target vertex without merging edges. The overall worst case time complexity for a complete merge step is O(|V |2 ): Selecting the best pair costs O(|V |) with a linear search or O(log |V |) using the heap. The two clusters are merged in constant time using union-find. Combining the edge lists requires quadratic time O(|V |2 ) in worst case. Updating the bestneighbor information also requires a linear search at each adjacent vertex in worst case. This adds quadratic time O(|V |2 ). In addition updating the best pair heap may cost up to O(|V | log |V |). However in sparse graphs the number of incident edges often is much smaller than the total number of vertices. And in dense graphs 30 3.2 Graph Coarsening nearly no edges require movement because they point to a common neighbor. Hence the average merge time should be nearly linear instead of quadratic. Related Work Clauset at al. use search trees to store the incident edges [15]. Each edge of the removed vertex is removed from the tree of its end-vertex. Its entry is searched in the list of the target vertex. In case target and end-vertex are not already adjacent the edge is inserted into both trees. Each vertex can have at most |V | adjacent vertices, thus merging two vertices with this method would cost O(|V | log |V |) time in the worst-case. Greedy grouping is structurally very similar to the minimum spanning tree problem with the selection quality as edge weights. But most selection qualities change their values dynamically depending on merge steps. Still some alternative techniques without directly contracting edges but deferring some operations to later steps might be applicable, cf. [27]. 3.2.2 Greedy Matching Data: graph,selector,match fraction,reduction factor Result: clustering foreach pair of adjacent vertices (a,b) do // collect possible merges if merging a, b would improve quality then sq ← selector quality(a,b); add (a, b, sq) to pair list; num pairs ← num pairs + 1; sort pair list by decreasing sq; good = match fraction ∗ num pairs + 1; merge count = reduction factor ∗ num vertices(graph ) + 1; foreach pair (a, b) in pair list do if a and b not marked then mark a and b; merge clusters of a and b; merge count ← merge count − 1; good ← good − 1; if merge count = 0 ∨ good = 0 then break; Figure 3.6: Coarsening Method: Greedy Matching The greedy matching method operates by first constructing a matching, i.e. a list of independent vertex pairs, and the next coarsening level could be directly generated from these pairs. The pseudo-code is shown in Figure 3.6. In the first phase all pairs of adjacent vertices are collected and sorted by their selection quality. Pairs not improving modularity are filtered out in advance. This list is processed beginning with the best ranked pairs. Merged vertices are marked and pairs with 31 3 The Multi-Level Refinement Algorithm at least one already marked vertex are ignored. When a pair of unmatched vertices is found they are merged. Unfortunately the matching also selects many bad ranked pairs even when their vertices have much better but already matched neighbors. Especially at the end of a long coarsening pass bad pairs would be grouped together just because all good pairs are already used up. The purpose of the ”good” counter is to avoid this by considering only a fraction of the best ranked pairs. The matching method also has a second downside: It is possible to fail the reduction aim when most vertices have only few common merge partners which are already matched. For example if a vertex has many neighbor vertices which have no further neighbors but this central vertex, many not independent merges are necessary. In that case many coarsening levels will be produced. This situation occurs also in praxis for example with graphs which feature a power-law vertex degree distribution [1]. Constructing the matching adds some complexity to the algorithm compared to the simple looking greedy grouping method. But thanks to the independence of the merged pairs no updated selection qualities are required and the dynamic coarsegraph can be omitted in favor of static data structures. However for simplicity this implementation reuses most data structures from the grouping method. Producing all coarsening levels has time complexity O(log |V | |E| log |E|) on wellstructured graphs: There are around log |V | coarsening levels and in each level the edges have to be sorted by the selection quality which costs O(|E| log |E|). The actual merge operations are done in constant time because no complicated updates are necessary. Related Work For a theoretical analysis of approximation guarantees other matchings than the implemented greedy strategy might be more useful. For example Monien et al. [54] developed more elaborate greedy matching strategies for min-cut partitioning and were able to derive approximation factors. Also optimal matching are computable in polynomial time using Edmond’s algorithm [55]. 3.3 Merge Selectors The merge selector is used by coarsening algorithms to select cluster pairs for merging. The clusters and their relation are encoded in a coarsened graph in which each cluster is represented by a vertex. Each edge in this graph (except self-edges) together with its two end-vertices embodies such a candidate merge pair. The merge selector ranks these pairs by computing a selection quality and pairs with high selection quality are selected first. The selector’s task is to lead to a good hierarchy of coarse graphs supporting the subsequent cluster refinement. In a good hierarchy the coarsest graph provides an initial clustering of high modularity. Secondly all coarsening levels shall provide evenly grown vertex groups with relatively similar influence on the modularity of the clustering. Otherwise the refinement will have only move options changing the modularity either too much or too weak which would obstruct the search. 32 3.3 Merge Selectors Extent Name Description local local semi-global modularity increase weight density random walk distance (t) semi-global random walk reachability (t,i) global spectral length global spectral length difference global spectral angle highest increase of modularity highest density between both clusters best neighborhood similarity with t random walk steps highest weight density with reachability of neighbor vertices as weight; with t steps and i iterations approx. best modularity contribution in spectral decomposition approx. highest modularity increase in spectral decomposition most similar direction of the spectral vertex vectors Table 3.1: Proposed Merge Selection Qualities Both aims are pursued by trying to predict which vertex pairs lie in the same cluster compared to the unknown optimum clustering. In this setting the selection qualities are the predictors. Their prediction quality (success probability) can be roughly measured by comparing the initial ranking of pairs against the best known clustering. The second aim may be fulfilled by proposing pairs well spread over the graph and not concentrated in a single region. But other factors may be more important, like for example the question when merge errors (merging vertices of different optimum clusters) occur during the coarsening. Ultimately the only reliable measure of success are the modularity clustering qualities reached after coarsening and after refinement. The presented selection qualities are classifiable by the extent of incorporated information into local, semi-global, and global measures. Local measures use only information about the two adjacent vertices and their connecting edge. Semi-global measures try to incorporate more information from the neighborhood of both vertices and global measures consider the topology of the whole graph at once. Non-local selectors can differ in how they are updated during merge steps. Dynamically updated selectors recompute the selection qualities after each merge, whereas static selectors just use simple approximations on the intermediate coarse graphs. All presented selection qualities are static. The key properties are computed once at the beginning of each coarsening pass. During single merge steps values are merged and updated just locally. Table 3.1 summarizes the proposed selection qualities. They are discussed in more detail in the following sections beginning with local selectors followed by random walks and spectral methods. Each of these sections is subdivided into a short introduction of the key ideas, presentation of the theoretical background and concludes with the discussion of derived selection qualities. 33 3 The Multi-Level Refinement Algorithm 3.3.1 Local Merge Selectors Studying the change of modularity during merges in the context of greedy agglomeration leads to a constraint depending on local properties. Two selection qualities modularity increase and weight density are derived from this constraint. The following sections present the theoretical background and discusses both selection qualities. Theoretical Background Like mentioned earlier clusters are represented by temporary coarse vertices. As only adjacent clusters are considered for merging, both vertices of each merge pair are connected by an edge. Therefore the available local information about each pair is the edge weight f (a, b), the volume Vol(a, b) and the expected edge weight ρ(V ) Vol(a, b). In addition properties of both clusters like the interior edge weight f (a, a) and the interior volume Vol(a, a) are locally available. Merging two clusters a, b changes the modularity Q(C) to Q(C 0 ) and the greedy agglomeration requires increases quality with Q(C 0 ) > Q(C). Inserting the modularity definition((unnormalized: simpler to read and ranking equivalent)) Q(C) = Intf (C) − ρ(V ) IntVol (C) gives the constraint f (a, b) − ρ(V ) Vol(a, b) > 0 because the merge adds f (a, b) to the interior edge weights. As visible this constraint depends just on local properties an may be used to derive selection qualities. Modularity Increase In order to reach a good final modularity it sounds reasonable to merge the pair of highest quality increase Q(C 0 ) − Q(C) in each step. The calculation is simplified by removing the clusters where nothing changes. This leaves the contribution of both clusters before and after the merge. These differ just by the removed edge weight and volume between both. Thus the modularity increase is: Q(C 0 ) − Q(C) = f (a, b) − ρ(V ) Vol(a, b) modularity (3.1) increase This selection quality is also a form of the greedy constraint from above and was already proposed by Newman [60] for his agglomerative clustering algorithm. Afterwards several authors (including himself) observed problems with the growth behavior triggered by this selector and proposed various fixes [17, 76]. The dynamic behavior can be described as follows: The cluster of the largest contribution to modularity is merged with its best neighbor. This further improves the strength of this cluster and the improvements possible by merges with small neighbors. The process goes on until all good neighbors of the optimum clustering are used up but then continues with bad neighbors until the influence of this huge cluster is low enough again to leave other merge pairs a chance. The whole cycle repeats several times producing clusters with an exponential distribution of the cluster size. The proposed corrections involve scaling the selection quality by the inverse cluster sizes to prefer merging either small clusters or pairs of similar size. But the real problem are the merge errors introduced in the second phase of shrinking influence and in many situations vertices and clusters belong together in spite of their very 34 3.3 Merge Selectors different size. Therefore these modifications were not implemented in favor to the selection quality presented in the next section. Weight Density The greedy constraint for merge pairs was f (a, b)−ρ(V ) Vol(a, b) > 0. Selectors are derived from such inequalities by using the left side as selection quality and the right side as filter condition. As the greedy filtering is already done for all selectors it can be ignored here. Other selection qualities are obtained by transforming the inequality. Adding the expected edge weight on both sides yields f (a, b) > ρ(V ) Vol(a, b). This resembles heavy edge matching [43] and ignores the volume model completely. Hence it most probably is not suitable for modularity clustering. But now dividing the constraint by the volume Vol(a, b) results in: f (a, b)/ Vol(a, b) > ρ(V ) (3.2) This suggests to select pairs by highest inter-density and is implemented as the weight density selector. A nice advantage of this selector is its consistency with the overall density-based approach. In each merge step these local selection qualities are recalculated from the merged edge and vertex weights. The weights of merged edges are summed and the coarsevertex weight is the sum over both vertices. From the definition follows that only the values to adjacent vertices can change. 3.3.2 Random Walks Additional to using direct information from the vertices also their neighborhoods can be compared. Ideally the neighborhoods of vertices in the same optimum cluster are more similar than of vertices in different clusters because more edges are in clusters than between clusters. But the unweighted, graph-theoretic neighborhood ignores edge weights. In extreme cases the neighborhood is the whole graph and thus provides no information. Note that already the assumption of more edges lying in optimum clusters than between does not always hold because with density-based clustering only the difference to the expected edges matters. A cluster border could be along a cut where the absolute number of edges is higher than everywhere else but still much lower than expected. A better alternative is to compute weighted neighborhoods using random walks. Two random walks are started at both vertices and aborted after a small number of steps. The probabilities to visit another vertex are computed deterministically and result in two probability distributions. Basically the selection quality is then defined as Euclidean distance of both distributions. The random walk distance selector is based on this idea refined by ideas from Latapy and Pons [64] as described further below. On the other hand the random walk reachability selector based on [40] does not directly compare the neighborhoods. It uses random walks to measure how well a vertex is reachable from another one walking over short paths. This is driven by the 35 weight density 3 The Multi-Level Refinement Algorithm P t−1 (v)p(v, v) w1 v w2 P t−1 (w1 )p(w1 , v) P t−1 (w2 )p(w2 , v) Figure 3.7: Contribution of Neighbor Vertices to the Visit Probability assumption that vertices of the same optimum cluster are connected over several paths while only few paths connect vertices of different clusters. p(u, v), Put (v) Theoretical Background This paragraph introduces a formal definition for random walks and discusses basic properties. The most important one is the connection between the connection density from Section 2.3.3 and the convergence of random walks: Random walks converge to a stationary distribution of visit probabilities. But the so called mixing rate of this convergence is coupled to the minimum scaled cut. This is the lowest inter-cluster density over all bisections of the graph and a low density relative to the global density leads to slow convergence. Thus it can be deduced that random walks pass local barriers of lower density less often than high-density areas. Unfortunately this connection is far from ideal. Firstly the results apply just to the degree multiplicity volume model. There is no apparent way to adapt random walks to other volume models. And finally the minimum inter-density (scaled cut) is not the modularity quality measure (shifted cut). The optimum clusterings of both may differ significantly. A random walk starts at a vertex and walks from vertex to vertex by randomly choosing a neighbor. The transition probability for moving from vertex u to v is proportional to the edge weight connecting both vertices, i.e. p(u, v) = f (u, v)/ deg(u). Instead of actually performing walks just the probability to visit certain vertices is calculated. Let Put (v) be the probability of visiting vertex v after t steps starting from u. It is calculated incrementally from the initial distribution Pu0 (v) = δu,v with P t t−1 steps Pu (v) = w∈V Pu (w)p(w, v). For example Figure 3.7 depicts the contributions of neighbor vertices to the visit probability of the central vertex v. Edges leaving the vertices in other directions are not shown. The visit probabilities are not symmetric, i.e. Pvt0 (vt ) 6= Pvtt (v0 ). But both are related by the start and end vertex degrees with deg(v0 )Pvt0 (vt ) = deg(vt )Pvtt (v0 ). To show this relation the visit probability is expanded into single steps: Pvt0 (vt ) = p(v0 , v1 ) · · · p(vt−1 , vt ) (3.3) The single transition probabilities contain edge weights which are symmetric and factors 1/ deg(vi ) which can be shifted to the previous transition. The ratio deg(vt )/ deg(v0 ) accounts for the inserted and the removed factor at both ends. 36 3.3 Merge Selectors The probability distribution P t is also a vector over all vertices. In this context a random walk step is the matrix vector product P t = M P t−1 with the transition matrix Mv,w = p(w, v). The distribution P ∗ (v) = deg(v)/(2f (V ))P is the unique sta∗ ∗ ∗ tionary distribution with P = M P . The proof uses M P (v) = w P ∗ (w)p(w, v). Inserting the definitions the vertex degrees deg(w) are canceled out. This leaves P f (w, v)/(2f (V )) which equals P ∗ (v) thanks to the symmetry of the edge weights. w In non-bipartite, connected graphs long random walks converge against the stationary distribution regardless of the starting point. Thus the visit probability of a vertex becomes proportional to its degree. The mixing rate measures the speed of convergence to this limit. The evolution of the probability distribution can be expressed in terms of the symmetric matrix N = D−1/2 M D1/2 using eigenvalue decomposition. The eigenvalues are 1 = λ1 > λ2 ≥ . . . ≥ λn > −1 with eigenvectors v1 , . . . , vn . Then the visit probability in spectral form is: s n X deg(j) Pit (j) = P ∗ (j) + λtk vk,i vk,j (3.4) deg(i) k=2 If all eigenvalues are near zero, i.e. max(|v2 |, |vn |) is small, the second term quickly disappears. In that case the distribution is converging fast and the mixing rate is high. The next step uses a known relation of the eigenvalue gap 1 − λ2 and graph conductance. The conductance of a bisection S, V − S of a graph G = (V, E) is defined as: Φ(S) := f (S, V − S) 2f (V )P ∗ (S)P ∗ (V − S) (3.5) The numerator is the probability of switching between both sets using a random walk started on the stationary distribution. And the denominator is the same probability in a sequence of independent randomly chosen vertices. Inserting the definition of P ∗ and using the clustering C = S, V − S this formula equals: Φ(S) = Cutf (C) ρ(V ) CutVol (C) (3.6) The conductance of a graph is defined as the minimum over all possible bisections. This however is nothing else than the minimum normalized inter-density, which is also called scaled cut: Φ = min ρ(Cut(C))/ρ(V ) C (3.7) Now the discrete version of Cheeger’s inequality from differential geometry states that Φ2 /8 ≤ 1 − λ2 ≤ Φ. For a proof see [49]. Therefore low inter-density leads to a small eigenvalue gap and thus to slow mixing. In bipartite graphs the vertices are divided into two groups and all edges run between both groups. Thus with each step also the visit probability completely swaps between both. This is also visible in the spectral form where the most negative eigenvalue λn is −1. Taking two adjacent vertices their visit probability distributions will 37 stationary distribution P ∗ 3 The Multi-Level Refinement Algorithm not overlap for a fixed number of steps. Hence to obtain comparable distributions Pt i ≤t the probability to visit a vertex in at most t steps Pu (v) = i=1 Pu (v) is used like in [40]. Random Walk Distance Latapy and Pons [64] propose short random walks to define a measure of vertex distance. They postulate that if to vertices lie in the same community (optimum cluster) the probabilities of reaching each other Put (v), Pvt (u) are high and both vertices have a similar neighborhood with Put (w) ' Pvt (w). Thus a low distance is expected between both distributions. They observe from the stationary distribution that high degree vertices are preferred by random walkers. To remove this influence they propose following Euclidean distance weighted by deg(w): RW distance d(u, v) = s X (Pu (w) − Pv (w))2 w deg(w) (3.8) To account for bipartite situations the implemented random walk distance selector uses d≤3 (u, v), the distance of the sums over walks of length at most 3. During merge operations the distance of merged edges is updated by taking the maximum. This is also known as complete-linkage in the literature. It corresponds to using the largest distance between all vertices of both vertex groups and is consistent with the aim to avoid merge errors. The extreme opposite for example is using the shortest distance (single-linkage). Here it is possible to merge very distant vertices by chaining a sequence of intermediate vertices of much lower distance. Random Walk Reachability The random walk reachability selection quality is similar to the weight density f (a, b)/ Vol(a, b) but uses modified edge weights. Random walks are used to compute weights which become weaker for edges crossing lowdensity cuts and stronger for edges in high-density groups. Its assumed that vertices in the same optimum cluster are reachable over many paths. And thus as starting point the probability to visit one end-vertex starting from the other end-vertex without returning back to the start vertex is used. This is very similar to the escape probability of Harel and Koren [40]. To obtain actual edge weights these probabilities are scaled with the start vertex degree. Applying the weighted symmetry of inverse paths yields: RW reachability rt (u, v) = deg(u)Put (v) + deg(v)Pvt (u) 2 (3.9) Thus after the first step the weight r1 (u, v) = deg(u)f (u, v)/ deg(u) equals the original edge weight. Additional random walk steps will add weight to the edge proportional to the strength of alternate paths to this neighbor. The paths returning back to the start vertex are suppressed by setting Pu≥1 (u) = 0 in each step. To account for local bipartite situations again the probabilities are summed over a small number of steps with r≤t (u, v). Because this reachability is so similar to normal weights the whole calculation can be applied to the result again leading to a reinforcing feedback [40]. 38 3.3 Merge Selectors As selection quality the density rt,i (u, v)/ Vol(u, v) is used. Here rt,i is the weight from i iterative applications of the reachability computation with random walks of at most t steps. During merging vertices the edge weights rt,i of merged edges are summed and the densities of all incident vertex pairs are recomputed. Implementation Notes The visit probabilities Pu≤t (v) are computed separately for each source vertex u using two vectors storing Put (v) and the new Put+1 (v) of each target vertex v. A random walk step requires to iterate over all edges and transfer a portion of the visit probability from the edge’s start-vertex in the old vector to the end-vertex in the new vector. The selection qualities are computed from the intermediate vectors between the steps. In order to avoid expensive memory management the two vectors are reused by swapping source and target vector between each step. The worst case time complexity for the reachability selector is O(|E||V |it) as for each vertex and each step all edges have to be visited. The implementation tries to improve the speed by only processing edges when their start-vertex has a nonzero visit probability. Thus the computation time depends on the typical size of neighborhoods. Still on complete graphs the worst case is reached after one step. The complexity of the random walk distance selector is even worse. Here the worst case is O(|E|2 |V |t) as for each pair of adjacent vertices two probability vectors have to be computed. On large graphs it is impossible to store and reuse these vectors for all vertices. Hence the implementation tries to safe time by reusing the vector of one vertex and processing all neighbor vertices in a row. 3.3.3 Spectral Methods Random Walks as described above collect semi-global information about the densitystructure of the graph. But the relation to modularity clustering is very indirect and holds just for one specific volume model. Some applications may require other volume models and random walks do not work well on some kinds of graphs. Thus the search for non-local selection qualities which are fully compatible to the modularity continues. The approach of this section is based on the spectral methods described by Newman [57]. The modularity computation is rewritten in a matrix-vector form and the matrix is replaced by its decomposition into eigenvalues and eigenvectors. The eigenvectors of the strongest positive eigenvalues are then used to define vertex vectors. These describe the contribution of each vertex in a higher dimensional space where the modularity is improved by maximizing the length of the vector sum in each cluster. The eigenvalues and eigenvectors are calculated at the beginning of each coarsening level. Given the vectors of two adjacent vertices the spectral length and spectral length difference selectors analyze the length of vertex vectors. On the other end the spectral angle selector uses the directions of these vectors. The following subsections present the mathematical derivation of the vertex vectors and conclude with a more detailed description of the selection qualities and implementation notes. 39 3 The Multi-Level Refinement Algorithm x̃u X̃1 X̃2 = x̃u + x̃v x̃v Figure 3.8: Spectral vertex vectors and two cluster vectors Theoretical Background The modularity can also be described as sum over all vertex pairs using the Kronecker delta δ(C(u), C(v)) as filter with δ(C(u), C(v)) = 1 only for pairs of the same cluster and zero elsewhere. This representation of the modularity is: X f (u, v) Vol(u, v) Q(C) = − δ(C(u), C(v)) (3.10) f (V ) Vol(V ) u,v modularity matrix The inner part of the sum is replaced by the modularity matrix Mu,v = f (u, v)/f (V )− Vol(u, v)/ Vol(V ). And for each cluster i ∈ C the delta is separately replaced by a binary vector si over all vertices with [si ]v = δ(C(v), i). The contribution of the cluster i ∈ C thus is sTi M si . The spectral decomposition of the modularity matrix is M = U DU T using the matrix of eigenvectors U = (u1 | . . . |un ) and the diagonal matrix of eigenvalues D = diag(λ1 , . . . , λn ). This is inserted into the modularity formula and the calculation is separated for each eigenvalue: Q= X i∈C sTi M si = n X XX (U T si )T D(U T si ) = λj (uTj si )2 i∈C (3.11) i∈C j=1 P The previous step exploits that uTj si = v∈Ci Uv,j is a scalar. As next step the eigenvalues shall be moved into sum contained in the squares by using the squarep roots (λj ). Unfortunately this is difficult with negative eigenvalues. Thus the eigenvalues and the inner sum are split up into positive and negative eigenvalues. Without loss of generality λ1 ≥ · · · ≥ λp > 0 ≥ λp+1 ≥ · · · ≥ λn and thus: Q= X i∈C vertex vector xv , yv p n X X X p p ( λj uTj si )2 − ( −λj uTj si )2 = ||Xi ||22 − ||Yi ||22 j=1 j=p+1 (3.12) i∈C The last step replaced the sums by squared Euclidean norms of the cluster vectors Xi and Yi . This is achieved by moving the further down into the p eigenvalues T eigenvectors forming vertex vectors xv = (. . . , λj Uv,j , . . .) with j = 1, . . . , p. The vertex vectors yv are analogously constructed from all negative eigenvalues P and their eigenvectors. The former scalar product with s is transformed into X = i i v∈Ci xv P and Yi = v∈Ci yv respectively. Therefore the cluster vectors Xi are formed by adding the vectors xv of all vertices contained in the cluster. It becomes visible that the modularity is maximized by grouping vertices together with maximal Xi and minimal Yi lengths. In the vertex 40 3.3 Merge Selectors vectors the eigenvectors are weighted with the square-root of their eigenvalues. Thus only eigenvectors with a large eigenvalue have a significant influence. Moreover it is assumed that just using the positive eigenvalues suffices for the maximization. Thus all negative and weak positive eigenvalues are dropped from now on. This is controlled by the two parameters in the list below. The resulting approximate vertex vectors are denoted by x̃v . An example of the relation between vertex and cluster vectors is shown in Figure 3.8. x̃v , X̃v spectral ratio Multiplied with the largest eigenvalue. Defines the lower limit for accepted eigenvalues. spectral max ev The maximal number of eigenvalues and -vectors to compute. Another important property besides the length of the vertex and cluster vectors is their direction because in the optimum clustering the cluster vectors have to be orthogonal. This follows from the increase of vector length (and modularity) when merging two adjacent clusters with vectors which are separated by an angle below 90 degree. Thus in m dimensions at most m + 1 clusters are representable. Hence in practice it is necessary to compute as many eigenvectors as expected clusters. Spectral Length and Length Difference From the spectral analysis of the previous section follows that P the modularity is improved by maximizing the length of each cluster vector X̃i = v∈Ci x̃v . The first idea was to select cluster pairs a, b with the longest vector sum ||Xa + Xb ||2 . Hence this selection quality is called spectral length. spectral length On second sight this selection quality is related to the modularity increase selector as was already pointed out by Newman in [57]. The modularity increase in spectral decomposition is the contribution of the new cluster ||Xa + Xb ||22 − ||Ya + Yb ||22 minus the previous contribution of both clusters ||Xa ||22 + ||Xb ||22 − ||Ya ||22 − ||Yb ||22 . Using just the positive eigenvalues results in the spectral length difference selector: 2X̃aT X̃b = ||X̃a + X̃b ||22 − (||X̃a ||22 + ||X̃b ||22 ) (3.13) The relation of both selection qualities to the modularity increase selector also inherits its disadvantages in the growth behavior. Looking at a long vector an adjacent long vector is ranked higher than all short vectors even when pointing into a nearly orthogonal direction. Since the vector length roughly correlates with the vertex degree, high-degree vertices will be grouped together early leading to merge errors. Measuring the change in vector length instead reduces this domination but is nevertheless similar to the modularity increase selector. It is not directly obvious why dropping negative and weak eigenvalues should provide a better insight into the selection problem. Spectral Angle Instead of the misleading vector length the direction of the vectors can be used. The spectral angle selector prefers pairs of adjacent clusters where the cluster vectors have the same direction, i.e. small angles. The angle is computed using the cosine: 41 length difference 3 The Multi-Level Refinement Algorithm cos(X̃a , X̃b ) = X̃aT X̃b /(||X̃a ||2 ||X̃b ||2 ) spectral angle (3.14) A conversion into angles is not necessary because orthogonal vectors have zero cosine and small angles are near the value 1, which suffices for the ranking. Implementation Notes For the computation of the eigenvalues and eigenvectors the C++ interface of ARPACK [48] is used. Since the modularity matrix is symmetric and only the largest eigenvalues are to be searched it suffices to provide an implementation for the matrix-vector product M x. The modularity matrix M itself is dense and would require O(|V |2 ) multiplications per evaluation. Fortunately the product can be computed much faster by exploiting the inner structure of the matrix and the multiplicity volume model: M = f (V )−1 A + Vol(V )−1 wwT with the adjacency matrix Au,v = f (u, v) and the vector of vertex weights w. For example in the degree multiplicity model the weights are w(v) = deg(v) and [wwT ]u,v = deg(u) deg(v) = Vol(u, v). The required matrix vector product thus becomes M x = f (V )−1 Ax + Vol(V )−1 w(wT x). The adjacency matrix is sparse when the graph is sparse and the volume term reduces to a scaled vector because wT x is a scalar. Therefore just O(|E| + 2|V |) multiplications are required. All vertex vectors have the same size and thus are stored in a simple memory pool without any overheads. On the outside a specialized mapping from vertices to the position of their vector is used. It transparently wraps the vectors into more convenient temporary objects on access. On merges the vectors of both coarse vertices are simply added. Then the selection qualities of all incident edges are recalculated. 3.4 Cluster Refinement Cluster refinement is the key ingredient of the third multi-level phase. Its task is improving the initial clustering by searching for better clustering using sequences of local modifications. This section explores the design space of refinement methods and discusses implemented algorithms. Two fundamental types of refinement can be distinguished: Greedy refinement only allows operations improving the quality of the clustering. On the other side search refinement cross phases of low-quality clusterings in the hope to find better clusterings later. A clustering in which a chosen greedy heuristic cannot improve the quality further is called local optimum. A local optimum with the best quality of all local optima is called global optimum. More than one global optimum may exist. Searching global optima commonly faces some problems. Again the NP-completeness of modularity clustering hides in these problems and enforces that not all are solvable. Following observations were identified from the literature [22, 77, 52]: foothill problem The search gets stuck in a low-quality local optimum like it is always the case with pure greedy heuristics. Some mechanism is necessary to 42 3.4 Cluster Refinement leave the local optimum, for example sequences of quality decreasing moves or crossing different locally optimal clusterings. basin problem Many low-quality local optima in a bigger valley with better optima being far away. In order to reach them several local optima of low-quality have to be passed. This is also known as funnels in the context of simulated annealing and similar methods. plateau problem The search space has nearly the same quality everywhere except for some very local peaks. Thus there is no local information about the right direction of search and most heuristics would be as good as random search. Monte-Carlo methods or a mathematical derivation of search directions might help here. ridge problem The direction of motion in the search space is different to the true direction of quality improvements. This might happen for example with heuristic selection criteria as a trade off to computational efforts. Also heuristics for leaving local optima can exhibit this property. This may be detected by checking with other selection criteria. multi-level reduction problem The global optimal clustering of the original graph is most likely not a local optimum in coarsened graphs because it is not representable when vertices of different clusters are merged together. Inversely a global optimum of a coarse graph not necessarily leads to the real global optimum of the original graph after projection and refinement. Improvements might be possible by considering multiple candidate clusterings or applying several multi-level passes with different coarsening hierarchies. This leads to apparently conflicting objectives. As there is no reason to accept obviously improvable clusterings it is necessary to fully exploit local optima without stopping half way. On the other hand exploring a wider range of local optima requires to leave local optima and even avoid already explored optima. Therefore the first important step towards reliable refinement algorithms is the design of greedy refinement and search algorithms. Later search and greedy refinement may be combined to effectively explore a wide search space without missing local optima. The next section explores the design space of refinement methods based on local operations like moving single vertices. After classifying concrete algorithms the second section discusses the implemented greedy algorithms. The third section presents the adaption of the well-known Kernighan-Lin refinement to modularity clustering. 3.4.1 Design Space of Local Refinement Algorithms This section explores components of simple refinement strategies with just two basic operations: moving a single vertex to another cluster, and moving a vertex into a new cluster. The analysis is based on often used implementations like greedy refinement [44], Kernighan-Lin refinement [45, 24, 58], simulated annealing [39, 38, 68] and Extremal Optimization [23, 11, 10] while looking for common components and differences. 43 3 The Multi-Level Refinement Algorithm Substantial differences to the classic k-way partition refinement1 exist: The number of clusters is not predetermined and no balance constraints on their size are given. Instead determining the right number of clusters has to be achieved by the algorithm. Clusters may be created and emptied as necessary. The modularity clustering quality is more complex than the minimum edge cut because the volume introduced global dependencies. Due to the multi-level approach it suffices to move single vertices instead of constructing vertex groups like in the original KernighanLin algorithm. The algorithms are dissected into the components listed below. The next paragraphs discuss these components in more detail and comment on their dependencies. The section concludes with an overview how basic refinement algorithms are fit into this design space. Vertex Ranking Which vertices should be selected for movement? Target Ranking To where should selected vertices be moved? Search Mode Which vertices and targets are evaluated before selecting one of them? Ranking Updates Is the vertex ranking updated after each move or fixed? On Modularity Decrease What to do with selected moves not improving modularity? maxmod Vertex and Target Ranking The decision which vertices are moved to which cluster is split into two aspects. The vertex ranking directs the selection of vertices and the target ranking selected the cluster. The vertex ranking is used as a predictor similar to the merge selector in the coarsening phase and should direct the refinement into good, modularity-increasing directions. Given vertex v, its current cluster C(v), and a target cluster j ∈ C ∪ {∅} the change of modularity ∆v,j Q(C) = Q(C 0 ) − Q(C) is calculated as shown in the equation below. The cluster ∅ is used for new, empty clusters. By definition edge weights and volumes involving this cluster are zero. The maxmod vertex ranking sorts the vertices by their best increase of modularity maxmod(v) = maxj ∆v,j Q(C). The single implemented target ranking is the maxmod ranking which selects the neighbor cluster with maximal increase of modularity arg maxj ∆v,j Q(C). f (v, Cj − v) − f (v, C[v] − v) Vol(v, Cj − v) − Vol(v, C[v] − v) − (3.15) f (V ) Vol(V ) f (v, C[v] − v) Vol(v, C[v] − v) ∆v,∅ Q(C) = − + (3.16) f (V ) Vol(V ) ∆v,j Q(C) = The first term depends just on the adjacent vertices as shown in Figure 3.9. To compute the edge weights f (v, Cj − v) all incident edges of v are visited and their weight is summed in an array grouped by the cluster of the end-vertex. Thus 1 44 k equal sized clusters, minimum edge cut 3.4 Cluster Refinement moving v from C(v) to j current cluster: f (v, C[v] − v) v target cluster: f (v, Cj − v) other clusters: no change Figure 3.9: Dependencies when Moving a Vertex computing the maxmod ranking costs linear time in the number of incident edges and the number of clusters. In average this is O(|E|/|V | + |C|). As visible the change of modularity depends through the volume term on the global clustering structure and changes with nearly each vertex move: The volume is computed according to the volume model with Vol(v, Cj − v) = (w(v) − 1)w(Cj ). The latter value is the sum over all vertex weights in the cluster. To evaluate the volume in constant time this value is stored per cluster and iteratively updated each time a vertex moves out or into the cluster. This global dependency is the key point which makes modularity refinement more expensive then the standard min-cut refinement. Instead of ranking vertices by the expensive to compute modularity increase good candidate vertices may also be identified by their current contribution to the modularity f (v, C[v]) − ρ(V ) Vol(v, C[v]). Vertices with a low or negative contribution have a good chance of not belonging to their current cluster. Moving a vertex does not modify the contribution of its self-edge. Therefore omitting self-edges leaves mod(v) = f (v, C[v] − v) − ρ(V ) Vol(v, C[v] − v). This computation can be done in constant time by storing f (v, C[v] − v) for each vertex and iteratively updating it when adjacent vertices are moved. Based on the contribution to modularity several vertex rankings were derived. It can be assumed that the placement of high-degree vertices has a higher influence on the modularity. To suppress this effect the modularity contribution can be divided by its degree. Or to stay consistent with the density model, it can be scaled by the inter-volume between the vertex and its cluster. Since these vertex rankings are structurally and semantically similar to the vertex fitness used in extremal optimization [23] they are called fitness measures. In contrast to other vertex rankings here lower values are preferred. In summary these are: fitness mod-fitness The modularity contribution mod(v) = f (v, C[v]−v)−ρ(V ) Vol(v, C[v]− v). eo-fitness The vertex fitness used in extremal optimization eof(v) = mod(v)/ deg(v). density-fitness The density of connections to the current cluster ρ(v, C[v] − v) = f (v, C[v]−v)/V ol(v, C[v]−v). This is ranking equivalent to mod(v)/[deg(v) deg(C[v]− v)] in the degree multiplicity volume model. 45 3 The Multi-Level Refinement Algorithm best vertex best unmoved Search Mode The search mode controls how vertices are selected based on the vertex ranking. It plays a similar rule like the grouping and matching algorithms in graph coarsening. The most accurate search method is to consider always all possible operations: In each step all possible moves of all vertices are evaluated and the global best modularity increase is selected. This is a costly operation because evaluating all moves requires to visit all edges in each move step. For k vertex moves the k|V | maxmod evaluations lead to O(k|E|+k|V | max |C|) average run-time where max |C| is the highest number of clusters the occurred during the moves. Fortunately this method can be divided into two stages by first selecting the best vertex according to a vertex ranking and then evaluating further operations only for that vertex (mode: best vertex ). In combination with the maxmod vertex ranking this is equal to the previous method. Instead the vertex ranking should be replaced by a constant-time predictor. Then performing k vertex moves requires at most k|V | evaluations of the vertex ranking and one maxmod target ranking per moved vertex. Thus the average run-time will be in O(k|V | + k|E|/|V | + k max |C|). Most predictors are likely to also select vertices which cannot be moved for modularity increase. In greedy refinement these vertices must be skipped. In order to not select these vertices again all processed vertices are marked and ignored later (mode: best unmoved ). This search mode thus selects one of the remaining vertices until no vertex is left. The whole process is restarted in case some early selected vertices became movable after later moves. Ranking Updates An algorithm could simply iterate over a fixed ordering of the vertices or update the ranking after each move. The updated ranking is implemented by simply visiting all vertices in each move step. Alternatively a heap can be used to efficiently retrieve the best ranked vertex. However empirical comparisons suggested that all variants are similarly fast. Therefore visiting all vertices is the preferred method. On Modularity Decrease In case the selected operation does not improve the modularity following three choices are possible: skip, abort, accept skip Mark the vertex as moved but leave it in the current cluster. abort Abort the refinement. accept Perform the operation anyway. Greedy heuristics are characterized by not accepting non-improving operations. Whether to directly abort the refinement depends on the search method. In case other vertices still allow modularity increasing moves the currently selected vertex should be skipped. Otherwise the refinement can be aborted. Classification of Basic Refinement Algorithms Table 3.2 provides an overview and classification of refinement methods building on the components described in 46 3.4 Cluster Refinement Algorithm SearchMode Vertex Ranking Mod. Decrease Impl. greedy refinement complete greedy sorted-maxmod sorted-mod sorted-eo sorted-dens best best best best best unmoved unmoved unmoved unmoved maxmod maxmod mod-fitness eo-fitness dens-fitness abort abort skip skip skip X X X X deterministic search KL-maxmod KL-mod KL-eo KL-dens best best best best unmoved unmoved unmoved unmoved maxmod mod-fitness eo-fitness dens-fitness accept accept accept accept X randomized search simulated annealing all random spinglass system all random extremal optimization best+random eo-fitness random target ranking, accept with probability depending on modularity increase random target ranking with an energy model, accepts like simulated annealing accept Table 3.2: Classification of Refinement Algorithms 47 3 The Multi-Level Refinement Algorithm the previous subsections. Specific observations about greedy and Kernighan-Lin refinement are discussed in the next section. The implemented algorithms are marked in the table. Other algorithms are included to show how they align within this design space. Sorted greedy refinement with maxmod vertex ranking (sorted-maxmod ) is not implemented because it is as slow as complete greedy but also is restricted in its search. Thus it has no advantages over other greedy algorithms. The three fitness-based Kernighan-Lin algorithms KLmod, KL-eo, and KL-dens are not considered further because they do not reliably find local optima. Their vertex ranking combined with accepting quality decreasing moves prevents this. Finally randomized algorithms are generally excluded for the same reason. They are interesting just in combination with greedy refinement. But then too many combinations exist to be discussed in the scope of this work. 3.4.2 Greedy Refinement Data: graph,clustering,selector Result: clustering repeat v ← selector:find best maxmod vertex; j ← selector:find best target cluster for v; if move v → Cj is improving modularity then move v to cluster j and update selector; until move was not improving ; Figure 3.10: Refinement Method: Complete Greedy Greedy refinement algorithms are characterized by accepting only vertex moves that increase modularity. The complete greedy algorithm, as displayed in Figure 3.10, uses the maximal modularity increase maxmod as vertex selector. This enforces the globally best move in each iteration. Selecting and moving vertices is repeated until the modularity is not further improved. In each iteration |V | vertex rankings have to be evaluated with each requiring O(|E|/|V | + max |C|) time in average. Selecting the best target cluster and updating the cluster weights w(C(v)), w(Cj ) costs linear time in the number of incident edges. Moving the vertex is done in constant time by updating the mapping C(v) which is stored in an array. Thus k moves require O(k|E| + k|V | max |C|) time. To improve search speed the sorted greedy algorithm splits vertex and move selection into separate steps. Figure 3.11 displays the algorithm in pseudo-code. The inner loop selects all vertices once sorted by their vertex ranking. Since the ranking changes with each move all vertices are re-visited in each inner iteration. With constant-time rankings like mod-fitness this will cost O(|V |2 ) time for the complete inner loop. Again the most expensive part is selecting the target cluster for the current vertex. The inner loop does this exactly once for each vertex which costs O(|E| + |V | max |C|). Moving a vertex and updating cluster weights and vertex 48 3.4 Cluster Refinement Data: graph,clustering,selector Result: clustering repeat mark all vertices as unmoved; while unmoved vertices exist do v ← selector:find best ranked, unmoved vertex; j ← selector:find best target cluster for v; mark v as moved; if move v → Cj is improving modularity then move v to cluster j and update selector; // outer loop // inner loop until no improving move found ; Figure 3.11: Refinement Method: Sorted Greedy fitness costs O(|E|/|V |) in average. Therefore the worst-case time for all iterations of the inner loop is in O(|V |2 + |E| + |V | max |C|). In case vertices were moved the outer loop restarts the refinement. This processes vertices which were visited early but became movable with modularity increase just after later moves of other vertices. In practice only a small number of outer iterations is necessary. These restarts ensures that sorted greedy always finds a local optimum: If at least one improving move exists its vertex will be visited and moved even if it is not the best ranked vertex. Higher ranked vertices are simply skipped. When improvements were found the refinement is restarted until no single improving move exists. Nevertheless the found optimum depends on the vertex ranking. The variants may end up in different nearby local optima. 3.4.3 Kernighan-Lin Refinement The central idea of Kernighan-Lin refinement is to escape local optima by moving vertices with the least modularity decrease in case no improvements are possible. Selecting the least modularity decreasing moves is like a careful depth-first walk into the surrounding clustering space along a ridge while avoiding clusterings of very low modularity. The basic algorithm is presented in the following subsection. Unfortunately it is not very effective considering its long run-time. To improve this the next subsections analyze two aspects of the dynamic behavior: The creation of clusters and the effective search depth. Based on the results the algorithm is improved by restricting the search depth. Basic Algorithm The basic algorithm is shown in Figure 3.12. The best clustering found during the refinement is called peak clustering and is separately stored. A new peak clustering would have to be stored when the clustering after a modularity increasing move is better than the last peak. To save some work just the last clustering in a series of modularity increasing moves is stored. This situation is 49 3 The Multi-Level Refinement Algorithm Data: graph,clustering,selector Result: clustering current ← clustering ; // outer loop repeat start ← current; peak ← current; mark all vertices as unmoved ; // inner loop while unmoved vertices exist do v ← selector:find best ranked, unmoved vertex; j ← selector:find best target cluster for v; mark v as moved; if modularity is decreasing ∧ current better than peak then peak ← current; move v to cluster j and update selector; if peak better than current then current ← peak; until current not better than start ; clustering ← current; Figure 3.12: Refinement Method: basic Kernighan-Lin checked at modularity decreasing moves in the inner loop. In order to not directly revert modularity decreasing moves processed vertices are marked and ignored in later inner iterations. In case the inner loop found an improved clustering it is contained in the current or the peak clustering. This is checked after the inner loop and the best is returned to the outer loop. Altogether the outer loop restarts the refinement on the best intermediate clustering until no further improvements were found. Just the Kernighan-Lin algorithm with maxmod vertex ranking fully moves into local optima. Its inner loop starts like greedy refinement and performs modularity increasing moves until reaching a local optimum and suppressed moves of marked vertices are performed by the second outer iteration. Unfortunately this is not shared by the other vertex rankings. When vertices with no modularity increasing moves are proposed before reaching a local optimum they are moved. Thus it becomes unlikely to reach local optima. Similar to greedy refinement with maxmod vertex ranking the time required for a complete run of the inner loop is O(|V |(|E| + |V | max |C|)). But in contrast to the complete greedy refinement this algorithm does not abort in local optima. This makes the basic algorithm very expensive which is also confirmed by empiric experience. Creation of Clusters The creation of clusters is an aspect specific to modularity clustering because the optimal number of clusters is not known. Initial observations with the basic algorithm discovered an excessive creation of new clusters in the inner loop. For example the scatter plot in Figure 3.13a shows the relation between 50 peak ● ● ● ● ● ● ● ● ● ● Modularity 0.60 Modularity 0.4 ● ● ● ● ● 0.58 ● ● 80 Clusters ● ● ● ● ● ● 100 ●● ● ● ● ● ● 60 ● peak 120 ● 0.64 ● 0.62 ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●● ● ● 0.5 ● ● 0.2 ● ● ● ● 0 500 1000 Clusters 1500 2000 (a) Modularity and Cluster Count 40 0.56 ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● 10000 12000 ● ● ● 20 0.3 ● ● 0.54 0.6 3.4 Cluster Refinement 4000 6000 Move 8000 16000 (b) Dynamics of Kernighan-Lin Refinement Figure 3.13: Kernighan-Lin Refinement Creating Clusters on Graph Epa main modularity and the number of clusters at the end of inner loops. In addition the position of peak clusterings is shown by blue circles. Moreover Figure 3.13b shows the beginning of multi-level Kernighan-Lin refinement in the same graph. The black line plots the modularity and the red line the number of clusters. Again blue circles mark cluster counts at peak clusterings. Both graphics underline that all peak clusterings are found with a small number of clusters while the algorithm tends to create a huge number of clusters. This behavior can be explained as follows: After all vertices fitting well into another cluster are moved the algorithm continues with vertices very well connected to their current cluster. Often moving them into a new cluster decreases the modularity less than moving them into other existing clusters. Thus a cluster is created for that vertex. The excessive creation of clusters brings two big problems: Firstly the search for better clusterings is quickly directed to low-modularity clusterings and explores only few with cluster counts near the optimum. This wastes time in uninteresting parts of the clustering space. But more importantly the time complexity of the whole algorithm is coupled to the number of clusters. Thus high cluster counts significantly impair the performance. Effective Search Depth The moves performed by the algorithm are interpretable as a depth-first search into the space of possible clusterings with the vertex and target ranking determining the direction. The complete execution of the inner loop is very time consuming. However already after a small number of moves the search is far away from optimal clusterings and cannot return back because of the marked vertices. In order to safely abort the inner loop earlier the number of moves between peak clusterings is analyzed. For this purpose let the effective search depth be the max- 51 0.652 80 45 0.653 3 The Multi-Level Refinement Algorithm ● peak 0.647 12 74 60 Observed Search Depth 40 ● ● ● ● ● ● ● ● ● ● ● 0 0.646 ● 100 Move 150 (a) Detail of the Dynamics on graph Epa main ● ● ●● ● 10 ● ● 50 ● ● ● ● ●● ● ● ● ● 100 ● ● ● ● ● ● ● ●● ● ● ● ● ●● ● ● ● ● ● ● ● ●● ●● ● ● ●●●●●● ● ●● ●● ● ● ● ●● ● ● ● ●● ● ●● ● ● ● ●● ● ● ● ● ●● ● ●● ● ● ●● ● ●● ● ● ● ● ● ● ●● ●● ● ● ● ●● ● ●● ●● ●● ●● ●● ● ● ● ●● ● ● ● ● ● 50 ● ● ● ● 0 ● ● 20 30 0.648 35 Clusters Modularity 0.649 0.650 40 0.651 ● ● ● ● 500 Vertices 1000 ● ● 5000 (b) Observed Effective Search Depth Figure 3.14: Effective Search Depth of Kernighan-Lin Refinement imal number of moves from a peak clustering to a clustering with equal or better modularity that may occur in real-world graphs. Of course this depth depends on the number of vertices. A typical situation with about 1500 vertices is displayed in Figure 3.14a. The modularity of the clusterings is shown by the black line. Blue vertical lines mark the moves where peak clusterings were found. At the beginning of the inner loop all 12 moves directly increasing modularity were executed. From this first peak clustering a series of 62 modularity decreasing and increasing moves were necessary to find better clusterings and the second peak. In example this represents an effective search depth of around 60 moves. In order to find a simple lower bound for the effective search depth the basic algorithm was applied with multi-level refinement to a selection of 17 real-world graphs. At each peak clusterings the observed search depth together with the number of vertices at the current coarsening level was recorded. The scatter plot in Figure 3.14b visualizes the dependency between vertex count and search depth. A logarithmic scale is used for the vertex count. It is visible that at extreme values the depth nearly linearly grows with the logarithm of the vertex count. Now a lower bound is obtained by taking the maximal quotient. This yielded a factor of about 20. Thus it is moderately safe to abort the inner loop after around 20 log10 |V | moves with a modularity below the last peak clustering. This bound is also shown by the dashed line. Based on these observations the basic algorithm is improved by restricting the search depth. The parameter search depth controls the accepted search depth. The inner loop is terminated after log10 |V | times search depth moves with a modularity lower than at the last peak clustering or beginning of the inner loop. Factors around 20 should be used based on the measurements. Due to this early termination it is unnecessary to also restrict the creation of clusters. 52 ● 3.5 Further Implementation Notes <<concept>> <<concept>> IndexSpace GrowableSpace +const_iterator: typedef +iterator: typedef +key_type: typedef = Key +index_type: typedef = size_t +SuperSpacePtr: typedef +contains_key(key:key_type): bool const +begin(): const_iterator const +end(): const_iterator const +size(): size_t const +upper_bound(): size_t const +lower_bound(): size_t const +get_key(idx:size_t): key_type const +get_index(key:key_type): size_t +super_space(): SuperSpacePtr const +root_space(): void* const +register_upper_bound(id:size_t,function) +unregister_upper_bound(id) +new_key(): key_type +new_key(idx:size_t): key_type <<root space>> <<concept>> RangeSpace DynamicSpace +add_key(key) +remove_key(key) +remove_all_keys() SuperSpace: BitSubspace <<root space>> SuperSpace: FilterFunctor: RecycleSpace FilterSubspace +map: IndexMap +filter: FilterFunctor reuses removed keys later Figure 3.15: Index Spaces: Class and Concept Diagram 3.5 Further Implementation Notes This section provides additional hints on the implementation. These details are not directly important for the functioning of the clustering algorithm but describe crucial components enabling and driving its implementation. Class diagrams and sample code give a small insight into its structure and appearance. The next subsection introduces the concept of index spaces and shows how graphs and data are managed in the program. The second subsection describes how the data is externally stored. 3.5.1 Index Spaces and Graphs Graph algorithms commonly deal with two different aspects: The adjacency structure of the graph and properties like vertex and edge weights. For navigation trough the graph the structure is used. On the other hand calculations typically involve weights retrieved from the current vertex or edge. For example the implementation of the volume model stores vertex weights. At the same time the merge selector stores the selection quality of each pair of adjacent vertices. These properties represent different aspects of the algorithm and should be managed separately. Often it is necessary to construct temporary properties in sub-algorithms not visible to the outside. In addition a noteworthy observation is that in graph algorithms huge amounts of 100, 000 and more small objects for vertices, edges and temporal properties exists. All these have no own life-cycle and do not need any garbage collection as their existence is coupled to the graph and algorithm phases. 53 3 The Multi-Level Refinement Algorithm VT:typename <<concept>> KeySpace IndexMapReadable +KeySpacePtr: typedef +value_type: typedef = VT +param_type: typedef = VT const & +reference: typedef = VT& +key_type: typedef = KeySpacePtr::key_type +get(key_type): param_type const +key_space(): KeySpacePtr +at(key_type): param_type const <<concept>> IndexSpace VT:typename KeySpace:IndexSpace ChunkIndexMap VT:typename <<concept>> VT:typename KeySpace:IndexSpace IndexMap +at(key): reference +put(key,value:param_type) ConstantDummyMap +value: value_type Figure 3.16: Index Maps: Class and Concept Diagram As solution the implementation of the presented algorithms uses the concept of index spaces and index maps. For similar approaches see [29, 9]. Index spaces carry the structure without considering additional properties. This is similar to a collection of array indices. Each index abstractly represents an object. Specific properties of these objects are managed by index maps. These allow to read and write the data stored for an object using its index as access key. This approach basically is an external variant of the decorator pattern. It is external because it does not wrap the single objects when extending them. Figure 3.15 contains a class diagram of the implemented index spaces and their interfaces. Basic spaces allow to iterate over the indices using the methods begin and end like in the C++ Standard Template Library (STL). The size in number of entries can be queried using size. Grow-able and dynamic spaces allow to create and remove indices. Subspaces mask parts of an index space. This can be used for example to blend out removed entries or iterate over all entries with a common property. For this purpose all index spaces hold a reference called super space to their superset. The super space of a root space points to itself. The class RangeSpace is the simplest, most compact implementation. It describes a continuous range of indices from a lower to an upper bound. It is used when this growing behavior suffices. In case it is necessary to remove indices from the space the class RecycleSpace is used. It recycles removed indices later when new are created. Internally a BitSubspace is used which marks removed indices in a bitset. The class FilterSubspace allows to easily operate with subsets defined by a filter function. The filter is provided at instantiation as function object. Index maps, as shown in Figure 3.16, provide access to specific data stored for the objects referred to by indices. Several maps can be created on the same index space and independently destroyed when they are no longer in use. The implementations use C++ templates to define the type of data to store. The index maps supply an 54 3.5 Further Implementation Notes BOOST_FOREACH(Vertex v, vertices(g)) { put(deg, v, 0.0); BOOST_FOREACH(Edge e, out_edges(v,g)) { if (v == target(e,g)) // self-edge at(deg, v) += 2 * get(ewgt, e); else at(deg, v) += get(ewgt, e); } } Figure 3.17: C++ Example: Calculation of Vertex Degrees. Input are the graph g and the edge weights ewgt. The index map deg will contain the computed degrees. automatic memory management and are resized when the underlying index space grows. To implement this the map holds a reference to its index space. An observer is registered at the space to inform the map when new storage has to be allocated. In order to efficiently implement the access to data just continuous ranges of integer indices are used in this implementation. Index Maps provide storage space for indices from a lower to an upper bound. The lower bound is always zero and the upper bound grows with the creation of new entries. The only real implementation is ChunkIndexMap. It stores the data not in a single huge array but in a sequence of fixed size array called chunks. Access to entries is performed in two steps by splitting the numeric index in a chunk number and the inner offset. In practice this does not cost much but allows to allocate memory much easier. When the index space grows no resizing of arrays and moving of data is necessary. Just a new chunk is allocated when necessary. In addition it is easier to find memory space for fixed-size chunks than for huge continuous arrays. The other implementation ConstantDummyMap provides read-only access to a fixed value. It is used in some specializations of algorithms to represent unit edge weights. Graphs are implemented using a combination of index spaces and maps. They contain a separate space for the vertices and for the edges. A map over the edges is used to store the start- and end-vertex indices. Separate implementations for directed and undirected (symmetric) graphs exists. Both allow iteration over the outgoing edges of a vertex and insertion and removal of edges. For this purpose two additional index maps provide access to sibling edges like in a double-linked list and a reference from each vertex to the start of its edge list. The undirected variant is used during graph coarsening and enforces for each edge the existence of its inverse edge. This enables the implementation of a vertex merge operation similar to edge contraction. Access to the graphs is mainly provided in the style of the Boost Graph Library (BGL) [70]. Besides basic graph algorithms it provides abstract concepts for accessing and manipulating graphs. The key point is the use of external adapters. Instead of defining a single interface to graphs the concepts define separate, external functions. With this strategy the functionality can be extended without the necessity to extend existing interfaces and replace proxy classes. Simply new functions are 55 3 The Multi-Level Refinement Algorithm Filename Description gname.meta.gz meta-information about the graph gname (source, content, type, . . . ) the vertex data (vertex name) the edge data (start-vertex, end-vertex, weight) the clustering computed with algorithm algname free form log-file of the algorithm configuration of the algorithm (parameters, cluster count, modularity, . . . ) copy of the best found clustering table comparing the clusterings (algorithm, modularity, runtime, . . . ) weighted and unweighted vertex degrees gname.vertices.gz gname.edges.gz gname.cluster.algname.gz gname.cluster.algname.log gname.cluster.algname.meta.gz gname.cluster_best.gz gname.cluster_info.gz gname.vertex-degree.gz Table 3.3: Hierarchical Naming Convention for File Names added which retrieve the graph object as argument. Based on the argument types the compiler then exploits this function overloading to select the correct implementation. Another advantage of this strategy is better readability. Many small details can be hidden behind the adapter functions. Therefore such adapters are provided for the own graph implementations and index maps. The C++ code in Figure 3.17 shows a small example to calculate the weighted degree of all vertices. Using a for-each construct the outer loop iterates over all vertices of the graph. At each vertex the out-going edges are visited by the inner loop and their weights summed up. In summary the index spaces and maps with their abstract access by indices are conceptually very similar to graphs and property maps in Boost. Here the concept is extended by the explicit integration of index spaces. Without them it is not possible to directly describe the dependencies between property maps and the structures they attribute. The Boost Graph Library already supplies the user with basic implementations for various graph types. However the simple collections (vectors, maps) of the Standard Template Library are used which do not scale well to big graphs. Here the chunks-based implementation for index maps provides a scalable alternative. 3.5.2 Data Management The management of graphs, clusterings and other data faces some special requirements in this project. Most importantly the evaluation will handle many graphs and produce huge amounts of clusterings. Thus in each clustering no duplicate information like the graph structure and vertex names should be stored. Besides the actual graph data quite some structured meta data emerges. This includes information about used algorithms, parameters, source references and comments. This data should be easily accessible to computers and humans. And finally the data should be easy to import in other software like GNU-R for post-processing. In consequence simple text files similar to the CSV (comma separated values) 56 3.5 Further Implementation Notes format are used. Each file stores exactly one table. The columns are separated by tabs (\t) and rows by a single newline (\n )character. Strings may be delimited by quotation marks. The value NA may be used for missing values. The first row contains the table header with column titles and the first column is always used as index. To save storage space all files are transparently compressed using the gzip format. Separate files are used for different data. For example graphs are stored in three files containing vertex data, edge data, and meta data. For each vertex its name or original identifier is stored. The edge table contains the pairs of end-vertex indices and the edge weights. Additional files are used to store clusterings and similar. In order to retrieve data easily a hierarchical naming convention is employed. The dot is universally used as separator in filenames. The first component names the graph and following components differentiate data sets. The naming scheme is best explained by example as in the table below. Meta data is organized in a similar fashion. Here character strings are used as row indices and the dot is used as hierarchical separator. 57 4 Evaluation This chapter evaluates the family of algorithms developed in the previous chapter. The algorithms are composed of different components like coarsening method, merge selector and so on. Specific algorithms are configured using a set of parameters. The key aspects of the evaluation are effectiveness, reliability, simplicity, and efficiency. An algorithm is effective when it finds better clusterings in terms of the modularity measure compared to the alternatives. This is strongly related to the reliability, i.e. finding these clusterings systematically and not just sometimes by random chance. Simplicity and efficiency are concerned with the question whether complicated or expensive components of the algorithm significantly improve the results. This includes the study of the scalability to check how the runtime of components develops with the graph size. Expensive components with just small or no improvements could be omitted to simplify future implementations. At the same time it might be possible to gain similar improvements with less effort using other heuristics. Not all of these aspects can be treated in full depth in the scope of this work. Instead the evaluation mostly concentrates on the effectiveness compared to other configurations and reference algorithms. The chapter is organized as follows. The next section summarized the configuration space and introduces the main evaluation methods used in this work. The next three section study the effectiveness of the graph coarsening, the merge selectors and the refinement. Efficiency aspects are also discussed where appropriate. The fourth section discusses experimental results on the scalability of the algorithms. And the last two sections compare the presented multi-level refinement method against other clustering algorithms and results collected from the literature. 4.1 Methods and Data This section centrally discusses various aspects of the employed evaluation methods. The first subsection summarizes the algorithm components and describes the configuration space spanned by the parameters. The second subsection introduces the mean modularity method used to study the effectiveness of clustering algorithms. In this context also the collection of benchmark graphs is presented. Finally the last subsection contains some notes on the evaluation of computation times and scalability. 4.1.1 Configuration Space The multi-level algorithm is configured by the parameters listed in the Table 4.1. This includes for example the coarsening and refinement methods and configuration 59 4 Evaluation parameter component description+values coarsening method coarsening greedy grouping reduction factor coarsening match fraction coarsening merge selector coarsening method to merge cluster pairs; greedy grouping, greedy matching number of clusters to merge in each coarsening level; 5%–50% number of best ranked pairs to consider in matching; 50%– 100% ranking of cluster pairs; modularity increase, weight density, RW distance, RW reachability, spectral length, spectral length difference, spectral angle RW steps RW iterations selector selector 2 3 spectral ratio selector spectral max ev selector length of random walks iterative applications of reachability number of eigenvectors to use, cut off value for λj /λ1 maximal number of eigenvectors refinement method refinement vertex selector refinement search depth refinement method to move vertices; complete greedy, sorted greedy, Kernighan-Lin ranking of vertices in sorted greedy refinement; mod-fitness, eo-fitness, densityfitness when to abort inner-loop search in Kernighan-Lin, multiplied by log10 |V | Table 4.1: The Configuration Space 60 default 10% 50% weight density 20% 30 sorted greedy density fitness 25 4.1 Methods and Data of merge selectors. The parameters are grouped into a coarsening and refinement phase depending on what they directly influence. Some merge selectors are further configurable. Their parameters are placed in the component called selector. Note that not all parameters are meaningful in all combinations. For example the match fraction is used only by greedy matching. The last column contains the default values used throughout the evaluation were nothing different is stated. The parameters controlling spectral merge selectors and the search depth are fixed and their influence will not be further evaluated for following reasons. The vertex vectors used in spectral merge selector always are approximations. Using more eigenvectors should improve the accuracy, but requires more memory and time to compute them. Thus the maximal number of eigenvectors is limited to 30. In graphs with less than 30 clusters also less eigenvectors contain meaningful information. By assuming that the positive eigenvalues quickly decay to zero only the few eigenvectors with eigenvalues larger than 20% of the largest eigenvalue are used. The search depth is only used by the Kernighan-Lin refinement. Previous experiments in Section 3.4.3 on refinement algorithms showed that it can be restricted to 20 log10 |V | moves without missing better clusterings. For safety the factor 25 will be used. 4.1.2 Effectiveness The main tool used in this evaluation is the comparison of mean modularities. For this purpose clusterings produced by specific algorithm configurations on a set of graphs are collected. The mean modularity of a configuration over the graphs is calculated using the arithmetic mean. Of course graphs with high modularity most influence the absolute value of the mean modularity. But the modularity measure is normalized to the range from zero to one. Thus all modularity values will be in the same range regardless of differing graph properties. In addition here the absolute values are unimportant as algorithms and configurations are compared. The differences between mean modularities are most influenced by graphs with high modularity variations between the algorithms. In many places also the results gained with and without refinement will be compared. Without refinement it becomes visible how much parameters influence the raw graph coarsening. On the other hand results with refinement indicate how well coarsening with the observed parameters supports the refinement algorithms later. Kernighan-Lin refinement is expected to be more resistant against difficulties as it can escape from local optima. Hence instead just sorted greedy refinement with the density-fitness vertex selector is used for this comparisons. This reference configuration will be abbreviated by sgrd in most places. As a useful side effect the modularity improvements with refinement over no refinement provide a significance scale. Fluctuations of the modularity much smaller than this scale are probably not of much interest. Similarly improvements of different coarsening configurations which are easily compensated by greedy refinement are neither important. For many questions of the evaluation tables containing mean modularities will be produced. However it is difficult to compare these raw values. Therefore where possible the mean modularity results will be plotted and graphically visualized. The 61 4 Evaluation significance scale also becomes visible in such figures by including results gained with and without refinement. It can be expected that not all graphs are equally well suited for the study and comparison of clustering algorithms. Obviously some may not contain any meaningful cluster structure. Structural properties like the number of clusters, inhomogeneous cluster size, and the degree of separation will influence the algorithms. Most importantly approximation algorithms and heuristics directly or implicitly exploit structural properties of the graphs to make their local decisions. Therefore it is advisable to not blindly use a few graphs for evaluation but also study their structural properties. Unfortunately still only little is known about the influence of various structural properties on the modularity clustering problem. In order to avoid pitfalls here the effectiveness of clustering methods is studied using a large collection of real-world graphs from different application domains. The next paragraph shortly comments on why no randomly generated graphs were used in this study. Finally the second paragraph introduces the benchmark graphs. On Random Graphs Using graph generators it is very easy to acquire huge amounts of benchmark graphs with roughly known structural properties. In the past quite some studies on clustering algorithms, for example [18], used simple generated graphs. Still random graphs should to be used with caution for following reasons: • Depending on the generator the expected and the actual number of edges in the generated graph do not match. In that case density and optimal quality vary even with identical parameters. • The random graph model determines the vertex degree distribution and many other structural properties. For example placing edges with equal probability between vertices (Bernoulli random graph) produces a binomial distribution. On the other hand certainly only few graphs in real-world applications will have such a structure. At the same time in some application domains it is still difficult to generate graphs with properties close to the real structures. • Some volume models depend on the vertex degree which again depends on where the edges are placed. This interdependency makes it very difficult to produce random graphs with specific intra- and inter-cluster densities. • The intended clustering of generated graphs appear to be known, but another clustering might well have a better quality for the chosen quality measure. On smaller random graphs the fluctuations may dominate the clusterings. The Real-World Benchmark Graphs Many researchers and institutions published typical graphs of their research areas on the Internet. In the context of modularity clustering for example Mark Newman, Alex Arenas, and Andreas Noack collected an published some interesting graphs. Besides these small, individual collections the Pajek project [8] brought together quite many graphs from different disciplines. These collections were used to compile a set of benchmark graphs. The complete set is listed in Table A.1 in the appendix. For each graph the number of vertices 62 4.1 Methods and Data and edges, the weighted mean vertex degree (mean wdeg) and the global connection density in the degree volume model (wgt. density) is printed. In addition the source of the graph is indicated in the last column and Table A.2 provides web addresses to this data. For various reasons also subsets of the collection will be used. For example some single evaluation steps will compare a big number of configurations. To keep feasible computation times a reduced graph set will be employed. The second column marks the graphs of the reduced set with R. In addition the graphs from the scalability analysis are marked with S and from the comparison to reference algorithms with C. Many graphs of the collection are directed and contain parallel edges. But the modularity measure is defined for simple, undirected graphs. Therefore the graphs were pre-processed with the following aim: Between each adjacent pair of vertices lies exactly one edge in each direction and their weight equals the sum of the original edge weights between both vertices. This pre-processing is allowed as the focus of this work lies on the evaluation of algorithms and not in the interpretation of single clustering results. However other publications may have used different normalization strategies. The pre-processing is accomplished in four steps: First parallel edges are removed and their weight is added to the first edge. Then missing inverse edges are added with zero weight. Self-edges are used as their own inverse edge. In the third pass the edge weights are made symmetric by taking the sum of each edge and its inverse edge, ignoring self-edges. Finally in disconnected graphs the largest connectivity component is chosen. These graphs are labeled with the suffix main. 4.1.3 Efficiency and Scalability In order to compare the efficiency of different configurations information about the computational expenses are necessary. Then trade offs between quality and runtime are identified in combination with measured differences in the mean modularity. Unfortunately it is not advisable to compare average computation times. There is no shared time scale between the graphs. The average values would be strongly influenced by the few largest graphs. Therefore just the measured times from a single graph are compared. Through the evaluation the graph DIC28 main is used. For some aspects also the graph Lederberg main is considered. All timings were measured on a 3.00GHz Intel(R) Pentium(R) 4 CPU with 1GB main memory. Similar to the mean modularity comparing the runtime with and without refinement provides a significance scale. Computation times measured on different graphs can be used to study the scalability of the algorithms. By nature the runtime depends on the number of vertices and edges. Plotting the measured times against the number of vertices will visualize these dependencies. In practice the absolute times are uninteresting as they also depend on startup time, implementation style, and the runtime environment. Instead the progression of runtime curves of different algorithms is compared. 63 4 Evaluation mean modularity 0.575 0.580 0.585 0.590 0.595 Modularity by Match Fraction 0.565 0.570 ● M−sgrd 10 M−sgrd 30 M−none 10 M−none 30 ● ● ● ● ● 20 40 60 match fraction [%] 80 100 Figure 4.1: Mean Modularity by Match Fraction (reduced set) 4.2 Effectiveness of the Graph Coarsening This section has three aims. First the best value for the parameter match fraction used by the matching algorithm is searched. Then greedy grouping is compared against the best greedy matching variant. And finally the influence of the reduction factor on coarsening and refinement is studied. 4.2.1 Match Fraction The match fraction controls how many of the best ranked cluster pairs are considered when constructing the matchings. All other pairs are simply ignored in order to avoid selecting pairs ranked very low just because all better pairs are blocked by previous matches. The greedy matching is evaluated in the default configuration. Two different reduction factors 10% and 30% and match fraction values 10%, 25%, 50%, 75% and 100% are used. With 100% no pairs are ignored. The parameters are compared with and without refinement. In summary the two basic algorithms are greedy matching with refinement (M-sgrd ) and without (M-none). Table 4.2 compares the mean modularities gained with the 20 configurations. As benchmark graphs the reduced set is used. The rows contain the two algorithms combined with both reduction factors. The columns list the match fractions. The maximum value of each row is marked in order to highlight the best match fraction. In addition Figure 4.1 contains a plot of the four rows. As visible in the table and the figure in average best modularities are gained with the 50% match fraction. Thus this value is used as default in the following evaluations. Altogether the influence of this parameter seems to be negligible. 64 4.2 Effectiveness of the Graph Coarsening 10% 0.56318 0.59582 0.56415 0.59516 M-none 10% M-sgrd 10% M-none 30% M-sgrd 30% 25% 0.56436 0.59573 0.56336 0.59513 50% 0.56595 0.59592 0.56431 0.59453 75% 0.56590 0.59566 0.56441 0.59432 100% 0.56572 0.59567 0.56252 0.59432 Table 4.2: Mean Modularity by Match Fraction. Columns contain the match fraction and rows contain the algorithm and reduction factor. Reduction Factor vs. Modularity and Runtime 140 Modularity by Reduction Factor ● ● ● ● ● 120 ● 100 80 runtime [s] G−none G−sgrd M−none M−sgrd ● ● 40 0.56 0.590 60 mean modularity 0.57 0.58 ● modularity runtime mean modularity 0.592 0.594 0.59 0.596 ● ● ● 20 40 60 reduction factor 80 100 (a) Modularity by Reduction Factor (reduced set) 20 40 60 reduction factor [%] 80 100 (b) Runtime and Modularity on DIC28 main Figure 4.2: Modularity and Runtime by Reduction Factor 4.2.2 Coarsening Methods Next greedy grouping and greedy merging are compared. To stay independent of specific reduction factors the values 5%, 10%, 30%, 50%, and 100% are used. The factor controls the number of coarsening levels and thus indirectly influences the refinement later. The value 100% produces just one coarsening level. However factors above 50% cannot be used with greedy matching. Again the default parameters are used. In summary the four algorithms are greedy grouping with refinement (G-sgrd ), without refinement (G-none), greedy matching with refinement (M-sgrd ), and without (M-none). Table 4.3 compares the mean modularities gained with the 18 configurations on the reduced graph set. The rows contain the four algorithms and the columns the reduction factors. The maximum value of each column is highlighted to better see the best algorithm. Additionally Figure 4.2a visualizes these values. The second Table 4.4a compares the runtime of the configurations on the largest graph of the set (DIC28 main). Furthermore the number of actually produced coarsening levels is printed in parenthesis. 65 4 Evaluation 5% 0.56849 0.59689 0.56616 0.59613 G-none G-sgrd M-none M-sgrd 10% 0.56849 0.59677 0.56595 0.59592 30% 0.56849 0.59617 0.56431 0.59453 50% 0.56846 0.59557 0.55364 0.59154 100% 0.56846 0.58852 Table 4.3: Mean modularity by Reduction Factor. The Columns are the reduction factor and rows list the algorithms. Empty cells are impossible combinations. G-none G-sgrd M-none M-sgrd 50.084s 136.239s 45.77s 139.139s 5% (123) (123) (123) (123) 26.579s 80.033s 22.2s 72.939s 10% (60) (60) (60) (60) 17.285s 40.683s 11.614s 36.83s 30% (19) (19) (20) (20) 16.785s 33.463s 11.235s 32.077s 50% (10) (10) (17) (17) 100% 17.152s (2) 30.903s (2) (a) Runtime and Coarsening Levels on DIC28 main G-none G-sgrd M-none M-sgrd 30.289s 40.514s 56.541s 62.994s 5% (129) (129) (200) (200) 10% 13.518s (61) 19.513s (61) 47.313s (200) 49.97s (200) 30% 6.85s (18) 9.851s (18) 41.586s (200) 43.598s (200) 50% 5.604s (10) 8.261s (10) 39.505s (200) 43.89s (200) 100% 4.558s (2) 7.656s (2) (b) Runtime and Coarsening Levels on Lederberg main Table 4.4: Runtime by Reduction Factor. The columns are the reduction factor and rows list the coarsening method. The cells contain the runtime in seconds and the number of coarsening levels. Empty cells are impossible combinations. 66 4.3 Effectiveness of the Merge Selectors As another extreme the times measured on the graph Lederberg main are listed in Table 4.4b. While the mean degree of this graph is around 10 it contains some high-degree vertices with an extreme vertex having 1103 incident edges. Here the matching algorithm reached the maximal number of coarsening levels (200) regardless of the reduction factor. Concerning grouping vs. matching, the table and plot of mean modularities shows that greedy grouping always performed slightly better than matching. Moreover the behavior of the matching algorithm on the graph Lederberg main confirmed that it is unreliable. Hence greedy grouping should be used as coarsening method. 4.2.3 Reduction Factor In order to find the best reduction factor the mean modularity Table 4.3 of the previous subsection is reused. In addition Figure 4.2b shows a plot of the mean modularities of G-sgrd and the runtime on the graph DIC28 main against the reduction factors. Without refinement the reduction factor has no real influence on greedy grouping (G-none). Just a small influence is visible with greedy matching (M-none). Maybe because with more coarsening levels it select less bad cluster pairs. Combined with greedy refinement smaller reduction factors improve the clusterings. However halving the reduction factor from 10% to 5% doubles the runtime and the number of coarsening levels while the difference in mean modularity is very small. Therefore as a trade off between quality and runtime the reduction factor 10% is a good choice. 4.3 Effectiveness of the Merge Selectors The merge selector is used by the coarsening methods to rank cluster pairs in the selection process. This section experimentally compares the developed merge selectors to find the best. First the best parameters for the random walk distance and reachability selectors are searched. Then these are compared with the other merge selectors. 4.3.1 Random Walk Distance The random walk distance selector has one parameter RW length controlling the length of random walks. Here the lengths 1–5 are studied. The clusterings are compared using greedy grouping with refinement, named RWdist-sgrd, and without, named RWdist-none. All other parameters are set to their default values. Table 4.5 summarizes the mean modularities gained with the 10 configurations on the reduced graph set. The columns contain the length of random walks and the rows contain the two basic algorithm combinations. The best mean modularity of each row is highlighted to make good lengths visible. The same values are also plotted in Figure 4.3a. As reference the dashed lines mark the mean modularity produced by the standard weight density selector with refinement (WD-sgrd ) and without (WDnone). Appendix B.1 contains a detailed table of the clusterings produced on each graph. 67 4 Evaluation 0.60 Modularity with RW Distance Runtime with RW Distance (DIC28_main) ● ● ● RWdist−sgrd RWdist−none 2500 0.58 3000 ● ● runtime [s] 1500 2000 ● ● 1000 0.52 RWdist−sgrd RWdist−none WD−sgrd WD−none ● ● ● 0 0.50 ● 500 mean modularity 0.54 0.56 ● 1 2 3 length of random walks 4 5 (a) Modularity with Random Walk Distance 1 2 3 random walk length 4 (b) Runtime with Random Walk Distance on DIC28 main Figure 4.3: The Random Walk Distance RWdist-none RWdist-sgrd 1 0.48510 0.53202 2 0.54798 0.57045 3 0.54424 0.56610 4 0.55213 0.57051 5 0.55495 0.57226 Table 4.5: Mean modularity with Random Walk Distance. Columns contain the length of random walks and rows contain the algorithm. 68 5 4.3 Effectiveness of the Merge Selectors ● 800 RWreach−sgrd 1 RWreach−sgrd 3 RWreach−sgrd 5 RWreach−none 1 RWreach−none 3 RWreach−none 5 WD−sgrd WD−none RWreach−sgrd 1 RWreach−sgrd 3 RWreach−sgrd 5 RWreach−none 1 RWreach−none 3 RWreach−none 5 0.57 400 time [s] 600 mean modularity 0.58 0.59 ● ● ● 1000 ● Runtime with RW Reachability (DIC28_main) 1200 Modularity with RW Reachability ● 200 ● ● ● 0 0.56 ● 1.0 1.5 2.0 2.5 3.0 length of random walks 3.5 4.0 1.0 1.5 2.0 2.5 3.0 length of random walks 3.5 4.0 (a) Modularity with Random Walk Reachability (b) Runtime with Random Walk Reachability on DIC28 main Figure 4.4: The Random Walk Reachability Finally Figure 4.3b compares the runtime on the graph DIC28 main. The values with and without refinement are plotted to provide a time scale relative to the refinement costs. The first table and figure show that the random walk distance selector becomes slightly better with longer random walks. However the mean modularities are very low compared to the weight density selector. With refinement RWdist is nearly as good as weight density is already without any refinement. In fact RWdist found better clusterings just on the two smallest graphs (SouthernWomen and dolphins). At the same time computing this selector is extremely expensive compared to the refinement. Therefore it is removed from the further evaluation. 4.3.2 Random Walk Reachability The random walk reachablity modifies the edge weights using short random walks. This modification operator can be applied on its own results and the RW iterations parameter controls this number of applications. Here the values 1, 3, and 5 are considered. Again the parameter RW length defines the length of the random walks and the values 1–4 are used. The combination with length one and just one iteration is equal to the original weight density selector. Because iterated application does not change the weights the other iteration counts were excluded in this combination. The remaining setup is identical to the previous evaluations and the two algorithms are named RWreach-sgrd and RWreach-none. Table 4.6 summarizes the mean modularities produced by the 20 configurations on the reduced set of graphs. The columns contain the length of the walks and the rows the two algorithms RWreach-sgrd and RWreach-none combined with the 69 4 Evaluation RWreach-none 1 RWreach-sgrd 1 RWreach-none 3 RWreach-sgrd 3 RWreach-none 5 RWreach-sgrd 5 1 0.56852 0.59678 2 0.57150 0.59616 0.57530 0.59658 0.57720 0.59732 3 0.57218 0.59663 0.57672 0.59609 0.57102 0.59464 4 0.57380 0.59676 0.57468 0.59760 0.55942 0.59016 Table 4.6: Mean modularity with Random Walk Reachability. Columns contain the length of random walks and rows contain the algorithm combined with the number of iterations. Empty cells are impossible combinations. number of iterations. Figure 4.4a shows a plot of the mean modularities vs. random walk length. Again the dashed lines mark the mean modularity produced by the weight density selector. Appendices B.2 and B.3 contain two detailed tables of the modularities gained on each graph with walks of length 2 and 3. In addition Figure 4.4a shows the runtime measured on the graph DIC28 main. Again the runtime difference between the sorted greedy refinement and no refinement provides a visual relation to other components of the algorithm. In the mean modularity plot some small improvements are visible without refinement. However with refinement these completely disappear. Three iterations seem to be slightly better then just one iteration. Using five iterations has a negative impact when combined with longer walks. In summary no clear tendency or improvements were measurable. Maybe the used benchmark graphs are structurally unsuited for these merge selectors. In the further evaluation simply two steps long walks with three iterations (RW reachability (2,3)) are chosen as default. This combination still can be computed quickly and does not really degrade the mean modularity. 4.3.3 Comparison of Merge Selectors To find the best merge selector in terms of effectiveness and efficiency this section compares mean modularities produced with each merge selector and their runtime. The larger graph collection is used to gain more reliable mean values. As a side effect this allows to verify whether the results from the reduced graph set are representative. For the same reason greedy matching is considered again checking whether it works better with other merge selectors than the weight density. The studied configurations are composed of three components: Merge selector, coarsening method, and refinement method using the naming scheme Selector Coarsening-Refinement. Cluster pairs are selected either by modularity increase like used in the greedy joining algorithm of Newman (MI ), or by the weight density (WD), random walk reachability (RWR), spectral length (SPL), spectral length difference (SPLD), or the spectral angle (SPA). As coarsening method greedy grouping (G) and greedy matching (M ) are used. The third component is the refinement algorithm. Like before no refinement (none) and sorted greedy refinement (sgrd ) is used. 70 4.3 Effectiveness of the Merge Selectors 0.54 Modularity by Merge Selector (large set) ● ● ● 0.46 mean modularity 0.48 0.50 0.52 ● 0.44 ● ● ● SPL SPLD SPA MI merge selector G−none M−none G−sgrd M−sgrd WD RWR Figure 4.5: Mean Modularity of the Merge Selectors (large set) Runtime by Merge Selector (DIC28_main) ● G−none M−none G−sgrd M−sgrd runtime [s] 100 runtime [s] 200 500 200 500 G−none M−none G−sgrd M−sgrd 1000 2000 ● Runtime by Merge Selector (Lederberg_main) 100 50 ● ● ● ● 50 ● ● 20 ● ● ● ● ● 20 ● SPL SPLD SPA MI merge selector WD RWR (a) Runtime by Merge Selector on DIC28 main SPL SPLD SPA MI merge selector WD RWR (b) Runtime by Merge Selector on Lederberg main Figure 4.6: Runtime of the Merge Selectors 71 4 Evaluation SPL SPLD SPA MI WD RWR G-none 0.43653 0.45479 0.52405 0.52963 0.52305 0.52971 M-none 0.44014 0.48348 0.52252 0.51985 0.51866 0.52768 G-sgrd 0.48521 0.49926 0.54041 0.53968 0.54608 0.54718 M-sgrd 0.51962 0.53231 0.54341 0.54369 0.54529 0.54744 Table 4.7: Mean Modularity of Merge Selectors. Columns contain the algorithms and rows the merge selectors. Table 4.7 displays the mean modularities of all 24 configurations on the large graph set. The rows contain the six merge selectors and the columns the four algorithm variants. The rows are sorted by their overall mean modularity. Figure 4.5 shows a plot of the same data. In order to compare the runtime of the configurations the times measured on the graphs DIC28 main and Lederberg main are shown in Figures 4.6a and 4.6b. A logarithmic scale is used to easier differentiate the runtime. The first two spectral merge selectors SPL and SPLD did not perform well in terms of runtime and clustering results. This could also be expected from their mathematical derivation as approximation of the modularity increase. However the modularity values on these two show that the matching (M ) method is less sensitive against bad selectors than grouping (G). All other merge selectors SPA, WD, MI, and RWR yielded very similar clustering results when compared to the modularity improvement possible by refinement. This is an interesting observation because the modularity increase selector (MI ) was expected to produce worse results than the weight density (WD). Very small modularity improvements over the other selectors were gained by the random walk reachability (RWR). On the better merge selectors greedy matching (M ) turned out to be as good as greedy grouping (G) when refinement was used. Without refinement grouping produced slightly better results. On DIC28 main matching with refinement was slightly faster than grouping. But on the graph Lederberg main with its extremely skewed distribution of vertex degrees greedy grouping always was considerably faster. Altogether the weight density selector (WD) often had the lowest runtime while still producing good mean modularities. Therefore it will be used as default merge selector. On the large graph set the grouping and matching methods were comparably good in terms of mean modularity. This result slightly differs to the measurements from the reduced graph set. 4.4 Effectiveness of the Cluster Refinement This section compares the multi-level refinement algorithms against raw graph coarsening without refinement. The first subsection concentrates on the greedy refinement variants to select the best in terms of modularity improvement and search speed. And finally the second subsection compares the best greedy refinement against the 72 4.4 Effectiveness of the Cluster Refinement Runtime by Refinement Method (DIC28_main) 0 0.570 1000 0.575 mean modularity 0.580 0.585 runtime [s] 2000 3000 0.590 4000 0.595 Modularity by Refinement Method none SGR−mod SGR−eo CGR SGR−density KL (a) Modularity by Refinement Method none SGR−mod SGR−eo CGR SGR−density KL (b) Runtime by Refinement Method on DIC28 main Figure 4.7: Mean Modularity by Refinement Method (reduced set) presented Kernighan-Lin method. 4.4.1 Greedy Refinement The complete greedy refinement method moves the vertex with the best modularity increase in each move step until no further improvement is possible. However it is computational expensive to find the best vertex because it requires to evaluate all possible moves of all vertices in each step. For this reason the sorted greedy refinement was introduced. It saves search time by trying to move the vertices in a specific order according to a vertex selector. In the following evaluation the best of the available vertex selectors is selected. And finally it is studied whether the complete greedy refinement can be replaced by the sorted variant without degrading the results too much. The three sorted greedy refinement methods are SGR-density using the densityfitness vertex selector, SGR-mod using the mod-fitness, and SGR-eo using the eofitness. The complete greedy refinement is named CGR. To provide a significance scale Kernighan-Lin refinement (KL) and no refinement (none) is included. All other parameters are set to their default values, i.e. greedy grouping by weight density is used for the coarsening. Table 4.8 summarizes the mean modularity produced by each refinement algorithm on the reduced graph set. The second column displays the runtime of the algorithms on the graph DIC28 main. The rows are sorted by their mean modularity. In addition to the table the bar plot 4.7a visually compares the mean modularities. The runtime is also plotted in Figure 4.7b. Between the three sorted greedy variants (SGR-density, SGR-mod, and SGR- 73 4 Evaluation none SGR-mod SGR-eo CGR SGR-density KL mean modularity 0.56849 0.59661 0.59664 0.59672 0.59677 0.59792 time DIC28 main 24.11900 75.93500 74.89400 798.17800 76.71600 4672.12100 mod. DIC28 main 0.80154 0.84754 0.84727 0.84758 0.84747 0.84781 Table 4.8: Mean Modularity by Refinement Method (reduced set). The first column contains mean modularities and the second column lists the runtime on the graph DIC28 main. mean modularity none 0.52305 SGR-density 0.54608 CGR 0.54610 KL 0.54810 Table 4.9: Mean Modularity by Refinement Method (large set) eo) no significant differences in modularity are visible. Their clustering results are also comparable to complete greedy refinement (CGR). On the graph DIC28 main the complete greedy refinement was about 10 times slower than any sorted greedy refinement. This is in agreement with the higher worst case complexity of the complete greedy refinement. Therefore sorted greedy refinement with the densityfitness vertex selector (SGR-density) is chosen as default greedy refinement method. 4.4.2 Kernighan-Lin Refinement This subsection analyzes how much the refinement variants improve the clustering results compared to no refinement. This includes the question whether KernighanLin refinement performs significantly better than greedy refinement. The algorithms are configured like in the previous subsection. Considered are the variants none, SGR-density, CGR, and KL. The evaluation is applied to the large graph set to gain more reliable mean modularity values. Table 4.9 lists the produced mean modularity values and Figure 4.8 provides a bar plot of the same values. Appendix B.4 contains a table of the single modularity values gained by the algorithms on each graph. On the large graph set the mean modularity was improved with sorted greedy refinement (SGR-density) by 4.4% compared to no refinement1 . This range provides the significance scale like already used in the previous evaluations. In comparison Kernighan-Lin refinement (KL) improved the results by 4.79%. The mean improvement with Kernighan-Lin refinement over sorted greedy refinement was 0.37%. The runtime and modularity values of the graph DIC28 main from the previous subsection show that the Kernighan-Lin refinement was about 10 times slower than sorted greedy refinement. At the same time there it improved the modularity by just 0.04%. Like on the reduced set also on the large graph set sorted greed refinement (SGR-density) was equally good as the complete greedy refinement (CGR). 1 74 max / min ∗100% 4.5 Scalability 0.525 0.530 mean modularity 0.535 0.540 0.545 Modularity by Refinement Method (large set) none SGR−density CGR KL Figure 4.8: Mean Modularity by Refinement Method (large set) Altogether Kernighan-Lin refinement reliably improves the clusterings by a small amount but requires considerably more runtime. Therefore it should be used in place of sorted greedy refinement only when best clusterings are searched. Similar or better modularity improvements might be easily achievable by other refinement methods in less time. 4.5 Scalability The purpose of this section is to experimentally study how well the runtime of the algorithms scales with the graph size. The considered configurations are the multilevel Kernighan-Lin refinement (KL), sorted greedy refinement by density-fitness (SGR-density), complete greedy refinement (CGR) and the raw graph coarsening without refinement (none). For all other parameters the default values are used. The runtime of all graphs and algorithm has to be measured on the same computer. Thus only a subset of 24 graphs from the large graph collection is used. The graphs are also marked in the graph table in the appendix. Figure 4.9a shows the total runtime of the algorithms versus the number of vertices. In addition Figure 4.9b shows the computation time of the coarsening phase, which equals the configuration none. The time spend on sorted greedy refinement is included. In order to show the influence of the vertices the names of three larger graphs and their vertex count are inserted into the figure. The complete runtime measurements of each graph can be found in the appendix in Table B.5. Of course the runtime not only depends on the number of vertices but also on the edges. On the other hand the number of edges mostly scales with the vertices because nearly all graphs have a similar mean vertex degree. The graphs eatRS and hep-th-new main are the two largest graphs of the collection. Both have more than three times more edges than other graphs of similar vertex count (for example 75 4 Evaluation Runtime vs. Graph Size Runtime vs. Graph Size eatRS 305501 edges none SGR−density CGR KL ● hep−th−new_main 352059 edges ● 1000 50 runtime [s] 100 runtime [s] 2000 3000 ● coarsening SGR−density without coarsening 150 4000 ● ● ● ● DIC28_main 71014 edges ● 0 0 ●● ● ● 5000 ● ● ● ● 10000 ● 15000 vertices ● 20000 (a) Runtime by Graph Size 25000 ● ● 0 ● ● ● ●● ● ● ● ● ● ●● ●● 0 ●● ● ● ● 5000 ● 10000 15000 vertices 20000 25000 (b) Coarsening and Greedy Refinement Time by Graph Size Figure 4.9: Runtime by Graph Size DIC28 main). In the second figure it is visible that coarsening and refinement also takes around three times longer on these graphs. Kernighan-Lin and complete greedy refinement even stronger depend on the number of edges. Both coarsening and sorted greedy refinement were equally fast on most graphs and scale much better with the graph size than complete greedy and Kernighan-Lin refinement. 4.6 Comparison to Reference Algorithms This section compares the developed algorithm to other publicly available algorithms. Many of these are implemented as proof of concept and only work on unweighted graphs without self-edges. Therefore the evaluation is based on the 19 graphs of the collection usable with all implementations. Information about the graphs can be found in Appendix A.1. First the reference algorithms are introduced shortly. Then the modularity of the clusterings produced by the algorithms and their runtime are compared. The following algorithms are compared here. For each a short name is given to be used in tables and summaries. One of the most popular algorithms is the fast greedy joining (fgj ) of Clauset, Newman, and Moore [15].2 It is an agglomeration method selecting cluster pairs by highest modularity increase. Recently also the modified greedy joining of Wakita and Tsurumi [76] became available.3 It selects cluster pairs by highest modularity increase multiplied with a consolidation ratio min(|Ci |/|Cj |, |Cj |/|Ci |). The wakita HN variant uses the number of vertices as 2 3 76 available at http://cs.unm.edu/~aaron/research/fastmodularity.htm available at http://www.is.titech.ac.jp/~wakita/en/ 4.6 Comparison to Reference Algorithms Chain8 Star9 K55 Tree15 ModMath main SouthernWomen karate mexican power Grid66 Sawmill dolphins polBooks adjnoun sandi main USAir97 circuit s838 CSphd main Erdos02 DIC28 main average walktrap 0.31633 -0.21875 0.00000 0.47449 0.34641 0.00000 0.39908 0.22869 0.47833 0.40153 0.44043 0.49927 0.17640 0.78157 0.28539 0.70256 0.88971 0.62534 0.70285 0.39630 leadingev 0.31633 0.00000 0.30000 0.50510 0.42287 0.21487 0.37763 0.29984 0.55000 0.44745 0.48940 0.39922 0.22153 0.78667 0.26937 0.70215 0.74219 0.59221 0.70923 0.43927 wakita HE 0.37755 -0.078125 0.30000 0.51276 0.31712 0.22775 0.35947 0.33772 0.42222 0.49102 0.48687 0.46585 0.24365 0.80995 0.34439 0.76499 0.91382 0.66991 0.73570 0.45803 wakita HN 0.31633 -0.078125 0.30000 0.50510 0.35488 0.24157 0.41880 0.31949 0.49236 0.48660 0.44838 0.49138 0.25766 0.82054 0.31526 0.79488 0.92001 0.67756 0.74179 0.46445 fgj 0.35714 -0.00781 0.00000 0.50510 0.41051 0.31467 0.38067 0.33089 0.49597 0.55008 0.49549 0.50197 0.29349 0.82729 0.32039 0.80472 0.92470 0.67027 0.78874 0.47180 ML-none 0.35714 -0.00781 0.30000 0.50510 0.42018 0.31530 0.40869 0.32811 0.51319 0.55008 0.51786 0.50362 0.28428 0.82083 0.34353 0.79041 0.92484 0.68484 0.80154 0.49272 spinglass 0.37755 -0.00781 0.30000 0.50510 0.44880 0.33121 0.41979 0.35119 0.53250 0.53863 0.52852 0.52640 0.31336 0.81619 0.36597 0.81481 0.90584 0.70985 0.81747 0.50502 ML-sgrd 0.35714 0.00000 0.30000 0.50510 0.42905 0.31972 0.41979 0.34776 0.55000 0.55008 0.52587 0.52724 0.31078 0.82773 0.36824 0.81551 0.92558 0.71592 0.84747 0.50753 ML-KL 0.37755 0.00000 0.30000 0.51276 0.44880 0.33601 0.41979 0.35952 0.54125 0.55008 0.52587 0.52561 0.31078 0.82773 0.36824 0.81551 0.92558 0.71611 0.84781 0.51100 Table 4.10: Clustering Results of Reference Algorithms cluster size |Ci | and the wakita HE variant the of number of edges leaving the cluster. A few other algorithms are available through the igraph library of Csárdi and Nepusz [16].4 The following were used. The spinglass algorithm of Reichardt and Bornholdt [67] optimizes modularity by simulated annealing on a physical energy model. The implementation requires an upper bound on the number of clusters. In this evaluation the upper bound 120 was used for all graphs. A spectral method is Newman’s [58] recursive bisection based on the first eigenvector of the modularity matrix (leadingev ). Finally an agglomeration method based on random walks is the walktrap algorithm of Pons and Latapy [64]. For comparison the standard configuration with three refinement variants was chosen. All use greedy grouping by weight density with 10% reduction factor. The variant without any refinement (ML-none) is similar to other pure agglomeration methods, namely fgj, wakita-HN, and wakita-HE. The two other refinement variants are sorted greedy refinement by density-fitness (ML-sgrd ) and the improved Kernighan-Lin refinement (ML-KL). Table 4.10 lists the modularity of the clusterings found by the implementations. The graphs are sorted by number of vertices and the algorithms by mean modularity. The last row shows the arithmetic mean modularity for each algorithm. For each graph the algorithms with the best result are marked with a bold font. A similar table of comparing the runtime of the algorithms is included in Appendix B.6. Figure 4.10 summarizes the mean modularity gained by each algorithm and the measured runtime. It is visible that in average greedy grouping by density (ML-none) is better than all other agglomeration methods (fgj, wakita HE, wakita HN ) even without refinement. Yet on a few graphs other agglomeration methods perform slightly better. Looking at the multi-level refinement methods (ML-sgrd, ML-KL) only the spinglass algorithm 4 available at http://igraph.sourceforge.net/ 77 4 Evaluation 0.3 0.4393 walktrap leadingev 0.4580 0.4644 0.4718 0.4927 0.5050 0.5075 0.5110 fgj ML−none spinglass ML−sgrd ML−KL 0.2 0.3963 0.0 0.1 Average Modularity 0.4 0.5 Comparison of Modularity wakita_HE wakita_HN Algorithm (a) Modularity by Algorithm 5000 Comparison of Runtime 3000 2000 walktrap leadingev wakita_HE wakita_HN fgj ML−none spinglass ML−sgrd ML−KL 0 1000 runtime [s] 4000 ● 0 ● ● ● ● ●● ●● ● 5000 10000 15000 20000 25000 vertex count (b) Runtime of the Algorithms vs. Graph Size Figure 4.10: Clustering Results and Runtime of the Reference Algorithms 78 4.7 Comparison to Published Results of Reichardt and Bornholdt produces comparable good clusterings. However it is much slower. On nearly all graphs it is outperformed by the multi-level KernighanLin refinement (ML-KL) in terms of modularity and runtime. The three slowest implementations were spinglass, ML-KL, and leadingev. All other algorithms had a relatively constant, low runtime. 4.7 Comparison to Published Results This section compares the presented multi-level refinement method against other clustering algorithms. For many of these algorithms clustering results are published in the papers presenting the algorithm. Because of their size clusterings are directly printed only for very small graphs. Commonly just the modularity value of the clusterings are published. The following discussion is based on the algorithms listed in Section 2.4 about fundamental clustering methods. First some general problems of this evaluation method are discussed. Then for each graph the modularity values found in articles are presented. The section concludes with a small summary table comparing the best found values to own results. Only modularity values published together with the original algorithms were considered and for each value the source article is cited. The modularity values are compared against two multi-level Kernighan-Lin refinement algorithms. Both use greedy grouping with 10% reduction factor and Kernighan-Lin refinement. As merge selector the weight density is employed by the ML-KL-density variant and the random walk reachability (2,3) by the ML-KL-rw variant. The first variant is the default configuration identified by the previous evaluation. In addition the other variant was chosen because for degree volume model used here it is in many cases able to find better clusterings. The comparison of printed modularity values embodies some general problems. Often only a few digits are printed to save space. With just three digits it is difficult to check whether the same or just similar clusterings where found. In this regard also printing the number of clusters would be helpful. The calculation of modularity may be done slightly different. Several variants of the modularity measure exist and small variations in the handling of self-edges are possible. In addition unweighted edges might have been used instead of the available weighted version. Finally small diversities can arise in preprocessing and graph conversion by different strategies to obtain undirected, symmetric graphs. Here just the last problem is addressed by excluding graphs when their published number of vertices and edges differs from the own version. This applies to most graphs in [64, 25, 69]. 4.7.1 The Graphs and Clusterings In the following paragraphs each graph is shortly presented. Thereafter the clustering results are discussed. For smaller graphs also example pictures are printed. The layout was computed with the LinLog energy model [62]5 . The best clustering 5 available at http://www.informatik.tu-cottbus.de/~an/GD/ 79 4 Evaluation (a) karate (b) dolphins Figure 4.11: Reference Graphs (karate and dolphins) computed by any configuration of the own multi-level algorithm is shown as vertex color. karate This is Zachary’s karate club study [82], maybe the most studied social network. A picture of the graph colored with the optimum clustering is shown in Figure 4.11a. During the 2 years long study, 34 members of a karate club were observed and the number of contacts outside of the club were counted. Like in most other studies of this network the simple unweighted version of the network is used here. The instructor of the club (vertex #1) resigned because of conflicts with the club administrator (vertex #34) about lesson fees. His supporters followed him forming a new organization with the members 1 to 9, 11 to 14, 17, 18, 20, 22. Person 9 joined this faction although he was a supporter of the club president, because he was only three weeks away from the test for black belt when the split occurred, and had to follow the instructor to retain this rank. The greedy joining algorithm of Newman found 2 clusters with modularity 0.381 [60]. By modifying the merge selector Danon found a clustering with modularity 0.4087 [17]. The spectral clustering algorithms of Donetti et al. yielded modularity 0.412 [21] and 0.419 [20] in the improved version. Also Newman’s recursive spectral bisection found modularity 0.419 [58]. Surprisingly the algorithms based on random walks just found clusterings with 0.3937 [65] and 0.38 [64]. Extremal Optimization with 4 clusters of modularity 0.4188 [23] fell short of the optimum. As verified by [2] the global optimum clustering has 4 clusters with modularity 0.4197. The same clustering was found by both multi-level algorithms. 80 4.7 Comparison to Published Results (a) polBooks (b) afootball Figure 4.12: Reference Graphs (polBooks and afootball) dolphins The dolphin social network is an undirected graph of frequent associations between 62 dolphins in a community residing in Doubtful Sound in New Zealand [51, 50]. For a picture see Figure 4.11b. This graph just recently appeared in the context of modularity clustering. Thus only very few published results exist. With Fractional Linear Programming a clustering of modularity 0.529 and the upper bound 0.531 was found [2]. Also the exact Integer Linear Programming gave an optimum clustering with modularity 0.529 [13]. Using multi-level refinement 5 clusters with modularity 0.52586 (ML-KL-density) and 0.52772 (ML-KL-rw ) were found. The best clustering found with experimental configurations had modularity 0.52851 and the same clustering was found by the spinglass algorithm in the previous section. polBooks A network of books about recent US politics sold by the online bookseller Amazon.com. Edges between books represent frequent co-purchasing of books by the same buyers (Fig. 4.12a). The network was compiled by Valdis Krebs [47]. With Fractional Linear Programming a clustering of modularity 0.5272 and an upper bound 0.528 was found [2]. This result was confirmed by the exact Integer Linear Programming method with modularity 0.527 [13]. The multi-level algorithm ML-KL-rw and very many other configurations found 5 clusters with modularity 0.52723. Surprisingly the ML-KL-density variant just found 0.52560 with 6 clusters although greedy refinement in the same configuration found the better clustering. afootball A real-world graph with known group structure is the games schedule of Division I of the United States college football for the 2000 season [31]. The 115 teams are grouped into conferences of 8-12 teams and some independent teams. In 81 4 Evaluation (a) jazz (b) celegans metabolic Figure 4.13: Reference Graphs (polBooks and celegans metabolic) average seven intra- and four inter-conference games are played and inter-conference games between geographical close teams are more likely. A picture of the graph is visible in Figure 4.12b. Newman’s greedy joining algorithm found 6 clusters with modularity 0.546 [60] whereas the spectral methods of White et al. found 11 clusters with 0.602 [80]. Similarly the random walk based agglomeration of Pons found modularity 0.60 [64]. With Fractional Linear Programming and rounding a clustering of 0.6046 and the upper bound 0.606 was found [2]. The random walk based multi-level clustering ML-KL-rw comes close to this bound with 10 clusters and modularity 0.60582. The simpler variant ML-KL-density just got modularity 0.59419 with 7 clusters. jazz A network of 198 Jazz musicians compiled by Geisler and Danon [33]. Such graphs are interesting for the social sciences as Jazz musicians very often work together in many different small groups (Fig. 4.13a). This graph also became very popular for the comparison of clustering methods. A picture of the graph is shown on the right. The best clustering was reported for Extremal Optimization with 5 clusters of modularity 0.4452 [23]. Fractional Linear programming with rounding gave modularity 0.445 [2] and an upper bound of 0.446. Spectral clustering methods yielded modularity 0.437 [21], 0.444 [20], and 0.442 [58]. The worst modularity 0.4409 [17] was produced by Danon’s greedy joining with modified merge selector. Both multilevel algorithms found 4 clusters with modularity 0.44514 (ML-KL-rw ) and 0.44487 (ML-KL-density). 82 4.7 Comparison to Published Results (a) circuit s838 (b) email Figure 4.14: Reference Graphs (circuit s838 and email) celegans metabolic A metabolic network for the nematode Caenorhabditis elegans [42, 23]. The graph describes chemical reactions as well as the regulatory interactions that guide these reactions (Fig. 4.13b). The first published clustering result of this graph was found by Extremal Optimization with 12 clusters of modularity 0.4342 [23]. Greedy joining with precoarsening by short random walks found 7 clusters with modularity 0.4164 [65]. Newman’s recursive spectral bisection yielded 0.435 [58]. The recursive bisection algorithm with Quadratic Programming of Agarwal and Kempe found the to date best published clustering with modularity 0.450. The two multi-level refinement algorithms gave similar results with 0.44992 (ML-KL-rw ) and 0.45090 (ML-KL-density). The best clustering with experimental configurations had modularity 0.45136. circuit s838 The electric circuit s838 st.txt of Uri Alon’s collection of complex networks [53]. This graph is an undirected version with 512 vertices connected by 819 edges (Fig. 4.14a). The only published clustering has modularity 0.815 with 13 clusters [69]. Both multi-level configurations found clusterings of similar modularity. The random walk based greedy grouping ML-KL-rw is with 0.815507 slightly worse than ML-KLdensity with the weight density selector which got 0.815513. However better clusterings exist. For example with experimental configurations a clustering with 12 clusters and modularity 0.81719 was found. email The network of e-mail interchanges between members of the University Rovira i Virgili in Tarragona, Spain [36, 37]. The graph contains 1133 users of the university e-mail system, including faculty, researchers, technicians, managers, 83 4 Evaluation administrators, and graduate students. Two users are connected by an edge if both exchanged e-mails with each other (Fig. 4.14b). Thus the graph is undirected and unweighted. In the last years also this graph became popular for the evaluation of graph clustering methods. The first published result is for Extremal Optimization. It found 15 clusters and modularity 0.5738 [23]. Danon’s greedy joining with modified merge selector just yielded modularity 0.5569 [17]. The recursive spectral bisection of Newman found a clustering with modularity 0.572 [58]. The best published clustering with modularity 0.579 was found by recursive bisection with Quadratic Programming [2]. The multi-level algorithm ML-KL-density produced 11 clusters with modularity 0.57767. A better clustering with 9 clusters and modularity 0.58137 was obtained from the random walk variant ML-KL-rw. hep-th main A scientist collaboration network in high-energy physics. The graph contains the collaborations of scientists posting preprints on the high-energy theory archive at http://www.arxiv.org between 1995 and 1999. The 5835 vertices represent authors and edges connect co-authors. The edges are weighted as described in [61]. The only published clustering is from a recursive spectral bisection method. It produced 114 clusters with modularity 0.707 [21]. The multi-level algorithm MLKL-density found 57 clusters of 0.85523. A better clustering was found with the random-walk variant ML-KL-rw. It contains 59 clusters and has modularity 0.85629. Erdos02 This is the year 2002 version of the collaboration network around Erdös [34]. The vertices represent authors and the unweighted edges connect co-authors. Only authors with a co-author distance from Erdös up to 2 are included. The only published clustering was found with greedy joining combined with random walk based pre-coarsening. It contains 20 clusters and has modularity 0.6817 [65]. The multi-level algorithm ML-KL-density produced 39 clusters with modularity 0.71611. A slightly better clustering with the same number of clusters and modularity 0.716396 was found by the random-walk variant ML-KL-rw PGPgiantcompo A giant component of the network of users of the Pretty-GoodPrivacy algorithm for secure information interchange compiled by Arenas et al. [12, 35]. The component contains 10680 vertices. Edges connect users trusting each other. The modified greedy joining of Danon found the worst clustering with modularity 0.7462 [17]. Extremal Optimization found 365 clusters with modularity 0.8459 [23]. The best published modularity got Newman’s spectral bisection with 0.855 [58]. The best of both multi-level algorithms was ML-KL-rw. It found 96 clusters with modularity 0.88462. The greedy grouping by density ML-KL-density yielded 102 clusters with 0.88392. Finally the best clustering found in own experiments so far has 94 clusters and modularity 0.885450. cond-mat-2003 main A collaboration network of scientists posting preprints on the condensed matter archive at http://www.arxiv.org. This version contains data 84 4.7 Comparison to Published Results graph karate dolphins polBooks afootball jazz celegans metabolic circuit s838 email hep-th main Erdos02 PGPgiantcompo cond-mat-2003 main |V | 34 62 105 115 198 453 512 1133 5835 6927 10680 27519 |E| 78 159 441 613 2742 2040 819 5451 13815 11850 24316 116181 article [2] [2] [2] [2] [23] [2] [69] [2] [21] [65] [58] [65] modularity 0.4197 (4) 0.529 0.5272 0.6046 0.4452 (5) 0.450 0.815 (13) 0.579 0.707 (114) 0.6817 (20) 0.855 0.7251 (44) ML-KL-density 0.41978 (4) 0.52586 (5) 0.52560 (6) 0.59419 (7) 0.44487 (4) 0.45090 (9) 0.81551 (15) 0.57767 (11) 0.85523 (57) 0.71611 (39) 0.88392 (102) 0.81377 (77) ML-KL-rw 0.41978 (4) 0.52772 (5) 0.52723 (5) 0.60582 (10) 0.44514 (4) 0.44992 (8) 0.81550 (15) 0.58137 (9) 0.85629 (59) 0.71639 (39) 0.88462 (96) 0.81586 (78) Table 4.11: Comparison to Published Results from 1995 to 2003. Vertices represent authors and edges connect co-authors. The edges are weighted as described in [61]. Here only the largest component containing 27519 vertices is used. Extremal Optimization discovered 647 clusters with modularity 0.6790 [23]. Later much better clusterings were found. Newman’s recursive spectral bisection yielded 0.723 [58] and greedy joining combined with pre-coarsening found 44 clusters with modularity 0.7251 [65]. Of the own algorithms the multi-level refinement ML-KLdensity produced 77 clusters with 0.81377. A better clustering with 78 clusters and modularity 0.81586 was found by ML-KL-rw. However even better clusterings were discovered in experiments. For example one with 82 clusters and modularity 0.81718. 4.7.2 Summary The Table 4.11 summarizes the clustering results. For each graph the best modularity found in the literature and the modularity computed with the two multi-level algorithms ML-KL-density and ML-KL-rw is given. The graphs are sorted by the number of vertices and the best modularity on each graph is marked in bold. On smaller graphs optimal or at least very good clusterings were obtained with the Linear Programming methods from [2] and [13]. In these cases the multi-level algorithms found comparable good clusterings. Only few published results for big graphs exist. There the multi-level algorithm ML-KL-rw performed much better than other methods. Still the clusterings found by experimental configurations show that on some graphs better results are possible. But these were found more by fortunate circumstances than by a reliable, systematic search. 85 5 Results and Future Work The objective of this work was developing and evaluating a multi-level refinement algorithm for the modularity graph clustering problem. In this context the main focus was on the derivation and evaluation of better merge selection criteria for the graph coarsening, implementing more efficient data structures for coarsening and refinement, and researching and evaluating applicable refinement algorithms. The following sections discuss the results of this work. First immediate results, like the default configuration derived from the evaluation, are presented. The second section shortly presents by-products developed during the course of this work and the last section comments on possible directions for future work. 5.1 Results of this Work Merge selection criteria direct the selection of cluster pairs in the coarsening phase of the multi-level algorithm. The selected clusters are merged until the clustering is not further improved. This merge process produces the initial clustering and the vertex groups for the later refinement. Therefore merge selectors play a crucial role in the overall algorithm. In this work a range of semi-global and global merge selectors was presented along with simple local selectors. The first group of selectors uses the visit probabilities of short random walks and the second groups applies spectral method to analyze the global structure of the graph. The implemented variants were experimentally compared using mean modularities over a large collection of graphs. Of the seven developed selectors four performed nearly equally well in terms of mean modularity. These are, in the order of best to worser results, the random walk reachability, the weight density, the modularity increase and the spectral angle. Comparing runtime and simplicity the weight density proved to produce good results in short time while being very simple to calculate. Two coarsening algorithms were implemented. The greedy grouping method works like the greedy joining of Newman [60] by directly merging the selected pair of clusters in each step. Efficient implementations of this method require a dynamic graph data structure to quickly retrieve and update selection qualities. The implemented structure is similar to the algorithm of Wakita [76]. The second method is greedy matching and resembles the graph coarsening methods traditionally used with multilevel graph algorithms. It constructs a matching of some vertices using only the best ranked pairs. The experimental comparison showed that greedy matching is less sensitive against bad merge selectors when combined with refinement than greedy grouping. However the grouping method proved to be more reliable and produced in average slightly better results. 87 5 Results and Future Work For developing effective and efficient refinement algorithms the design space of simple methods was analyzed. The main focus was on algorithms moving single vertices between clusters. In the domain of greedy refinement methods this analysis allowed to develop a more efficient greedy algorithm. It is named sorted greedy refinement and selects vertices to move using a simple move selector instead of the expensive to calculate modularity increase. Three move selectors based on the vertex fitness concept were proposed. In addition a hill-climbing refinement algorithm in the style of Kernighan and Lin [45] was implemented. All refinement algorithms were able to improve the clustering results considerably compared to the raw coarsening phase. The sorted greedy refinement produced equally good results compared to the traditional greedy refinement. At the same time it was much faster and the runtime scaled better with the graph size. The move selectors had an insignificant influence on the results. The Kernighan-Lin refinement was improved in its runtime and it also produced slightly better results than greedy refinement. However its bad scalability still makes it unusable for large graphs. For example on graphs with about 25,000 vertices it was around 60 times slower than the sorted greedy refinement.1 The presented algorithm reflects the state of the art of multi-level refinement methods. In comparison to other clustering algorithms the multi-level refinement was competitive in both clustering quality and runtime. However studying clusterings produced with other configurations shows that still improvements of the reliability are possible as sometimes small variations of the parameters lead to better clusterings. Summarizing the results, following default configuration was identified. For the graph coarsening phase greedy merging by the weight density merge selector should be used with reduction factor 10%. For the refinement phase the sorted greedy refinement with density-fitness should be used. When the computation time is not a concern alternatively Kernighan-Lin refinement could be used to find slightly better clusterings. 5.2 By-Products As by-product the study of the mathematical properties of the modularity measure presented in Chapter 2 shed light on the relationship between modularity and other clustering quality measures. In this context the concept of connection density relative to a volume model was introduced and a mathematical connection between density and modularity was established. The volume model describes expected edge weights derived from a vertex similarity measure. This enables the quick and easy adaption of the modularity measure together with the clustering algorithms to specific application domains. The implementation of the presented clustering algorithm is relatively flexible and configurable. The algorithm’s components like merge selectors and refinement algorithms can be replaced without much effort. At the same time the use of index spaces and maps supports the generic implementation of graph algorithms. This 1 88 Measured on the graph DIC28 main, cf. Table B.5 5.3 Directions for Future Work strategy turned out to be useable and it is possible to transfer the key concepts into other programming languages. Over the course of this project a small website was build up collecting information about clustering algorithms, benchmark graphs and related topics.2 The site contains a rich set of data about the graph collection, including various eigenvalue spectra and other graph statistics. Clustering results are summarized for each graph and clustering methods can be compared globally. Such a website could be used in future as a reference for the evaluation of algorithms by providing a common collection of graphs and a tool chain to compare clusterings to older results. For example in and around the load-balancing community the collection of Walshaw et al. [74] containing graphs and partitioning results helped to track the current state of art and provided inspiration to new combined optimization strategies. 5.3 Directions for Future Work In this section a few ideas for potential improvements of the clustering algorithm are proposed. The first three subsections are mostly related to the graph coarsening phase and the last two discuss high-level meta-strategies. 5.3.1 Pre-Coarsening Performance and quality might be improved by applying a pre-coarsening to the initial graph: The vertex degree of quite some real-world networks follows a power law distribution. In that case most vertices have only a single edge attaching them to a more central vertex. These vertices are called hairs and certainly will be grouped into the same cluster as their single neighbor. Particularly in unweighted graphs maximum modularity clusterings cannot contain clusters of a single one-edge vertex [13]. Thus a special first coarsening pass should merge them in advance. This will improve the information available to the merge selector. Additionally the greedy matching may become more effective and reliably because it is no longer obstructed by single-edge vertices. Similar preprocessing methods were already proposed by Arenas et al. [4] for size reduction of graphs. An implementation would collect the unweighted vertex degrees first. All hair vertices have unit degree and thus are easy to find. They are merged to its single neighbor vertex if this increases the modularity. In addition some two-edge vertices might be reduced: These bridge vertices lie between two neighbors and could be merged with the better of both neighbors. Furthermore this pre-coarsening is a strong counter-example to merge selectors based on vertex-size or consolidation ratios. These try to prefer vertex pairs of similar size in order to grow clusters more evenly. But merging hairs with their much bigger neighbor is an expected and necessary behavior in spite the huge size difference. 2 presently at http://goya.informatik.tu-cottbus.de/~clustering/ 89 5 Results and Future Work 5.3.2 Study of Merge Selectors During the course of this work a measure for the prediction quality of merge selectors was developed. It counts the percentage of the vertex pairs having a high selection quality that would actually select a pair of vertices in the same final cluster. For this purpose a reference clustering is necessary. However it turned out that this prediction quality alone does not much about the general performance of a merge selector. In practice it is also important to study when and where coarsening errors, i.e. putting together vertices of different clusters, occur. For example it might be that the spectral angle merge selector performs well on big, fine grained graphs but just fails on the coarser coarsening levels. On the fine levels all information is non-locally stretched while the coarsening joins information about the structure and may produce good local information later on. In that case the weight density selector should be used on the coarser levels instead. 5.3.3 Linear Programming It is possible to formulate the clustering problem as linear or quadratic program [13, 2]. Instead of classic rounding techniques the computed distances could be used as merge selection quality. This would enable multi-level refinement on the rounding results. The fractional linear program is solvable in polynomial time but requires |V |2 space for the distance matrix and cubic space for the transitivity constraints. Thus it becomes impracticable already for medium-sized graphs unless the constraints are replaced by a more compact implicit representation. However for the presented multi-level refinement approximate distances between adjacent vertices would suffice. Maybe such approximations could be faster computed using other representations of the optimization aim. For example in [6] an embedding into higher-dimensional unit spheres under square Euclidean norm was used for similar quality measures. 5.3.4 Multi-Pass Clustering and Randomization A meta-strategy similar to evolutionary search [72] is multi-pass clustering. Because the refinement corrects coarsening errors the computed clustering contains valuable information. In a second pass this can be fed back into the coarsening by ignoring all vertex pairs crossing previous cluster boundaries. This effectively produces a corrected coarsening hierarchy and allows further improvements by refinement heuristics. Coarsening and refinement are repeated until the clustering does not further improve. The multi-pass search may be widened by applying some kind of randomization during the coarsening. 5.3.5 High-Level Refinement Search In the domain of cluster refinement several improvements might be possible. For example restarting the Kernighan-Lin search on intermediate clusterings allows to improve the search depth. In this context a slight randomization would help breaking 90 5.3 Directions for Future Work ties and avoiding to cycle repeatedly through the same movement sequence. But a too strong randomization prevents the detection of local optima again. Still the Kernighan-Lin approach is very slow because in modularity clustering the quality improvements of vertex moves are difficult to compute. On the other hand in this work a very fast greedy refinement method was developed. Similarly a fast, randomized method to leave local optima could be developed. Combining both would allow to walk between local optima like in the basin hopping method [52]. However some open questions remain how this can be effectively combined with the multi-level strategy. For example currently a lot of information about the graph is lost between each coarsening level because just the last best clustering is projected to the finer level. 91 Bibliography [1] A. Abou-Rjeili and G. Karypis. Multilevel algorithms for partitioning powerlaw graphs. 20th International Parallel and Distributed Processing Symposium, page 10, 2006. [2] Gaurav Agarwal and David Kempe. Modularity-maximizing network communities via mathematical programming. 0710.2533, October 2007. [3] Charles J. Alpert, Andrew B. Kahng, and So-Zen Yao. Spectral partitioning with multiple eigenvectors. Discrete Applied Mathematics, 90:3–26, 1999. [4] A. Arenas, J. Duch, A. Fernandez, and S. Gomez. Size reduction of complex networks preserving modularity. New Journal of Physics, 9:176, June 2007. [5] A. Arenas, A. Fernández, and S. Gómez. Analysis of the structure of complex networks at different resolution levels. New Journal of Physics, 10:053039, 2008. [6] Sanjeev Arora, Satish Rao, and Umesh Vazirani. Expander flows, geometric embeddings and graph partitioning. Proceedings of the thirty-sixth annual ACM symposium on Theory of computing, pages 222–231, 2004. [7] James P. Bagrow and Erik M. Bollt. Local method for detecting communities. Physical Review E (Statistical, Nonlinear, and Soft Matter Physics), 72:046108– 10, October 2005. [8] Vladimir Batagelj and Andrej Mrvar. Pajek datasets, 2006. [9] M. Besch and H.W. Pohl. Dependence-Free Clustering of Shift-Invariant Data Structures. Proceedings of the Third International Euro-Par Conference on Parallel Processing, pages 338–341, 1997. [10] Stefan Boettcher and Allon G. Percus. Extremal optimization for graph partitioning. Physical Review E, 64:026114, July 2001. [11] Stefan Boettcher and Allon G. Percus. Optimization with extremal dynamics. Physical Review Letters, 86:5211, June 2001. [12] M. Boguñá, R. Pastor-Satorras, A. Dı́az-Guilera, and A. Arenas. Models of social networks based on social distance attachment. Physical Review E, 70(5):56122, 2004. [13] Ulrik Brandes, Daniel Delling, Marco Gaertler, Robert Gorke, Martin Hoefer, Zoran Nikoloski, and Dorothea Wagner. On modularity clustering. IEEE Trans. on Knowl. and Data Eng., 20:172–188, 2008. 93 BIBLIOGRAPHY [14] B.L. Chamberlain. Graph partitioning algorithms for distributing workloads of parallel computations. University of Washington Technical Report UW-CSE98-10, 3, 1998. [15] Aaron Clauset, M. E. J. Newman, and Cristopher Moore. Finding community structure in very large networks. Physical Review E, 70:066111, December 2004. [16] Gábor Csárdi and Tamás Nepusz. The igraph software package for complex network research. InterJournal Complex Systems, 1695, 2006. [17] L. Danon, A. Dı́az-Guilera, and A. Arenas. Effect of size heterogeneity on community identification in complex networks. Arxiv preprint physics/0601144, 2006. [18] Leon Danon, Albert Dı́az-Guilera, Jordi Duch, and Alex Arenas. Comparing community structure identification. Journal of Statistical Mechanics: Theory and Experiment, 2005:P09008, 2005. [19] Hristo Djidjev. A scalable multilevel algorithm for graph clustering and community structure detection. Workshop on Algorithms and Models for the Web Graph, Lecture Notes in Computer Science, pages LA–UR–06–6261, 2006. [20] L. Donetti and M. A Munoz. Improved spectral algorithm for the detection of network communities. physics/0504059, April 2005. [21] Luca Donetti and Miguel A. Muñoz. Detecting network communities: a new systematic and efficient algorithm. Journal of Statistical Mechanics: Theory and Experiment, 2004:P10012, 2004. [22] Jonathan Doye and David Wales. On the thermodynamics of global optimization. cond-mat/9709019, September 1997. Phys. Rev. Lett. 80, 1357 (1998). [23] Jordi Duch and Alex Arenas. Community detection in complex networks using extremal optimization. Physical Review E (Statistical, Nonlinear, and Soft Matter Physics), 72:027104–4, 2005. [24] C.M. Fiduccia and R.M. Mattheyses. A linear-time heuristic for improving network partitions. In Design Automation, 1982. 19th Conference on, pages 175–181, 1982. [25] Santo Fortunato and Marc Barthelemy. Resolution limit in community detection. Proceedings of the National Academy of Sciences, 104(1):36, 2007. [26] Santo Fortunato, Vito Latora, and Massimo Marchiori. Method to find community structures based on information centrality. Physical Review E (Statistical, Nonlinear, and Soft Matter Physics), 70:056104–13, November 2004. [27] H. N. Gabow, Z. Galil, and T.H. Spencer. Efficient Implementation of Graph Algorithms Using Contraction. Journal of the Association for Computing Machinery, 36(3):540–572, 1989. 94 BIBLIOGRAPHY [28] M. Gaertler, R. Görke, and D. Wagner. Significance-driven graph clustering. Proceedings of the 3rd International Conference on Algorithmic Aspects in Information and Management (AAIM’07). Lecture Notes in Computer Science, June 2007. [29] J. Gerlach and P. Gottschling. A generic c++ framework for parallel meshbased scientific applications. Proceedings of the 6th International Workshop on High-Level Parallel Programming Models and Supportive Environments, pages 45–54, 2001. [30] Jorge Gil-Mendieta and Samuel Schmidt. The political network in mexico. Social Networks, 18:355–381, October 1996. [31] M. Girvan and J. Newman. Community structure in social and biological networks. Proceedings of the National Academy of Sciences, 99(12):7821, 2002. [32] M. Girvan and M. E. J. Newman. Community structure in social and biological networks. Proceedings of the National Academy of Sciences, 99:7821–7826, June 2002. [33] P. Gleiser and L. Danon. Community structure in jazz. Advances in Complex Systems, 6(4):565–573, 2003. [34] Jerry Grossman. The erdös number project, 2002. [35] X. Guardiola, R. Guimera, A. Arenas, A. Diaz-Guilera, D. Streib, and L. A. N. Amaral. Macro-and micro-structure of trust networks. Arxiv preprint condmat/0206240, 2002. [36] R. Guimera, L. Danon, A. Diaz-Guilera, F. Giralt, and A. Arenas. Self-similar community structure in a network of human interactions. Physical Review E, 68:065103, December 2003. [37] R. Guimera, L. Danon, A. Diaz-Guilera, F. Giralt, and A. Arenas. The real communication network behind the formal chart: Community structure in organizations. Journal of Economic Behavior & Organization, 61:653–667, December 2006. [38] Roger Guimera and Luis A. Nunes Amaral. Functional cartography of complex metabolic networks. q-bio/0502035, February 2005. Nature 433, 895-900 (2005). [39] Roger Guimerà, Marta Sales-Pardo, and Luı́s A. Nunes Amaral. Modularity from fluctuations in random graphs and complex networks. Physical Review E, 70:025101, 2004. [40] David Harel and Yehuda Koren. On clustering using random walks. Lecture Notes in Computer Science, 2245:18–41, 2001. [41] Bruce Hendrickson and Robert W. Leland. A multilevel algorithm for partitioning graphs. Proc. Supercomputing, 95:285, 1995. 95 BIBLIOGRAPHY [42] H. Jeong, B. Tombor, R. Albert, Z. N Oltvai, and A. L Barabasi. The largescale organization of metabolic networks. cond-mat/0010278, October 2000. Nature, v407 651-654 (2000). [43] G. Karypis and V. Kumar. A fast and high quality multilevel scheme for partitioning irregular graphs. SIAM Journal on Scientific Computing, 20(1):359, 1998. [44] George Karypis and Vipin Kumar. Parallel multilevel k-way partitioning scheme for irregular graphs. SIAM Review, 41(2):278–300, 1999. [45] B.W. Kernighan and S. Lin. An efficient heuristic procedure for partitioning graphs. Bell System Technical Journal, 49(2):291–307, 1970. [46] J. Kleinberg. An impossibility theorem for clustering. Advances in Neural Information Processing Systems 15: Proceedings of the 2002 Conference, 2003. [47] Valdis Krebs. A network of books about recent us politics sold by the online bookseller amazon.com. http://www.orgnet.com/. [48] R. Lehoucq, K. Maschhoff, D. Sorenson, and C. Yang. Arpack: An efficient portable large scale eigenvalue package http://www.caam.rice.edu/software/arpack/, 1996. [49] László Lovász. Random walks on graphs: A survey. Combinatorics, Paul Erdos is Eighty, 2:1–46, 1993. [50] David Lusseau. Evidence for social role in a dolphin social network. bio/0607048, July 2006. q- [51] David Lusseau, Karsten Schneider, Oliver J. Boisseau, Patti Haase, Elisabeth Slooten, and Steve M. Dawson. The bottlenose dolphin community of doubtful sound features a large proportion of long-lasting associations. Behavioral Ecology and Sociobiology, 54:396–405, 2003. [52] Claire P. Massen and Jonathan P. K. Doye. Identifying communities within energy landscapes. Physical Review E (Statistical, Nonlinear, and Soft Matter Physics), 71:046101–12, April 2005. [53] Ron Milo, Shalev Itzkovitz, Nadav Kashtan, Reuven Levitt, Shai Shen-Orr, Inbal Ayzenshtat, Michal Sheffer, and Uri Alon. Superfamilies of evolved and designed networks. Science, 303:1538–1542, March 2004. [54] Burkhard Monien, Robert Preis, and Ralf Diekmann. Quality matching and local improvement for multilevel graph-partitioning. Parallel Computing, 26:1609–1634, November 2000. [55] M. Müller-Hannemann and A. Schwartz. Implementing weighted b-matching algorithms: towards a flexible software design. J. Exp. Algorithmics, 4:7, 1999. 96 BIBLIOGRAPHY [56] Newman. Detecting community structure in networks. The European Physical Journal B - Condensed Matter and Complex Systems, 38:321–330, March 2004. [57] J. Newman. Finding community structure in networks using the eigenvectors of matrices. Physical Review E, 74(3):036104, 2006. [58] J. Newman. Modularity and community structure in networks. Proceedings of the National Academy of Sciences, 103(23):8577, 2006. [59] J. Newman and M. Girvan. Finding and evaluating community structure in networks. Physical Review E, 69(2):026113, 2004. [60] M. E. J. Newman. Fast algorithm for detecting community structure in networks. Physical Review E, 69:066133, June 2004. [61] MEJ Newman. From the Cover: The structure of scientific collaboration networks. Proceedings of the National Academy of Sciences, 98(2):404, 2001. [62] Andreas Noack. Unified quality measures for clusterings, layouts, and orderings of graphs, and their application as software design criteria. PhD thesis, Brandenburgische Technische Universität Cottbus, 2007. [63] P. Pons. Post-processing hierarchical community structures: Quality improvements and multi-scale view. ArXiv Computer Science e-prints, cs.DS/0608050, 2006. [64] Pascal Pons and Matthieu Latapy. Computing communities in large networks using random walks (long version). physics/0512106, December 2005. [65] Josep M. Pujol, Javier Bejar, and Jordi Delgado. Clustering algorithm for determining community structure in large networks. Physical Review E (Statistical, Nonlinear, and Soft Matter Physics), 74:016107–9, July 2006. [66] Filippo Radicchi, Claudio Castellano, Federico Cecconi, Vittorio Loreto, and Domenico Parisi. Defining and identifying communities in networks. Proceedings of the National Academy of Sciences, 101:2658–2663, March 2004. [67] Jorg Reichardt and Stefan Bornholdt. Statistical mechanics of community detection. Physical Review E (Statistical, Nonlinear, and Soft Matter Physics), 74(1):016110–14, 2006. [68] J?rg Reichardt and Stefan Bornholdt. Detecting fuzzy community structures in complex networks with a potts model. Physical Review Letters, 93:218701, November 2004. [69] Jianhua Ruan and Weixiong Zhang. Identifying network communities with a high resolution. Physical Review E (Statistical, Nonlinear, and Soft Matter Physics), 77:016104–12, 2008. [70] J. Siek, L.Q. Lee, and A. Lumsdaine. The Boost Graph Library: User Guide and Reference Manual. Addison-Wesley, 2002. 97 BIBLIOGRAPHY [71] A. J. Soper and C. Walshaw. A generational scheme for partitioning graphs. In L. Spector et al., editor, Proc. Genetic & Evolutionary Comput. Conf. (GECCO-2001), pages 607–614. Morgan Kaufmann, San Francisco, 2001. [72] A. J. Soper, C. Walshaw, and M. Cross. A combined evolutionary search and multilevel approach to graph partitioning. In D. Whitley et al., editor, Proc. Genetic & Evolutionary Comput. Conf. (GECCO-2000), pages 674–681. Morgan Kaufmann, San Francisco, 2000. [73] A. J. Soper, C. Walshaw, and M. Cross. A combined evolutionary search and multilevel optimisation approach to graph partitioning. Tech. Rep. 00/IM/58, Comp. Math. Sci., Univ. Greenwich, London SE10 9LS, UK, April 2000. [74] A. J. Soper, C. Walshaw, and M. Cross. A combined evolutionary search and multilevel optimisation approach to graph partitioning. J. Global Optimization, 29(2):225–241, 2004. [75] Stijn van Dongen. Graph Clustering by Flow Simulation. PhD thesis, University of Utrecht, May 2000. [76] Ken Wakita and Toshiyuki Tsurumi. Finding community structure in megascale social networks. cs/0702048, February 2007. [77] David Wales and Jonathan Doye. Global optimization by basin-hopping and the lowest energy structures of lennard-jones clusters containing up to 110 atoms. cond-mat/9803344, March 1998. J. Phys. Chem. A 101, 5111-5116 (1997). [78] C. Walshaw and M. Cross. Mesh partitioning: a multilevel balancing and refinement algorithm. SIAM J. Sci. Comput., 22(1):63–80, 2000. [79] J.H. Ward Jr. Hierarchical grouping to optimize an objective function. Journal of the American Statistical Association, 58(301):236–244, 1963. [80] Scott White and Padhraic Smyth. A spectral clustering approach to finding communities in graph. SIAM Data Mining Conference, 2005. [81] Wu and Huberman. Finding communities in linear time: a physics approach. The European Physical Journal B - Condensed Matter and Complex Systems, 38:331–338, March 2004. [82] W. W. Zachary. An information flow model for conflict and fission in small groups. Journal of Anthropological Research, 33:452–473, 1977. [83] Zhou and Lipowsky. Network brownian motion: A new method to measure vertex-vertex proximity and to identify communities and subcommunities. Computational Science - ICCS 2004, 2004. 98 A The Benchmark Graph Collection The Table A.1 contains a list of the used benchmarks graphs. The graphs are sorted by their number of vertices. For each graph the number of vertices and edges, the weighted mean vertex degree (mean wdeg) and the global connection density in the degree volume model (wgt. density) is printed. Graphs with the suffix main are just the largest connectivity component of the original graph. The second column marks the graphs used in the reduced set (R), in the scalability analysis (S ), and in the comparison to reference algorithms (C ). The last column gives a short source reference. Links to the corresponding websites can be found in Table A.2. Table A.1: The Benchmark Graph Collection Chain8 Star9 K55 Student government Tree15 ModMath main SouthernWomen Hi-tech main karate football mexican power Grid66 morse Sawmill Food dolphins WorldImport1999 prison alon lesmis world trade polBooks adjnoun afootball baywet jazz A99m main SmallW main A01 main sandi main celegansneural USAir97 8Clusters WorldCities main celegans metabolic circuit s838 Roget main CSphd main email polBlogs main NDyeast main Yeast main subset SC C SC C C RC S C C C C RC R S RC RC R S S S RC RS RC R RC S SC R S R vertices 8 9 10 11 15 30 32 33 34 35 35 36 36 36 45 62 66 67 77 80 105 112 115 128 198 233 233 249 253 297 332 400 413 453 512 994 1025 1133 1222 1458 2224 edges 7 8 25 32 14 61 89 91 78 118 117 60 666 62 990 159 2145 142 254 875 441 425 613 2075 2742 325 994 635 307 2148 2126 24063 7518 2040 819 3641 1043 5451 16717 1993 7049 mean wdeg 1.75000 1.77778 5.00000 7.45455 1.86667 4.06667 5.56250 8.90909 4.58824 16.85714 6.68571 3.33333 1324.38889 3.44444 507.82222 5.12903 132361.51897 5.43284 21.29870 1644039.85000 8.40000 7.58929 10.71304 54.05277 55.39394 2.79828 17.06438 5.15663 2.42688 59.37374 12.80723 238.63000 81.80145 20.24283 3.19922 10.17807 2.03512 19.24448 31.23977 2.70302 6.14119 wgt. density 0.071429 0.06250 0.02000 0.012195 0.035714 0.0081967 0.005618 0.0034014 0.0064103 0.0016949 0.0042735 0.0083333 2.2390e-05 0.0080645 4.376e-05 0.0031447 1.1447e-07 0.0027473 0.00060976 7.6032e-09 0.0011338 0.0011765 0.00081169 0.00014453 9.1174e-05 0.0015337 0.00025151 0.00077882 0.0016287 5.6709e-05 0.00023518 1.0476e-05 2.96e-05 0.00010931 0.0006105 9.8853e-05 0.00047939 4.5863e-05 2.6197e-05 0.00025664 7.5576e-05 source ANoack-generated ANoack-generated RRotta pajek-social ANoack-generated pajek-social pajek-social pajek-social Newman pajek pajek ANoack-generated anoack pajek-social ANoack pajek-social ANoack UriAlon Newman pajek-econ pajek-mixed Newman-ling Newman pajek-bio Arenas pajek-social pajek-citation pajek-citation pajek-citation Newman-bio pajek ANoack-generated pajek-mixed Arenas-bio UriAlon pajek-ling pajek-social Arenas pajek-mixed NotreDame-bio pajek-bio 99 A The Benchmark Graph Collection SciMet main ODLIS main DutchElite main geom main Epa main eva main USpowerGrid hep-th main Erdos02 Lederberg main PairsP PGPgiantcompo astro-ph main eatRS DIC28 main hep-th-new main cond-mat-2003 main subset S S RS R S SC RS RS S RS S RSC S S vertices 2678 2898 3621 3621 4253 4475 4941 5835 6927 8212 10617 10680 14845 23219 24831 27400 27519 edges 10369 16381 4310 9461 8897 4654 6594 13815 11850 41436 63786 24316 119652 305501 71014 352059 116181 mean wdeg 7.75541 12.70842 2.38111 10.91964 4.21020 2.08402 5.33819 4.68711 3.42139 10.10801 115.39098 4.55805 4.49610 67.90641 5.71979 25.73161 4.41826 wgt. density 4.8151e-05 2.7156e-05 0.00011598 2.5291e-05 5.5847e-05 0.00010725 3.7913e-05 3.6564e-05 4.2194e-05 1.2048e-05 8.1627e-07 2.0542e-05 1.4982e-05 6.3464e-07 7.0409e-06 1.4184e-06 8.2246e-06 source pajek-citation pajek-ling pajek-social pajek-citation pajek-web pajek-econ pajek-mixed Newman-coauth pajek-social pajek-citation pajek-ling Arenas-social Newman-coauth pajek-ling pajek-ling pajek-citation Newman-coauth Table A.2: References to the Graph Sources ANoack-generated pajek-citation pajek-social Newman-ling Newman Newman-coauth pajek-bio Arenas-bio Newman-bio UriAlon pajek-ling Arenas pajek-web pajek-econ ANoack pajek RRotta anoack NotreDame-bio Arenas-social pajek-mixed 100 web address http://www-sst.informatik.tu-cottbus.de/~an/GD/ http://vlado.fmf.uni-lj.si/pub/networks/data/ http://vlado.fmf.uni-lj.si/pub/networks/data/ http://www-personal.umich.edu/~mejn/netdata/ http://www-personal.umich.edu/~mejn/netdata/ http://www-personal.umich.edu/~mejn/netdata/ http://vlado.fmf.uni-lj.si/pub/networks/data/bio/foodweb/foodweb.htm http://deim.urv.cat/~aarenas/data/welcome.htm http://www-personal.umich.edu/~mejn/netdata/ http://www.weizmann.ac.il/mcb/UriAlon/ http://vlado.fmf.uni-lj.si/pub/networks/data/ http://deim.urv.cat/~aarenas/data/welcome.htm http://vlado.fmf.uni-lj.si/pub/networks/data/mix/mixed.htm http://vlado.fmf.uni-lj.si/pub/networks/data/econ/Eva/Eva.htm http://www-sst.informatik.tu-cottbus.de/~an/GD/ http://vlado.fmf.uni-lj.si/pub/networks/data/sport/football.htm http://goya.informatik.tu-cottbus.de/~clustering/ http://www.informatik.tu-cottbus.de/~an/GD/ http://vlado.fmf.uni-lj.si/pub/networks/data/ND/NDnets.htm http://deim.urv.cat/~aarenas/data/welcome.htm http://vlado.fmf.uni-lj.si/pub/networks/data/mix/mixed.htm B Clustering Results Table B.1: Random Walk Distance by Graph SouthernWomen dolphins prison alon polBooks adjnoun jazz sandi main celegansneural USAir97 celegans metabolic circuit s838 polBlogs main Yeast main Epa main USpowerGrid Lederberg main PairsP astro-ph main DIC28 main 1 0.33449 0.46220 0.56098 0.52694 0.24411 0.44403 0.70997 0.48148 0.35050 0.42627 0.67174 0.43162 0.54016 0.58271 0.84679 0.61540 0.50587 0.65154 0.72153 2 0.31530 0.52680 0.61317 0.52694 0.27143 0.44403 0.80704 0.47932 0.35093 0.43558 0.81058 0.43219 0.57924 0.63287 0.90077 0.64291 0.58476 0.71070 0.77391 3 0.31530 0.46642 0.61739 0.52694 0.30637 0.44424 0.80215 0.47936 0.35246 0.42427 0.76911 0.43219 0.58575 0.63498 0.86988 0.66511 0.58308 0.70995 0.77093 4 0.33184 0.46642 0.61667 0.52694 0.28173 0.44403 0.81677 0.48499 0.35804 0.42970 0.77549 0.43219 0.59310 0.63672 0.90234 0.66392 0.58951 0.71056 0.77864 5 0.33449 0.46642 0.60873 0.52694 0.26923 0.44447 0.80646 0.50224 0.36082 0.43405 0.80461 0.43219 0.58996 0.64941 0.87756 0.68129 0.58975 0.71071 0.78366 WD-sgrd 0.31972 0.52587 0.62077 0.52724 0.31078 0.44468 0.82773 0.50295 0.36824 0.45070 0.81551 0.43237 0.62179 0.66843 0.93795 0.70339 0.65067 0.76261 0.84747 Table B.2: Random Walk Reachability, Length 2 SouthernWomen dolphins prison alon polBooks adjnoun jazz sandi main celegansneural USAir97 celegans metabolic circuit s838 polBlogs main Yeast main Epa main USpowerGrid Lederberg main PairsP astro-ph main DIC28 main RWreach-sgrd 1 0.31972 0.52852 0.62077 0.52724 0.30194 0.44487 0.82760 0.50295 0.36824 0.44966 0.80524 0.43237 0.62207 0.66783 0.93717 0.70267 0.65476 0.76248 0.85097 RWreach-sgrd 3 0.31972 0.52852 0.61464 0.52724 0.30359 0.44452 0.82779 0.50382 0.36603 0.44989 0.81521 0.43228 0.62328 0.66795 0.93882 0.70323 0.65361 0.76329 0.85162 RWreach-sgrd 5 0.33449 0.52852 0.61288 0.52724 0.30750 0.44487 0.82779 0.50353 0.36824 0.44182 0.81551 0.43228 0.62593 0.66568 0.93919 0.70402 0.65386 0.76409 0.85158 WD-sgrd 0.31972 0.52587 0.62077 0.52724 0.31078 0.44468 0.82773 0.50295 0.36824 0.45070 0.81551 0.43237 0.62179 0.66843 0.93795 0.70339 0.65067 0.76261 0.84747 Table B.3: Random Walk Reachability, Length 3 SouthernWomen dolphins RWreach-sgrd 1 0.31972 0.52852 RWreach-sgrd 3 0.32635 0.52680 RWreach-sgrd 5 0.33184 0.52680 WD-sgrd 0.31972 0.52587 101 B Clustering Results prison alon polBooks adjnoun jazz sandi main celegansneural USAir97 celegans metabolic circuit s838 polBlogs main Yeast main Epa main USpowerGrid Lederberg main PairsP astro-ph main DIC28 main RWreach-sgrd 1 0.61288 0.52724 0.30122 0.44468 0.82779 0.50295 0.36824 0.45105 0.81377 0.43242 0.62524 0.67258 0.93768 0.70331 0.65233 0.76328 0.85106 RWreach-sgrd 3 0.61464 0.52724 0.29690 0.44487 0.82779 0.49997 0.36603 0.44075 0.81520 0.43233 0.62475 0.66818 0.93921 0.70382 0.65434 0.76365 0.85285 RWreach-sgrd 5 0.61898 0.52694 0.27580 0.44424 0.82779 0.49783 0.36094 0.43418 0.81447 0.43225 0.62710 0.66667 0.93937 0.70362 0.65346 0.76440 0.85152 WD-sgrd 0.62077 0.52724 0.31078 0.44468 0.82773 0.50295 0.36824 0.45070 0.81551 0.43237 0.62179 0.66843 0.93795 0.70339 0.65067 0.76261 0.84747 Table B.4: Clustering Results from the Refinement Phase Chain8 Star9 K55 Student government Tree15 ModMath main SouthernWomen Hi-tech main karate football mexican power Grid66 morse Sawmill Food dolphins WorldImport1999 prison alon lesmis world trade polBooks adjnoun afootball baywet jazz A99m main SmallW main A01 main sandi main celegansneural USAir97 8Clusters WorldCities main celegans metabolic circuit s838 Roget main CSphd main email polBlogs main NDyeast main Yeast main SciMet main 102 none 0.3571429 (2) -0.0078125 (2) 0.3000000 (2) 0.1299822 (3) 0.5051020 (5) 0.4201827 (4) 0.3153011 (2) 0.2932112 (4) 0.4086949 (4) 0.3560299 (4) 0.3281102 (4) 0.5131944 (4) 0.2238226 (5) 0.5500780 (4) 0.3989169 (4) 0.5178593 (4) 0.2746734 (3) 0.6158375 (9) 0.5662983 (6) 0.3416073 (3) 0.5036225 (4) 0.2842824 (7) 0.5768034 (7) 0.3616026 (4) 0.3915476 (4) 0.6994806 (12) 0.4444636 (3) 0.5760183 (10) 0.8208257 (15) 0.4740477 (5) 0.3435344 (6) 0.2844054 (8) 0.1162580 (5) 0.4006559 (9) 0.7904052 (15) 0.5371181 (14) 0.9248402 (33) 0.5293695 (11) 0.4144859 (3) 0.8135195 (34) 0.5718771 (23) 0.5485112 (14) SGR-density 0.3571429 (2) 0.0000000 (1) 0.3000000 (2) 0.1817371 (2) 0.5051020 (5) 0.4290513 (4) 0.3197197 (3) 0.3164654 (3) 0.4197896 (4) 0.3602930 (5) 0.3477610 (5) 0.5500000 (4) 0.2260019 (5) 0.5500780 (4) 0.4022119 (4) 0.5258692 (5) 0.2750381 (3) 0.6207735 (9) 0.5666880 (6) 0.3458804 (3) 0.5272366 (5) 0.3107820 (6) 0.5878718 (7) 0.3630110 (4) 0.4446760 (4) 0.7093229 (12) 0.4489310 (3) 0.6163954 (11) 0.8277276 (15) 0.5029479 (5) 0.3682440 (6) 0.2940509 (8) 0.1772768 (4) 0.4507027 (9) 0.8155133 (15) 0.5800064 (14) 0.9255843 (33) 0.5771250 (12) 0.4323714 (13) 0.8209006 (32) 0.6217924 (23) 0.6206436 (18) CGR 0.3571429 (2) 0.0000000 (1) 0.3000000 (2) 0.1817371 (2) 0.5051020 (5) 0.4290513 (4) 0.3197197 (3) 0.3164422 (4) 0.4197896 (4) 0.3602930 (5) 0.3477610 (5) 0.5500000 (4) 0.2260019 (5) 0.5500780 (4) 0.4022119 (4) 0.5258692 (5) 0.2750381 (3) 0.6207735 (9) 0.5666880 (6) 0.3458804 (3) 0.5272366 (5) 0.3107820 (6) 0.5878718 (7) 0.3630111 (4) 0.4446760 (4) 0.7093229 (12) 0.4489310 (3) 0.6195992 (11) 0.8277276 (15) 0.5029479 (5) 0.3682440 (6) 0.2940509 (8) 0.1773516 (4) 0.4507027 (9) 0.8155118 (15) 0.5801273 (14) 0.9255843 (33) 0.5771250 (12) 0.4323714 (13) 0.8211389 (33) 0.6202976 (22) 0.6210574 (18) KL 0.3775510 (3) 0.0000000 (1) 0.3000000 (2) 0.1817371 (2) 0.5127551 (4) 0.4488041 (4) 0.3360056 (3) 0.3167893 (3) 0.4197896 (4) 0.3603160 (5) 0.3595222 (4) 0.5412500 (5) 0.2276623 (4) 0.5500780 (4) 0.4022119 (4) 0.5258692 (5) 0.2750381 (3) 0.6207735 (9) 0.5666880 (6) 0.3458804 (3) 0.5256066 (6) 0.3107820 (6) 0.5941993 (7) 0.3630111 (4) 0.4448713 (4) 0.7139006 (11) 0.4494608 (3) 0.6321622 (12) 0.8277276 (15) 0.5029479 (5) 0.3682440 (6) 0.2940509 (8) 0.1789302 (4) 0.4509054 (9) 0.8155133 (15) 0.5868690 (16) 0.9255843 (33) 0.5776780 (11) 0.4324173 (13) 0.8225031 (31) 0.6232274 (22) 0.6224638 (16) ODLIS main DutchElite main geom main Epa main eva main USpowerGrid hep-th main Erdos02 Lederberg main PairsP PGPgiantcompo astro-ph main eatRS DIC28 main hep-th-new main cond-mat-2003 main average none 0.4607628 (12) 0.8389162 (49) 0.7250140 (36) 0.6270148 (28) 0.9333784 (53) 0.9328997 (41) 0.8425086 (58) 0.6848450 (31) 0.6508633 (20) 0.6145969 (35) 0.8631499 (66) 0.7300281 (49) 0.4466266 (26) 0.8015399 (58) 0.5798309 (18) 0.7891245 (70) 0.52305183903 SGR-density 0.5134514 (12) 0.8493005 (48) 0.7470261 (43) 0.6684264 (27) 0.9357163 (52) 0.9379491 (39) 0.8551498 (59) 0.7159228 (38) 0.7033853 (22) 0.6506679 (35) 0.8832731 (94) 0.7623166 (58) 0.4999899 (25) 0.8474654 (105) 0.6663347 (31) 0.8135988 (76) 0.54608087341 CGR 0.5128609 (12) 0.8482894 (48) 0.7470261 (43) 0.6685486 (26) 0.9357164 (52) 0.9382380 (39) 0.8548411 (57) 0.7159229 (38) 0.7033853 (22) 0.6510081 (33) 0.8832732 (94) 0.7621238 (58) 0.4998523 (26) 0.8475762 (102) 0.6663266 (31) 0.8134295 (75) 0.54609771377 KL 0.5137963 (12) 0.8492394 (47) 0.7472464 (42) 0.6709372 (27) 0.9360090 (50) 0.9381809 (39) 0.8552368 (57) 0.7161150 (39) 0.7033671 (22) 0.6520954 (31) 0.8839220 (102) 0.7631902 (58) 0.5005763 (22) 0.8478136 (104) 0.6667373 (30) 0.8137741 (77) 0.54810369879 Table B.5: Runtime Measurements. All times are in seconds. Chain8 K55 Hi-tech main lesmis A99m main SmallW main A01 main celegansneural Roget main CSphd main NDyeast main SciMet main ODLIS main Epa main hep-th main Erdos02 Lederberg main PairsP PGPgiantcompo astro-ph main eatRS DIC28 main hep-th-new main cond-mat-2003 main vertices 8 10 33 77 233 233 249 297 994 1025 1458 2678 2898 4253 5835 6927 8212 10617 10680 14845 23219 24831 27400 27519 edges 7 25 91 254 325 994 635 2148 3641 1043 1993 10369 16381 8897 13815 11850 41436 63786 24316 119652 305501 71014 352059 116181 none 0.28800 0.42300 0.90900 1.16100 1.39100 1.83700 1.55100 2.02100 2.75000 1.66000 2.03800 4.75600 6.35300 4.54300 5.27600 6.04900 14.61700 22.13300 9.00400 29.20600 192.21500 24.79100 143.37100 33.39300 SGR-density 0.32700 0.40100 0.93300 1.20900 1.52100 1.96800 1.76400 2.21100 3.20600 2.04300 2.80200 6.44700 8.02400 7.70500 11.26100 11.31000 20.20700 33.34600 22.69000 52.03700 236.66900 76.99900 186.87100 76.34600 CGR 0.29000 0.37800 0.87800 1.15600 1.43500 1.89600 1.71000 2.18600 3.85200 1.96100 3.32300 14.37800 13.80900 30.41500 27.64100 34.10100 83.13300 127.07900 55.56700 365.39200 2115.29200 816.51200 1608.20900 1019.13700 KL 0.35100 0.43800 1.12900 1.28100 2.49300 3.08400 3.35600 5.16600 24.25000 22.72700 37.48900 84.12400 79.36200 201.82000 423.34600 359.59600 392.03800 769.91300 1255.66100 1917.59600 4003.69600 4755.91700 4553.72100 4158.93400 103 Chain8 Star9 K55 Tree15 ModMath main SouthernWomen karate mexican power Grid66 Sawmill dolphins polBooks adjnoun sandi main USAir97 circuit s838 CSphd main Erdos02 DIC28 main average walktrap 1e-05 0.00100 1e-05 1e-05 0.00100 0.00100 0.00100 0.00200 0.00100 0.00100 0.00200 0.00300 0.00500 0.00500 0.03400 0.01500 0.04300 4.66500 28.73900 0.0029568 leadingev 0.00100 0.05700 0.00100 0.00700 0.02200 0.02900 0.03300 0.03000 0.04400 0.03500 0.06000 0.13100 0.22600 1.13800 1.10400 1.84800 2.51300 54.29700 810.58300 0.14823 wakita HE 0.12300 0.12400 0.17300 0.17400 0.17300 0.17300 0.17400 0.17300 0.17400 0.17400 0.17400 0.17500 0.17400 0.22300 0.22300 0.22300 0.22400 2.22800 2.56000 0.23281 wakita HN 0.12400 0.12300 0.12300 0.12300 0.17400 0.17300 0.17300 0.17300 0.17300 0.17300 0.17300 0.17400 0.17500 0.17400 0.22500 0.22400 0.22400 2.18700 1.54400 0.21555 fgj 0.00600 0.00500 0.00500 0.00600 0.00700 0.00700 0.00700 0.00700 0.00700 0.00700 0.00800 0.01300 0.01700 0.01900 0.04200 0.05100 0.08900 3.52000 27.60200 0.02285 ML-none 0.25300 0.29000 0.32800 0.40500 0.78900 0.90400 0.82900 0.88300 0.86500 0.86600 1.11300 1.34800 1.25500 1.22000 2.04600 1.61200 1.59400 5.71400 24.19300 1.10230 spinglass 1.85300 3.02000 1.26400 5.02400 6.92700 7.43200 7.73500 6.25000 5.37300 6.41200 11.57400 18.45100 30.18600 48.13700 116.79000 137.68800 219.04600 1959.89200 7213.12900 22.44932 ML-sgrd 0.27100 0.30900 0.34900 0.43400 0.84800 0.97100 0.89600 0.94300 0.93300 0.93100 1.19900 1.46500 1.36300 1.34900 2.23900 1.84300 1.99000 10.89800 75.86800 1.30958 ML-KL 0.31200 0.36000 0.38000 0.50100 1.00800 1.16200 1.06900 1.12900 1.10300 1.09600 0.93000 1.39700 1.40800 2.37400 5.03300 5.88600 22.37300 351.05700 4581.66900 2.71063 B Clustering Results Table B.6: Runtime of the Reference Algorithms. All time values are printed in seconds. 104