wiki:Clustering

About clustering

The clustering algorithm sorts entities into groups, that are closely related from different point of view. The entities can be modules or functions and the relations between them are function calls and record usages. The created groups are called clusters.

Clustering in RefactorErl

Before using the clustering functionality of the tool, the source files have to be loaded to the database. The clustering algorithm will cluster all the modules and functions which are in the database. In the clustering options the user can choose the modules and functions that should be ignored during the clustering.

The clustering module can also provide a decomposition of the modules base on a module clustering. This option takes the library modules and header files in the database, and distributes their functions, records and macros among the clusters. Parts used by more than one cluster stay in their place.

Types of clustering

There are two implemented clustering algorithms in RefactorErl:

  • Hierarchical algorithm (agglomerative)
  • Genetic algorithm

Agglomerative clustering

In the beginning, each entity forms a separate cluster. Then, in each step, the two closest clusters are selected and unified. This process continues until there is only one cluster. The intermediate states contain a possible clustering of the entities. The output of the algorithm is the list of these possible clusterings.

Genetic clustering

Genetic algorithms simulate the evolution of species. There are iterations of populations in which every entity fights for survival or for the survival of its genes. A fitness function is defined to determine the value of an entity. The fitter an entity is, the more likely it survives. The algorithm is expected to converge to the fittest possible entity, like evolution does.

Using the clustering functionality

The clustering functionality is available from Emacs, the console interface and the scriptable interface.

Parameters for agglomerative clustering

  • Modules to skip: The list of module names (separated by space or comma characters) that should be ignored in the clustering process
  • Functions to skip: The list of function names (separated by space or comma characters) that should be ignored in the clustering process
  • Transform function: The function that transforms the attribute matrix before running the clustering. There are two options for the transformation: zero_one and none.
    • zero_one: The option zero_one means that the weights that are positive in the attribute matrix will be transformed to 1.
    • none: The option none means that no transformation will be performed in the attribute matrix.
  • Distance function: It can be call_sum, weight or a function reference to user-defined function.
    • call_sum: Distance function based on function call structure, sums call weights.
    • weight: The distance function is based on function call structure and record usage. It is weighted by the anti-gravity factor.
    • User-defined function: TODO
  • Anti-gravity: The anti-gravity factor for distance calculating function, like weight.
  • Merge function: The cluster attribute calculator functions are used in the attribute matrix user algorithm. This function calculates the new attributes of the created clusters.
    • smart: The size attributes are summed, the entities attributes are merged, average is calculated from the function, record and macro attributes, and the other attributes are undefined.
    • User-defined function: TODO

Parameters for genetic clustering

  • Population size: The number of chromosomes in every iteration of the algorithm. At the beginning of the algorithm a random population is generated.
  • Iterations: The number of iteration in the algorithm. For default type 10.
  • Mutation rate: The probability of mutation.* For default type 0.9.
  • Crossover rate: The probability that a crossover will be performed on two selected chromosomes.* For default type 0.7.
  • Elite count: The number of chromosomes that are transferred to the next generation without change. For default type: 2
  • Maximum cluster size: Maximum number of clusters allowed.
  • Maximum start cluster size: Maximum number of clusters allowed at startup.

Parameters for decomposition

  • Decomposition: It shows whether the user wants a possible decomposition of the modules or not. Only available with module clustering.
  • Library limit: The minimum number of function calls for library modules. If a module is called by at least this many other modules, it is considered a library modules.
  • Headers: The format of header files. It is a string which is matched to the end of the file names.

Parameters for clustering

  • Algorithm: The used clustering algorithm.
  • Entities: The entities of the clustering. Modules and functions available for both algorithms.
  • Show Goodness: Yes/No? question. If enabled, the tool shows the goodness values for each of the clusterings.
  • Only best: Yes/No? question. If enabled, the tool shows the best clustering result only.
  • Store results: Yes/No? question. If enabled, the tool stores the results in 3 different format describes bellow.

Output formats

As mentioned above, the tool can store clustering results in 3 different formats:

  • Dets table: It is used by the tool itself.
  • Scriptable file: It is format which can easily be used by other programs. It is a list of pairs, every pair contains a keyword and a result. ([{clusterings, cluster1],[cluster2],...,[clusterN?},{goodnesses,[goodness1,...,goodnessN]},...])
  • Readable file: It creates a report "readable to the human eye". It shows the resulting clusterings and the decomposition offered by the tool.

Important notes about the module

There are some things which are important to know, when using the clustering module of the tool.

Distance functions for agglomerative algorithm

Currently, there are two distance functions available for the agglomerative clustering module. (Weight and Call Sum) The weight distance function recognizes two entities similar, if they use the same functions. It uses the idea, that the modules' behavior is largely dependent on the functions they call. It means that if two modules call the same functions, they must be similar in behavior. The call sum distance function considers two modules similar if they call each others' functions. The idea here is that these modules must work similarly, because use each other more. It is very important to choose the distance function wisely, because the two distance function can generate very different results, so the goal of the clustering must be considered before running the algorithm. The call sum distance function can be better, if the user wants to discover connections between the modules, and unite modules, while weight distance function can show, which modules have the same behavior.

Choosing the parameters of the genetic algorithm

As mentioned above, the genetic algorithm starts from a randomly generated population. This makes the algorithm very non-deterministic. This makes choosing the parameters very important. We don't have proven method for choosing the parameters right, but we have some results which can make this decision easier. We examined the effect of population size and iteration number on the precision and run time of the algorithm. The results are the following:

Iterations / Population size 10 20 30 40 50
10 {1.712,24} {1.853,37} {1.955,49} {2.012,63} {2.063,79}
20 {1.844,36} {1.953,64} {2.048,94} {2.088,119} {2.141,152}
30 {1.9,51} {1.973,91} {2.069,123} {2.074,160} {2.135,203}
40 {1.906,61} {1.977,111} {2.064,160} {2.092,223} {2.16,261}
50 {1.942,72} {1.982,134} {2.09,203} {2.091,264} {2.151,330}

Every cell shows the results' average weighted fitness and the run time of the algorithms, with parameters {iterations,X}, {population_size,Y} for cell {X,Y}. The fitness of the best clustering of this database is 2.404, so this is the maximum fitness achievable. The algorithm is run 500 times with every pair of parameters on this database. The run time is shown in seconds.

It shows that while run time is symmetrical, the average fitness of the result is not. The population size affects the result more than the iteration number. Actually, increasing the iteration number (over 30) doesn't really change the result. This does not apply to the population size.

Last modified 11 years ago Last modified on May 31, 2013, 12:03:49 PM