= About clustering =

The clustering algorithm sorts entities into groups, that are closely related from different point of view. The entities can be '''modules''' or '''functions''' and the relations between them are function calls and record usages. The created groups are called '''''clusters'''''.

= Clustering in !RefactorErl =

Before using the clustering functionality of the tool, the source files have to be loaded to the database. The clustering algorithm will cluster all the modules and functions which are in the database. In the clustering options the user can choose the modules and functions that should be ignored during the clustering.

== Types of clustering ==

There are two implemented clustering algorithms in !RefactorErl:
* Hierarchical algorithm (agglomerative)

* Genetic algorithm

=== Agglomerative clustering ===

In the beginning, each entity forms a separate cluster. Then, in each step, the two closest clusters are selected and unified. This process continues until there is only one cluster. The intermediate states contain a possible clustering of the entities. The output of the algorithm is the list of these possible clusterings.

=== Genetic clustering ===

Genetic algorithms simulate the evolution of species. There are iterations of populations in which every entity figths for survival or for the survival of its genes. A fitness function is defined to determine the value of an entity. The fitter an entity is, the more likely it survives. The algorithm is expected to converge to the fittest possible entity, like evolution does.

= Using the clustering functionality =

The clustering functionality is available from Emacs and the console interface.

== Parameters for agglomerative clustering ==

* '''''Modules to skip:''''' The list of module names (separated by space or comma characters) that should be ignored in the clustering process
* '''''Functions to skip:''''' The list of function names (separated by space or comma characters) that should be ignored in the clustering process
* '''''Transform function:'''''  The function that transforms the attribute matrix before running the clustering. There are two options for the transformation: '''zero_one''' and '''none'''.
    * '''zero_one''': The option ''zero_one'' means that the weights that are positive in the attribute matrix will be transformed to 1.
    * '''none''': The option ''none'' means that no transformation will be performed in the attribute matrix.
* '''''Distance function:''''' It can be '''call_sum''', '''weight''' or a function reference to user-defined function.
    * '''call_sum:''' Distance function based on function call structure, sums call weights.
    * '''weight:''' The distance function is based on function call structure and record usage. It is weighted by the anti-gravity factor.
    * '''User-defined function:''' TODO
* '''''Anti-gravity:''''' The anti-gravity factor for distance calculating function, like weight.
* '''''Merge function:''''' The cluster attribute calculator functions are used in the attribute matrix user algorithm. This function calculates the new attributes of the created clusters.
    * '''smart:''' The size attributes are summed, the entities attributes are merged, average is calculated from the function, record and macro attributes, and the other attributes are undefined.
    * '''User-defined function:''' TODO

== Parameters for genetic clustering ==

* '''''Population size:''''' The number of chromosomes in every iteration of the algorithm. At the beginning of the algorithm a random population is generated.
* '''''Iterations:''''' The number of iteration in the algorithm. For default type 10.
* '''''Mutation rate:''''' The probability of mutation.* For default type 0.9.
* '''''Crossover rate:''''' The probability that a crossover will be performed on two selected chromosomes.* For default type 0.7.
* '''''Elite count:''''' The number of chromosomes that are transferred to the next generation without change. For default type: 2
* '''''Maximum cluster size:''''' Maximum number of clusters allowed.
* '''''Maximum start cluster size:''''' Maximum number of clusters allowed at startup.