Version 1 (modified by manualwiki, 11 years ago) (diff) |
---|
Clone IdentifiErl
Overview
Clone IdentifiErl is a prototype duplicated code detector software, which provides two separate ways to identify code clones.
- The algorithm, called matrix, is an AST/metric based detector with which clones containing at least one expression can be found.
- The algorithm, called sw_metrics, is a software metrics based detector with which similar function pairs can be identified.
Actually, It is only available via the ri module: ri:clone_identifierl/0, ri:clone_identifierl/1.
Algorithms
There are two separate algorithms which serve different purposes. The matrix algorithm works with smaller granularity (a top-level expression), and it is more sensitive to syntactic modficiations, thus it is more suitable for a detailed detection. The sw_metrics algorithm works with function pairs, and it utilises software metrics to point out duplicates. This algorithm is more fast and is not greatly influenced by the syntactic structures of the examined source code.
The subject of both algorithms can be the entire database (this is the default option) or one can start such examination in which user-given functions can be matched against the database.
The common properties of the algorithms are summed up in the table below. All the properties are optional.
Property | Description | Type | Default value | Example |
---|---|---|---|---|
algorithm | Detection takes places by the chosen algorithm. | sw_metrics | matrix | sw_metrics | {algorithm, sw_metrics} |
func_list | The given functions are matched against the database, if this property is present. | [{Module::atom(), Function::atom(), Arity::integer()}] | - | [{my_mod, My_fun, 1}, {my_mod2, f, 0}] |
subject | The entities identified by the given IDs are matched against the database. (advanced option) | [{atom(), atom(), integer()}] | - | [{'$gn',func,54}] |
matrix algorithm
This algorithm is a two-phase algorithm, which uses string similarity metrics to detect code clones during the first phase. Although, the normalised Levenshtein distance or the Dice-Sorensen metric are ready-to-use for the detection, user-defined metric (passed as a two-arity, anonymous function via options) can be requested. The maximum deviation is also controllable by the user, upper that the examined pair is not considered as a clone. We advice you to use 0.1 as a maximum deviation to use with Dice-Sorensen metric and 0.2 with Levenshtein metric.
It is likely to happen that a clone is divided into sub clones due to insertions, deletions or other kinds of modifications. It would be practical if a full clone could be gathered somehow, therefore we need to add a new parameter, called the invalid sequence length. An invalid sequence length is the maximum length of a sequence whose middle elements can differ too much from each other. By introducing invalid sequence length, one can customise the allowable maximum deviation of a clone.
Due to the high computational requirements of the algorithm, two variants exist: a caching and a non-caching one. If the computer has much free memory, the caching version should be chosen.
The previously introduced concepts are the parameters of the algorithm which can be defined by a proplist. The possible properties are summed up in the table below. All the properties are optional.
Property | Description | Type | Default value | Example |
---|---|---|---|---|
metric | The used metric to measure the similarity | leveinstein | dice_sorensen | fun((A::string(),B::string())->0.0 .. 1.0) | dice_sorensen | {metric, fun(A,A)->1.0; (A,B)->0.0 end} |
diff_limit | The allowed maximum deviation | 0.0 .. 1.0 | 0.1 | {diff_limit, 0.0} |
max_invalid_seq_length | The allowed maximum length of an invalid sequence | non_neg_integer() | 1 | {max_invalid_seq_length, 0} |
cache | It is allowed to use cache or not | boolean() | false | {cache, true} |
sw_metrics algorithm
This algorithm is a two-phase algorithm that points out duplicates by using software similarity metrics. Here, no extra parameter can be given.
Exemplars
Clone IdentifiErl is only available through the ri interface. In this section, we show some illustrative examples to get familiar with this feature.
- Simpliest case.
ri:clone_identifierl().
- We are looking for all of the clones can be found in the database. We have much memory and a wide interest in any detectable clones, and we also allow a greater deviation of clones.
ri:clone_identifierl([{algorithm, matrix}, {caching, true}, {max_invalid_seq_length, 3}, {diff_limit, 0.2}]).
- We want to find the duplicates of a specific, newly introduced library function (lib_module:new_fun/2).
ri:clone_identifierl([{func_list, [{lib_module, new_fun, 2}]}]).
- We want to find either the whole function or even its subsequences as duplicates of a specific, newly introduced library function (lib_module:new_fun/2).
ri:clone_identifierl([{algorithm, matrix}, {func_list, [{lib_module, new_fun, 2}]}]).
- We want to find the duplicates of every function located in a library module (lib_module).
{_,_,QFuns} = ris:q("mods[name=lib_module].funs"), Funs = ris:unpack(QFuns), ri:clone_identifierl([{subject, Funs}]).
- We want to find either the whole function or even its subsequences as duplicates of every function located in a library module (lib_module).
{_,_,QTLEs} = ris:q("mods[name=lib_module].funs.body"), TLEs = ris:unpack(QTLEs), ri:clone_identifierl([{algorithm, matrix},{subject, TLEs}]).