Clone IdentifiErl
Overview
Clone IdentifiErl is a prototype duplicated code detector software, which provides several ways to identify code clones. Groups of clones are present in the result.
- The algorithm, called matrix, is an AST/metric based detector with which clones containing at least one top-level expression can be found. (The clauses of functions built up from either one top-level expression or the sequences of top-level expressions.) To read further details about the theoretical background refer to the following link and search for the Identifying Code Clones with RefactorErl titled article.
- The algorithm, called sw_metrics, is a software metrics based detector with which similar functions can be identified. To read further details about the theoretical background refer to the following link.
- The algorithm, called matrixfilter, is the combination of the matrix and sw_metrics algorithms.
- The algorithm, called suffix_tree, is a token based detector with which similar functions or expressions can be found. Its a token based algorithm which uses a suffix tree to identify clones. Clones match one of the following syntactic structures: functions or arbitrary expressions.
- The algorithm, called filtered_suffix_tree, is based on the suffix_tree algorithm but provides only clones that are straightforward to be eliminated.
Clone IdentifiErl is available via the following interfaces: ErlangShellInterface, ScriptableInterface, EmacsInterface, WxInterface and Web2.
Optional software dependencies
Igraph is a collection of network analysis tools. Some of these tools are used by Clone IdentifiErl, because Igraph provides an efficient way to group results. Although it is suggested to install Igraph because it can decrease the runtime of the duplicated code detection, Clone IdentifiErl still works properly if Igraph has not been installed.
Igraph source code can be downloaded from here. Install guide for Igraph can be found here.
If RefactorErl had been compiled after that Igraph have been installed, RefactorErl have to be recompiled to take the advantages of Igraph. This can be done as follows.
- Quit RefactorErl
- Type: bin/referl -build clean
- Type: bin/referl -build tool -igraph Path (where Path is the absolute path of the installed Igraph library. For example: /usr/local/lib/igraph)
If no compiled RefactorErl instance is present, then only the last step is required to be done.
Algorithms and provided services
Each of the algorithms serves different purposes. The matrix algorithm works with smaller granularity (a top-level expression), and it is more sensitive to syntactic modificiations, thus it is more suitable for a detailed detection. The sw_metrics algorithm works with function pairs, and it utilises software metrics to point out duplicates. This algorithm is more fast and is not greatly influenced by the syntactic structures of the examined source code. The suffix_tree algorithm uses a Suffix tree to detect clones. This algorithm can be used to detect functions, top-level expressions and expressions, but only finds parametric clones. Parametric clones are clones whose instances differ only in the identifiers or literals. The filtered_suffix_tree is based on the suffix_tree algorithm, but provides only clones that are straightforward to be eliminated. This algorithm ensures that the instances of the reported clones use similar records and refer to similar functions. The sw_metrics, matrix, matrixfilter algorithms offer an efficient way to match modules against the whole database, while the suffix_tree based algorithms have the ability to decrease the entire search space. That is, only the given set of modules will be analysed to report duplicates whilst the rest of the database will not be considered.
If a clone detection finished then the result will be saved using a key, and a name. The key is made of the algorithm and the given parameters. If a clone detection is started with a key that has been already saved, then the previous result associated with this key will be loaded instead of running the requested algorithm. The saved results can be queried later at any time, until the database has not been changed. All names must be unique. It is possible to overwrite a result associated with a name.
A saved result can be queried using different output formats, for example: prettyprint, file_and_loc. Regardless of the chosen algorithm all output format, which are illustrated below, can be requested.
- prettyprint: Returns the textual representation of the clones.
Command: ri:clone_identifierl([{algorithm,sw_metrics}, {files,[ucl_alg_dm]}, {format,prettyprint}]).
-------------------------------------------------------------------------------- Clone group: 1 Clone element: (found in ucl_alg_dm): q_gen(QueryETS) -> fun(F, Ps) -> [{F, Ps1}] = ets:lookup(QueryETS, F), length(list_sort_intersect(Ps1, Ps)) end. -------------------------------------------------------------------------------- -------------------------------------------------------------------------------- Clone element: (found in ucl_alg_dm): u_gen(UseETS) -> fun(P, Fs) -> [{P, Fs1}] = ets:lookup(UseETS, P), length(list_sort_intersect(Fs1, Fs)) end. --------------------------------------------------------------------------------
- file_and_loc: Returns file and position information related to the clones. When using this output format it is possible to set the position type.
The linecol position type returns the row and column where the found clone starts and ends. The scalar position type returns the character position where the clone starts and ends. This format also gives information about the number of the actual clone group.
Command:
ri:clone_identifierl([{algorithm,sw_metrics}, {files,[ucl_alg_dm]}, {format,file_and_loc}, {postype,linecol]).
[{1,[[{filepath,"/home/laptop/refactorerl/src/ucl_alg_dm.erl"}, {startpos,{12,1}}, {endpos,{16,8}}], [{filepath,"/home/laptop/refactorerl/src/ucl_alg_dm.erl"}, {startpos,{41,1}}, {endpos,{45,8}}]]}]
Command:
ri:clone_identifierl([{algorithm,sw_metrics}, {files,[ucl_alg_dm]}, {format,file_and_loc}, {postype,scalar]).
[{1,[[{filepath,"/home/laptop/refactorerl/src/ucl_alg_dm.erl"}, {startpos,{47}}, {endpos,{121}}], [{filepath,"/home/laptop/refactorerl/src/ucl_alg_dm.erl"}, {startpos,{520}}, {endpos,{576}}]]}]
- nodes: Returns the internal identifiers of the found clone groups. It is a good choice if you wish to further process the result by scripting using the ris interface.
Command: ri:clone_identifierl([{algorithm,sw_metrics}, {files,[ucl_alg_dm]}, {format,nodes}]).
[[[{'$gn',form,2}],[{'$gn',form,3}]]]
Tuning the algorithms
The common properties of the algorithms are summed up in the table below. All the properties are optional.
Property | Description | Type | Default value | Example |
---|---|---|---|---|
algorithm | Detection takes places by the chosen algorithm. | sw_metrics | matrix | matrixfilter | suffix_tree | filtered_suffix_tree | sw_metrics | {algorithm, sw_metrics} |
name | The name with the result will be saved. | atom() | the atom 'temp' concatenated with the timestamp | {name, temp201405101829} |
enforce | If the name is already used and this option is given, then this enables to overwrite previous stored results. | bool() | false | {enforce, true} |
format | The output format. (Only relevant in the ri interface) | prettyprint | file_and_loc | nodes | prettyprint | {format, file_and_loc} |
output | The output file name with the results will be written in the data dir. | atom() | - | {output, mysearch} |
postype | The preferred position type. | linecol | scalar | linecol | {postype, linecol} |
Options only available for the matrix,sw_metrics,matrixfilter algorithms
The subject of both algorithms can be the entire database (this is the default option) or one can start such examination in which user-given functions can be matched against the database.
The common properties of these algorithms are summed up in the table below. All the properties are optional.
Property | Description | Type | Default value | Example |
---|---|---|---|---|
func_list | The given functions are matched against the database. | [{Module::atom(), Function::atom(), Arity::integer()}] | - | {func_list, [{my_mod, My_fun, 1}, {my_mod2, f, 0}]} |
subject | The entities identified by the given IDs are matched against the database. (advanced option) | [{atom(), atom(), integer()}] | - | {subject, [{'$gn',func,54}]} |
files | The contents of the given files are matched against the database. | [Module::atom() | Filepath::string() | RegExp::string() | File::string()] | all files from the database | {files,["/usr/local/lib/erlang/lib/mnesia-4.5/src/mnesia_log.erl", mymod]} |
Options only available for the matrix algorithm
This algorithm is a two-phase algorithm, which uses string similarity metrics to detect code clones during the first phase. Although, the normalized Levenshtein distance or the Dice-Sorensen metric are ready-to-use for the detection, user-defined metric (passed as a two-arity, anonymous function via options) can be requested. The maximum deviation is also controllable by the user, upper that the examined pair is not considered as a clone. We advice you to use 0.1 as a maximum deviation to use with Dice-Sorensen metric and 0.2 with Levenshtein metric.
It is likely to happen that a clone is divided into sub clones due to insertions, deletions or other kinds of modifications. It would be practical if a full clone could be gathered somehow, therefore we need to add a new parameter, called the invalid sequence length. An invalid sequence length is the maximum length of a sequence whose middle elements can differ too much from each other. By introducing invalid sequence length, one can customize the allowable maximum deviation of a clone.
Due to the high computational requirements of the algorithm, two variants exist: a caching and a non-caching one. If the computer has much free memory, the caching version should be chosen.
The previously introduced concepts are the parameters of the algorithm which can be defined by a proplist. The possible properties are summed up in the table below. All the properties are optional.
Property | Description | Type | Default value | Example |
---|---|---|---|---|
metric | The used metric to measure the similarity | leveinstein | dice_sorensen | fun((A::string(),B::string())->0.0 .. 1.0) | dice_sorensen | {metric, fun(A,A)->1.0; (A,B)->0.0 end} |
diff_limit | The allowed maximum deviation | 0.0 .. 1.0 | 0.1 | {diff_limit, 0.0} |
max_invalid_seq_length | The allowed maximum length of an invalid sequence | non_neg_integer() | 1 | {max_invalid_seq_length, 0} |
cache | It is allowed to use cache or not | boolean() | false | {cache, true} |
Options only available for the suffix_tree algorithm
The available parameters are summed up in the table below.
Property | Description | Type | Default value | Example |
---|---|---|---|---|
files | Weaken the search space of the algorithm. By giving these option, only the given files will be the subject of the detector. | [Module::atom() | Filepath::string() | RegExp::string() | File::string()] | all files from the database | {files,["/usr/local/lib/erlang/lib/mnesia-4.5/src/mnesia_log.erl", mymod]} |
minlen | The minimal number of tokens that a clone must contains | integer() | 32 | {minlen,50} |
minnum | The minimal number of instances in one clone group | integer() | 2 | {minnum,3} |
overlap | The allowed maximum number of tokens that duplicates can overlap each other | integer() | 0 | {overlap,2} |
filtered_suffix_tree algorithm
The available parameters include the parameters of the suffix_tree algorithm and the following:
Property | Description | Type | Default value | Example |
---|---|---|---|---|
max_invalid_seq_length | The allowed maximum length of an invalid sequence | non_neg_integer() | 0 | {max_invalid_seq_length, 1} |