Changes between Version 1 and Version 2 of CloneIdentifiErl


Ignore:
Timestamp:
Sep 17, 2014, 3:53:40 PM (10 years ago)
Author:
manualwiki
Comment:

--

Legend:

Unmodified
Added
Removed
Modified
  • CloneIdentifiErl

    v1 v2  
    22 
    33== Overview == 
    4 Clone !IdentifiErl is a prototype duplicated code detector software, which provides two separate ways to identify code clones. 
     4Clone !IdentifiErl is a prototype duplicated code detector software, which provides several ways to identify code clones. Groups of clones are present in the result. 
    55 
    6 * The algorithm, called {{{matrix}}}, is an AST/metric based detector with which clones containing at least one expression can be found. 
    7 * The algorithm, called {{{sw_metrics}}}, is a software metrics based detector with which similar function pairs can be identified. 
     6* The algorithm, called {{{matrix}}}, is an AST/metric based detector with which clones containing at least one top-level expression can be found. (The clauses of functions built up from either one top-level expression or the sequences of top-level expressions.) To read further details about the theoretical background refer to the following [http://www.inf.u-szeged.hu/splst13/splst13proc.pdf link] and search for the ''Identifying Code Clones with !RefactorErl'' titled article. 
     7* The algorithm, called {{{sw_metrics}}}, is a software metrics based detector with which similar functions can be identified. To read further details about the theoretical background refer to the following [http://www.cs.ubbcluj.ro/~studia-i/2014-macs/08Fordos.pdf link]. 
     8* The algorithm, called {{{matrixfilter}}}, is the combination of the {{{matrix}}} and {{{sw_metrics}}} algorithms. 
     9* The algorithm, called {{{suffix_tree}}}, is a token based detector with which similar functions or expressions can be found. Its a token based algorithm which uses a suffix tree to identify clones. Clones match one of the following syntactic structures: functions or arbitrary expressions. 
     10* The algorithm, called {{{filtered_suffix_tree}}}, is based on the {{{suffix_tree}}} algorithm but provides only clones that are straightforward to be eliminated. 
    811 
    9 Actually, It is only available via the ri module: ri:clone_identifierl/0, ri:clone_identifierl/1. 
     12Clone !IdentifiErl is available via the following interfaces: ErlangShellInterface, ScriptableInterface, EmacsInterface, WxInterface and [wiki:Web2  Web2]. 
    1013 
    11 == Algorithms == 
    12 There are two separate algorithms which serve different purposes. The {{{matrix}}} algorithm works with smaller granularity (a top-level expression), and it is more sensitive to syntactic modficiations, thus it is more suitable for a detailed detection. The {{{sw_metrics}}} algorithm works with function pairs, and it utilises software metrics to point out duplicates. This algorithm is more fast and is not greatly influenced by the syntactic structures of the examined source code.  
     14== Optional software dependencies == 
    1315 
     16[http://igraph.org/ Igraph] is a collection of network analysis tools. Some of these tools are used by Clone !IdentifiErl, because Igraph provides an efficient way to group results.  
     17Although it is suggested to install Igraph because it can decrease the runtime of the duplicated code detection, Clone !IdentifiErl still works properly if Igraph has not been installed. 
     18 
     19Igraph source code can be downloaded from [http://igraph.org/c/#downloads here]. Install guide for Igraph can be found [http://igraph.org/c/#creq here]. 
     20 
     21If !RefactorErl had been compiled after that Igraph have been installed, !RefactorErl have to be recompiled to take the advantages of Igraph. This can be done as follows. 
     22 
     231. Quit !RefactorErl 
     242. Type: bin/referl -build clean 
     253. Type: bin/referl -build tool -igraph Path (where Path is the absolute path of the installed Igraph library. For example: /usr/local/lib/igraph) 
     26If no compiled !RefactorErl instance is present, then only the last step is required to be done. 
     27 
     28 
     29== Algorithms and provided services== 
     30Each of the algorithms serves different purposes. The {{{matrix}}} algorithm works with smaller granularity (a top-level expression), and it is more sensitive to syntactic modificiations, thus it is more suitable for a detailed detection. The {{{sw_metrics}}} algorithm works with function pairs, and it utilises software metrics to point out duplicates. This algorithm is more fast and is not greatly influenced by the syntactic structures of the examined source code. The {{{suffix_tree}}} algorithm uses a [http://en.wikipedia.org/wiki/Suffix_tree Suffix tree] to detect clones. This algorithm can be used to detect functions, top-level expressions and expressions, but only finds parametric clones. Parametric clones are clones whose instances differ only in the identifiers or literals. The {{{filtered_suffix_tree}}} is based on the {{{suffix_tree}}} algorithm,  but provides only clones that are straightforward to be eliminated. This algorithm ensures that the instances of the reported clones use similar records and refer to similar functions.  
     31The {{{sw_metrics, matrix, matrixfilter}}} algorithms offer an efficient way to match modules against the whole database, while the {{{suffix_tree}}} based algorithms have the ability to decrease the entire search space. That is, only the given set of modules will be analysed to report duplicates whilst the rest of the database will not be considered. 
     32 
     33If a clone detection finished then the result will be saved using a key, and a name. The key is made of the algorithm and the given parameters. If a clone detection is started with a key that has been already saved, then the previous result associated with this key will be loaded instead of running the requested algorithm. The saved results can be queried later at any time, until the database has not been changed. All names must be unique. It is possible to overwrite a result associated with a name. [[BR]] 
     34 
     35A saved result can be queried using different output formats, for example: {{{prettyprint, file_and_loc}}}.  
     36Regardless of the chosen algorithm all output format, which are illustrated below, can be requested. 
     37* prettyprint: Returns the textual representation of the clones. 
     38Command: {{{ri:clone_identifierl([{algorithm,sw_metrics}, {files,[ucl_alg_dm]}, {format,prettyprint}]).}}} 
     39{{{ 
     40--------------------------------------------------------------------------------  
     41Clone element: (found in ucl_alg_dm): 
     42q_gen(QueryETS) -> 
     43    fun(F, Ps) -> 
     44        [{F, Ps1}] = ets:lookup(QueryETS, F), 
     45        length(list_sort_intersect(Ps1, Ps)) 
     46    end. 
     47 
     48--------------------------------------------------------------------------------  
     49--------------------------------------------------------------------------------  
     50Clone element: (found in ucl_alg_dm): 
     51u_gen(UseETS) -> 
     52    fun(P, Fs) -> 
     53        [{P, Fs1}] = ets:lookup(UseETS, P), 
     54        length(list_sort_intersect(Fs1, Fs)) 
     55    end. 
     56--------------------------------------------------------------------------------  
     57}}} 
     58 
     59* file_and_loc: Returns file and position information related to the clones. When using this output format it is possible to set the position type. 
     60The {{{linecol}}} position type returns the row and column where the found clone starts and ends. The scalar position type returns the character position where the clone starts and ends. 
     61 
     62Command:[[BR]] 
     63{{{ri:clone_identifierl([{algorithm,sw_metrics}, {files,[ucl_alg_dm]}, {format,file_and_loc}, {postype,linecol]).}}} 
     64 
     65 
     66{{{ 
     67[[[{filepath,"/home/laptop/refactorerl/src/ucl_alg_dm.erl"}, 
     68   {startpos,{12,1}}, 
     69   {endpos,{16,8}}], 
     70  [{filepath,"/home/laptop/refactorerl/src/ucl_alg_dm.erl"}, 
     71   {startpos,{41,1}}, 
     72   {endpos,{45,8}}]]] 
     73}}} 
     74 
     75Command: [[BR]] 
     76{{{ri:clone_identifierl([{algorithm,sw_metrics}, {files,[ucl_alg_dm]}, {format,file_and_loc}, {postype,scalar]).}}} 
     77 
     78{{{ 
     79[[[{filepath,"/home/laptop/refactorerl/src/ucl_alg_dm.erl"}, 
     80   {startpos,{47}}, 
     81   {endpos,{121}}], 
     82  [{filepath,"/home/laptop/refactorerl/src/ucl_alg_dm.erl"}, 
     83   {startpos,{520}}, 
     84   {endpos,{576}}]]] 
     85}}} 
     86 
     87* nodes: Returns the internal identifiers of the found clone groups. It is a good choice if you wish to further process the result by scripting using the ris interface. 
     88Command: {{{ri:clone_identifierl([{algorithm,sw_metrics}, {files,[ucl_alg_dm]}, {format,nodes}]).}}} 
     89{{{ 
     90[[[{'$gn',form,2}],[{'$gn',form,3}]]] 
     91}}} 
     92 
     93 
     94 
     95== Tuning the algorithms ==  
     96The common properties of the algorithms are summed up in the table below. All the properties are optional. 
     97||=Property=||=Description=||=Type=||=Default value=||=Example=|| 
     98|| algorithm || Detection takes places by  the chosen algorithm. || sw_metrics | matrix | matrixfilter | suffix_tree | filtered_suffix_tree || sw_metrics || {algorithm, sw_metrics} || 
     99|| name || The name with the result will be saved. || atom() || the atom 'temp' concatenated with the timestamp || {name, temp201405101829} || 
     100|| enforce || If the name is already used and this option is given, then this enables to overwrite previous stored results. || bool() || false || {enforce, true} || 
     101|| format || The output format. (Only relevant in the ri interface) || prettyprint | file_and_loc | nodes || prettyprint || {format, file_and_loc} || 
     102|| output || The output file name with the results will be written in the [http://pnyf.inf.elte.hu/trac/refactorerl/wiki/StartUp#Options data dir]. || atom() || - || {output, mysearch} || 
     103|| postype || The preferred position type. || linecol | scalar || linecol || {postype, linecol} || 
     104 
     105 
     106=== Options only available for the {{{matrix,sw_metrics,matrixfilter}}} algorithms === 
    14107The subject of both algorithms can be the entire database (this is the default option) or one can start such examination in which user-given functions can be matched against the database.  
    15108 
    16 The common properties of the algorithms are summed up in the table below. All the properties are optional. 
     109The common properties of these algorithms are summed up in the table below. All the properties are optional. 
    17110||=Property=||=Description=||=Type=||=Default value=||=Example=|| 
    18 || algorithm || Detection takes places by  the chosen algorithm. || sw_metrics | matrix || sw_metrics || {algorithm, sw_metrics} || 
    19 || func_list   || The given functions are matched against the database, if this property is present. || [{Module::atom(), Function::atom(), Arity::integer()}] || - || [{my_mod, My_fun, 1}, {my_mod2, f, 0}] || 
    20 || subject     || The entities identified by the given IDs are matched against the database. (advanced option) || [{atom(), atom(), integer()}] || - || [{'$gn',func,54}] || 
     111|| func_list   || The given functions are matched against the database. || [{Module::atom(), Function::atom(), Arity::integer()}] || - || {func_list, [{my_mod, My_fun, 1}, {my_mod2, f, 0}]} || 
     112|| subject     || The entities identified by the given IDs are matched against the database. (advanced option) || [{atom(), atom(), integer()}] || - || {subject, [{'$gn',func,54}]} || 
     113|| files || The contents of the given files are matched against the database. || [Module::atom() | Filepath::string() | RegExp::string() | File::string()] || all files from the database || {files,["/usr/local/lib/erlang/lib/mnesia-4.5/src/mnesia_log.erl", mymod]} || 
    21114 
    22 === {{{matrix}}} algorithm === 
    23 This algorithm is a two-phase algorithm, which uses ''string similarity metrics'' to detect code clones during the first phase. Although, the normalised [http://en.wikipedia.org/wiki/Levenshtein_distance Levenshtein distance] or the [http://en.wikipedia.org/wiki/Sørensen_similarity_index Dice-Sorensen metric] are ready-to-use for the detection, user-defined metric (passed as a two-arity, anonymous function via options) can be requested. The ''maximum deviation'' is also controllable by the user, upper that the examined pair is not considered as a clone. We advice you to use 0.1 as a maximum deviation to use with Dice-Sorensen metric and 0.2 with Levenshtein metric. 
     115=== Options only available for the {{{matrix}}} algorithm === 
     116This algorithm is a two-phase algorithm, which uses ''string similarity metrics'' to detect code clones during the first phase. Although, the normalized [http://en.wikipedia.org/wiki/Levenshtein_distance Levenshtein distance] or the [http://en.wikipedia.org/wiki/Sørensen_similarity_index Dice-Sorensen metric] are ready-to-use for the detection, user-defined metric (passed as a two-arity, anonymous function via options) can be requested. The ''maximum deviation'' is also controllable by the user, upper that the examined pair is not considered as a clone. We advice you to use 0.1 as a maximum deviation to use with Dice-Sorensen metric and 0.2 with Levenshtein metric. 
    24117 
    25 It is likely to happen that a clone is divided into sub clones due to insertions, deletions or other kinds of modifications. It would be practical if a full clone could be gathered somehow, therefore we need to add a new parameter, called the ''invalid sequence length''. An invalid sequence length is the maximum length of a sequence whose middle elements can differ too much from each other. By introducing invalid sequence length, one can customise the allowable maximum deviation of a clone. 
     118It is likely to happen that a clone is divided into sub clones due to insertions, deletions or other kinds of modifications. It would be practical if a full clone could be gathered somehow, therefore we need to add a new parameter, called the ''invalid sequence length''. An invalid sequence length is the maximum length of a sequence whose middle elements can differ too much from each other. By introducing invalid sequence length, one can customize the allowable maximum deviation of a clone. 
    26119 
    27120Due to the high computational requirements of the algorithm, two variants exist: a caching and a non-caching one. If the computer has much free memory, the caching version should be chosen. 
     
    34127|| cache        || It is allowed to use cache or not || boolean() || false || {cache, true} || 
    35128 
    36 === {{{sw_metrics}}} algorithm === 
    37 This algorithm is a two-phase algorithm that points out duplicates by using software similarity metrics. Here, no extra parameter can be given. 
    38129 
    39 == Exemplars  == 
    40 Clone !IdentifiErl is only available through the ri interface. In this section, we show some illustrative examples to get familiar with this feature. 
    41 * Simpliest case. 
    42 {{{#!erlang 
    43 ri:clone_identifierl(). 
    44 }}} 
    45 * We are looking for all of the clones can be found in the database. We have much memory and a wide interest in any detectable clones, and we also allow a greater deviation of clones. 
    46 {{{#!erlang 
    47 ri:clone_identifierl([{algorithm, matrix}, {caching, true}, {max_invalid_seq_length, 3}, {diff_limit, 0.2}]). 
    48 }}} 
     130=== Options only available for the {{{suffix_tree}}} algorithm === 
     131The available parameters are summed up in the table below. 
     132||=Property=||=Description=||=Type=||=Default value=||=Example=|| 
     133|| files || Weaken the search space of the algorithm. By giving these option, only the given files will be the subject of the detector. || [Module::atom() | Filepath::string() | RegExp::string() | File::string()] || all files from the database || {files,["/usr/local/lib/erlang/lib/mnesia-4.5/src/mnesia_log.erl", mymod]} || 
     134|| minlen || The minimal number of tokens that a clone must contains || integer() || 32 || {minlen,50} || 
     135|| minnum || The minimal number of instances in one clone group || integer() || 2 || {minnum,3} || 
     136|| overlap || The allowed maximum number of tokens that duplicates can overlap each other || integer() || 0 || {overlap,2} || 
    49137 
    50 * We want to find the duplicates of a specific, newly introduced library function (lib_module:new_fun/2). 
    51 {{{#!erlang 
    52 ri:clone_identifierl([{func_list, [{lib_module, new_fun, 2}]}]). 
    53 }}} 
    54  
    55 * We want to find either the whole function or even its subsequences as duplicates of a specific, newly introduced library function (lib_module:new_fun/2). 
    56 {{{#!erlang 
    57 ri:clone_identifierl([{algorithm, matrix}, {func_list, [{lib_module, new_fun, 2}]}]). 
    58 }}} 
    59  
    60 * We want to find the duplicates of every function located in a library module (lib_module). 
    61 {{{#!erlang 
    62 {_,_,QFuns} = ris:q("mods[name=lib_module].funs"), 
    63 Funs = ris:unpack(QFuns), 
    64 ri:clone_identifierl([{subject, Funs}]). 
    65 }}} 
    66  
    67 * We want to find either the whole function or even its subsequences as duplicates of every function located in a library module (lib_module). 
    68 {{{#!erlang 
    69 {_,_,QTLEs} = ris:q("mods[name=lib_module].funs.body"), 
    70 TLEs = ris:unpack(QTLEs), 
    71 ri:clone_identifierl([{algorithm, matrix},{subject, TLEs}]). 
    72 }}} 
     138=== {{{filtered_suffix_tree}}} algorithm === 
     139The available parameters include the parameters of the {{{suffix_tree}}} algorithm and the following: 
     140||=Property=||=Description=||=Type=||=Default value=||=Example=|| 
     141|| max_invalid_seq_length || The allowed maximum length of an invalid sequence || non_neg_integer() || 1 || {max_invalid_seq_length, 0} ||