| 16 | [http://igraph.org/ Igraph] is a collection of network analysis tools. Some of these tools are used by Clone !IdentifiErl, because Igraph provides an efficient way to group results. |
| 17 | Although it is suggested to install Igraph because it can decrease the runtime of the duplicated code detection, Clone !IdentifiErl still works properly if Igraph has not been installed. |
| 18 | |
| 19 | Igraph source code can be downloaded from [http://igraph.org/c/#downloads here]. Install guide for Igraph can be found [http://igraph.org/c/#creq here]. |
| 20 | |
| 21 | If !RefactorErl had been compiled after that Igraph have been installed, !RefactorErl have to be recompiled to take the advantages of Igraph. This can be done as follows. |
| 22 | |
| 23 | 1. Quit !RefactorErl |
| 24 | 2. Type: bin/referl -build clean |
| 25 | 3. Type: bin/referl -build tool -igraph Path (where Path is the absolute path of the installed Igraph library. For example: /usr/local/lib/igraph) |
| 26 | If no compiled !RefactorErl instance is present, then only the last step is required to be done. |
| 27 | |
| 28 | |
| 29 | == Algorithms and provided services== |
| 30 | Each of the algorithms serves different purposes. The {{{matrix}}} algorithm works with smaller granularity (a top-level expression), and it is more sensitive to syntactic modificiations, thus it is more suitable for a detailed detection. The {{{sw_metrics}}} algorithm works with function pairs, and it utilises software metrics to point out duplicates. This algorithm is more fast and is not greatly influenced by the syntactic structures of the examined source code. The {{{suffix_tree}}} algorithm uses a [http://en.wikipedia.org/wiki/Suffix_tree Suffix tree] to detect clones. This algorithm can be used to detect functions, top-level expressions and expressions, but only finds parametric clones. Parametric clones are clones whose instances differ only in the identifiers or literals. The {{{filtered_suffix_tree}}} is based on the {{{suffix_tree}}} algorithm, but provides only clones that are straightforward to be eliminated. This algorithm ensures that the instances of the reported clones use similar records and refer to similar functions. |
| 31 | The {{{sw_metrics, matrix, matrixfilter}}} algorithms offer an efficient way to match modules against the whole database, while the {{{suffix_tree}}} based algorithms have the ability to decrease the entire search space. That is, only the given set of modules will be analysed to report duplicates whilst the rest of the database will not be considered. |
| 32 | |
| 33 | If a clone detection finished then the result will be saved using a key, and a name. The key is made of the algorithm and the given parameters. If a clone detection is started with a key that has been already saved, then the previous result associated with this key will be loaded instead of running the requested algorithm. The saved results can be queried later at any time, until the database has not been changed. All names must be unique. It is possible to overwrite a result associated with a name. [[BR]] |
| 34 | |
| 35 | A saved result can be queried using different output formats, for example: {{{prettyprint, file_and_loc}}}. |
| 36 | Regardless of the chosen algorithm all output format, which are illustrated below, can be requested. |
| 37 | * prettyprint: Returns the textual representation of the clones. |
| 38 | Command: {{{ri:clone_identifierl([{algorithm,sw_metrics}, {files,[ucl_alg_dm]}, {format,prettyprint}]).}}} |
| 39 | {{{ |
| 40 | -------------------------------------------------------------------------------- |
| 41 | Clone element: (found in ucl_alg_dm): |
| 42 | q_gen(QueryETS) -> |
| 43 | fun(F, Ps) -> |
| 44 | [{F, Ps1}] = ets:lookup(QueryETS, F), |
| 45 | length(list_sort_intersect(Ps1, Ps)) |
| 46 | end. |
| 47 | |
| 48 | -------------------------------------------------------------------------------- |
| 49 | -------------------------------------------------------------------------------- |
| 50 | Clone element: (found in ucl_alg_dm): |
| 51 | u_gen(UseETS) -> |
| 52 | fun(P, Fs) -> |
| 53 | [{P, Fs1}] = ets:lookup(UseETS, P), |
| 54 | length(list_sort_intersect(Fs1, Fs)) |
| 55 | end. |
| 56 | -------------------------------------------------------------------------------- |
| 57 | }}} |
| 58 | |
| 59 | * file_and_loc: Returns file and position information related to the clones. When using this output format it is possible to set the position type. |
| 60 | The {{{linecol}}} position type returns the row and column where the found clone starts and ends. The scalar position type returns the character position where the clone starts and ends. |
| 61 | |
| 62 | Command:[[BR]] |
| 63 | {{{ri:clone_identifierl([{algorithm,sw_metrics}, {files,[ucl_alg_dm]}, {format,file_and_loc}, {postype,linecol]).}}} |
| 64 | |
| 65 | |
| 66 | {{{ |
| 67 | [[[{filepath,"/home/laptop/refactorerl/src/ucl_alg_dm.erl"}, |
| 68 | {startpos,{12,1}}, |
| 69 | {endpos,{16,8}}], |
| 70 | [{filepath,"/home/laptop/refactorerl/src/ucl_alg_dm.erl"}, |
| 71 | {startpos,{41,1}}, |
| 72 | {endpos,{45,8}}]]] |
| 73 | }}} |
| 74 | |
| 75 | Command: [[BR]] |
| 76 | {{{ri:clone_identifierl([{algorithm,sw_metrics}, {files,[ucl_alg_dm]}, {format,file_and_loc}, {postype,scalar]).}}} |
| 77 | |
| 78 | {{{ |
| 79 | [[[{filepath,"/home/laptop/refactorerl/src/ucl_alg_dm.erl"}, |
| 80 | {startpos,{47}}, |
| 81 | {endpos,{121}}], |
| 82 | [{filepath,"/home/laptop/refactorerl/src/ucl_alg_dm.erl"}, |
| 83 | {startpos,{520}}, |
| 84 | {endpos,{576}}]]] |
| 85 | }}} |
| 86 | |
| 87 | * nodes: Returns the internal identifiers of the found clone groups. It is a good choice if you wish to further process the result by scripting using the ris interface. |
| 88 | Command: {{{ri:clone_identifierl([{algorithm,sw_metrics}, {files,[ucl_alg_dm]}, {format,nodes}]).}}} |
| 89 | {{{ |
| 90 | [[[{'$gn',form,2}],[{'$gn',form,3}]]] |
| 91 | }}} |
| 92 | |
| 93 | |
| 94 | |
| 95 | == Tuning the algorithms == |
| 96 | The common properties of the algorithms are summed up in the table below. All the properties are optional. |
| 97 | ||=Property=||=Description=||=Type=||=Default value=||=Example=|| |
| 98 | || algorithm || Detection takes places by the chosen algorithm. || sw_metrics | matrix | matrixfilter | suffix_tree | filtered_suffix_tree || sw_metrics || {algorithm, sw_metrics} || |
| 99 | || name || The name with the result will be saved. || atom() || the atom 'temp' concatenated with the timestamp || {name, temp201405101829} || |
| 100 | || enforce || If the name is already used and this option is given, then this enables to overwrite previous stored results. || bool() || false || {enforce, true} || |
| 101 | || format || The output format. (Only relevant in the ri interface) || prettyprint | file_and_loc | nodes || prettyprint || {format, file_and_loc} || |
| 102 | || output || The output file name with the results will be written in the [http://pnyf.inf.elte.hu/trac/refactorerl/wiki/StartUp#Options data dir]. || atom() || - || {output, mysearch} || |
| 103 | || postype || The preferred position type. || linecol | scalar || linecol || {postype, linecol} || |
| 104 | |
| 105 | |
| 106 | === Options only available for the {{{matrix,sw_metrics,matrixfilter}}} algorithms === |
18 | | || algorithm || Detection takes places by the chosen algorithm. || sw_metrics | matrix || sw_metrics || {algorithm, sw_metrics} || |
19 | | || func_list || The given functions are matched against the database, if this property is present. || [{Module::atom(), Function::atom(), Arity::integer()}] || - || [{my_mod, My_fun, 1}, {my_mod2, f, 0}] || |
20 | | || subject || The entities identified by the given IDs are matched against the database. (advanced option) || [{atom(), atom(), integer()}] || - || [{'$gn',func,54}] || |
| 111 | || func_list || The given functions are matched against the database. || [{Module::atom(), Function::atom(), Arity::integer()}] || - || {func_list, [{my_mod, My_fun, 1}, {my_mod2, f, 0}]} || |
| 112 | || subject || The entities identified by the given IDs are matched against the database. (advanced option) || [{atom(), atom(), integer()}] || - || {subject, [{'$gn',func,54}]} || |
| 113 | || files || The contents of the given files are matched against the database. || [Module::atom() | Filepath::string() | RegExp::string() | File::string()] || all files from the database || {files,["/usr/local/lib/erlang/lib/mnesia-4.5/src/mnesia_log.erl", mymod]} || |
39 | | == Exemplars == |
40 | | Clone !IdentifiErl is only available through the ri interface. In this section, we show some illustrative examples to get familiar with this feature. |
41 | | * Simpliest case. |
42 | | {{{#!erlang |
43 | | ri:clone_identifierl(). |
44 | | }}} |
45 | | * We are looking for all of the clones can be found in the database. We have much memory and a wide interest in any detectable clones, and we also allow a greater deviation of clones. |
46 | | {{{#!erlang |
47 | | ri:clone_identifierl([{algorithm, matrix}, {caching, true}, {max_invalid_seq_length, 3}, {diff_limit, 0.2}]). |
48 | | }}} |
| 130 | === Options only available for the {{{suffix_tree}}} algorithm === |
| 131 | The available parameters are summed up in the table below. |
| 132 | ||=Property=||=Description=||=Type=||=Default value=||=Example=|| |
| 133 | || files || Weaken the search space of the algorithm. By giving these option, only the given files will be the subject of the detector. || [Module::atom() | Filepath::string() | RegExp::string() | File::string()] || all files from the database || {files,["/usr/local/lib/erlang/lib/mnesia-4.5/src/mnesia_log.erl", mymod]} || |
| 134 | || minlen || The minimal number of tokens that a clone must contains || integer() || 32 || {minlen,50} || |
| 135 | || minnum || The minimal number of instances in one clone group || integer() || 2 || {minnum,3} || |
| 136 | || overlap || The allowed maximum number of tokens that duplicates can overlap each other || integer() || 0 || {overlap,2} || |
50 | | * We want to find the duplicates of a specific, newly introduced library function (lib_module:new_fun/2). |
51 | | {{{#!erlang |
52 | | ri:clone_identifierl([{func_list, [{lib_module, new_fun, 2}]}]). |
53 | | }}} |
54 | | |
55 | | * We want to find either the whole function or even its subsequences as duplicates of a specific, newly introduced library function (lib_module:new_fun/2). |
56 | | {{{#!erlang |
57 | | ri:clone_identifierl([{algorithm, matrix}, {func_list, [{lib_module, new_fun, 2}]}]). |
58 | | }}} |
59 | | |
60 | | * We want to find the duplicates of every function located in a library module (lib_module). |
61 | | {{{#!erlang |
62 | | {_,_,QFuns} = ris:q("mods[name=lib_module].funs"), |
63 | | Funs = ris:unpack(QFuns), |
64 | | ri:clone_identifierl([{subject, Funs}]). |
65 | | }}} |
66 | | |
67 | | * We want to find either the whole function or even its subsequences as duplicates of every function located in a library module (lib_module). |
68 | | {{{#!erlang |
69 | | {_,_,QTLEs} = ris:q("mods[name=lib_module].funs.body"), |
70 | | TLEs = ris:unpack(QTLEs), |
71 | | ri:clone_identifierl([{algorithm, matrix},{subject, TLEs}]). |
72 | | }}} |
| 138 | === {{{filtered_suffix_tree}}} algorithm === |
| 139 | The available parameters include the parameters of the {{{suffix_tree}}} algorithm and the following: |
| 140 | ||=Property=||=Description=||=Type=||=Default value=||=Example=|| |
| 141 | || max_invalid_seq_length || The allowed maximum length of an invalid sequence || non_neg_integer() || 1 || {max_invalid_seq_length, 0} || |