Changes between Initial Version and Version 1 of CloneIdentifiErl


Ignore:
Timestamp:
Jan 2, 2014, 5:54:18 PM (11 years ago)
Author:
manualwiki
Comment:

--

Legend:

Unmodified
Added
Removed
Modified
  • CloneIdentifiErl

    v1 v1  
     1= Clone !IdentifiErl = 
     2 
     3== Overview == 
     4Clone !IdentifiErl is a prototype duplicated code detector software, which provides two separate ways to identify code clones. 
     5 
     6* The algorithm, called {{{matrix}}}, is an AST/metric based detector with which clones containing at least one expression can be found. 
     7* The algorithm, called {{{sw_metrics}}}, is a software metrics based detector with which similar function pairs can be identified. 
     8 
     9Actually, It is only available via the ri module: ri:clone_identifierl/0, ri:clone_identifierl/1. 
     10 
     11== Algorithms == 
     12There are two separate algorithms which serve different purposes. The {{{matrix}}} algorithm works with smaller granularity (a top-level expression), and it is more sensitive to syntactic modficiations, thus it is more suitable for a detailed detection. The {{{sw_metrics}}} algorithm works with function pairs, and it utilises software metrics to point out duplicates. This algorithm is more fast and is not greatly influenced by the syntactic structures of the examined source code.  
     13 
     14The subject of both algorithms can be the entire database (this is the default option) or one can start such examination in which user-given functions can be matched against the database.  
     15 
     16The common properties of the algorithms are summed up in the table below. All the properties are optional. 
     17||=Property=||=Description=||=Type=||=Default value=||=Example=|| 
     18|| algorithm || Detection takes places by  the chosen algorithm. || sw_metrics | matrix || sw_metrics || {algorithm, sw_metrics} || 
     19|| func_list   || The given functions are matched against the database, if this property is present. || [{Module::atom(), Function::atom(), Arity::integer()}] || - || [{my_mod, My_fun, 1}, {my_mod2, f, 0}] || 
     20|| subject     || The entities identified by the given IDs are matched against the database. (advanced option) || [{atom(), atom(), integer()}] || - || [{'$gn',func,54}] || 
     21 
     22=== {{{matrix}}} algorithm === 
     23This algorithm is a two-phase algorithm, which uses ''string similarity metrics'' to detect code clones during the first phase. Although, the normalised [http://en.wikipedia.org/wiki/Levenshtein_distance Levenshtein distance] or the [http://en.wikipedia.org/wiki/Sørensen_similarity_index Dice-Sorensen metric] are ready-to-use for the detection, user-defined metric (passed as a two-arity, anonymous function via options) can be requested. The ''maximum deviation'' is also controllable by the user, upper that the examined pair is not considered as a clone. We advice you to use 0.1 as a maximum deviation to use with Dice-Sorensen metric and 0.2 with Levenshtein metric. 
     24 
     25It is likely to happen that a clone is divided into sub clones due to insertions, deletions or other kinds of modifications. It would be practical if a full clone could be gathered somehow, therefore we need to add a new parameter, called the ''invalid sequence length''. An invalid sequence length is the maximum length of a sequence whose middle elements can differ too much from each other. By introducing invalid sequence length, one can customise the allowable maximum deviation of a clone. 
     26 
     27Due to the high computational requirements of the algorithm, two variants exist: a caching and a non-caching one. If the computer has much free memory, the caching version should be chosen. 
     28 
     29The previously introduced concepts are the parameters of the algorithm which can be defined by a proplist. The possible properties are summed up in the table below. All the properties are optional. 
     30||=Property=||=Description=||=Type=||=Default value=||=Example=|| 
     31|| metric       || The used metric to measure the similarity || leveinstein | dice_sorensen | fun((A::string(),B::string())->0.0 .. 1.0) || dice_sorensen || {metric, fun(A,A)->1.0; (A,B)->0.0 end} || 
     32|| diff_limit   || The allowed maximum deviation               || 0.0 .. 1.0 || 0.1 || {diff_limit, 0.0} || 
     33|| max_invalid_seq_length || The allowed maximum length of an invalid sequence || non_neg_integer() || 1 || {max_invalid_seq_length, 0} || 
     34|| cache        || It is allowed to use cache or not || boolean() || false || {cache, true} || 
     35 
     36=== {{{sw_metrics}}} algorithm === 
     37This algorithm is a two-phase algorithm that points out duplicates by using software similarity metrics. Here, no extra parameter can be given. 
     38 
     39== Exemplars  == 
     40Clone !IdentifiErl is only available through the ri interface. In this section, we show some illustrative examples to get familiar with this feature. 
     41* Simpliest case. 
     42{{{#!erlang 
     43ri:clone_identifierl(). 
     44}}} 
     45* We are looking for all of the clones can be found in the database. We have much memory and a wide interest in any detectable clones, and we also allow a greater deviation of clones. 
     46{{{#!erlang 
     47ri:clone_identifierl([{algorithm, matrix}, {caching, true}, {max_invalid_seq_length, 3}, {diff_limit, 0.2}]). 
     48}}} 
     49 
     50* We want to find the duplicates of a specific, newly introduced library function (lib_module:new_fun/2). 
     51{{{#!erlang 
     52ri:clone_identifierl([{func_list, [{lib_module, new_fun, 2}]}]). 
     53}}} 
     54 
     55* We want to find either the whole function or even its subsequences as duplicates of a specific, newly introduced library function (lib_module:new_fun/2). 
     56{{{#!erlang 
     57ri:clone_identifierl([{algorithm, matrix}, {func_list, [{lib_module, new_fun, 2}]}]). 
     58}}} 
     59 
     60* We want to find the duplicates of every function located in a library module (lib_module). 
     61{{{#!erlang 
     62{_,_,QFuns} = ris:q("mods[name=lib_module].funs"), 
     63Funs = ris:unpack(QFuns), 
     64ri:clone_identifierl([{subject, Funs}]). 
     65}}} 
     66 
     67* We want to find either the whole function or even its subsequences as duplicates of every function located in a library module (lib_module). 
     68{{{#!erlang 
     69{_,_,QTLEs} = ris:q("mods[name=lib_module].funs.body"), 
     70TLEs = ris:unpack(QTLEs), 
     71ri:clone_identifierl([{algorithm, matrix},{subject, TLEs}]). 
     72}}}