Development of an Efficient Semantic Code Clone Detection Technique
Loading...
Date
Authors
Supervisors
Journal Title
Journal ISSN
Volume Title
Publisher
Abstract
Over the last few years, code clones have emerged as an active area of research because
of their wide range of applications in di erent domains of software engineering. Code
clones are the result of copy paste activities. Similar code fragments that exist at di erent
locations are called code clones. Code clones are reported in the form of clone pairs.
Clone pairs are further clustered to form code clone groups. Code clones are broadly
categorized into four types from Type 1 to 4. In literature, numerous code clone detection
techniques exist to nd di erent types of code clones. Knowledge extraction from
existing software resources for maintenance, re-engineering and bug removal through
code clone detection is an integral part of software systems. Code clone detection techniques
are mainly classi ed into text based, token based, tree based, metric based and
semantic code clone detection techniques.
Most of the existing semantic code clone detection techniques in literature are based
on the comparison of program dependence graphs through sub graph isomorphism,
which is NP-Complete. Moreover, these techniques for semantic code clone detection
are unable to provide heuristic solution for problems such as statement reordering, inversion
of control predicates and insertion of irrelevant statements which may cause a
performance bottleneck. To address these issues, we proposed a novel approach that
nds semantic code clones between code fragments using data
ow analysis on the basis
of reaching de nition and liveness analysis. The algorithm based on reaching de nition
and liveness analysis is designed to nd similar code fragments which are structurally
divergent, but semantically equivalent. The results obtained demonstrate that the proiii
posed approach using reaching de nition and liveness analysis is e ective in detection
of semantic code clones for various applications. Results obtained on subject systems
taken from DeCapo Benchmark con rms the e ectiveness of the proposed approach.
Further, code clone groups are extracted among di erent versions of the program
le distributed over thousands of commit hashes in distributed version control system
(DVCS). Code clone group extraction has many software applications that help in refactoring
and maintenance of code in open source software systems. The evolution of code
clone groups across the history of a software system is termed as code clone genealogy.
Most of the existing solutions for code clone group extraction are based on text similarity
among di erent versions of program les stored in centralized version control system
(CVS). However, existing proposals in literature for code clone group extraction fail to
extract code clone groups among di erent versions of program les stored in distributed
version control system.
To address these issues, we presented a novel Git code clone group extraction model
based on transitive closure computation on directed acyclic graphs using Big Data Technologies.
Our insight is to extract clone pairs from thousands of commits on a software
system in Git by transitive closure computation and mapping of clone pair parameters
in genealogy to extract code clone evolution patterns in graph database (Neo4j). We
e ciently detected code clone genealogies on Git based e-health care system and created
a scalable solution. We performed evaluations on OpenMRS, an open source e-health
system on Git and presented interesting code clone evolution relationships in code clone
genealogy. The performance of the proposed approach is evaluated using parameters
such as transitive depth, ratio of similarity and count of clones.
