A Plagiarism Detection System in Computer Source Code

International Journal of Computer Science Research and Application 2013, Vol. 03, Issue 01(Special Issue), pp. 22-30 ISSN 2012-9564 (Print) ISSN 2012-...

Author: Shawn Hunter

7 downloads 4 Views 2MB Size

Report

Download PDF

Recommend Documents

Source Code Plagiarism Detection SCPDet : A Review

C CODE PLAGIARISM DETECTION SYSTEM

A Machine Learning Based Tool for Source Code Plagiarism Detection

Desktop Tools for Offline Plagiarism Detection in Computer Programs

Plagiarism Detection: A combined Approach

Plagiarism Detection in Software Designs

Presenting an Alternative Source Code Plagiarism Detection Framework for Improving the Teaching and Learning of Programming

Syntax tree fingerprinting: a foundation for source code similarity detection

Plagiarism Detection Software

The Honor Code: Plagiarism

Detection of Plagiarism in Arabic Documents

Development of a Plagiarism Detection Software

Detection of Plagiarism in Arabic Documents

Intrinsic Plagiarism Detection in Digital Data

Using TurnitinUK For Plagiarism Detection

Plagiarism Detection Software Test 2013

PLAGIARISM DETECTION USING SEMANTIC ANALYSIS

A Computer Vision Detection System for Network Model Validation

Plagiarism and Detection Tools: An Overview

Using plagiarism detection software with Blackboard

Plagiarism Detection Reduced to String Matching

Plagiarism Detection Using Blackboard s SafeAssign Feature

A Study on Plagiarism Detection and Plagiarism Direction Identification Using Natural Language Processing Techniques

International Journal of Computer Science Research and Application 2013, Vol. 03, Issue 01(Special Issue), pp. 22-30 ISSN 2012-9564 (Print) ISSN 2012-9572 (Online) © Daniela Marinescu. Authors retain all rights. IJCSRA has been granted the right to publish and share, Creative Commons 3.0

INTERNATIONAL JOURNAL OF COMPUTER SCIENCE RESEARCH AND APPLICATION www.ijcsra.org

A Plagiarism Detection System in Computer Source Code Daniela Marinescu1, Alexandra Băicoianu2 Sebastian Dimitriu3 1

Transilvania University of Braşov, E-mail: [email protected] Transilvania University of Braşov, E-mail: [email protected] 3 iQuest Technologies, Braşov, E-mail: [email protected] Author Correspondence: Transilvania University of Brasov, Department of Mathematics and Computer Science, Iuliu Maniu Str., No. 50, Braşov, 500091, ROMANIA, Telephone/Fax +40 268 414016, E-mail : [email protected] 2

Abstract This paper is an extended version of the paper (Marinescu et al., 2012) about plagiarism detection. The plagiarism is a very important problem especially in the field of education and research. Due to the huge amount of information available on Internet the plagiarism is very tempting especially for the students. The available software for checking plagiarism is too much expensive and open source software is low and unreliable. This is a difficult problem for education sectors, not only for document plagiarism but also for computer source code plagiarism. The objective of this paper is to describe an application for detection of the plagiarism in computer source code because this kind of plagiarism is very frequent in education field. This tool analyses more projects in the same time and by using an efficient algorithm it can decide if the code is unique or not. This application was developed for code refactoring and not for other kinds of plagiarism. It detects any proof of cheating and it works against different levels of similarity,from low level similarities till high level ones.

Keywords: Plagiarism detection, Source code, Source Code Similarity; Software, University education.

1. Introduction The plagiarism is not a new problem but is more actually because the digital documents are easily copied. The plagiarism can be found in virtually any field, including literature, scientific papers, art designs and source code. The plagiarism detection is the process of locating instances of plagiarism within a work or document. Nowadays the widespread use of computers and the advent of the Internet have made it easier to plagiarize the work of others. Most cases of plagiarism are found in academia, where documents are typically essays or reports. For students in computer science there are two sources of solved assignments in programming courses: the internet and other students. Of course, the assignments can be formulating in a special matter so that to be difficult to find the solution on the internet. But it is difficult to have completely different assignments for every student. So the plagiarism of source code from other students is very tempting. The easiest way of cheating is to make some changes in a working program: changing the names of the variables, the declarations (i.e. changing the positions of declared variables, declaring extra constants, etc.), statement spacing, changing comments, adding redundant statements or variables.

23

International Journal of Computer Science Research and Application, 3(1):22-30

Comparing manually all pairs of programs, for evidence of plagiarism, becomes infeasible for large classes. So it is necessary to have an automated tool for detection of plagiarism in computer source code for the educational area. The plagiarism detection can be either manual or computer-assisted. Manual detection requires substantial effort and excellent memory, and it is impractical in cases where too many documents must be compared, or original documents are not available for comparison. Computer-assisted detection allows vast collections of documents to be compared to each other, making successful detection much more likely. The plagiarism.org is an on-line service for plagiarism detection. In (Clough, 2000) are given more examples of tools used to plagiarism detection in natural and programming languages: IntegriGuard, EVE2, CopyCatch, Glatt Plagiarism Services Inc, WordCheck Keyword Software, YAP3, JPlag, MOSS, SIM, and Siff. Different techniques are used for plagiarism detection. So a string-based approach was proposed by Baker in (Baker, 1995), by comparing lists of tokens for the longest common sequence of tokens. Wise proposed a YAP algorithm in (Wise, 1996). Using a dynamic programming string alignment technique in (Gitchell&Tran, 1998) the authors presented a SIM plagiarism detection system. In academic area, for programming plagiarism detection two systems, JPlag (Prechelt et al, 2000), MOSS (Schleimer et al, 2003) and SID (Chen et al, 2004) are using token-based tools. CCFinder has proposed in the paper (Kamiya et al, 2002), a clone detection technique with transformation rules and a token-based comparison. In (Li et al, 2004) is developed a tool for finding copy-paste and related defects in operating system code and in (Qu et al, 2010) tokenization techniques is used for cloned codes detection in software systems. Another problem is code source plagiarism by translation of a program from a programming language in another programming language. This problem is named cross-language plagiarism and it is frequently find in natural languages too. The solution of this problem is more complicate because the resulting program is not an exact translation of the original one due to of implementation. To detect this kind of cross-languages plagiarism the authors Arwin C. and Tahaghoghi S. M. M (Arwin&Tahaghoghi, 2006) translate first the two programs in intermediate language produced by compiler and after that compare the two intermediate programs. Another technique to the detection of cross-language source code reuse, presented in (Flores et al. 2011) is based on character n-grams comparison, and it was developed at Universidad Politecnica de Valencia. In this paper we present an application designed to effective detect and thereby prevent plagiarism. The goal of this program is to reduce the impact of plagiarism on education and educational institutions. At present, it is used in the Transilvania University of Braşov, Mathematics and Computer Science Faculty for different laboratories. The application finds duplicated code blocks and it can be used to detect plagiarism in software written in Java, C++ and C#, but it can be extended for more programming languages. The algorithm behind this application is Winnowing algorithm (Schleimer et al, 2003), developed by Schleimer S., Wilkerson D., and Aiken A. at Stanford University and it is a significant improvement over other cheating detection algorithms (at least, over those known by us).

1.1. The plagiarism problem: Institutional context Plagiarism is not a new problem, but given the range of easily accessible electronic resources in recent times, it has become much easier for students to ‘cut and paste’and to substitute the signature. This can sometimes lead to assignments being submitted that are inadequately referenced or assignments being submitted that are largely or entirely the work of someone else. Plagiarism is not only about copying the words, code, ideas of another person, thing that can be avoided by paraphrasing the sentences, but also the use of words, ideas, data, organization and original thoughts of another person without giving credit to that individual. Such activities can give rise to serious consequences, sometimes, attracting legal punishments. The growing problem of plagiarism in schools, colleges and even in universities has determined the concerned authorities to take it quite seriously. Its consequences in the academic field depend on the nature of the mistake and the number of times one commits it. The penalties can vary from failing an assignment, loss of privilege, getting a low-grade to not being promoted to a higher standard, academic probation and even expulsion from an educational institution. Various strategies can be developed by academics to police plagiarism, ranging from simple Web search techniques used by individual lecturers, to the development of software capable of tracking plagiarism between students. The last strategy is the one used mostly by us.

24

International Journal of Computer Science Research and Application, 3(1):22-30

2. The Winnowing algorithm Generally, a copy-detection algorithm should have three main properties: 1. Whitespace insensitivity: in matching text files, matches should be unaffected by such things as extra whitespace, capitalization, punctuation, etc. In other domains the notion of what strings should be equal is different—for example, in matching software text it is desirable to make matching insensitive to variable names. 2. Noise suppression: discovering short matches, such as the fact that the word “the” appears in two different documents, is uninteresting. Any match must be large enough to imply that the material has been copied and is not simply a common word or idiom of the language in which documents are written. 3. Position independence: coarse-grained permutation of the contents of a document (e.g., scrambling the order of paragraphs) should not affect the set of discovered matches. Adding to a document should not affect the set of matches in the original portion of the new document. Removing part of a document should not affect the set of matches in the portion that remains. In the paper (Schleimer et al, 2003) the authors try to find the most important property of a document. For this purpose they introduce the class of local document fingerprinting algorithms to detect copies and propose an efficient algorithm, named winnowing. For our Source Code Plagiarism Detector we used the winnowing algorithm, in order to select fingerprints from hashes. It was developed in 2003 in Stanford University by Schleimer S., Wilkerson D., Aiken A. and it is a local document fingerprinting algorithm that is both efficient and guarantees that matches of a certain length are detected. A series of experiments that show the effectiveness of winnowing on real data are available (Schleimer et al, 2003). Shortly, given a set of documents, we want to find substring matches between them that satisfy two properties: 1. If there is a substring match at least as long as the guarantee threshold, t, then this match is detected, and 2. We do not detect any matches shorter than the noise threshold, k. The constants t and k = t are chosen by the user. We avoid matching strings below the noise threshold by considering only hashes of k-grams. The larger k is, the more confident we can be that matches between documents are not coincidental. On the other hand, larger values of k also limit the sensitivity to reordering of document contents, as we cannot detect the relocation of any substring of length less than k. Thus, it is important to choose k to be the minimum value that eliminates coincidental matches. For our detector we implemented this algorithm with a significant optimization over other cheating detection algorithms.

2.1 Source Code Plagiarism Detector Source Code Plagiarism Detector is an application for determining the similarity of programs. In this case, the main objective has been the detection of plagiarism in programming classes. Since its development (beginning 2012), the detector has been very effective in this role. The algorithm behind this detector is a significant improvement over other detection algorithms. Source Code Plagiarism Detector can currently analyze code written in one of the following languages: Java, C++, C# but it provides support to extend the list of supported programming languages. This is done by adding an XML file to the definitions folder with a specific pattern. This XML file has a specific format like in Figure 1. This file contains the name of the programming language, the file extension associated with it, files that should be excluded even if they have the right extension, patterns for blocks that should be ignored (comments, imports, etc.), keywords, operators and delimiters. By default the application works for all the languages mentioned before, but it offers support for other programming languages too. This can be changed from the language definition window depending on the users need. For a Java assignment you don't care that an html file is similar, for example. The application is internationalized to ensure that it's easy to use by everyone.

25

International Journal of Computer Science Research and Application, 3(1):22-30

Figure 1. The XML file for the definitions folder.

The main steps for using this application are: 1. Select Configuration from File menu and set the wishes values like in Figure2.

Figure 2. Setting the wished value

26

International Journal of Computer Science Research and Application, 3(1):22-30

2. Select Language definitions from File menu for programming language like in Figure 3. Currently there are definitions for Java, C++ and C#, but it is possible to define more other programming languages.

Figure 3: Select Language definitions from File menu

3. Select Language for wishes natural language (Romanian or English) like in Figure 4.

Figure 4: Select Language (Romanian or English) from menu

27

International Journal of Computer Science Research and Application, 3(1):22-30

4. Load the desired projects into the application like in Figure 5. This will perform an analysis of the directories and detect the files that are detected as code files and the language they are written in. Loading the desired project is made by the following:  Push "Add projects" (the third button with “+”sign) and select the folder which contains the projects.  Use now the first “+” button, "Add project", for each selected project.  By pressing "Remove project" we can remove of a project from list.

Figure 5: Loading projects to compare with the application

5. Push the "Start analysis" button, Figure 6. This will parse the content of the files detected above and generate a fingerprint for each file detected in the previous step.

Figure 6: Starting the analysis

28

International Journal of Computer Science Research and Application, 3(1):22-30

6. Select the desired filter for the displayed results and push “Show reports” button, like in Figure 7. It generates project pair reports based on the similarity of the fingerprints. To be more precise, it calculates the amount of identical hashes detected in the fingerprints.

Figure 7: Select the desired filter 7. Select one or more of the project pairs for a more detailed inspection, like in Figure 8. Usually projects with high similarity factor need detailed inspection. This will generate a file pair reports for the selected project pair. This indicates the similarity factor at file level.

Figure 8: Select one of the file pairs for inspection

29

International Journal of Computer Science Research and Application, 3(1):22-30

8. Push “Highlight all sections" and after that “Highlight next sections”. It will display a parallel view where the user can see the chunks of code that are detected as being similar highlighted, Figure 9.

Figure 9: The similarity factor at file level

The role of the parsing module is to reduce the amount of text that needs to be indexed (fingerprinted) to improve performance and accuracy. The parsing step excludes redundant text (comments, imports, etc.) and replaces each token (keyword, operator, and delimiter) detected in the file with a unique key for that token type. The keys have fixed length (1 character) to ensure that the amount of text fingerprinted is minimum and that each token is equally important, otherwise similar longer keywords would have been more important than similar operators for example. File parsing is done in parallel. This ensures that the application scales well with better machines. The fingerprinting module generates a fingerprint (collection of hashes) for each file based on the content obtained after parsing. Each hash is indexed so you can trace where the similarities occur in the original files. Just like parsing, fingerprinting is done in parallel. It is based on the Winnowing algorithm developed at Stanford University, (Schleimer et al, 2003).

4. Conclusion Because plagiarism in computer source code is very frequent, in this paper we described an application used for detection of this kind of plagiarism. This tool analyses more projects in the same time and, by using an efficient algorithm, it can decide if the code is unique or not. This application was designed for helping teachers from Transilvania University of Braşov to effectively detect and thereby prevent plagiarism between students. The application Source Code Plagiarism Detector was proved to be effective and of a great help for teachers. Of course an important role has the lab tutor in formulate the exercises in a special matter so that to be difficult for a student to find the solution on the Internet. It remains the possibility to copy from another student and this application is made to prevent this kind of plagiarism. By reducing the impact of plagiarism, this program has an important role in education of the new generation not only for student period but also for their future life.

30

International Journal of Computer Science Research and Application, 3(1):22-30

References Arwin C., Tahaghoghi S. M. M., 2006, Plagiarism Detection across Programming Languages. In: Proceedings of the 29th Australasian Computer Science Conference,vol. 48, pp. 277–286. Baker B. S., 1995, On finding duplication and near duplication in large software systems, In: Proceedings of the Second Working Conference on Reverse Engineering, Toronto, Canada, pp.86–95. Chen X., Francia B. Li M., Mckinnon B. Seker A., 2004, Shared information and program plagiarism detection, IEEE Transactions on Information Theory, 50 (7), pp.1545–1551. Clough P., 2000, Plagiarism in natural and programming languages: An overview of current tools and technologies, Technical report: CS-00–05. Department of Computer Science, University of Sheffield, United Kingdom. Gitchell D., Tran N., 1998, A utility for detecting similarity in computer programs, In: Proceedings of the 30th ACM Special Interest Group on Computer Science Education Technical Symposium, New Orleans, LA, USA, , pp. 266–270. Flores E., Barr´on-Cede˜no A., Rosso P., Moreno L., 2011, Towards the Detection of Cross-Language Source Code Reuse, LNCS, 6716, pp. 250–253, 2011. Kamiya T., Kusumoto S., Inoue K., 2002, CCFinder: a multilinguistic token-based code clone detection system for large scale source code, In: IEEE Transactions on Software Engineering 28 (7) pp.654–670. Li Z., Lu S. Myagmar S., Zhou Y., 2004, CP-Miner,: A tool for finding copy-paste and related bugs in operating system code, In: Proceedings of the Sixth Symposium on Operating System Design and Implementation, pp.289–302. Marinescu D., Băicoianu A., Dimitriu S., 2012, Software for Plagiarism Detection in Computer Source Code, In: Proc. of the 7th International Conference on Virtual Learning ICVL 2012, pp. 373-379. Prechelt L., Malpohl G., Philippsen M., 2002, Finding plagiarisms among a set of program with JPlag, Journal of Universal Computer Sciences 8 (11), pp.1016–1038. Qu W., Jia Y., Jiang M., 2010, Pattern mining of cloned codes in software systems, Information Sciences. , http://dx.doi.org/10.1016/j.ins.2010.04.022 Schleimer S., Wilkerson D., Aiken A. ,2003, Winnowing: local algorithms for document fingerprinting, In: Proceedings of the ACM SIGMOD International Conference on Management of Data, pp. 76–85, http://theory.stanford.edu/~aiken/publications/papers/sigmod03.pdf Wise M.J., 1996, YAP3: improved detection of similarities in computer program and other texts, In: Proceedings of the 27th SIGCSE Technical Symposium on Computer Science Education, Philadelphia, PA, USA, pp. 130–134.

A Brief Author Biography Daniela Marinescu – Professor, PhD of the Department of Mathematics and Computer Science, from Transilvania University of Braşov, Romania, and Vice dean of the Faculty of Mathematics and Computer Science from the same university. She graduated from the area of computing machines, Faculty of Mathematics, Babeş-Bolyai University, ClujNapoca, Romania and the PhD at the same university with thesis: Contributions regarding some optimisation problems in formal languages. Research interests: Formal Languages and applications in cutting-stock/bin packing problems (two or three dimensional), Compilations techniques, Optimisation and Mathematical Modelling. Publishing activity: 5 books in the field of Theoretical computer science, Compilation Techniques and Engineering modelling, 6 teaching books and more than 70 research papers. Alexandra Băicoianu – PhD Student in Computer Science, Faculty of Mathematics and Computer Science, Babes Bolyai University, Cluj-Napoca. She received her M.S. degree in Algorithms and Software Products (2007), Faculty of Mathematics and Computer Science, Transilvania University of Brasov and currently she is also assistant professor at Transilvania University of Brasov and she teaches courses in the Department of Mathematics and Computer Science, in the subjects of: Object-Oriented Programming, Logic Programming, Automata Theory and Formal Languages, Artificial Intelligence. She also published papers in research area of: data mining, regular expressions, mathematical modelling. Sebastian Dimitriu– Graduate of the Faculty of Mathematics and Informatics of the Transilvania, University of Brasov. At present he works at iQuest Technologies as a Java Developer in the Telecom Division in parallel with his studies for a master’s degree. He is interested in Java based solutions in the telecom industry.

Copyright for articles published in this journal is retained by the authors, with first publication rights granted to the journal. By the appearance in this open access journal, articles are free to use with the required attribution. Users must contact the corresponding authors for any potential use of the article or its content that affects the authors’ copyright.