Development of a Plagiarism Detection Software

Bachelor’s Thesis Development of a Plagiarism Detection Software Meichelbeck Julien 2012 Abstract of thesis University: Turku University of Applie...
Author: Cecil Phillips
0 downloads 4 Views 1MB Size
Bachelor’s Thesis

Development of a Plagiarism Detection Software Meichelbeck Julien

2012

Abstract of thesis University: Turku University of Applied Sciences (TUAS), Finland Degree program: Information Technology (Embedded Systems) Author: Meichelbeck Julien Title: Development of a Plagiarism Detection Software Instructor: Paalassalo Jari-Pekka Date: June 19, 2012

Total number of pages: 28

This thesis gives a working example on how to design and implement a plagiarism detection software in Java using various libraries. The software uses websites as datasources to determine if a text or a file is a plagiarized document or not. The first chapter of the report is describing how the software works, which tools were used with of a focus on the different Java libraries. This part deals also with the technical requirements. The main part is describing how the program has been designed (class diagrams, design decisions, etc.). An analysis of the implementation is also developed in this part, beginning with the Levenshtein algorithm description. The final part is showing the user interface, especially the different windows of the software, and deals also with the possible improvements.

Keywords: Plagiarism, detection software, parsing, online data harvester. Deposit at: Library of Turku University of Applied Sciences ii

Acknowledgments I would like to express my gratitude to my supervisor, Jari-Pekka Paalassalo, whose expertise and understanding, added considerably to my graduate experience this year. I would also like to thank my family for their support and my friends, with whom I had a lot of fun during this year. And finally, I thank the Erasmus program for having given me the opportunity to study abroad in this wonderful country.

iii

Declaration I declare that this thesis was composed by myself, that the work contained herein is my own except where explicitly stated otherwise in the text, and that this work has not been submitted for any other degree or professional qualication except as specified.

Metz, June 19, 2012 _________________ Meichelbeck Julien

iv

Foreword Revision history: 18.06.2012 17.06.2012 10.06.2012 03.06.2012 02.06.2012 30.05.2012 20.05.2012 15.05.2012 13.05.2012

Final and last modifications Styles changed and two pages schema added Conclusion Project requirements, implementation and class diagrams Project testing, abstract Introduction, Technology description, Tools Project design Styles, presentation, and variables created First version of this document

Printing and copying (hard copies or pdf-files) of this document is allowed.

v

Contents

ABSTRACT OF THESIS ....................................................................................................................................................... II ACKNOWLEDGMENTS ...................................................................................................................................................... III DECLARATION ............................................................................................................................................................... IV FOREWORD ................................................................................................................................................................... V CONTENTS ................................................................................................................................................................... VI INTRODUCTION .............................................................................................................................................................. 1 1. TECHNOLOGY DESCRIPTION ..................................................................................................................................... 2 2. TOOLS ................................................................................................................................................................ 5 2.1 Java ............................................................................................................................................................. 5 2.2 Java libraries ............................................................................................................................................... 5 2.3 Google Search ............................................................................................................................................. 7 3. PROJECT REQUIREMENTS ........................................................................................................................................ 8 4. PROJECT DESIGN ................................................................................................................................................... 9 4.2 Plagiarism detection class diagram .......................................................................................................... 10 4.3 User Interface class diagram ..................................................................................................................... 11 4.4 Configuration file and constants ............................................................................................................... 12 5. IMPLEMENTATION ............................................................................................................................................... 13 5.1 Damerau - Levenshtein Algorithm ............................................................................................................ 13 5.2 Substring matching ................................................................................................................................... 14 5.3 Parsers....................................................................................................................................................... 15 6. PROJECT TESTING ................................................................................................................................................ 17 6.1 Main Window ............................................................................................................................................ 17 6.2 Online plagiarism check ............................................................................................................................ 18 6.3 File to file comparison ............................................................................................................................... 24 6.4 Possible enhancements ............................................................................................................................. 25 CONCLUSION ............................................................................................................................................................... 26 TABLE OF FIGURES ........................................................................................................................................................ 27 BIBLIOGRAPHY ............................................................................................................................................................. 28

vi

Introduction Nowadays, plagiarism is a really serious issue within the professional environment, or even within the education system. Since Internet is accessible to everyone, it is easy to use Internet as a source of information. However, copying documents from Internet can be considered as plagiarism: what can be found on Internet can come from a book, a research document or an article. It can even result to some legal problems, such as copyright infringement. Although many of the laws and concepts are not new, the intellectual property concept is relatively recent, dating from the 19th century and this notion has been reassessed the last years with the creation of new online datasources such as Wikipedia, or even through the development of advanced search engines such as Google. Since then, a new kind of software has emerged, the plagiarism detection softwares. There are several types of them; they can use databases (of thesis, books, or articles), Internet or comparison between files. This project is focusing on the development of an application, mostly using the extraction of data from Internet to check plagiarism and file to file comparisons. Because the project is a typical software development work, the thesis has main focus on implementing this software; therefore it is mainly a technical report.

1

1. Technology description This part will describe how the software works without going into details. There are two available features: Importing one single file for online plagiarism check, or importing two file for a comparative check. The online plagiarism check corresponds to the biggest part of the project, because the comparative check uses the same classes created for the online plagiarism check. Concerning this part, the user is able to import files (doc, txt or pdf) to check for plagiarism. Then, the user can proceed to the configuration of the plagiarism detection through the configuration window. Then, the file analysis can start, and can be described in several steps.

STEP 1

The text is exported from the file ignoring the pictures, diagrams or other figures. A variable is created to save the text.

STEP 2

The text is divided into a lot of groups of words. These groups correspond to sentences by default.

STEP 3

Each group is searched by the software on a search engine.

STEP 4

The page containing the search engine results is loaded and then, the page from a result is loaded.

STEP 5

When the website has been loaded, the page is parsed, to extract the text from the HTML code. All the text contained on this page is isolated.

STEP 6

The sentence is searched inside the extracted text.

STEP 7

If a similar sentence has been found, the source is added to the source list and the next sentence starts to be analyzed (back to STEP 3).

STEP 8

If the sentence has not been found, another website is loaded (back to STEP 4) until χ results were analyzed. χ being a number defined by the user (default value: 3).

The two following pages correspond to a schema illustrating these steps.

2

File to analyze

STEP 2 STEP 1

Divided into groups of words

Text extraction

STEP 3 Each group is searched on a search enchine.

STEP 5 The page is parsed, and the code is cleaned to isolate the text from the HTML document.

STEP 4 The first χ * results are loaded by the program.

χ = number defined by the user

3

STEP 6

STEP 8

Research of the group of words in the page using Damerau-Levenshtein Algorithm.

Return to STEP 4, to check another page.

STEP 7 The group of words comes from this source. The source is added to the sources list and the group of words is declared plagiarized. Return to STEP 3, to analyze the next group of word.

Figure 1 – Schema showing the steps for a plagiarism check

4

2. Tools 2.1

Java

This software has been programmed in Java. Java is a programming language and computing platform first released by Sun Microsystems in 1995. It is the underlying technology that powers state-of-the-art programs including utilities, games, and business applications. Java runs on more than 850 million personal computers worldwide, and on billions of devices worldwide, including mobile and TV devices (Oracle Technology Network, 2010). The version which has been chosen for this project is JRE System 1.7. This version contains important enhancements to improve performance, stability and security of the Java applications. Java is known for his large number of libraries. Indeed, Sun provides a large number of frameworks and API in order to allow a lot of diversified uses. This is why Java was probably the best choice, at least the most suitable language, for the implementation of this project.

2.2

Java libraries

Jsoup 1.6.2 Jsoup is a Java library for working with real-world HTML. It provides a very convenient API for extracting and manipulating data, using the best of DOM, CSS, and jquery-like methods. Jsoup implements the WHATWG HTML5 1 specification, and parses HTML to the same DOM as modern browsers do.     

scrape and parse HTML from a URL, file, or string find and extract data, using DOM traversal or CSS selectors manipulate the HTML elements, attributes, and text clean user-submitted content against a safe white-list, to prevent XSS attacks output tidy HTML

Jsoup is designed to deal with all varieties of HTML found in the wild; from pristine and validating, to invalid tag-soup; jsoup will create a sensible parse tree.

1

Web Hypertext Application Technology Working Group HTML5 - http://www.whatwg.org/

5

Jsoup example

Code: Java

// Connection to the website to get the source code // USER_AGENT = Chrome/14.0.835.186 (example) // REFERRER = Previous page gDoc = Jsoup.connect(URL).referrer(REFERRER).userAgent(USER_AGENT).get(); // Selection of the DOM path, to reach what we want to extract Elements titles = gDoc.select("h3.r > a"); // Display the result System.out.println(titles.get(0).text());

Apache Poi 3.8 The Apache POI library is used to manipulate various file formats based upon the Office Open XML standards (OOXML) and Microsoft's OLE 2 Compound Document format (OLE2). In short, it allows reading and writing MS Excel, MS Word, or MS PowerPoint files using Java. Apache Poi example

Code: Java

// Creation of the variable which will contain the text String parsedText = new String(); POIFSFileSystem fs = null; try{ fs = new POIFSFileSystem(new FileInputStream(PATH)); HWPFDocument doc = new HWPFDocument(fs); WordExtractor we = new WordExtractor(doc); String[] paragraphs = we.getParagraphText(); for( int i=0; i

Suggest Documents