Intrinsic Plagiarism Detection in Digital Data

International Journal of Innovative and Emerging Research in Engineering Volume 2, Issue 3, 2015 Available online at www.ijiere.com International Jou...

Author: Shauna Caldwell

6 downloads 0 Views 733KB Size

Report

Download PDF

Recommend Documents

Intrinsic Plagiarism Detection and Author Analysis By Utilizing Grammar

Plagiarism Detection in Software Designs

Plagiarism Detection Software

Plagiarism in the Digital Age

Detection of Plagiarism in Arabic Documents

Plagiarism Detection: A combined Approach

C CODE PLAGIARISM DETECTION SYSTEM

Using TurnitinUK For Plagiarism Detection

Plagiarism Detection Software Test 2013

PLAGIARISM DETECTION USING SEMANTIC ANALYSIS

Addressing Plagiarism in A Digital Age

The Detection of the Intrinsic Viscosity (IV)

Source Code Plagiarism Detection SCPDet : A Review

Plagiarism and Detection Tools: An Overview

Using plagiarism detection software with Blackboard

Plagiarism Detection Reduced to String Matching

Plagiarism Detection Using Blackboard s SafeAssign Feature

Program Dependence Flowchart Generator and Plagiarism Detection

A Plagiarism Detection System in Computer Source Code

Old and new challenges in automatic plagiarism detection

Strengths and Weaknesses of Plagiarism Detection Software

Development of a Plagiarism Detection Software

International Journal of Innovative and Emerging Research in Engineering Volume 2, Issue 3, 2015 Available online at www.ijiere.com

International Journal of Innovative and Emerging Research in Engineering e-ISSN: 2394 – 3343

p-ISSN: 2394 – 5494

Intrinsic Plagiarism Detection in Digital Data Netra Charya, Kushagra Doshi, Smit Bawkar and Dr. Radha Shankarmani Information Technology Dept., Sardar Patel Institute of Technology, Andheri (W), Mumbai, India HOD, Information Technology Dept., Sardar Patel Institute of Technology, Andheri (W), Mumbai, India ABSTRACT: Eventual growth in the field of research is leading to the publication of many research papers and articles over the World Wide Web. Chances of the data being repeated are high. This leads to plagiarism in the contents of research papers thus violating the authenticity of the achievements in that particular research field. Much progress is made into creating tools that determine data being plagiarized from web sources. But we present novel software that can determine plagiarized sections in digital data taken from sources, unavailable over the internet. The major idea behind this software is the analysis of the grammar usage and sentence constructions used by the author. The sentence are compared with each other to determine the deviation among them by using pq-gram distances computed between pairs of grammar trees formed for every pair of sentences in the submitted data snippet and performing mathematical calculations. Thus the possibly plagiarized sentences in digital data are determined. For thorough examination of the authenticity of digital data on the World Wide Web, the proposed system can be used as a complementary tool to the available online tools. Keywords: Intrinsic plagiarism detection, grammar analysis, pq-gram distance, matrix generation I. INTRODUCTION As the research in various fields is increasing many-fold these days, there are a lot of research papers and articles that are being published online in journals and presented in conferences. Thus, many aspects included within the research topic often overlap and the data included in the newer papers appear similar to the data already present in the previously published papers. The source of information may not always be available digitally. This overlapping of data leads to plagiarism in digital data. It is of utmost necessity to detect as well as prevent the ever increasing cases of plagiarism in the digital data published in research papers and articles to maintain the purity of the research contents. The following paper is divided in to five major sections. The first part gives an extensive introduction to the definition and explanation of plagiarism and a detailed insight into the available tools and how different the proposed system is from these tools. The second part of the paper is devoted to the detailed classification of plagiarism types and narrows down to the major type of plagiarism the proposed system aims to handle. The third part of the paper describes the classification of the plagiarism detection methods there are available. It also aims at extensively understanding how these methods work and finally narrows down to the method used in the proposed system. The fourth part of the paper describes the proposed system and suggested utility of the system to the prospective users. II. THE ISSUE - PLAGIARISM A. Definition By definition, plagiarism means “an act or instance of using or closely imitating the language and thoughts of another author without authorization and the presentation of that author's work as one's own, without crediting the original author.” [1] According to the Merriam-Webster online dictionary, to "plagiarize" means “to steal and pass off the ideas or words of another person’s as one's own, to use another person’s production without crediting the source, to commit literary theft or to present an idea as new and original derived from an existing source.” [2][5] Under the Indian law, expressing one’s idea in any tangible form falls under the consideration of copyright laws. [3] Indian Copyright law does not recognize property rights in abstract ideas. Until one expresses an idea in a tangible form, one has no protection against it. When an idea is expressed in a tangible form it automatically becomes a subject of common law property rights which are protected by the courts at till it can be said to be novel and new (for clarity, see [4]). Stealing ideas that had been originally created by others is extremely unethical and thus there is an urgent need to detect and prevent plagiarism cases as much as possible. Plagiarism prevents one from establishing one’s own ideas on a large platform in one’s field of interest. B. Background of available tools In general, there are two methods to determine plagiarism in research papers – external and intrinsic plagiarism detection methods. The existing online tools and offline software that are currently available all detect plagiarism in textual 23

International Journal of Innovative and Emerging Research in Engineering Volume 2, Issue 3, 2015 documents using external detection method only. Following are some of the popular tools that are available to check plagiarism in text documents: 1. PaperRater (http://www.paperrater.com/): PaperRater is a free online tool developed and maintained by linguistics professionals and graduate students. PaperRater splits up the supplied text into smaller sections which are then compared against more than 20 billion pages found in the books, journals, research articles, and web pages indexed by the search giants like Google, Yahoo, and Bing. An originality score is calculated within seconds. A low originality score indicates that all or a portion of the submitted document can be found in some of the above sources. Documents that may match the input text are displayed so that the user may verify whether potential matches represent plagiarism or a false positive. A high originality score indicates that none, or almost none, of the submitted document was found on the Internet via search. It should also be noted that quotations are never removed before checking for plagiarism that may skew the originality score. 2. Turnitin (http://turnitin.com/): Turnitin is an annually licensed product. Turnitin checks the submitted papers against 24+ billion web pages, 300+ million student papers and 110,000+ publications. Turnitin accepts submissions for originality check via file upload or cut-and-paste. The software is regularly updated with new content acquired through new partnerships and also includes archived Internet data that are no longer available. For instance, Turnitin partner CrossRef boasts 500‐plus members that include publishers such as Elsevier and the IEEE, and has already added hundreds of millions of pages of new content to their database. 3. PlagTracker (http://www.plagtracker.com/): PlagTracker is a very user friendly, popular, freely available online plagiarism detection tool for students, teachers, publishers as well as website owners.. It does not require to be downloaded for utility. One can simply upload the research paper or cut-paste parts of the paper and obtain a complete report. It allows uploading papers after thoroughly checking them for originality. PlagTracker uses a unique checking algorithm that scans content for plagiarism. It is fast and easy to use. The system finds all the content that may have been plagiarized, along with a list of all the sources, to make is easier for correction. Different types of dashboards and utility options are provided for different users. 4. EVE2 ( http://www.canexus.com/): EVE2 is a very powerful tool that is paid and requires registered users to download the software for their operating system for further use. It does not allow online features. It allows professors at all levels of the education system to determine if students have plagiarized material from the World Wide Web. EVE2 accepts essays in plain text, Microsoft Word, or Corel Word Perfect format and returns links to web pages from which a student may have copied the content from. EVE2 is so well developed that it does not provide false positives. Eve performs a large number of complex searches to find material from any Internet site. EVE2 employs the most advanced searching tools available to locate suspected sites and does direct comparisons to obtain the plagiarism instances. If it finds evidence of plagiarism, the URL is recorded. Once the search has completed, a full report on each paper that contained plagiarism, including the percent of the essay plagiarized, and an annotated copy of the paper showing all plagiarism highlighted in red. 5. Viper (http://www.scanmyessay.com/): Viper is a fast plagiarism detection tool with the ability to scan your document through more than 10 billion resources, such as academic essays and other online sources as well as the essays and documents available on the user’s system. It offers side-by-side comparisons for plagiarism. It’s free of cost and can be downloaded very easily. It allows unlimited length of the documents and unlimited submissions. It cannot be used online on the go. It is available to be downloaded only for Microsoft Windows users. A tabular comparison of the above tools will make their features clear and also determine that all of them lack in one common feature:

Feature Reference corpora? Online tool? Free tool? Multilingual? Performs online database checking? Performs offline database checking? Text acceptance method?

Table 1. A comparison of various available plagiarism detection tools PaperRater Turnitin PlagTracker EVE2 Viper Yes Yes Yes Yes Yes Yes Yes Yes No No Yes No Yes No Yes Yes

Yes

Yes

Yes

Yes

No

No

No

No

Yes

Copy – Paste reqd. Text

File upload & copy paste Yes PDF or Word or plain text No

File upload & copy paste Yes PDF or Word or plain text No

File upload & copy - paste

File upload

Yes Corel Word or MS Word or plain text Yes

Yes PDF or MS Word

Database scalable? File format support?

Yes -

Requires download?

No

Yes 24

International Journal of Innovative and Emerging Research in Engineering Volume 2, Issue 3, 2015

Operating system support? Report quality?

-

-

-

Open Source

Good but very brief

Very good with details

Good but very brief

False positives?

Possible

Possible

Detects plagiarized text from nondigital and nonpublic sources?

No

Very unlikely No

Extremely good with apt details Highly impossible No

No

Microsoft Windows only Extremely good

Very unlikely No

Thus it is quite evident that external plagiarism detection method indicates the use of reference corpora from various digital sources and available databases on the World Wide Web. It is also seen in the last feature of the above table list that none of these tools can detect plagiarized sentences from sources which aren’t publicly available on the internet. They fail in this aspect. This loophole allows authors to take advantage and copy content from journals and sources unavailable on the World Wide Web for public use. At present, there are no available tools to determine plagiarism in research paper intrinsically (further refer section IV). C. Project Insight The proposed software indicates the use of intrinsic plagiarism detection method – that means it aims at detecting plagiarized sections in the research papers without using reference corpora. It analyses the recognisable and distinguishable grammar and sentence constructions used by the author of the paper. Each sentence will be compared with rest of the sentences in the paper and the suspicious sentence constructions will be marked plagiarized and finally a plagiarism score will be provided (further refer section IV in this paper). III. TYPES OF PLAGIARISM Plagiarism appears in various forms in research papers, or for that matter, any official document that requires grading or publication. In this section the two major types of plagiarism instances that are encountered specifically in research papers are explained in detail [6]: 1. Textual Plagiarism 2. Source Code Plagiarism The scope of the paper lies extensively in textual plagiarism and hence the following section explains the classification of the sub-types of textual plagiarism [7][8]: 1. Deliberate copy–paste / Clone plagiarism: This refers to directly picking up content from a viable source and put it in one’s own research paper with or without citing the source appropriately or providing credit to the original author and declaring the work to be one’s own. 2. Paraphrasing plagiarism: Paraphrasing generally refers to using the idea from only one specific source but switching words, changing sentence constructions, improvising on grammar styles and using synonyms for the words wherever possible for the work to look one’s own or legitimate. 3. Metaphor plagiarism: "Metaphors are used either to make an idea clearer or give the reader an analogy that touches the senses or emotions better than a plain description of the object or process. Metaphors, then, are an important part of an author's creative style.” [7][9] Thus if one cannot provide perfect examples or metaphors to make one’s idea clear, one picks up the metaphor used by the author in the source document. 4. Idea plagiarism: If one copies an innovative idea or a solution provided by another author in a source document, whilst one cannot provide a solution or an idea of his own, the idea plagiarism is said to have occurred. The research paper authors have a hard time distinguishing the ideas and/or solutions provided by the author of the source paper from public domain information. Public domain information is any idea or solution about which people in the field accept as general knowledge. 5. Mosaic / Hybrid / Patchwork paraphrasing plagiarism: Patchwork paraphrasing refers to obtaining content from a various sources catering to the same topic of interest and rephrasing the sentences, switching words, using synonyms and improvising on the grammar styles to finally producing one’s own research paper without citing the sources. 6. Self / Recycled plagiarism: Here the author of the research paper reuses his own previous work to produce a new work. 7. 404 Error / Illegitimate Source plagiarism: In this type of textual plagiarism, the author manages to cite all the sources in his paper, but the sources are invalid, improperly cited, missing or simply do not exist. 8. Retweet plagiarism: The author may cite all the sources correctly but the work relies too closely on the original content wordings, sentence structures and/or grammar usage. 25

International Journal of Innovative and Emerging Research in Engineering Volume 2, Issue 3, 2015 Although the above sub-types comprise of textual plagiarism, detecting each of them requires separate mechanisms. The proposed software i.e. Intrinsic Plagiarism Detection Software can detect Paraphrasing, Idea, Mosaic and 404 Error textual plagiarisms sub-types. External Plagiarism Detection method can detect Clone, Metaphor, Retweet and possibly (with a low probability or none) Self plagiarisms. IV. CLASSIFICATION OF PLAGIARISM DETECTION METHODS: The various types of plagiarism methodologies require different methods and algorithms for detection to be made possible. A large number of methods are used to detect as much amount of plagiarism in a text document and as accurately as possible. But it must be noted that none of the methods are 100% accurate. Following are the major classifications in plagiarism detection methods: 1. Manual Detection: Not recommended for large scale use at all 2. Computer Automated Detection: Much easier and faster to use The main interest lies in classifying computer automated detection method, which is as follows: 1. External Detection Methods: Plagiarism is detected by comparing the contents of the submitted research paper with the contents of the already published and publicly available in various databases, over the World Wide Web and digital journals. Thus it requires the assistance of a reference corpus. 2. Intrinsic Detection Methods: Plagiarism is detected without using any reference corpus. It analyses the grammar used by the author, similarity between words and the topic under consideration etc. There is no perfect classification of the various techniques available anywhere. But understanding the basic description and an overview of the internal working of each technique makes it quite evident to clearly classify each under a major category. In the following sub-section, a brief description of the techniques used in Computer Automated External Detection methods is provided although the scope of this paper lies in understanding the various detection techniques used within Computer Automated Intrinsic Detection methods, explained in detail further in the next sub-section. Developing intrinsic plagiarism detection methods or techniques is relatively a difficult task and getting the software into a working condition is quite a challenge. There are six popular techniques included under Computer Automated External Detection method: 1. Grammar Based Plagiarism Detection: This technique uses a string-based matching approach to detect and to measure similarity between the documents available within a database under consideration. The grammar-based technique is suitable for detecting clone documents and fails to detect plagiarism in paraphrased documents. [6] 2.

Semantics Based Plagiarism Detection: This technique focuses on determining similarities in the use of words between documents stored in the given database using a vector space model. It is also capable of calculating the redundancy count of the words used in the document under review. It does not give accurate results for partially paraphrased documents as it cannot actually locate the plagiarized section in the submitted research paper. [6]

3.

Clustering Based Plagiarism Detection: A cluster-Based Plagiarism Detection method, use the grammar-based technique largely, by dividing it into three steps: first step called pre-selecting, so as to narrow the scope of detection using the successive same fingerprint; the second, called locating, is to find and merge all fragments between two documents using cluster method; the third step, called post-processing is deal with some merging errors. There are two traditional clustering algorithms implemented with document representation based on winnowing fingerprints, by adapting the similarity measures for working with multi-sets and designed a new way of centroid computation. [14]

4.

Cross Lingual Plagiarism Detection: This technique is used for detecting suspected documents plagiarized from other language sources. In this method, the similarity between a suspected and an original document is evaluated using statistical models to establish the probability that the suspected document was related to the original document regardless of the order in which the terms appear in suspected and original documents. This approach necessitates the construction of the cross-lingual corpus.[10]

5.

Citation Based Plagiarism Detection: Proposed by [11], this technique is used for identifying academic documents that were read and used without referred to those documents. It actually belongs to semantic plagiarism detection techniques because it is focuses for detection on semantic contained in the citations used in a text academic documents. It intends to identify similar patterns in the citation sequences of academic works for similarity computation. [10]

6.

Character Based Plagiarism Detection: Character Based Plagiarism Detection has two sub types namely, Fingerprinting and String Matching. In the fingerprinting technique, the pre-processing step involves creating representative digests of documents by selecting a set of multiple substrings using n-grams from them. These digests are referred to as fingerprints. A suspicious document’s passages are compared to the reference corpus based on their computed fingerprints. Fingerprint matching with those of other documents indicate shared text segments and suggest potential plagiarism if they exceed certain similarity threshold. Duplicate and near duplicate passages are assumed to have similar fingerprints [13][14]. In the string matching technique, the documents are 26

International Journal of Innovative and Emerging Research in Engineering Volume 2, Issue 3, 2015 compared for exact text overlaps with other documents in the database. This requires the computation and storage of efficiently comparable representations for all documents in the reference corpus for pair wise comparison, obtained using suffix document models, such as suffix trees or suffix vectors. [13] There are three popular techniques also under Computer Automated Intrinsic Detection method: 1. Grammar Semantics Hybrid Plagiarism Detection: The base of this technique is Natural Language Processing (NLP) and thus makes it a good choice for intrinsic plagiarism detection. It can determine Paraphrasing and Mosaic types of plagiarisms in research papers. By calculating similarity measures between the words written, it can locate the plagiarized sections in the document. [6] 2. Structure Based Plagiarism Detection: This technique focuses on structure features of the text in the document such as headers, sections, paragraphs, and references. “Chow and Rahman [12] adopt a tree-structured representation with Multi Layer Self Organizing Maps (ML-SOM) for information retrieval and plagiarism detection. They have built their idea based on two layers - a top layer and a bottom layer whereby the top layer presents the clustering and retrieving of documents while the bottom layer utilizes a Cosine similarity coefficient to capture similar and plagiarized text.” [10] 3. Syntax Similarity Based Detection: This technique is on the boom in the research field. Syntactical features are manifested in part of speech (POS) of phrases and words in different statements. Basic POS tags include verbs, nouns, pronouns, adjectives, adverbs, prepositions, conjunctions and interjections. POS tagging marks the words in the given sentence. Sentence-based representation works by splitting the text into statements using end-ofsentences delimiters such as period, exclamation mark and question mark. After splitting the text into sentences, POS and phrase structures can be constructed using POS taggers also known as tokenization. Another way is to create chunks, the so-called windowing or sliding windows to characterize bigger text than phrases or sentences. POS is further used in windowing to generate more expressive POS chunks. Word order in a sentence or a chunk are combined as a feature, and used as a comparison scheme between sentences. This technique is important in plagiarism detection where POS tags features clubbed with other string similarity metrics in the analysis and calculation of similarity between sentences. The more the POS tags are used, the more reliable features are produced for measuring similarity.[15] The proposed system narrows down its internal working to Computer Automated Intrinsic Syntax Similarity Based Detection. This methodology aims at detecting Paraphrasing, Idea, Mosaic and 404 Error textual types of plagiarisms that may possibly be observed in the submitted research paper. V. PROPOSED SYSTEM A. Motivation The sole reason behind proposing this system is the deficiency of all the available plagiarism detection tools in determining sentences within digital data that are plagiarized from sources that are not publicly unavailable on the internet. This drawback can always be taken advantage of, by the authors, for creating new articles or research papers digitally. Hence a detailed description and comparison of various available plagiarism detection tools has been provided in section II. The proposed system can be clubbed with one of the above mentioned tools for thoroughly checking the authenticity of digital data in e-articles and e-research papers. B. An Overview The Intrinsic Plagiarism Detection in Digital Data (IPDDD) system allows the examiner of the research papers or the editor of digital journals to determine whether there are plagiarized sentences in the submitted research paper in text format only. IPDDD basically attempts to detect plagiarized sentences in the digital text data without using a reference corpus. IPDDD uses grammar analysis of the sentences written by the author. If suspicious sentences are found by computing the similarity distance between grammar trees of the sentences found in the digital data source to that of the successive sentences, then by calculating appropriate mathematical values using the computed distances between pairs of grammar trees and a certain threshold value, the software tries to identify suspicious sentences. Then such sentences are recorded and their total count is stored. Using the count of plagiarized sections and the total number of sentences in the paper, an authenticity ratio is calculated. If the percentage ratio is more than a prescribed value, then the paper is decided to be violating the rules of plagiarism acceptance. IPDDD attempts at detecting plagiarized sentences within the digital data written in the research papers and articles which are to be newly published. Authors use recognizable and distinguishable grammar syntax to construct sentences in their papers even if they may copy contents from sources generally unavailable in digitized format. IPDDD software parses the sentences written in the digital data source including the spaces between the words. The software forms a grammar tree for each parsed sentence that determines the content written by the author. If suspicious sentences are found by computing the similarity distance between grammar trees and computing mathematical parameters by comparing the edit distances with the mean value, the software declares them as potentially plagiarized sentences.

27

International Journal of Innovative and Emerging Research in Engineering Volume 2, Issue 3, 2015 C. Major factor in aid – author’s writing style Intrinsic plagiarism detection in digital data can be made operational only when the writing style of the author can be quantified appropriately. It is necessary to divide the data into natural parts like words, sentences, paragraphs and sections. Each author develops an individual writing style; he or she employs patterns, consciously or subconsciously, to construct sentences and uses an individual vocabulary. Stylometric features quantify style aspects, and some of them have been used successfully in the past to discriminate between texts with respect to authorship. Following are some features that contribute to identifying patterns in the author’s style of writing digital text data: 1. Text Statistics: Operates at character level, say word length, commas, etc. 2. Syntactic Features: Measures writing style at sentence level, say using keywords, sentence length, etc. 3. Parts-of-Speech Features: Quantify use of word classes, say use of adjectives, pronouns, etc. 4. Closed-class word sets: Counts special words, say use of stop words, foreign words 5. Structural Features: Reflects text organization, say paragraph lengths, etc. [17] D. Parsing sentences and Pre-Processing Every sentence in the digital data document is parsed individually. But for improved classification of which sentences must be considered to undergo the intrinsic plagiarism check, all the sentences must undergo pre-processing. This would handle all the exceptional cases in the given set of data. According to our study we recognized a few exceptions as follows that require definite consideration: 1. A lot of special characters are used such as commas, colon, semi-colon, question marks etc. 2. Data within quotes are always cited and they must lie outside the scope of plagiarism check. 3. Data possessing cross references with their respective sources cannot be considered as plagiarized since the sources are cited by the author. 4. Discarding acronyms within data E. Construction of Trees and Matrix Generation Each sentence of the digital text document format of the paper is parsed by its syntax, which results in a set of grammar trees. These trees are then compared against each other as first with rest of the sentences in the document, second with rest of the sentences and so on. Now, having all grammar trees of all sentences, the distance between each pair of trees is calculated and stored into a triangular distance matrix D. Thereby, each distance d(i,j) corresponds to the distance of the grammar trees between sentence i and j, where d(i,j) = d(j,i). The distance itself is calculated using distance algorithm. A pq-gram is defined through the base p and the stem q, which define the number of nodes taken into account vertically (p) and horizontally (q). Missing nodes, e.g. when there are less than q horizontal neighbors, are marked with ∗. [16] F. Mathematical Analysis to determine plagiarized sentences When the matrix containing the pq gram distances between all possible pairs of trees throughout the document is obtained, we must find out the possibly plagiarized sentences. First of all we calculate the median values for every row in the distance matrix and store these values in a list. Then, we compute the mean of all the values in the median list using the below formula:



X N

Further, the standard deviation is computed by using the following formula:



(X  )

2

N

Where “X” represents every element from the median list, “µ” represents the mean of the median values from the list as calculated above, “N” represents the total elements in the median list. A threshold value, t, is computed using the mu and sigma values. The threshold value will help us determine the possibly plagiarized sentences. This is achieved by comparing each individual value from the distance matrix obtained in the previous stage. The values which are greater than "t" are marked suspicious and others pass the plagiarism test. Now that we have compared the distance values with "t", we must determine which sentences are actually plagiarized between a given pair. To achieve this, we store all the suspicious values in a list and check the frequency of suspicious values. All the suspicious pairs having a common sentence will automatically be detected as plagiarized. G. Suggested Utility Since the above proposed system is capable of detecting the trickier types of plagiarism techniques used in digital data while writing a research paper, using this system alone is not enough for a thorough checking of the submitted paper for plagiarism instances. Hence it is strongly suggested to pair the proposed system with one of the available external plagiarism detection tools, mentioned in section II. 28

International Journal of Innovative and Emerging Research in Engineering Volume 2, Issue 3, 2015 H. Block Diagram of the proposed system The proposed system considers the following ten steps:

Figure 1. Block Diagram – IPDDD I. Architecture Diagram of the proposed system The proposed system basically uses the layered architecture with four major layers as follows:

Figure 2. Architecture Diagram – IPDDD VI. CONCLUSION The proposed system aims at detecting the possible plagiarized sentences or digital text data in a newly submitted research paper or article for approval. The proposed system detects almost all the plagiarized sections in the paper without using a reference corpus but by analyzing the grammar used by the author of the paper. But it must be noted that the software does not give 100% accurate results. Some of the sentences that are declared plagiarized by the software may not really be plagiarized and vice-versa. The development of this system is a continual process. VII. FUTURE WORK The proposed system aims at detecting plagiarism in papers available in .txt format. This system can further be scaled to cover different file formats like .doc, .docx, .pdf etc. as well as data published on websites. The optimization of 29

International Journal of Innovative and Emerging Research in Engineering Volume 2, Issue 3, 2015 mathematical parameters can be improved magnificently to increase the accuracy of plagiarism detection. This can help reduce the number of false positives in the plagiarism detection report.

REFERENCES [1] Dictionary definition of plagiarism - http://dictionary.reference.com/browse/plagiarism [2] http://www.plagiarism.org/ [3] Copyright laws in India - http://www.legalserviceindia.com/article/l195-Copyright-Law-in-India.html [4] http://blog.inolyst.com/2013/12/04/protect-ideas-through-copyrights/ [5]http://akirchner.hubpages.com/hub/Fair-Use-And-Recipe-Copyright-Tips-And-Options & http://www.copyrightlaw.co.za/plagiarism.html [6] Asim M. El Tahir Ali, Hussam M. Dahwa Abdulla, and Vaclav Snasel, “Overview and Comparison of Plagiarism Detection Tools”, Department of Computer Science, VSB Technical University of Ostrava, 17. listopadu 15, Ostrava - Poruba, Czech Republic, ceur-ws.org, Vol-706 [7] Barnbaum, C., “Plagiarism: A Student's Guide to Recognizing It and Avoiding It.”, Valdosta State University, http://www.valdosta.edu/~cbarnbau/personal/teaching_MISC/plagiarism.htm (Accessed 23 January 2006). [8] http://blogs.uoregon.edu/casitblog/files/2012/05/plagiarism.png [9] Liles, Jeffrey A. and Michael E. Rozalski., “It's a Matter of Style: A Style Manual Workshops for Preventing Plagiarism.”, College & Undergraduate Libraries, 11 (2), 2004, p. 91-101. [10] Ahmed Hamza Osman, Naomie Salim and Albaraa Abuobieda, “Survey of Text Plagiarism Detection”, Computer Engineering and Applications Journal 2012 [11] B. Gipp and J. Beel, "Citation based plagiarism detection: a new approach to identify plagiarized work language independently”, 2010, pp. 273-274. [12] T. W. S. Chow and M. K. M. Rahman, "Multilayer SOM with tree-structured data for efficient document retrieval and plagiarism detection,", vol. 20, pp. 1385-1402, 2009. [13] http://en.wikipedia.org/wiki/Plagiarism_detection [14]http://www.ukessays.com/essays/information-technology/a-survey-of-plagiarism-detection-methods-informationtechnology-essay.php [15] Salha Alzahrani, Naomie Salim, and Ajith Abraham, “Understanding Plagiarism Linguistic Patterns, Textual Features and Detection Methods”, SMIEEE [16] Michael Tschuggnall, Gunther Specht, “Detecting plagiarism in text documents through Grammar Analysis of authors”, BTW 2013, 15. GI-Fachtagung Datenbanksysteme für Business, Technologie und Web, 11. März – 15. März 2013 Magdeburg, LNI, pp. 241-259 [17] Sven Meyer zu Eissen, Benno Stein, and Marian Kulig, “Plagiarism Detection without Reference Collections”, Advances in Data Analysis, Studies in Classification, Data Analysis, and Knowledge Organization 2007, pp 359-366

30