Extrinsic Plagiarism Detection in Text Combining Vector Space Model and Fuzzy Semantic Similarity Scheme

IRACST – International Journal of Advanced Computing, Engineering and Application (IJACEA), ISSN: 2319-281X, Vol. 2, No. 6, December 2013 Extrinsic P...
Author: Maryann Welch
4 downloads 0 Views 190KB Size
IRACST – International Journal of Advanced Computing, Engineering and Application (IJACEA), ISSN: 2319-281X, Vol. 2, No. 6, December 2013

Extrinsic Plagiarism Detection in Text Combining Vector Space Model and Fuzzy Semantic Similarity Scheme Rasia Naseem, Sheena Kurian Department of Computer Science and Engineering KMEA Engineering College Aluva, Kerala, India [email protected]; [email protected] Abstract— The proposed work combines Vector Space Model with Fuzzy similarity measure to detect plagiarism cases in documents. For a given suspicious document the aim is to identify the set of source documents from which the suspicious document is copied. In the first step, all the documents need to be processed to perform tokenization, stop word removal, stemming, etc. In the next step, a subset of documents that may possibly be the sources of plagiarism need to be selected. Vector Space Model (VSM) can be used for this candidate selection. Similarity between a suspicious document and a source document can be computed using cosine similarity measure between the document vectors weighted by tf-idf scoring. Thirdly, a sentence-wise in-depth analysis using fuzzy semantic based approach to find the plagiarized parts in the suspicious documents. This can detect similar, yet not necessarily the same, statements based on the similarity degree between words in the statements and the fuzzy set. Adjacent sentences regarded as plagiarism are joined together, and the final plagiarism cases are reported. The similarity Index (SI) and overall similarity index (OSI) can be used to report the similarity between source document and suspicious document. The system that combines vector space model and fuzzy semantic similarity was evaluated in terms of precision, recall and F-measure and showed improved performance over pure vector space model for extrinsic plagiarism detection.

developments in related fields like information retrieval (IR), cross-language information retrieval (CLIR), natural language processing, computational linguistics, artificial intelligence, and soft computing [1].

Keywords-Plagiarism detection, extrinsic plagiarism, Vector space model, Fuzzy similarity, Cosine similarity, F-measure, similarity index.

The proposed work is an extrinsic monolingual plagiarism detection approach which identifies the set of plagiarised documents for an input query document from a collection of source documents. The approach consists of four stages namely pre-processing, feature weighting and extraction of candidate documents, verification and detection of plagiarised passages. The documents are pre-processed to tokenise them into words, stop word removal and stemming. In the second step, the tokens are weighted using tf-idf weighting scheme. Each source and suspicious document thus forms a weighted vector. Similarity between a suspicious document and a source document is computed using cosine similarity measure [2]. Suspicious documents are then compared sentence-wise with the associated candidate documents. This stage entails the computation of fuzzy degree of similarity that ranges between two edges: 0 for completely different sentences and 1 for exactly identical

I.

INTRODUCTION

Plagiarism is the reuse of someone else’s prior ideas, processes, results, or words without explicitly acknowledging the original author and source. The problem of plagiarism has recently increased because of the digital era of resources available on the World Wide Web. Plagiarism detection in natural languages by statistical or computerized methods has started since the 1990s, which is pioneered by the studies of copy detection mechanisms in digital documents[1]. During the last decade, research on automated plagiarism detection in natural languages has actively evolved, which takes the advantage of recent

Automatic plagiarism detection has drawn significant attention to the research communities of Information retrieval (IR) due to its potential commercial applications. There are mainly two different types of plagiarism detection systems available: one being external, the other being intrinsic. In external plagiarism detection, a set of pure documents will be provided. Each time we need to compare a suspicious document with a given set of reference collections. This requires a document model and a proper similarity metric. The detection task is equivalent to retrieve all documents those contain texts which are similar to a degree above a chosen threshold to texts in the suspicious documents.[d] In case of Intrinsic plagiarism detection there is no need to supply any source document; the suspicious document is analysed based on the changes in the unique writing style of the author as an indicator for potential plagiarism. The method is very difficult to analyse and needs human judgement to reliably identify plagiarisms [2].

112

IRACST – International Journal of Advanced Computing, Engineering and Application (IJACEA), ISSN: 2319-281X, Vol. 2, No. 6, December 2013

sentences. Two sentences are marked as similar (i.e. plagiarised) if they gain a fuzzy similarity score above a certain threshold [1]. The last step is post-processing where consecutive sentences are joined to form single paragraphs/sections [14]. II.

RELATED WORKS

Some of the existing methods for external plagiarism detection are based on string matching procedure [3], Vector Space Model [4] and fingerprinting [5]. String matching procedure for plagiarism detection [3] aims to identify longest pairs of identical text strings. Suffix document models are mostly used for the task. The strength of this procedure is the detection accuracy with respect to the lexical overlaps. One drawback of the algorithm is the relative difficulty of detecting disguised plagiarisms. The method also requires huge computational efforts. Fingerprinting [5] is one of the most widely applied approaches for plagiarism detection. A representative digests of documents is formed by selecting a set of multiple substrings (n-grams) from them. The sets represent the fingerprints and their elements are called minutiae. For a given suspicious document, its fingerprint is first computed and the minutiae are compared with a pre-computed index of fingerprints for all documents of a reference collection. The inherent challenge of fingerprinting is finding a tradeoff between document dimension and detection accuracy. The reduction in dimension for document representation can cause to the information loss that, in turn, affects system performance. The method requires tuning a number of parameters such as the chunking strategy, chunk size (granularity of the fingerprint) or number of minutiae (resolution of the fingerprint) [5]. Determining the best parameter combination is strongly dependent on the nature and size of the document collection as well as on the amount and the forms of plagiarisms. Citation-based plagiarism detection is a computer assisted plagiarism detection approach which can be used in academics. It does not rely on the texts of the given documents but depends on the references given with a particular research paper. It first identifies similar patterns in the citation sequences of two academic works [6]. Subsequent nonexclusively containing citations shared by both documents being compared is represented by using citation patterns. Citation patterns are identified based on the factors of similar order and proximity of citations within the text. In order to quantify the pattern’s degree of similarity, some other factors, for example the absolute number or relative fraction of shared citations in the pattern as well as the probability that citations co-occur in a document are considered. Stylometry [6] applies some statistical methods in order to determine an author’s unique writing style. It is

mainly used for identifying the authorship attribution or intrinsic plagiarism detection. III.

PROPOSED ALGORITHM

Performance of vector space model greatly depends on the individual plagiarism incidence to be analyzed and the parameter configuration. The global similarity assessment of VSMs sometimes hampers the overall performance. In other words, many documents that are globally similar may not contain plagiarized paragraphs or sentences of the system [2]. Many plagiarism detection methods focus on copying text with/without minor modification of the words and grammar. In fact, most of the existing systems fail to detect plagiarism by paraphrasing the text, by summarizing the text but retaining the same idea, or by stealing ideas and contributions of others. This is why most of the current methods do not account the overlap when a plagiarized text is presented in different words [1]. While using vector space model, long documents are poorly represented because they have poor similarity values (a small scalar product and a large dimensionality). Search keywords must precisely match document terms; word substrings might result in a "false positive match”. Another important drawback is that semantic sensitivity; documents with similar context but different term vocabulary won't be associated, resulting in a "false negative match". Also, the order in which the terms appear in the document is lost in the vector space representation. The model theoretically assumes terms are statistically independent The proposed system tries to overcome the limitations of vector space model by combining it with fuzzy semantic similarity measure. Plagiarism could be fuzzier than clear, more complex than trivial copy and paste. Methods for plagiarism detection mostly track verbatim plagiarism; however, detecting excessive paraphrasing is a difficult task. Many current techniques rely on exactly matched substrings or some kinds of textual fingerprinting. But that may not be sufficient as cases of rephrasing and rewording the content treated as different (i.e. not plagiarized). Therefore, this work considers the problem of finding the suspected fragments that have the same semantics with the same/different syntax. In this regard, matching fragments of text becomes approximate or vague and can be implemented as a spectrum of values between 1 (i.e. exactly matched) and 0 (entirely different). For an input document, similarity verification is done on source document set. For a given input query document d and a set of source documents S, the objective is to compare the set of documents sx in S. This requires a document model and a proper similarity metric. The detection task is equivalent to retrieve all documents those contain texts

113

IRACST – International Journal of Advanced Computing, Engineering and Application (IJACEA), ISSN: 2319-281X, Vol. 2, No. 6, December 2013

which are similar to a degree above a chosen threshold to texts in the suspicious documents. Also the passages in the retrieved documents which are plagiarized in query document are merged to form the output. The algorithm consists of four stages namely preprocessing, A four stage selection algorithm is proposed by combining vector space model with fuzzy semantic similarity measure.

A. Pre-processing In the first phase, standard text pre-processing methods like tokenization, removal of stop words and stemming are applied on the input document d and al the documents in the source collection S. For each document a feature set F of m tokens {x1,x2,….,xm}are retrieved. Parser ignores stop words and other unwanted characters like scripts in the page. This method is referred to as “Stop Listing” [15]. Normally, the connectives and prepositions in the English language are treated as stop words. The tokens remaining after stop words removal are then stemmed. Stemming is done by removing each word of its derivational and inflectional suffixes [10]. For example, the words “correspondent”, “corresponding”, “corresponded”, “corresponds” etc. will be reduced to its root word “correspond”. Stemming is done using Porters stemming algorithm. Porter's algorithm consists of 5 phases of word reductions, applied sequentially. Within each phase there are various conventions to select rules, such as selecting the rule from each rule group that applies to the longest suffix [10].

B. Feature weighting and extraction of candidate documents Each term of the vector is then weighted with term frequency-inverse document frequency, more widely known as tf-idf. Each source and suspicious document thus form a weighted vector. Similarity between a suspicious document and a source document is computed using cosine similarity measure. The weight tf −idf is calculated for each term ti in a document dj as [2]: tf − idf i, j = tf i, j • idf i (1) The term frequency of ti in document dj, or tfi,j is [2] : (2) tf i , j = ni , j ni,j being the number of occurrences of the term, ti, in document dj. The inverse document frequency of term ti is [2]: 1 idf i = d : ti ∈ d (3) where |d| represents the number of documents in the collection of source documents that contain the term, ti. The system uses cosine similarity measure to retrieve the set of candidate documents which could be the sources of

plagiarism in the query document. V (d ) is the vector derived from document d, with one component in the vector for each dictionary term. The set of documents in a collection then may be viewed as a set of vectors in a vector space, in which there is one axis for each term. The standard way of quantifying the similarity between two documents d1 and d2 is to compute the cosine similarity of their vector representations V (d1 ) and V (d 2 ) as : sim ( d1 , d 2 ) =

V ( d1 ) • V ( d 2 ) V ( d1 ) V ( d 2 )

(4) where the numerator represents the dot product of the vectors V (d1 ) and V (d 2 ) , while the denominator is the product of their Euclidean lengths. The cosine similarity of two documents will range from 0 to 1. All the documents whose cosine similarity with the query document above a threshold are regarded as candidate documents after this phase.

C. Verification and detection of plagiarized passages At this stage, a sentence-wise detailed analysis between each suspicious document dq and its candidate document dx ϵ Dx is performed. At first, dq and dx are segmented into sentences Sq and Sx respectively using end-of-sentence delimiters. To obtain the degree of similarity between two sentences (Sq,Sx) a term-to-sentence correlation factor for each term wq in Sq and the sentence Sx is computed as[1][14] : µ q, x = 1 − ∏ w ∈ S 1 − Fq, k k x (5) where wk are words in Sx and Fq,k is the term-to-term correlation factor between wq and wk. Fq,k is defined as [1][14] : ⎧1 if wk = wq ⎪ Fq,k = ⎨0.5 if wk ∈ synset ( wq ) ⎪ ⎩0 otherwise (6) The synset of wq is extracted by querying the Wordnet lexical database [12]. Based on the µ-value of each word in a sentence Si, which is computed against sentence Sj, the degree of similarity of Si with respect to Sj can be defined as [14]: ( µ1, j + µ 2, j + µ 3, j + ... + µ n, j ) n (7) where n is the total number of words in Si . sim ( S i , S j ) is sim ( S i , S j ) =

a normalized value. Likewise, sim ( S j , S i ) which is the degree of similarity of Sj with respect to Si, is defined accordingly. Using the equation as defined earlier, two sentences Si and Sj should be treated the same, i.e., equal (EQ), as follows [14]:

114

IRACST – International Journal of Advanced Computing, Engineering and Application (IJACEA), ISSN: 2319-281X, Vol. 2, No. 6, December 2013

⎧1 if min(sim( S i , S j ), sim( S j , S i )) ≥ p ∧ ⎪ ⎪ EQ( S i , S j ) = ⎨ sim( S i , S j ) − sim( S j , S i ≤ vi ⎪ ⎪0 otherwise ⎩ (8) where p, which is the permission threshold value, was set to 0.65, where as v, which is the variation threshold value, was set to 0.15 [1]. Permission threshold is the minimal similarity between any two sentences Si and Sj, which is used partially to determine whether Si and Sj should be treated as equal (EQ). On the other hand, variation threshold value is used to decrease false positives (statements that are treated as equal but they are not) and false negatives (statements that are equal but treated as different).

Finally, the output of this algorithm is a list of sentence from the source document marked as similar/plagiarized. Because of using sentences as comparison scheme, post-processing is required to merge subsequent sentences marked as plagiarized into passages. The candidate documents with length of plagiarized passages less than a threshold are discarded as it might come out to be a false positive. At the end of this stage, two parameters are calculated at document level to indicate the measure of plagiarism performed: similarity index and overall similarity index [16]. The similarity index (SI) of a query document q and a source document d is the length of all plagiarism cases found in q that are taken from the original document d, as follows: p: p∈q∧ p∈d SI (q, d ) = q (9) The overall similarity index (OSI ) of q is the percentage of all plagiarism cases found in q as follows: p : p∈q OSI (q) = q (10) where q is the total length (in terms) of q, and p is length of a plagiarism case in q plagiarized from d. Fig 1 shows framework of the proposed system





IV. EXPERIMENTAL ANALYSIS A. Data Set The datasets used for the experiments is Corpus of Plagiarized Short Answers by Paul Clough and Mark Stevenson (Clough09) [13] which contain a diverse set of artificial and simulated plagiarism cases. The corpus consists of answers to five short answer questions namely A, B, C, D and E, related to different areas of computer science. Each question has 19 answers each in the corpus. Some of these are non-plagiarized while others are plagiarized through cutand-paste, paraphrasing, word shuffling, etc.

Fig 1: Framework of the proposed system B. Evaluation metrics The system is evaluated in terms of precision, recall and Fmeasure. Recall is defined as the number of relevant documents retrieved by a search divided by the total number of existing relevant documents, while precision is defined as the number of relevant documents retrieved by a search divided by the total number of documents retrieved by that search. number of correct results precision = number of all returned results (10) number of correct results recall = total number of actual results (11) F-measure combines precision and recall and is the harmonic mean of precision and recall. precision × recall F − measure = 2 × precision + recall (12) C. Results and discussions We performed experiments with the benchmark set up of Clough-09 corpus [13]. Several experiments were conducted with different query documents and the precision, recall and F-measure of the output was calculated. Table I shows the experimental analysis results for different query document inputs to the system. From the table it is observed that the average precision value is much higher for fuzzy semantic similarity measure when compared to vector space model. This indicates that the results of fuzzy semantic similarity contain more number of true cases than returned by vector space model. This higher improvement in precision value can compromise for the very small percentage of drop in the

115

IRACST – International Journal of Advanced Computing, Engineering and Application (IJACEA), ISSN: 2319-281X, Vol. 2, No. 6, December 2013

TABLE I: EXPERIMENTAL ANALYSIS RESULTS

QUERY

VECTOR SPACE MODEL PRECISION

RECALL

FUZZY SIMILARITY

F-MEASURE

PRECISION

RECALL

QA

75

60

66.67

100

60

75

QB

61

80

69.22

100

70

82.35

QC

58

83

68.28

90

83

86.35

QD

68

84

75.16

91

85

87.89

QE

68

91

77.83

73

92

81.40

Average

66

79.6

71.43

90.8

78

82.6

recall value. More over, the F-measure which combines precision and recall is much improved for fuzzy semantic similarity than vector space model.

[3]

V.CONCLUSION

[4]

In this paper, we have proposed an extrinsic plagiarism detection system which uses vector space model in combination with fuzzy semantic similarity measure so as to overcome the drawbacks of vector space model. A three stage algorithm is proposed in which candidate documents are retrieved through vector space model and the verification of the candidate documents as well as detection of plagiarised passages is done through fuzzy semantic similarity measure. The use of fuzzy similarity enables the system to detect intelligent plagiarism methods like paraphrasing, shuffling, as well as literal plagiarism. The system also reports the measure of plagiarism detected through two parameters similarity index and overall similarity index. Upon analysing the system using the metrics like precision, recall and F-measure, the system shows improved performance over the similar approaches that uses pure vector space model. From the experimental results, it is observed that the number of false positives or incorrect detections decrease considerably in fuzzy semantic similarity measure when compared to vector space model. The average F-measure value is also much higher in fuzzy semantic similarity measure that that obtained through vector space model. This analysis results shows that fuzzy semantic similarity performs better for extrinsic plagiarism detection when compared to pure vector space model. REFERENCES [1]

[2]

F-MEASURE

S. M. Alzahrani, N. Salim, and A. Abraham, "Understanding Plagiarism Linguistic Patterns, Textual Features and Detection Methods," in IEEE Transactions on Systems, Man, and Cybernetics, Part C: Applications and Reviews. vol. PP, 2011, pp. 1-17 Asif Ekbal, Sriparna Saha,Gaurav Choudhary, Plagiarism Detection in Text using Vector Space Model, 12th International Conference on Hybrid Intelligent Systems (HIS)2012. Z. Su, B.-R. Ahn, K.-Y. Eom, M.-K. Kang, J.-P. Kim, and M.- K.

[5] [6] [7]

[8]

[9]

[10] [11] [12] [13] [14] [15]

[16]

Kim, Plagiarism detection using the levenshtein distance and smithwaterman algorithm, in Proceedings of the2008 3rd International Conference on Innovative ComputingInformation and Control, ser. ICICIC 08. Washington, DC,USA: IEEE Computer Society, 2008, pp. 569. [Online].Available: http://dx.doi.org/10.1109/ICICIC.2008.422. J. Kasprzak and M. Brandejs, Improving the reliability of the plagiarism detection system-lab report for pan at clef 2010, in Notebook Papers of CLEF 2010 LABs and Workshops,2010. T. C. Hoad and J. Zobel, Methods for identifying versioned and plagiarized documents, J. Am. Soc. Inf. Sci. Technol.,vol. 54, no. 3, pp.203215, Feb. 2003. [Online]. S. Meye Zu Eissen and B. Stein, Intrinsic plagiarism detection, in Advances in Information Retrieval 28th European Conference on IR Research, ECIR 2006, 2006, pp. 565-569. B. Gipp and N. Meuschke, Citation pattern matching algorithms for citation-based plagiarism detection: greedy citation tiling, citation chunking and longest common citation sequence. in ACM Symposium on Document Engineering, M. R. B. Hardy and F. W. Tompa, Eds. ACM, 2011, pp. 249-258. Jorge Ropero, Ariel Gmez, Carlos Len, Alejandro Carrasco, Information Extraction in a Set of Knowledge Using a Fuzzy Logic Based Intelligent Agent., In proceeding of: Computational Science and Its Applications - ICCSA 2007, International Conference, Kuala Lumpur, Malaysia, August 26-29, 2007. Ahmed Hamza Osmana,b, Naomie Salima, Mohammed Salem Binwahlanc, Rihab Alteebd, Albaraa Abuobiedaa,b An improved plagiarism detection scheme based on semantic role labeling, Published in: JournalApplied Soft Computing archiveVolume 12 Issue 5, May, 2012 Pages Elsevier Science Publishers. Porter, M. F. (1980). An algorithm for suffix stripping. Program, 14(3), 130-137. Yerra, R., Ng, Y.-K. (2005). "A Sentence-Based Copy Detection Approach for Web Documents". In Fuzzy Systems and Knowledge Discovery (pp. 557-570). Miller, G. A. (1995). WordNet: A Lexical Database for English. Communications of the ACM, 38(11), 39-41 Paul Clough , Mark Stevenson, Developing a corpus of plagiarised short answers, Language Resources and Evaluation, v.45 n.1, p.5-24, March 2011 S. Alzahrani and N. Salim, "Fuzzy semantic-based string similarity for extrinsic plagiarism detection: Lab Report for PAN at CLEF'10," in Proc. 4th Int. Workshop PAN-10, Padua, Italy,2010 Pant, G., Srinivasan, P., Menczer, F., (2004) “Crawling the Web”. Web Dynamics: Adapting to Change in Content, Size, Topology and Use, edited by M. Levene and A. Poulovassilis, Springerverlog, pp: 153-178, November. Salha Alzahrani, Naomie Salim, Ajith Abraham,Vasile Palade, “iPlag: Intelligent Plagiarism Reasoner in Scientific Publications”, Information and Communication Technologies (WICT), 2011 World Congress, 11-14 Dec. 2011, Page(s): 1- 6.

116

Suggest Documents