Technologies for Reusing Text from the Web

Technologies for Reusing Text from the Web The Oral Exam of Martin Potthast To Obtain the Academic Degree of Dr. rer. nat. Web Technology & Informa...
Author: Elinor Jackson
7 downloads 0 Views 3MB Size
Technologies for Reusing Text from the Web

The Oral Exam of Martin Potthast

To Obtain the Academic Degree of Dr. rer. nat.

Web Technology & Information Systems Group Bauhaus-Universität Weimar

www.uni-weimar.de

www.webis.de

www.potthast.net

Technologies for Reusing Text from the Web

2 [∧]

c

www.webis.de 2011

Technologies for Reusing Text from the Web ;5tttt3Ctttttttttttk /tttttttt3JEttttttttt3. , ,EtttttttttF VtttttttttZ7 `*cttttt3F \tttttttt/ "Vz5L _,EttttttF =zzzzzzzz. ` `````````` ,xc /tttttt3. ,cEttt1 /tttttttt3. :t5ttttttt1 /tttttttt3"=L \ttttttttt1 Ettttttty \tttttttt5 c5ztttty ,L \ttttt3Z. Vtzcccc========s ;5zcczzzzzzzzSF \ttttttttttttt3 /5ttttttttttttF \tttttttttttt3 /tttttttttttttF "ttttttttttt3 "Etttttttttt5' `*cjjjjjjjJ Ct[jjti>*` \L 3 [∧]

c

www.webis.de 2011

Technologies for Reusing Text from the Web

Quotation Boilerplate Translation

Summarization

Metaphrase Paraphrase

4 [∧]

c

www.webis.de 2011

Technologies for Reusing Text from the Web Plagiarism Quotation Boilerplate Translation

Summarization

Metaphrase Paraphrase

5 [∧]

c

www.webis.de 2011

Contributions of Technologies for Reusing Text from the Web 1. Models & Algorithms q Unifying fingerprinting framework q Cross-language ESA q Comment cross-media similarity q Query segmentation algorithms

2. Surveys q Fingerprinting q Plagiarism detection q Web comment retrieval q Query segmentation

3. Evaluation Resources q Wikipedia as near-duplicate corpus q Wikipedia as cross-language corpus q 3 measures for plagiarism detection q 3 plagiarism corpora q Query segmentation corpus

4. Comparative Evaluations q 5 fingerprint algorithms q 3 cross-language models q 32 plagiarism detectors within 3 PAN evaluation competitions q 8 query segmentation algorithms

5. Tools q Netspeak 6 [∧]

q

Picapica

q

OpinionCloud

q

AItools lib c

www.webis.de 2011

Detecting Cross-Language Text Reuse

7 [∧]

c

www.webis.de 2011

Measuring Cross-language Similarity

Alan Turing was conceived at Chatrapur, Orissa, India. His father was a member of the Indian Civil Service. He and his wife wanted Alan to be brought up in England, so they returned to Maida Vale, London, where Alan Turing was born on 23 June 1912. He had an elder brother, John. His father's civil service commission was still active, and during Turing's childhood years his parents travelled between Hastings, England and India, leaving their two sons to stay with a retired Army couple. Very early in life, Turing showed signs of the genius he was to later prominently display.

8 [∧]

Alan Mathison Turing was born on 23 June 1912. His father was Julius Mathison Turing, member of the civil service in India, and his mother Ethel Sara Turing, the daughter of Edward Waller Stoney. Alan's childhood was spent with his elder brother John, living with a retired Army couple near Hastings, England. His parents returned to India until the end of his father‘s civil service commission, and visited when they could. Signs of Turing‘s genius showed early in his life. It is reported that he taught himself reading in less than three weeks.

c

www.webis.de 2011

Measuring Cross-language Similarity

Alan Turing was conceived at Chatrapur, Orissa, India. His father was a member of the Indian Civil Service. He and his wife wanted Alan to be brought up in England, so they returned to Maida Vale, London, where Alan Turing was born on 23 June 1912. He had an elder brother, John. His father's civil service commission was still active, and during Turing's childhood years his parents travelled between Hastings, England and India, leaving their two sons to stay with a retired Army couple. Very early in life, Turing showed signs of the genius he was to later prominently display.

9 [∧]

Alan Mathison Turing was born on 23 June 1912. His father was Julius Mathison Turing, member of the civil service in India, and his mother Ethel Sara Turing, the daughter of Edward Waller Stoney. Alan's childhood was spent with his elder brother John, living with a retired Army couple near Hastings, England. His parents returned to India until the end of his father‘s civil service commission, and visited when they could. Signs of Turing‘s genius showed early in his life. It is reported that he taught himself reading in less than three weeks.

c

www.webis.de 2011

Measuring Cross-language Similarity

Alan Mathison Turing was born on 23 June 1912. His father was Julius Mathison Turing, member of the civil service in India, and his mother Ethel Sara 5Turing, the daughter of Edward Waller Stoney. Alan's childhood 0 was spent with his elder brother John, living with a 1 retired Army couple near Hastings, England. His parents returned to India until the end of his father‘s 1 civil service commission, and visited when they could. 2 Signs of Turing‘s genius showed early in his life. It is reported0 that he taught himself reading in less than three weeks.

...

10 [∧]

...

...

Alan Turing was conceived at Chatrapur, Orissa, India. His father was a member of the Indian Civil Service. He and his wife wanted Alan to be brought up in4 England, so turing they returned to Maida Vale, travel where Alan1Turing was London, born on 23 June 1912. He had an teach 0 elder brother, John. His father's civil service commission was still active, and during Turing's childarmy 1 hood years his parents travelled between alan Hastings,3England and India, leaving their two sons to stay active with a retired Army1 couple. Very early in life, Turing showed signs of the genius he was to later prominently display.

c

www.webis.de 2011

Measuring Cross-language Similarity

ϕ

Alan Mathison Turing was born on 23 June 1912. His father was Julius Mathison Turing, member of the civil service in India, and his mother Ethel Sara 5Turing, the daughter of Edward Waller Stoney. Alan's childhood 0 was spent with his elder brother John, living with a 1 retired Army couple near Hastings, England. His parents returned to India until the end of his father‘s 1 civil service commission, and visited when they could. 2 Signs of Turing‘s genius showed early in his life. It is reported0 that he taught himself reading in less than three weeks.

- Euclidean distance - scalar product - cosine similarity

...

11 [∧]

...

...

Alan Turing was conceived at Chatrapur, Orissa, India. His father was a member of the Indian Civil Service. He and his wife wanted Alan to be brought up in4 England, so turing they returned to Maida Vale, travel where Alan1Turing was London, born on 23 June 1912. He had an teach 0 elder brother, John. His father's civil service commission was still active, and during Turing's childarmy 1 hood years his parents travelled between alan Hastings,3England and India, leaving their two sons to stay active with a retired Army1 couple. Very early in life, Turing showed signs of the genius he was to later prominently display.

c

www.webis.de 2011

Measuring Cross-language Similarity

Alan Turing was conceived at Chatrapur, Orissa, India. His father was a member of the Indian Civil Service. He and his wife wanted Alan to be brought up in England, so they returned to Maida Vale, London, where Alan Turing was born on 23 June 1912. He had an elder brother, John. His father's civil service commission was still active, and during Turing's childhood years his parents travelled between Hastings, England and India, leaving their two sons to stay with a retired Army couple. Very early in life, Turing showed signs of the genius he was to later prominently display.

12 [∧]

Turings Vater Julius Mathison Turing, ein britischer Staatsdiener in Chatrapur, Indien, und dessen Frau Ethel Sara wollten, dass ihr Kind in Großbritannien geboren wird. Deshalb kehrten sie nach London-Paddington zurück, wo Alan Turing am 23. Juni 1912 zur Welt kam. Da der Staatsdienst seines Vaters noch nicht beendet war, pendelte dieser während Turings Kindheit zwischen England und Indien. Seine Familie ließ er aus Furcht vor Gefahren in der britischen Kolonie bei Freunden in England zurück. Schon in frühester Kindheit zeigte sich die hohe Begabung und Intelligenz Turings.

c

www.webis.de 2011

Measuring Cross-language Similarity

Alan Turing was conceived at Chatrapur, Orissa, India. His father was a member of the Indian Civil Service. He and his wife wanted Alan to be brought up in England, so they returned to Maida Vale, London, where Alan Turing was born on 23 June 1912. He had an elder brother, John. His father's civil service commission was still active, and during Turing's childhood years his parents travelled between Hastings, England and India, leaving their two sons to stay with a retired Army couple. Very early in life, Turing showed signs of the genius he was to later prominently display.

13 [∧]

Turings Vater Julius Mathison Turing, ein britischer Staatsdiener in Chatrapur, Indien, und dessen Frau Ethel Sara wollten, dass ihr Kind in Großbritannien geboren wird. Deshalb kehrten sie nach London-Paddington zurück, wo Alan Turing am 23. Juni 1912 zur Welt kam. Da der Staatsdienst seines Vaters noch nicht beendet war, pendelte dieser während Turings Kindheit zwischen England und Indien. Seine Familie ließ er aus Furcht vor Gefahren in der britischen Kolonie bei Freunden in England zurück. Schon in frühester Kindheit zeigte sich die hohe Begabung und Intelligenz Turings.

c

www.webis.de 2011

Measuring Cross-language Similarity

Turings Vater Julius Mathison Turing, ein britischer Staatsdiener in Chatrapur, Indien, und dessen Frau Ethel Sara wollten, dass ihr Kind in Großbritannien 5 geboren wird. Deshalb kehrten sie nach 0 zurück, wo London-Paddington Alan Turing am 23. Juni 1912 zur 0 Welt kam. Da der Staatsdienst seines Vaters noch nicht beendet war, pendelte dieser während 2 Turings Kindheit zwischen England und Indien. Seine1Familie ließ er aus Furcht vor Gefahren in der 1bei Freunden in britischen Kolonie England zurück. Schon in frühester Kindheit zeigte sich die hohe Begabung und Intelligenz Turings.

...

14 [∧]

...

...

Alan Turing was conceived at Chatrapur, Orissa, India. His father was a member of the Indian Civil Service. He and his wife wanted Alan to be brought up in4 England, so turing they returned to Maida Vale, travel where Alan1Turing was London, born He had an twoon 23 June 1912. 1 elder brother, John. His father's civil service commission was still active, and during Turing's childbritisch 0 hood years his parents travelled between Hastings, 0 England and beendet India, leaving their two sons to stay alan with a retired Army3 couple. Very early in life, Turing showed signs of the genius he was to later prominently display.

c

www.webis.de 2011

Measuring Cross-language Similarity

ϕ

Turings Vater Julius Mathison Turing, ein britischer Staatsdiener in Chatrapur, Indien, und dessen Frau Ethel Sara wollten, dass ihr Kind in Großbritannien 5 geboren wird. Deshalb kehrten sie nach 0 zurück, wo London-Paddington Alan Turing am 23. Juni 1912 zur 0 Welt kam. Da der Staatsdienst seines Vaters noch nicht beendet war, pendelte dieser während 2 Turings Kindheit zwischen England und Indien. Seine1Familie ließ er aus Furcht vor Gefahren in der 1bei Freunden in britischen Kolonie England zurück. Schon in frühester Kindheit zeigte sich die hohe Begabung und Intelligenz Turings.

unless using - syntax overlaps - translations

...

15 [∧]

...

...

Alan Turing was conceived at Chatrapur, Orissa, India. His father was a member of the Indian Civil Service. He and his wife wanted Alan to be brought up in4 England, so turing they returned to Maida Vale, travel where Alan1Turing was London, born He had an twoon 23 June 1912. 1 elder brother, John. His father's civil service commission was still active, and during Turing's childbritisch 0 hood years his parents travelled between Hastings, 0 England and beendet India, leaving their two sons to stay alan with a retired Army3 couple. Very early in life, Turing showed signs of the genius he was to later prominently display.

c

www.webis.de 2011

Cross-language Explicit Semantic Analysis

4

...

1

16 [∧]

ϕ

Turings Vater Julius Mathison Turing, ein britischer Staatsdiener in Chatrapur, Indien, und dessen Frau Ethel wollten, dass ihr Kind in Großbritannien geboren wird. Deshalb kehrten sie nach London-Paddington zurück, wo Alan Turing am 23. Juni 1912 zur Welt kam. Da der Staatsdienst seines Vaters noch nicht beendet war, pendelte dieser während Turings Kindheit zwischen England und Indien. Seine Familie ließ er aus Furcht vor Gefahren in der britischen Kolonie bei Freunden in England zurück. Schon in frühester Kindheit zeigte sich die hohe Begabung und Intelligenz Turings.

5

...

Alan Turing was conceived at Chatrapur, Orissa, India. His father was a member of the Indian Civil Service. He and his wife wanted Alan to be brought up in England, so they returned to Maida Vale, London, where Alan Turing was born on 23 June 1912. He had an elder brother, John. His father's civil service commission was still active, and during Turing's childhood years his parents travelled between Hastings, England and India, leaving their two sons to stay with a retired Army couple. Very early in life, Turing showed signs of the genius he was to later prominently display.

0

c

www.webis.de 2011

Cross-language Explicit Semantic Analysis

4

...

1

...

17 [∧]

ϕ

Turings Vater Julius Mathison Turing, ein britischer Staatsdiener in Chatrapur, Indien, und dessen Frau Ethel wollten, dass ihr Kind in Großbritannien geboren wird. Deshalb kehrten sie nach London-Paddington zurück, wo Alan Turing am 23. Juni 1912 zur Welt kam. Da der Staatsdienst seines Vaters noch nicht beendet war, pendelte dieser während Turings Kindheit zwischen England und Indien. Seine Familie ließ er aus Furcht vor Gefahren in der britischen Kolonie bei Freunden in England zurück. Schon in frühester Kindheit zeigte sich die hohe Begabung und Intelligenz Turings.

5

...

Alan Turing was conceived at Chatrapur, Orissa, India. His father was a member of the Indian Civil Service. He and his wife wanted Alan to be brought up in England, so they returned to Maida Vale, London, where Alan Turing was born on 23 June 1912. He had an elder brother, John. His father's civil service commission was still active, and during Turing's childhood years his parents travelled between Hastings, England and India, leaving their two sons to stay with a retired Army couple. Very early in life, Turing showed signs of the genius he was to later prominently display.

0

...

c

www.webis.de 2011

Cross-language Explicit Semantic Analysis

4

...

1

...

18 [∧]

ϕ

Turings Vater Julius Mathison Turing, ein britischer Staatsdiener in Chatrapur, Indien, und dessen Frau Ethel wollten, dass ihr Kind in Großbritannien geboren wird. Deshalb kehrten sie nach London-Paddington zurück, wo Alan Turing am 23. Juni 1912 zur Welt kam. Da der Staatsdienst seines Vaters noch nicht beendet war, pendelte dieser während Turings Kindheit zwischen England und Indien. Seine Familie ließ er aus Furcht vor Gefahren in der britischen Kolonie bei Freunden in England zurück. Schon in frühester Kindheit zeigte sich die hohe Begabung und Intelligenz Turings.

5

...

Alan Turing was conceived at Chatrapur, Orissa, India. His father was a member of the Indian Civil Service. He and his wife wanted Alan to be brought up in England, so they returned to Maida Vale, London, where Alan Turing was born on 23 June 1912. He had an elder brother, John. His father's civil service commission was still active, and during Turing's childhood years his parents travelled between Hastings, England and India, leaving their two sons to stay with a retired Army couple. Very early in life, Turing showed signs of the genius he was to later prominently display.

0

...

c

www.webis.de 2011

Cross-language Explicit Semantic Analysis

4

...

1

...

19 [∧]

ϕ

Turings Vater Julius Mathison Turing, ein britischer Staatsdiener in Chatrapur, Indien, und dessen Frau Ethel wollten, dass ihr Kind in Großbritannien geboren wird. Deshalb kehrten sie nach London-Paddington zurück, wo Alan Turing am 23. Juni 1912 zur Welt kam. Da der Staatsdienst seines Vaters noch nicht beendet war, pendelte dieser während Turings Kindheit zwischen England und Indien. Seine Familie ließ er aus Furcht vor Gefahren in der britischen Kolonie bei Freunden in England zurück. Schon in frühester Kindheit zeigte sich die hohe Begabung und Intelligenz Turings.

5

...

Alan Turing was conceived at Chatrapur, Orissa, India. His father was a member of the Indian Civil Service. He and his wife wanted Alan to be brought up in England, so they returned to Maida Vale, London, where Alan Turing was born on 23 June 1912. He had an elder brother, John. His father's civil service commission was still active, and during Turing's childhood years his parents travelled between Hastings, England and India, leaving their two sons to stay with a retired Army couple. Very early in life, Turing showed signs of the genius he was to later prominently display.

0

...

c

www.webis.de 2011

Cross-language Explicit Semantic Analysis

4

...

1

...

20 [∧]

ϕ

Turings Vater Julius Mathison Turing, ein britischer Staatsdiener in Chatrapur, Indien, und dessen Frau Ethel wollten, dass ihr Kind in Großbritannien geboren wird. Deshalb kehrten sie nach London-Paddington zurück, wo Alan Turing am 23. Juni 1912 zur Welt kam. Da der Staatsdienst seines Vaters noch nicht beendet war, pendelte dieser während Turings Kindheit zwischen England und Indien. Seine Familie ließ er aus Furcht vor Gefahren in der britischen Kolonie bei Freunden in England zurück. Schon in frühester Kindheit zeigte sich die hohe Begabung und Intelligenz Turings.

5

...

Alan Turing was conceived at Chatrapur, Orissa, India. His father was a member of the Indian Civil Service. He and his wife wanted Alan to be brought up in England, so they returned to Maida Vale, London, where Alan Turing was born on 23 June 1912. He had an elder brother, John. His father's civil service commission was still active, and during Turing's childhood years his parents travelled between Hastings, England and India, leaving their two sons to stay with a retired Army couple. Very early in life, Turing showed signs of the genius he was to later prominently display.

0

...

c

www.webis.de 2011

Cross-language Explicit Semantic Analysis Alan Turing was conceived at Chatrapur, Orissa, India. His father was a member of the Indian Civil Service. He and his wife wanted Alan to be brought up in England, so they returned to Maida Vale, London, where Alan Turing was born on 23 June 1912. He had an elder brother, John. His father's civil service commission was still active, and during Turing's childhood years his parents travelled between Hastings, England and India, leaving their two sons to stay with a retired Army couple. Very early in life, Turing showed signs of the genius he was to later prominently display.

4

Turings Vater Julius Mathison Turing, ein britischer Staatsdiener in Chatrapur, Indien, und dessen Frau Ethel wollten, dass ihr Kind in Großbritannien geboren wird. Deshalb kehrten sie nach London-Paddington zurück, wo Alan Turing am 23. Juni 1912 zur Welt kam. Da der Staatsdienst seines Vaters noch nicht beendet war, pendelte dieser während Turings Kindheit zwischen England und Indien. Seine Familie ließ er aus Furcht vor Gefahren in der britischen Kolonie bei Freunden in England zurück. Schon in frühester Kindheit zeigte sich die hohe Begabung und Intelligenz Turings.

5

...

...

ϕ

1

0

5 2

3

...

1

...

...

21 [∧]

...

...

...

2

2

6

2

c

www.webis.de 2011

Cross-language Explicit Semantic Analysis Alan Turing was conceived at Chatrapur, Orissa, India. His father was a member of the Indian Civil Service. He and his wife wanted Alan to be brought up in England, so they returned to Maida Vale, London, where Alan Turing was born on 23 June 1912. He had an elder brother, John. His father's civil service commission was still active, and during Turing's childhood years his parents travelled between Hastings, England and India, leaving their two sons to stay with a retired Army couple. Very early in life, Turing showed signs of the genius he was to later prominently display.

4

Turings Vater Julius Mathison Turing, ein britischer Staatsdiener in Chatrapur, Indien, und dessen Frau Ethel wollten, dass ihr Kind in Großbritannien geboren wird. Deshalb kehrten sie nach London-Paddington zurück, wo Alan Turing am 23. Juni 1912 zur Welt kam. Da der Staatsdienst seines Vaters noch nicht beendet war, pendelte dieser während Turings Kindheit zwischen England und Indien. Seine Familie ließ er aus Furcht vor Gefahren in der britischen Kolonie bei Freunden in England zurück. Schon in frühester Kindheit zeigte sich die hohe Begabung und Intelligenz Turings.

5

...

...

ϕ

1

0

ϕ

ϕ

5

...

3

...

1 ...

6

2

0.2

...

...

22 [∧]

2 ...

0.1

...

...

2

2

0.2

0.1

c

www.webis.de 2011

Cross-language Explicit Semantic Analysis Alan Turing was conceived at Chatrapur, Orissa, India. His father was a member of the Indian Civil Service. He and his wife wanted Alan to be brought up in England, so they returned to Maida Vale, London, where Alan Turing was born on 23 June 1912. He had an elder brother, John. His father's civil service commission was still active, and during Turing's childhood years his parents travelled between Hastings, England and India, leaving their two sons to stay with a retired Army couple. Very early in life, Turing showed signs of the genius he was to later prominently display.

4

Turings Vater Julius Mathison Turing, ein britischer Staatsdiener in Chatrapur, Indien, und dessen Frau Ethel wollten, dass ihr Kind in Großbritannien geboren wird. Deshalb kehrten sie nach London-Paddington zurück, wo Alan Turing am 23. Juni 1912 zur Welt kam. Da der Staatsdienst seines Vaters noch nicht beendet war, pendelte dieser während Turings Kindheit zwischen England und Indien. Seine Familie ließ er aus Furcht vor Gefahren in der britischen Kolonie bei Freunden in England zurück. Schon in frühester Kindheit zeigte sich die hohe Begabung und Intelligenz Turings.

5

...

...

ϕ

1

0

ϕ

ϕ

5

...

1 ...

23 [∧]

...

6

2

ϕ

0.2

Cross-language similarity

0.1

...

...

0.2

3

2 ...

0.1

...

...

2

2

c

www.webis.de 2011

Cross-language Explicit Semantic Analysis Experiments 1. cross-language ranking 2. bilingual rank correlation 3. cross-language similarity distribution 4. quality vs. dimensionality of CL-ESA 5. multilingualism (number of possible simultaneous languages) 6. runtime

24

[∧]

q

comparison to two other state of the art models

q

usage of 2 multilingual test collections

q

comparison on 6 pairs of languages

q

more than 100 000 documents in each of several dozen runs

q

> 100 million similarities computed

c

www.webis.de 2011

0.4

JRC-Acquis 0.81

Recall

0.2 0 1 0.8 0.6

Wikipedia 0.61

0.4

JRC-Acquis 0.46

Recall

0.2 0 1 0.8 0.6

Wikipedia 0.44

0.4

JRC-Acquis 0.20

0.2 0

Recall

1 0.8

Wikipedia 0.22

0.6 0.4

JRC-Acquis 0.09

Recall

0.2 0 1 0.8 0.6

Wikipedia 0.07

0.4 0.2 0 1

25 [∧]

2

3

4 5 10 20 50 Rank

JRC-Acquis 0.04

Ratio of Similarities

Wikipedia 0.72

Ratio of Similarities

0.6

Ratio of Similarities

Recall

1 0.8

Experiment 3 Cross-language Similarity Distribution Ratio of Similarities

Experiment 2 Bilingual rank correlation

Ratio of Similarities

Experiment 1 Cross-language Ranking

Dimensions

0.4 JRC-Acquis

Wikipedia

105

0.2 0 0.4

104

0.2 0 0.4

103

0.2 0 0.4

102

0.2 0 0.4

10

0.2 0 0

0.2

0.4 0.6 0.8 Similarity Interval

1 c

www.webis.de 2011

Evaluating Plagiarism Detectors

26 [∧]

c

www.webis.de 2011

Detection Performance Measures Suspicious Document dplg

Source Document dsrc

Alan Mathison Turing, OBE, FRS (23 June 1912 – 7 June 1954), was an English mathematician, logician, cryptanalyst, and computer scientist. He was highly influential in the development of computer science, providing a formalisation of the concepts of "algorithm" and "computation" with the Turing machine, which played a significant role in the creation of the modern computer. Turing is widely considered to be the father of computer science and artificial intelligence. He was stockily built, had a high-pitched voice, and was talkative, witty, and somewhat donnish.

Alan Turing was born on 23 June, 1912, in London. His father was in the Indian Civil Service and Turing's parents lived in India until his father's retirement in 1926. Turing and his brother stayed with friends and relatives in England. Turing studied mathematics at Cambridge University, and subsequently taught there, working in the burgeoning world of quantum mechanics. It was at Cambridge that he developed the proof which states that automatic computation cannot solve all mathematical problems. This concept, also known as the Turing machine, is considered the basis for the modern theory of computation.

During the Second World War, Turing worked for the Government Code and Cypher School at Bletchley Park, Britain's codebreaking centre. For a time he was head of Hut 8, the section responsible for German naval cryptanalysis. He devised a number of techniques for breaking German ciphers, including the method of the bombe, an electromechanical machine that could find settings for the Enigma machine.

In 1936, Turing went to Princeton University in America, returning to England in 1938. He began to work secretly part-time for the British cryptanalytic department, the Government Code and Cypher School. On the outbreak of war he took up full-time work at its headquarters, Bletchley Park. After the war, Turing turned his thoughts to the development of a machine that would logically process information. He worked first for the National Physical Laboratory (1945-1948). His plans were dismissed by his colleagues and the lab lost out on being the first to design a digital computer. It is thought that Turing's blueprint would have secured them the honour, as his machine was capable of computation speeds higher than the others. In 1949, he went to Manchester University where he directed the computing laboratory and developed a body of work that helped to form the basis for the field of artificial intelligence. In 1951 he was elected a fellow of the Royal Society.

After the war he worked at the National Physical Laboratory, where he created one of the first designs for the stored-program computer ACE. In 1949, he went to Manchester University where he directed the computing laboratory and developed a body of work that helped to form the basis for the field of artificial intelligence. In 1951 he was elected a fellow of the Royal Society. Turing's homosexuality resulted in a criminal prosecution in 1952, when homosexual acts were still illegal in the United Kingdom. He accepted treatment with female hormones (chemical castration) as an alternative to prison. He died in 1954, just over two weeks before his 42nd birthday, from cyanide poisoning. An inquest determined it was suicide; his mother and some others believed his death was accidental. On 10 September 2009, following an Internet campaign, British Prime Minister Gordon Brown made an official public apology on behalf of the British government for the way in which Turing was treated after the war.

In 1952, Turing was arrested and tried for homosexuality, then a criminal offence. To avoid prison, he accepted injections of oestrogen for a year, which were intended to neutralise his libido. In that era, homosexuals were considered a security risk as they were open to blackmail. Turing's security clearance was withdrawn, meaning he could no longer work for GCHQ, the post-war successor to Bletchley Park. He committed suicide on 7 June, 1954.

Taken from http://en.wikipedia.org/wiki/Alan_Turing and post-edited to include material from the right hand text.

27 [∧]

q

Plagiarism s = hsplg, dplg, ssrc, dsrci

q

Detection

Taken from http://www.bbc.co.uk/history/people/alan_turing

q

What is the detection quality?

r = hrplg, dplg, rsrc, d0srci c

www.webis.de 2011

Detection Performance Measures Suspicious Document dplg

splg

Source Document dsrc

Alan Mathison Turing, OBE, FRS (23 June 1912 – 7 June 1954), was an English mathematician, logician, cryptanalyst, and computer scientist. He was highly influential in the development of computer science, providing a formalisation of the concepts of "algorithm" and "computation" with the Turing machine, which played a significant role in the creation of the modern computer. Turing is widely considered to be the father of computer science and artificial intelligence. He was stockily built, had a high-pitched voice, and was talkative, witty, and somewhat donnish.

Alan Turing was born on 23 June, 1912, in London. His father was in the Indian Civil Service and Turing's parents lived in India until his father's retirement in 1926. Turing and his brother stayed with friends and relatives in England. Turing studied mathematics at Cambridge University, and subsequently taught there, working in the burgeoning world of quantum mechanics. It was at Cambridge that he developed the proof which states that automatic computation cannot solve all mathematical problems. This concept, also known as the Turing machine, is considered the basis for the modern theory of computation.

During the Second World War, Turing worked for the Government Code and Cypher School at Bletchley Park, Britain's codebreaking centre. For a time he was head of Hut 8, the section responsible for German naval cryptanalysis. He devised a number of techniques for breaking German ciphers, including the method of the bombe, an electromechanical machine that could find settings for the Enigma machine.

In 1936, Turing went to Princeton University in America, returning to England in 1938. He began to work secretly part-time for the British cryptanalytic department, the Government Code and Cypher School. On the outbreak of war he took up full-time work at its headquarters, Bletchley Park.

After the war he worked at the National Physical Laboratory, where he created one of the first designs for the stored-program computer ACE. In 1949, he went to Manchester University where he directed the computing laboratory and developed a body of work that helped to form the basis for the field of artificial intelligence. In 1951 he was elected a fellow of the Royal Society.

ssrc

Turing's homosexuality resulted in a criminal prosecution in 1952, when homosexual acts were still illegal in the United Kingdom. He accepted treatment with female hormones (chemical castration) as an alternative to prison. He died in 1954, just over two weeks before his 42nd birthday, from cyanide poisoning. An inquest determined it was suicide; his mother and some others believed his death was accidental. On 10 September 2009, following an Internet campaign, British Prime Minister Gordon Brown made an official public apology on behalf of the British government for the way in which Turing was treated after the war.

After the war, Turing turned his thoughts to the development of a machine that would logically process information. He worked first for the National Physical Laboratory (1945-1948). His plans were dismissed by his colleagues and the lab lost out on being the first to design a digital computer. It is thought that Turing's blueprint would have secured them the honour, as his machine was capable of computation speeds higher than the others. In 1949, he went to Manchester University where he directed the computing laboratory and developed a body of work that helped to form the basis for the field of artificial intelligence. In 1951 he was elected a fellow of the Royal Society. In 1952, Turing was arrested and tried for homosexuality, then a criminal offence. To avoid prison, he accepted injections of oestrogen for a year, which were intended to neutralise his libido. In that era, homosexuals were considered a security risk as they were open to blackmail. Turing's security clearance was withdrawn, meaning he could no longer work for GCHQ, the post-war successor to Bletchley Park. He committed suicide on 7 June, 1954.

Taken from http://en.wikipedia.org/wiki/Alan_Turing and post-edited to include material from the right hand text.

28 [∧]

q

Plagiarism s = hsplg, dplg, ssrc, dsrci

q

Detection

Taken from http://www.bbc.co.uk/history/people/alan_turing

q

What is the detection quality?

r = hrplg, dplg, rsrc, d0srci c

www.webis.de 2011

Detection Performance Measures Suspicious Document dplg

splg

Source Document dsrc

Alan Mathison Turing, OBE, FRS (23 June 1912 – 7 June 1954), was an English mathematician, logician, cryptanalyst, and computer scientist. He was highly influential in the development of computer science, providing a formalisation of the concepts of "algorithm" and "computation" with the Turing machine, which played a significant role in the creation of the modern computer. Turing is widely considered to be the father of computer science and artificial intelligence. He was stockily built, had a high-pitched voice, and was talkative, witty, and somewhat donnish.

Alan Turing was born on 23 June, 1912, in London. His father was in the Indian Civil Service and Turing's parents lived in India until his father's retirement in 1926. Turing and his brother stayed with friends and relatives in England. Turing studied mathematics at Cambridge University, and subsequently taught there, working in the burgeoning world of quantum mechanics. It was at Cambridge that he developed the proof which states that automatic computation cannot solve all mathematical problems. This concept, also known as the Turing machine, is considered the basis for the modern theory of computation.

During the Second World War, Turing worked for the Government Code and Cypher School at Bletchley Park, Britain's codebreaking centre. For a time he was head of Hut 8, the section responsible for German naval cryptanalysis. He devised a number of techniques for breaking German ciphers, including the method of the bombe, an electromechanical machine that could find settings for the Enigma machine.

In 1936, Turing went to Princeton University in America, returning to England in 1938. He began to work secretly part-time for the British cryptanalytic department, the Government Code and Cypher School. On the outbreak of war he took up full-time work at its headquarters, Bletchley Park.

ssrc

After the war he worked at the National Physical Laboratory, where he created one of the first designs for the stored-program computer ACE. In 1949, he went to Manchester University where he directed the computing laboratory and developed a body of work that helped to form the basis for the field of artificial intelligence. In 1951 he was elected a fellow of the Royal Society. Turing's homosexuality resulted in a criminal prosecution in 1952, when homosexual acts were still illegal in the United Kingdom. He accepted treatment with female hormones (chemical castration) as an alternative to prison. He died in 1954, just over two weeks before his 42nd birthday, from cyanide poisoning. An inquest determined it was suicide; his mother and some others believed his death was accidental. On 10 September 2009, following an Internet campaign, British Prime Minister Gordon Brown made an official public apology on behalf of the British government for the way in which Turing was treated after the war.

rplg

After the war, Turing turned his thoughts to the development of a machine that would logically process information. He worked first for the National Physical Laboratory (1945-1948). His plans were dismissed by his colleagues and the lab lost out on being the first to design a digital computer. It is thought that Turing's blueprint would have secured them the honour, as his machine was capable of computation speeds higher than the others. In 1949, he went to Manchester University where he directed the computing laboratory and developed a body of work that helped to form the basis for the field of artificial intelligence. In 1951 he was elected a fellow of the Royal Society. In 1952, Turing was arrested and tried for homosexuality, then a criminal offence. To avoid prison, he accepted injections of oestrogen for a year, which were intended to neutralise his libido. In that era, homosexuals were considered a security risk as they were open to blackmail. Turing's security clearance was withdrawn, meaning he could no longer work for GCHQ, the post-war successor to Bletchley Park.

rsrc

He committed suicide on 7 June, 1954.

Taken from http://en.wikipedia.org/wiki/Alan_Turing and post-edited to include material from the right hand text.

29 [∧]

q

Plagiarism s = hsplg, dplg, ssrc, dsrci

q

Detection

Taken from http://www.bbc.co.uk/history/people/alan_turing

q

What is the detection quality?

r = hrplg, dplg, rsrc, d0srci c

www.webis.de 2011

Detection Performance Measures Suspicious Document dplg

splg

Source Document dsrc

Alan Mathison Turing, OBE, FRS (23 June 1912 – 7 June 1954), was an English mathematician, logician, cryptanalyst, and computer scientist. He was highly influential in the development of computer science, providing a formalisation of the concepts of "algorithm" and "computation" with the Turing machine, which played a significant role in the creation of the modern computer. Turing is widely considered to be the father of computer science and artificial intelligence. He was stockily built, had a high-pitched voice, and was talkative, witty, and somewhat donnish.

Alan Turing was born on 23 June, 1912, in London. His father was in the Indian Civil Service and Turing's parents lived in India until his father's retirement in 1926. Turing and his brother stayed with friends and relatives in England. Turing studied mathematics at Cambridge University, and subsequently taught there, working in the burgeoning world of quantum mechanics. It was at Cambridge that he developed the proof which states that automatic computation cannot solve all mathematical problems. This concept, also known as the Turing machine, is considered the basis for the modern theory of computation.

During the Second World War, Turing worked for the Government Code and Cypher School at Bletchley Park, Britain's codebreaking centre. For a time he was head of Hut 8, the section responsible for German naval cryptanalysis. He devised a number of techniques for breaking German ciphers, including the method of the bombe, an electromechanical machine that could find settings for the Enigma machine.

In 1936, Turing went to Princeton University in America, returning to England in 1938. He began to work secretly part-time for the British cryptanalytic department, the Government Code and Cypher School. On the outbreak of war he took up full-time work at its headquarters, Bletchley Park.

ssrc

After the war he worked at the National Physical Laboratory, where he created one of the first designs for the stored-program computer ACE. In 1949, he went to Manchester University where he directed the computing laboratory and developed a body of work that helped to form the basis for the field of artificial intelligence. In 1951 he was elected a fellow of the Royal Society. Turing's homosexuality resulted in a criminal prosecution in 1952, when homosexual acts were still illegal in the United Kingdom. He accepted treatment with female hormones (chemical castration) as an alternative to prison. He died in 1954, just over two weeks before his 42nd birthday, from cyanide poisoning. An inquest determined it was suicide; his mother and some others believed his death was accidental. On 10 September 2009, following an Internet campaign, British Prime Minister Gordon Brown made an official public apology on behalf of the British government for the way in which Turing was treated after the war.

rplg

After the war, Turing turned his thoughts to the development of a machine that would logically process information. He worked first for the National Physical Laboratory (1945-1948). His plans were dismissed by his colleagues and the lab lost out on being the first to design a digital computer. It is thought that Turing's blueprint would have secured them the honour, as his machine was capable of computation speeds higher than the others. In 1949, he went to Manchester University where he directed the computing laboratory and developed a body of work that helped to form the basis for the field of artificial intelligence. In 1951 he was elected a fellow of the Royal Society. In 1952, Turing was arrested and tried for homosexuality, then a criminal offence. To avoid prison, he accepted injections of oestrogen for a year, which were intended to neutralise his libido. In that era, homosexuals were considered a security risk as they were open to blackmail. Turing's security clearance was withdrawn, meaning he could no longer work for GCHQ, the post-war successor to Bletchley Park.

rsrc

He committed suicide on 7 June, 1954.

Taken from http://en.wikipedia.org/wiki/Alan_Turing and post-edited to include material from the right hand text.

30 [∧]

q

Plagiarism s = hsplg, dplg, ssrc, dsrci

q

Detection

Taken from http://www.bbc.co.uk/history/people/alan_turing

q

What is the detection quality?

r = hrplg, dplg, rsrc, d0srci c

www.webis.de 2011

Detection Performance Measures splg

ssrc

After the war he worked at the National Physical Laboratory, where he created one of the first designs for the stored-program computer ACE. In 1949, he went to Manchester University where he directed the computing laboratory and developed a body of work that helped to form the basis for the field of artificial intelligence. In 1951 he was elected a fellow of the Royal Society. Turing's homosexuality resulted in a criminal prosecution in 1952, when homosexual acts were still illegal in the United Kingdom. He accepted treatment with female hormones (chemical castration) as an alternative to prison. He died in 1954, just over two weeks before his 42nd birthday, from cyanide poisoning. An inquest determined it was suicide;

rplg

After the war, Turing turned his thoughts to the development of a machine that would logically process information. He worked first for the National Physical Laboratory (1945-1948). His plans were dismissed by his colleagues and the lab lost out on being the first to design a digital computer. It is thought that Turing's blueprint would have secured them the honour, as his machine was capable of computation speeds higher than the others. In 1949, he went to Manchester University where he directed the computing laboratory and developed a body of work that helped to form the basis for the field of artificial intelligence. In 1951 he was elected a fellow of the Royal Society. In 1952, Turing was arrested and tried for homosexuality, then a criminal offence. To avoid prison, he accepted injections of oestrogen for a year, which were intended to neutralise his libido. In that era, homosexuals were considered a security risk as they were open to blackmail. Turing's security clearance was withdrawn, meaning he could no longer work for GCHQ, the post-war successor to Bletchley Park.

rsrc

He committed suicide on 7 June, 1954.

31 [∧]

q

Plagiarism s = hsplg, dplg, ssrc, dsrci

q

Detection

q

What is the detection quality?

r = hrplg, dplg, rsrc, d0srci

c

www.webis.de 2011

Detection Performance Measures splg

ssrc

After the war he worked at the National Physical Laboratory, where he created one of the first designs for the stored-program computer ACE. In 1949, he went to Manchester University where he directed the computing laboratory and developed a body of work that helped to form the basis for the field of artificial intelligence. In 1951 he was elected a fellow of the Royal Society. Turing's homosexuality resulted in a criminal prosecution in 1952, when homosexual acts were still illegal in the United Kingdom. He accepted treatment with female hormones (chemical castration) as an alternative to prison. He died in 1954, just over two weeks before his 42nd birthday, from cyanide poisoning. An inquest determined it was suicide;

rplg

After the war, Turing turned his thoughts to the development of a machine that would logically process information. He worked first for the National Physical Laboratory (1945-1948). His plans were dismissed by his colleagues and the lab lost out on being the first to design a digital computer. It is thought that Turing's blueprint would have secured them the honour, as his machine was capable of computation speeds higher than the others. In 1949, he went to Manchester University where he directed the computing laboratory and developed a body of work that helped to form the basis for the field of artificial intelligence. In 1951 he was elected a fellow of the Royal Society. In 1952, Turing was arrested and tried for homosexuality, then a criminal offence. To avoid prison, he accepted injections of oestrogen for a year, which were intended to neutralise his libido. In that era, homosexuals were considered a security risk as they were open to blackmail. Turing's security clearance was withdrawn, meaning he could no longer work for GCHQ, the post-war successor to Bletchley Park.

rsrc

He committed suicide on 7 June, 1954.

32

[∧]

q

Plagiarism s = hsplg, dplg, ssrc, dsrci

q

Detection

q

r detects s iff

q

What is the detection quality?

r = hrplg, dplg, rsrc, d0srci rplg ∩ splg 6= ∅,

rsrc ∩ ssrc 6= ∅,

and d0src = dsrc

c

www.webis.de 2011

Detection Performance Measures splg

ssrc

After the war he worked at the National Physical Laboratory, where he created one of the first designs for the stored-program computer ACE. In 1949, he went to Manchester University where he directed the computing laboratory and developed a body of work that helped to form the basis for the field of artificial intelligence. In 1951 he was elected a fellow of the Royal Society. Turing's homosexuality resulted in a criminal prosecution in 1952, when homosexual acts were still illegal in the United Kingdom. He accepted treatment with female hormones (chemical castration) as an alternative to prison. He died in 1954, just over two weeks before his 42nd birthday, from cyanide poisoning. An inquest determined it was suicide;

rplg

After the war, Turing turned his thoughts to the development of a machine that would logically process information. He worked first for the National Physical Laboratory (1945-1948). His plans were dismissed by his colleagues and the lab lost out on being the first to design a digital computer. It is thought that Turing's blueprint would have secured them the honour, as his machine was capable of computation speeds higher than the others. In 1949, he went to Manchester University where he directed the computing laboratory and developed a body of work that helped to form the basis for the field of artificial intelligence. In 1951 he was elected a fellow of the Royal Society. In 1952, Turing was arrested and tried for homosexuality, then a criminal offence. To avoid prison, he accepted injections of oestrogen for a year, which were intended to neutralise his libido. In that era, homosexuals were considered a security risk as they were open to blackmail. Turing's security clearance was withdrawn, meaning he could no longer work for GCHQ, the post-war successor to Bletchley Park.

rsrc

He committed suicide on 7 June, 1954.

q

Plagiarism s = hsplg, dplg, ssrc, dsrci

q

Detection

q

r detects s iff 

q

33

[∧]

|s u r| :=

q

What is the detection quality?

r = hrplg, dplg, rsrc, d0srci rplg ∩ splg 6= ∅,

rsrc ∩ ssrc 6= ∅,

number of overlapping characters 0

and d0src = dsrc

if r detects s, else

c

www.webis.de 2011

Detection Performance Measures splg

ssrc

After the war he worked at the National Physical Laboratory, where he created one of the first designs for the stored-program computer ACE. In 1949, he went to Manchester University where he directed the computing laboratory and developed a body of work that helped to form the basis for the field of artificial intelligence. In 1951 he was elected a fellow of the Royal Society. Turing's homosexuality resulted in a criminal prosecution in 1952, when homosexual acts were still illegal in the United Kingdom. He accepted treatment with female hormones (chemical castration) as an alternative to prison. He died in 1954, just over two weeks before his 42nd birthday, from cyanide poisoning. An inquest determined it was suicide;

rplg

After the war, Turing turned his thoughts to the development of a machine that would logically process information. He worked first for the National Physical Laboratory (1945-1948). His plans were dismissed by his colleagues and the lab lost out on being the first to design a digital computer. It is thought that Turing's blueprint would have secured them the honour, as his machine was capable of computation speeds higher than the others. In 1949, he went to Manchester University where he directed the computing laboratory and developed a body of work that helped to form the basis for the field of artificial intelligence. In 1951 he was elected a fellow of the Royal Society. In 1952, Turing was arrested and tried for homosexuality, then a criminal offence. To avoid prison, he accepted injections of oestrogen for a year, which were intended to neutralise his libido. In that era, homosexuals were considered a security risk as they were open to blackmail. Turing's security clearance was withdrawn, meaning he could no longer work for GCHQ, the post-war successor to Bletchley Park.

rsrc

He committed suicide on 7 June, 1954.

q

Plagiarism s = hsplg, dplg, ssrc, dsrci

q

Detection

q

r detects s iff 

34 [∧]

q

What is the detection quality?

r = hrplg, dplg, rsrc, d0srci rplg ∩ splg 6= ∅,

rsrc ∩ ssrc 6= ∅,

number of overlapping characters 0

q

|s u r| :=

q

precicion(s, r) =

|s u r| = 0.38 |r|

q

and d0src = dsrc

if r detects s, else

recall(s, r) =

|s u r| = 0.45 |s| c

www.webis.de 2011

Detection Performance Measures

...

...

Possible patterns:

+ combinations thereof + combinations regarding pairs of suspicious and source documents

35 [∧]

c

www.webis.de 2011

Detection Performance Measures

...

...

Possible patterns:

+ combinations thereof + combinations regarding pairs of suspicious and source documents

36 [∧]

q

no 1:1 correspondence between plagiarism cases and detections

q

deal with sets of detections R and plagiarism cases S

q

avoid double-counting of detection overlaps (inclusion-exclusion principle)

c

www.webis.de 2011

Detection Performance Measures

...

...

Possible patterns:

+ combinations thereof + combinations regarding pairs of suspicious and source documents q

no 1:1 correspondence between plagiarism cases and detections

q

deal with sets of detections R and plagiarism cases S

q

avoid double-counting of detection overlaps (inclusion-exclusion principle)

q

measure precision for each detection and recall for each plagiarism case, averaging the results: S X | s∈S (s u r)| 1 precicion(S, R) = |R| |r| r∈R S X | 1 r∈R (s u r)| recall(S, R) = |S| |s| s∈S

37 [∧]

c

www.webis.de 2011

Detection Performance Measures splg

38 [∧]

After the war he worked at the National Physical Laboratory, where he created one of the first designs for the stored-program computer ACE. In 1949, he went to Manchester University where he directed the computing laboratory and developed a body of work that helped to form the basis for the field of artificial intelligence. In 1951 he was elected a fellow of the Royal Society.

ssrc

After the war, Turing turned his thoughts to the development of a machine that would logically process information. He worked first for the National Physical Laboratory (1945-1948). His plans were dismissed by his colleagues and the lab lost out on being the first to design a digital computer. It is thought that Turing's blueprint would have secured them the honour, as his machine was capable of computation speeds higher than the others. In 1949, he went to Manchester University where he directed the computing laboratory and developed a body of work that helped to form the basis for the field of artificial intelligence. In 1951 he was elected a fellow of the Royal Society.

c

www.webis.de 2011

Detection Performance Measures splg

39 [∧]

After the war he worked at the National Physical Laboratory, where he created one of the first designs for the stored-program computer ACE. In 1949, he went to Manchester University where he directed the computing laboratory and developed a body of work that helped to form the basis for the field of artificial intelligence. In 1951 he was elected a fellow of the Royal Society.

ssrc

After the war, Turing turned his thoughts to the development of a machine that would logically process information. He worked first for the National Physical Laboratory (1945-1948). His plans were dismissed by his colleagues and the lab lost out on being the first to design a digital computer. It is thought that Turing's blueprint would have secured them the honour, as his machine was capable of computation speeds higher than the others. In 1949, he went to Manchester University where he directed the computing laboratory and developed a body of work that helped to form the basis for the field of artificial intelligence. In 1951 he was elected a fellow of the Royal Society.

c

www.webis.de 2011

Detection Performance Measures splg

After the war he worked at the National Physical Laboratory, where he created one of the first designs for the stored-program computer ACE. In 1949, he went to Manchester University where he directed the computing laboratory and developed a body of work that helped to form the basis for the field of artificial intelligence. In 1951 he was elected a fellow of the Royal Society.

ssrc

After the war, Turing turned his thoughts to the development of a machine that would logically process information. He worked first for the National Physical Laboratory (1945-1948). His plans were dismissed by his colleagues and the lab lost out on being the first to design a digital computer. It is thought that Turing's blueprint would have secured them the honour, as his machine was capable of computation speeds higher than the others. In 1949, he went to Manchester University where he directed the computing laboratory and developed a body of work that helped to form the basis for the field of artificial intelligence. In 1951 he was elected a fellow of the Royal Society.

q

undesirable fragmentation of the detection

q

measure the average number of times a plagiarism case is detected: 1 X |Rs| granularity (S, R) = |S R | s∈S R

where S R ⊆ S are detected cases, and Rs ⊆ R are detections of s

40 [∧]

c

www.webis.de 2011

Detection Performance Measures splg

After the war he worked at the National Physical Laboratory, where he created one of the first designs for the stored-program computer ACE. In 1949, he went to Manchester University where he directed the computing laboratory and developed a body of work that helped to form the basis for the field of artificial intelligence. In 1951 he was elected a fellow of the Royal Society.

ssrc

After the war, Turing turned his thoughts to the development of a machine that would logically process information. He worked first for the National Physical Laboratory (1945-1948). His plans were dismissed by his colleagues and the lab lost out on being the first to design a digital computer. It is thought that Turing's blueprint would have secured them the honour, as his machine was capable of computation speeds higher than the others. In 1949, he went to Manchester University where he directed the computing laboratory and developed a body of work that helped to form the basis for the field of artificial intelligence. In 1951 he was elected a fellow of the Royal Society.

q

undesirable fragmentation of the detection

q

measure the average number of times a plagiarism case is detected: 1 X |Rs| granularity (S, R) = |S R | s∈S R

where S R ⊆ S are detected cases, and Rs ⊆ R are detections of s q

precicion, recall, and granularity allow only for a partial order

q

combination of the three measures into one score: F1 plagdet(S, R) = log2(1 + granularity (S, R)) where F1 is the harmonic mean of precicion and recall

41 [∧]

c

www.webis.de 2011

Evaluation Competitions at PAN 2009-2011

42 [∧]

c

www.webis.de 2011

Evaluation Competitions at PAN 2009-2011 2007

2008

2009

2010

2011

43 [∧]

c

www.webis.de 2011

Evaluation Competitions at PAN 2009-2011 Precision

Plagdet Grozea Kasprzak Basile Palkovskii Zechner Shcherbinin Pereira Vallés Balaguer Malcolm Allen

2009

Kasprzak Zou Muhr Grozea Oberreuter Torrejón Pereira Palkovskii Sobha Gottron Micol Costa-jussà Nawab Gupta Vania Suàrez Alzahrani Iftene

2010

Grman Grozea Oberreuter Cooke Torrejón Rao Palkovskii Nawab Ghosh

2011 0

44 [∧]

Granularity

Recall

0.5

1 0

0.5

1 0

0.5

1 1

1.5

2

c

www.webis.de 2011

Reusing the Web for Writing Assistance

45 [∧]

c

www.webis.de 2011

Reusing the Web for Writing Assistance q q q

46 [∧]

writing is not so much about what to write, but how finding the right words is essential to maximize understanding Netspeak is a search engine for words in context:

c

www.webis.de 2011

Reusing the Web for Writing Assistance q q q

writing is not so much about what to write, but how finding the right words is essential to maximize understanding Netspeak is a search engine for words in context:

Technical details: q q q q q q 47

[∧]

> 3 billion phrases and their usage frequencies as of 2006. > 120 GB inverted index data structure (scalable) < 1 second response time > 4300 users / month wildcard query processor instant search c

www.webis.de 2011

48 [∧]

c

www.webis.de 2011

Contributions of Technologies for Reusing Text from the Web 1. Models & Algorithms q Unifying fingerprinting framework q Cross-language ESA q Comment cross-media similarity q Query segmentation algorithms

2. Surveys q Fingerprinting q Plagiarism detection q Web comment retrieval q Query segmentation

3. Evaluation Resources q Wikipedia as near-duplicate corpus q Wikipedia as cross-language corpus q 3 measures for plagiarism detection q 3 plagiarism corpora q Query segmentation corpus

4. Comparative Evaluations q 5 fingerprint algorithms q 3 cross-language models q 32 plagiarism detectors within 3 PAN evaluation competitions q 8 query segmentation algorithms

5. Tools q Netspeak 49 [∧]

q

Picapica

q

OpinionCloud

q

AItools lib c

www.webis.de 2011

Benno Stein q Maik Anderka q Steven Burrows q Tim Gollub q Matthias Hagen q Dennis Hoppe q Nedim Lipka q Sven Meyer zu Eißen q Peter Prettenhofer q Patrick Riehmann q Bernd Fröhlich q Alberto Barrón-Cedeño q Paolo Rosso q Paul Clough q Steffen Becker q Christof Bräutigam q Andreas Eiselt q Robert Gerling q Teresa Holfeld q Alexander Kümmel q Fabian Loose q Martin Trenkmann q Dietmar Bratke q Jürgen Eismann q Nadin Glaser q Maria-Theresa Hansens q Melanie Hennig q Dana Horch q Antje Klahn q Hildegard Kühndorf q Tina Meinhardt q Christin Oehmichen q Ursula Schmidt q Katja Schöllner q Nils Rethmeier q Tsvetomira Palakarska q Steven Reinisch q Hagen-Christian Tönnies q Michael Völske q Anita Schilling q Michael Blersch q Christoph Lössnitz q Dennis Braunsdorf q Alexander Kleppe q Franz Coriand q Verena Skuk q Anne Köpsel q Marcel Heunemann q Stefan Knoblauch q Klaus Krämer q Christian Fricke q Denis Kreis q Clement Welsch q Maximilian Michel q Jan Grassegger q Jan Dittrich q Fabian Vogelsteller q Felicitas Höbelt q Carsten Tetens q Jan Hühne q Nils Gründl q André Zölitz q Michael Hengst q Yunlu Ai q Markus Riedel q Bjarne-Vanja Melani q Henning Gründl q Stephan Bongartz q Daniel, Wiebke, Marc und Merle Potthast q Steffi, Leonie und Louisa Daniel q Gabi und Günter Aab q Georg Potthast und Hildegard Knoke q Ellinor Pfützner q Martin Weitert q Daniel Warner q Christian Ederer

Thank you!

50

[∧]

c

www.webis.de 2011

Appendix

51 [∧]

q

Detecting Plagiarism and Evaluating Detectors

q

Survey of Plagiarism Detection Evaluations

q

Plagiarism Corpus Construction

q

Netspeak Experiments

c

www.webis.de 2011

Detecting Plagiarism Suspicious document

Heuristic retrieval

Document collection

52 [∧]

Thesis

Candidate documents

Detailed comparison

Knowledge-based post-processing

Suspicious passages

c

www.webis.de 2011

Detecting Plagiarism Suspicious document

Heuristic retrieval

Thesis

Candidate documents

Detailed comparison

Document collection

Knowledge-based post-processing

Suspicious passages

Evaluating Plagiarism Detectors Simulate inputs — measure output quality — repeat What’s required:

53 [∧]

q

corpus of plagiarism cases

q

performance mesaures

q

alternative implementations c

www.webis.de 2011

Survey of Plagiarism Detection Evaluations

54 [∧]

Evaluation Aspect Text Code

Evaluation Aspect Text Code

Experiment Task local collection Web retrieval other

Corpus Acquisition existing corpus homemade corpus

80% 15% 5%

95% 0% 5%

Performance Measure precision, recall 43% manual, similarity 35% runtime only 15% other 7%

18% 69% 1% 12%

Comparison none parameter settings other algorithms

51% 9% 40%

46% 19% 35%

q

more than 200 papers were reviewed

q

many struggle with proper evaluation

20% 80%

Corpus Size [# documents] [1, 10) 11% [10, 102) 19% [102, 103) 38% [103, 104) 8% [104, 105) 16% [105, 106) 8%

18% 82% 10% 30% 33% 11% 4% 0%

c

www.webis.de 2011

Plagiarism Corpus Construction Corpus overview:

55 [∧]

q

real plagiarism cases not available on a large scale

q

plagiarism was generated automatically using heuristics

q

plagiarism was also crowdsourced via Amazon’s Mechanical Turk

q

the corpus was compiled 3 years in a row, improving it each time

q

∼ 27 000 documents (obtained from the Project Gutenberg)

q

∼ 61 000 plagiarism cases

c

www.webis.de 2011

Plagiarism Corpus Construction Corpus overview: q

real plagiarism cases not available on a large scale

q

plagiarism was generated automatically using heuristics

q

plagiarism was also crowdsourced via Amazon’s Mechanical Turk

q

the corpus was compiled 3 years in a row, improving it each time

q

∼ 27 000 documents (obtained from the Project Gutenberg)

q

∼ 61 000 plagiarism cases

Corpus parameters: 1. document length 2. document purpose 3. plagiarism per document 4. plagiarism case length 5. plagiarism case obfuscation

56 [∧]

c

www.webis.de 2011

Corpus Parameters 100%

57 [∧]

26 939 documents

c

www.webis.de 2011

Corpus Parameters 100%

26 939 documents

Document length: 50%

1-10 pages

35%

10-100 pages

15%

102 -103 pp.

Document purpose: 50%

source documents

50%

suspicious documents

Plagiarism per suspicious document: 50%

58 [∧]

none

50%

range from little to entirely

c

www.webis.de 2011

Corpus Parameters 100%

26 939 documents

Document length: 50%

1-10 pages

35%

10-100 pages

15%

102 -103 pp.

Document purpose: 50%

source documents

50%

suspicious documents

Plagiarism per suspicious document: 50%

none

50%

100%

59 [∧]

range from little to entirely

61 064 plagiarism cases

c

www.webis.de 2011

Corpus Parameters 100%

26 939 documents

Document length: 50%

1-10 pages

35%

10-100 pages

15%

102 -103 pp.

Document purpose: 50%

source documents

50%

suspicious documents

Plagiarism per suspicious document: 50%

none

50%

100%

range from little to entirely

61 064 plagiarism cases

Plagiarism case length: 35%

1150 words

Plagiarism case obfuscation: 18%

none 32%

automatic (weak)

31%

automatic (strong)

translation manual

de

es

q Manual paraphrases (8%) via Amazon’s Mechanical Turk. q Translations (11%) via Google Translate from de→en and es→en. 60 [∧]

c

www.webis.de 2011

Netspeak Experiments 1

macro-averaged recall

Netspeak quantile 4-word-queries

0.8

3-word-queries average

0.6

0.4 2-word-queries 1-word-queries

0.2

micro-averaged recall

0 1 0.8 3-word-queries 1-word-queries

average

0.6

4-word-queries 0.4

0.2

2-word-queries Netspeak quantile

0

61 [∧]

0

0.1

0.2

0.3

0.7

0.8

0

0.01

0.06

0.21 0.59 1.47 3.36 7.37 percentage of a postlist evaluated

0

0.0044

0.021

0.4

0.5 quantile

0.6

0.16 0.36 0.83 retrieval time (seconds)

0.9

1

15.94

34.88

100

1.86

4.25

10.03

c

www.webis.de 2011