Oracle, Where Shall I Submit My Precious Papers?

Oracle, Where Shall I Submit My Precious Papers? IST Faculty Brown Bag Sep. 22, 2006 Dongwon Lee Credits z Students z z z z Ergin Elmacioglu (CSE,...
Author: Willis Gardner
7 downloads 0 Views 3MB Size
Oracle, Where Shall I Submit My Precious Papers? IST Faculty Brown Bag Sep. 22, 2006 Dongwon Lee

Credits z

Students z z z

z

Ergin Elmacioglu (CSE, Penn State) Su Yan (IST, Penn State) Ziming Zhuang (IST, Penn State)

Colleagues z z z z

Lee Giles (Penn State) Min-Yen Kan (NUS, Singapore) Jaewoo Kang (Korea U., Korea) Divesh Srivastava (AT&T Labs – Research)

Sep. 22, 2006

2

1

What do I do?

Databases / Data Mining

Digital Libraries / Info. Retrieval

XML / Web

Sep. 22, 2006

3

What projects do I do? Today’s Talk Databases / Data Mining

Digital Libraries / Info. Retrieval

Microsoft SciData 2005 NSF OISE 2006

XML / Web IBM Eclipse, 2004 & 2006 Penn State eBRC, 2005 Sep. 22, 2006

4

2

Outline z z z z

Motivation Simple Study Results Summary

Sep. 22, 2006

5

MIT’s Prank http://pdos.csail.mit.edu/scigen/

The World Multi-Conference on Systemics, Cybernetics and Informatics (SCI) Sep. 22, 2006

6

3

Annoyance…

Sep. 22, 2006

7

“Dong-Won Lee” as PC?

WMSCI 2006

Sep. 22, 2006

8

4

Some Known Questionable Venues z

From http://www.inesc-id.pt/~aml/trash.html: z z z z z z z z z z z z z z z

IMCSE: International Multiconference in Computer Science and Computer Engineering WMSCI or SCI: World Multiconference on Systemics, Cybernetics and Informatics ICCCT: International Conference on Computing, Communications and Control Technologies PISTA: Conference on Politics and Information Systems: Technologies and Applications SSCCII: Symposium of Santa Caterina on Challanges in the Internet and Interdisciplinary Research CITSA: International Conference on Cybernetics and Information Technologies, Systems and Applications ISAS: International Conference on Information Systems Analysis and Synthesis CISCI: Conferencia Iberoamericana en Sistemas, Cibernética e Informática SIECI: Simposium Iberoamericano de Educación, Cibernética e Informática WCAC: World Congress in Applied Computing Any IPSI International Conference or journal Any GESTS international conference or journal KCPR: International Conference on Knowledge Communication and Peer Reviewing International e-Conference on Computer Science …

http://fakeconferences.org => down from a threat Sep. 22, 2006

9

Fakes Everywhere Microsoft HoneyMonkey

Sep. 22, 2006

10

5

Fake Venues z

According to fakeconferences.org, z

z

“… fake venues are ones that are organized for the revenue, not for the advancement of science…They share a lot in common…an abundance of varying, vaguely connected topics, high frequency of conference, spam mailings, obscure organizers and sponsors, and poor peer reviewing and randomly accepting papers …”

WMSCI has listed close to 300 research topics as relevant in its Call-For-Paper (CFP), and reportedly accepted 2,165 and 2,904 papers in 2003 and 2004, respectively Sep. 22, 2006

11

Differences in Disciplines z

Computer Science z z z z

z

Pure Sciences (eg, Math, Physics) z z z

z

Peer-reviewed conferences Top conferences have 5-15% acceptance rate Specialized and small conferences (attendance of 500+) Often value conferences > journals Pre-print at Arxiv.org Rigorous reviews for journals Huge flagship conference (ICM 98 attracted ~4000)

Social Sciences z z z

Often value journals > conferences Conferences are mostly for gathering or short abstract based screening Rigorous reviews for journals

Sep. 22, 2006

12

6

Outline z z z z

Motivation Simple Study Results Summary

Sep. 22, 2006

13

Research Question Can we detect the so called “fake venues” automatically? z

Desiderata z

z

z

Large-number of venues per year Ù scalable Automatic detection Ù no human involvement False positives >> false negatives

800 700 600 500 400 300 200 100 0

Sep. 22, 2006

1999 2000 2001 2002 2003 2004 2005 2006

Histogram of CFPs in dbworld

14

7

Candidate Features z

Good vs. bad venues z z z z z

z

Citation counting (eg, Impact Factor) Acceptance rate Reputation (eg, society) History …

At the end, none satisfy our desiderata. Need something else…

Sep. 22, 2006

15

Research Hypothesis Qualities of venues are closely correlated with those of PC members of the venues

z

z

z

PC member list can be readily available from CFP Ù data extraction + data cleaning Each CFP has only finite number of PCs Ù scalability Examine quality of PC w.r.t heuristics: z

Citation counting, productivity, centrality, betweeness, impact, … Sep. 22, 2006

16

8

Data Mining Models z

Outlier detection

z

Clustering

z

Classification

Fake ?

training set

Sep. 22, 2006

17

Classification w. Decision Tree PC has feature A? Yes PC has feature B?

No Regular venue

PC has feature C?

Fake venue

training set

PC has feature D?

Fake ?

Sep. 22, 2006

18

9

5 Classification Features z z z z z z

# of PC # of publication of PC # of co-authors of PC Closeness centrality of PC Betweeness centrality of PC …

Sep. 22, 2006

19

Set-Up z

ACM DL: downloaded data of 1950-2004 z z

z

Dbworld: 2,979 CFPs (free text formats) z

z z

0.6M authors, 0.7M articles 1.2M edges (ie, collaboration) 16,147 distinct PC names

Hand-selected 20 fake venues Ù Q Laborious cleaning process for venue, PC names, and citations: z z z

Entity resolution Name disambiguation Record linkage Sep. 22, 2006

Another Talk

20

10

Outline z z z z

Motivation Simple Study Results Summary

Sep. 22, 2006

21

1 0.9 0.8 0.7 0.6 0.5 0.4 0.3 0.2 0.1 0

fraction of conferences

35% 30% 25% 20% 15% 10% 5% 0% 10

50

90

130

170

210

250

290

330

370

probability of Q

# of PC

410

number of pc

Sep. 22, 2006

22

11

# of publication of PC 0.0

7.5 15.0 22.5 30.0 37.5 45.0 52.5

LQC Q

C

80 70

Percent

60 50 40 30 20 10 0

0.0

7.5 15.0 22.5 30.0 37.5 45.0 52.5

Sep. 22, 2006

23

14% 12%

0.3

10% 8% 6% 4%

0.2

2% 0%

0.05

probability of LQC Q

0.25 0.15 0.1 0

5. 00 E03 0. 01 5 0. 02 5 0. 03 5 0. 04 5 0. 05 5 0. 06 5 0. 07 5 0. 08 5 0. 09 5

fraction of conferences

Closeness centrality of PC

average closeness

CC (v) =

n −1 ∑ d (v, w)

v , w∈G Sep. 22, 2006

24

12

Combining All Features z

Naïve (C4.5) z z

z

0.877 0.965

PC has feature A?

Bagging z z

z

Precision: Recall: Precision: Recall:

0.899 0.979

PC has feature B? PC has feature C?

PC has feature D?

Boosting z z

Precision: Recall:

0.938 0.964

Sep. 22, 2006

25

More than “usual suspects” z

Classification detected two: z

z

z

The 2nd International Advanced Database Conference (IADC) The 4th International Conference on Computer Science and its Applications (ICCSA)

Not part of original Q

Sep. 22, 2006

26

13

PSU Prank z

Apr. 10, 2006, we generated 3 bogus papers using MIT SCIgen software: z z

z

P1

P1 by Ethan Patel P2 by Simon R. Hathaway P3 by Richard Zhang

P2

Sep. 22, 2006

27

Sep. 22, 2006

28

PSU Prank z

Indiana’s Inauthentic Paper Detector says: z

z

z

P1: 28.9% => inauthentic P2: 61.5% => authentic P3: 38% => inauthentic

14

PSU Prank April 24 – May 1, 2006

z z

P1 to ICCSA on April 24, (2) P2 to IADC on April 26, and (3) P3 to ICCSA on May 1.

May 15, 2006

z z z z

P1 and P2 accepted w/o reviews P3 rejected w/o reviews Asked for reviews or any rationale Ù no response so far

Sep. 22, 2006

29

“Ethan Patel” made it !

Sep. 22, 2006

30

15

“Richard Zhang” too !

Sep. 22, 2006

31

Sep. 22, 2006

32

Outline z z z z

Motivation Simple Study Results Summary

16

Summary z

Practical setting of outlier detection z

z

z

Developing general semantic outlier detection framework Applying to other practical problems z

z

Semantic outlier vs. syntactic outlier

Eg, GM counterfeit detection

Developing general venue ranking framework z

AppleRank project

Sep. 22, 2006

33

17