Oracle, Where Shall I Submit My Precious Papers? IST Faculty Brown Bag Sep. 22, 2006 Dongwon Lee
Credits z
Students z z z
z
Ergin Elmacioglu (CSE, Penn State) Su Yan (IST, Penn State) Ziming Zhuang (IST, Penn State)
Colleagues z z z z
Lee Giles (Penn State) Min-Yen Kan (NUS, Singapore) Jaewoo Kang (Korea U., Korea) Divesh Srivastava (AT&T Labs – Research)
Sep. 22, 2006
2
1
What do I do?
Databases / Data Mining
Digital Libraries / Info. Retrieval
XML / Web
Sep. 22, 2006
3
What projects do I do? Today’s Talk Databases / Data Mining
Digital Libraries / Info. Retrieval
Microsoft SciData 2005 NSF OISE 2006
XML / Web IBM Eclipse, 2004 & 2006 Penn State eBRC, 2005 Sep. 22, 2006
4
2
Outline z z z z
Motivation Simple Study Results Summary
Sep. 22, 2006
5
MIT’s Prank http://pdos.csail.mit.edu/scigen/
The World Multi-Conference on Systemics, Cybernetics and Informatics (SCI) Sep. 22, 2006
6
3
Annoyance…
Sep. 22, 2006
7
“Dong-Won Lee” as PC?
WMSCI 2006
Sep. 22, 2006
8
4
Some Known Questionable Venues z
From http://www.inesc-id.pt/~aml/trash.html: z z z z z z z z z z z z z z z
IMCSE: International Multiconference in Computer Science and Computer Engineering WMSCI or SCI: World Multiconference on Systemics, Cybernetics and Informatics ICCCT: International Conference on Computing, Communications and Control Technologies PISTA: Conference on Politics and Information Systems: Technologies and Applications SSCCII: Symposium of Santa Caterina on Challanges in the Internet and Interdisciplinary Research CITSA: International Conference on Cybernetics and Information Technologies, Systems and Applications ISAS: International Conference on Information Systems Analysis and Synthesis CISCI: Conferencia Iberoamericana en Sistemas, Cibernética e Informática SIECI: Simposium Iberoamericano de Educación, Cibernética e Informática WCAC: World Congress in Applied Computing Any IPSI International Conference or journal Any GESTS international conference or journal KCPR: International Conference on Knowledge Communication and Peer Reviewing International e-Conference on Computer Science …
http://fakeconferences.org => down from a threat Sep. 22, 2006
9
Fakes Everywhere Microsoft HoneyMonkey
Sep. 22, 2006
10
5
Fake Venues z
According to fakeconferences.org, z
z
“… fake venues are ones that are organized for the revenue, not for the advancement of science…They share a lot in common…an abundance of varying, vaguely connected topics, high frequency of conference, spam mailings, obscure organizers and sponsors, and poor peer reviewing and randomly accepting papers …”
WMSCI has listed close to 300 research topics as relevant in its Call-For-Paper (CFP), and reportedly accepted 2,165 and 2,904 papers in 2003 and 2004, respectively Sep. 22, 2006
11
Differences in Disciplines z
Computer Science z z z z
z
Pure Sciences (eg, Math, Physics) z z z
z
Peer-reviewed conferences Top conferences have 5-15% acceptance rate Specialized and small conferences (attendance of 500+) Often value conferences > journals Pre-print at Arxiv.org Rigorous reviews for journals Huge flagship conference (ICM 98 attracted ~4000)
Social Sciences z z z
Often value journals > conferences Conferences are mostly for gathering or short abstract based screening Rigorous reviews for journals
Sep. 22, 2006
12
6
Outline z z z z
Motivation Simple Study Results Summary
Sep. 22, 2006
13
Research Question Can we detect the so called “fake venues” automatically? z
Desiderata z
z
z
Large-number of venues per year Ù scalable Automatic detection Ù no human involvement False positives >> false negatives
800 700 600 500 400 300 200 100 0
Sep. 22, 2006
1999 2000 2001 2002 2003 2004 2005 2006
Histogram of CFPs in dbworld
14
7
Candidate Features z
Good vs. bad venues z z z z z
z
Citation counting (eg, Impact Factor) Acceptance rate Reputation (eg, society) History …
At the end, none satisfy our desiderata. Need something else…
Sep. 22, 2006
15
Research Hypothesis Qualities of venues are closely correlated with those of PC members of the venues
z
z
z
PC member list can be readily available from CFP Ù data extraction + data cleaning Each CFP has only finite number of PCs Ù scalability Examine quality of PC w.r.t heuristics: z
Citation counting, productivity, centrality, betweeness, impact, … Sep. 22, 2006
16
8
Data Mining Models z
Outlier detection
z
Clustering
z
Classification
Fake ?
training set
Sep. 22, 2006
17
Classification w. Decision Tree PC has feature A? Yes PC has feature B?
No Regular venue
PC has feature C?
Fake venue
training set
PC has feature D?
Fake ?
Sep. 22, 2006
18
9
5 Classification Features z z z z z z
# of PC # of publication of PC # of co-authors of PC Closeness centrality of PC Betweeness centrality of PC …
Sep. 22, 2006
19
Set-Up z
ACM DL: downloaded data of 1950-2004 z z
z
Dbworld: 2,979 CFPs (free text formats) z
z z
0.6M authors, 0.7M articles 1.2M edges (ie, collaboration) 16,147 distinct PC names
Hand-selected 20 fake venues Ù Q Laborious cleaning process for venue, PC names, and citations: z z z
Entity resolution Name disambiguation Record linkage Sep. 22, 2006
Another Talk
20
10
Outline z z z z
Motivation Simple Study Results Summary
Sep. 22, 2006
21
1 0.9 0.8 0.7 0.6 0.5 0.4 0.3 0.2 0.1 0
fraction of conferences
35% 30% 25% 20% 15% 10% 5% 0% 10
50
90
130
170
210
250
290
330
370
probability of Q
# of PC
410
number of pc
Sep. 22, 2006
22
11
# of publication of PC 0.0
7.5 15.0 22.5 30.0 37.5 45.0 52.5
LQC Q
C
80 70
Percent
60 50 40 30 20 10 0
0.0
7.5 15.0 22.5 30.0 37.5 45.0 52.5
Sep. 22, 2006
23
14% 12%
0.3
10% 8% 6% 4%
0.2
2% 0%
0.05
probability of LQC Q
0.25 0.15 0.1 0
5. 00 E03 0. 01 5 0. 02 5 0. 03 5 0. 04 5 0. 05 5 0. 06 5 0. 07 5 0. 08 5 0. 09 5
fraction of conferences
Closeness centrality of PC
average closeness
CC (v) =
n −1 ∑ d (v, w)
v , w∈G Sep. 22, 2006
24
12
Combining All Features z
Naïve (C4.5) z z
z
0.877 0.965
PC has feature A?
Bagging z z
z
Precision: Recall: Precision: Recall:
0.899 0.979
PC has feature B? PC has feature C?
PC has feature D?
Boosting z z
Precision: Recall:
0.938 0.964
Sep. 22, 2006
25
More than “usual suspects” z
Classification detected two: z
z
z
The 2nd International Advanced Database Conference (IADC) The 4th International Conference on Computer Science and its Applications (ICCSA)
Not part of original Q
Sep. 22, 2006
26
13
PSU Prank z
Apr. 10, 2006, we generated 3 bogus papers using MIT SCIgen software: z z
z
P1
P1 by Ethan Patel P2 by Simon R. Hathaway P3 by Richard Zhang
P2
Sep. 22, 2006
27
Sep. 22, 2006
28
PSU Prank z
Indiana’s Inauthentic Paper Detector says: z
z
z
P1: 28.9% => inauthentic P2: 61.5% => authentic P3: 38% => inauthentic
14
PSU Prank April 24 – May 1, 2006
z z
P1 to ICCSA on April 24, (2) P2 to IADC on April 26, and (3) P3 to ICCSA on May 1.
May 15, 2006
z z z z
P1 and P2 accepted w/o reviews P3 rejected w/o reviews Asked for reviews or any rationale Ù no response so far
Sep. 22, 2006
29
“Ethan Patel” made it !
Sep. 22, 2006
30
15
“Richard Zhang” too !
Sep. 22, 2006
31
Sep. 22, 2006
32
Outline z z z z
Motivation Simple Study Results Summary
16
Summary z
Practical setting of outlier detection z
z
z
Developing general semantic outlier detection framework Applying to other practical problems z
z
Semantic outlier vs. syntactic outlier
Eg, GM counterfeit detection
Developing general venue ranking framework z
AppleRank project
Sep. 22, 2006
33
17