Adrienn Szabo David Siklosi

Jacint Szabo

Istvan Biro

Zsolt Fekete

Miklos Attila Simon Kurucz Pereszlényi Racz

Web Spam Filtering @ LiWA - Living Web Archives WP 3: Data Cleansing and Noise Filtering IWAW presentation September 19, 2008 Aarhus, Denmark András A. Benczúr Hungarian Academy of Sciences

Web Spam: a Survey with Vision for the Archivist András A. Benczúr, Dávid Siklósi, Jácint Szabó, István Bíró, Zsolt Fekete, Attila Pereszlényi, Simon Rácz, Adrienn Szabó Hungarian Academy of Sciences (MTA SZTAKI) Data Mining and Web Search Group

This talk is about …

Web spam: for (or against) engines

Web Spam vs. E-mail Spam • Web Spam not (necessarily) targeted against end user E.g. improve the Google ranking for a „customer”

• More effectively fought against since •  No filter available for spammer to test •  Slow feedback (crawler finds, visits, gets into index)

• But very costly if not fought against: 10+% sites, near 20% HTML pages

Distribution of categories 2004 .de crawl Courtesy: T. Suel

Unknown 0.4% Alias 0.3% Empty 0.4% Non-existent 7.9% Ad 3.7% Weborg 0.8%

Spam 16.5% Reputable 70.0%

Spammers’ target is Google … •  High revenue for top SE ranking • Manipulation, “Search Engine Optimization” • Content spam Keywords, popular expressions, mis-spellings • Link spam „Farms”: densely connected sites, redirects •  Maybe indirect revenue • Affiliate programs, Google AdSense • Ad display, traffic funneling

„spam industry had a revenue potential of $4.5 billion in year 2004 if they had been able to completely fool all search engines on all commercially viable queries” [Amitay 2004]

Time elapsed to reach hit position

Time spent looking at hit position

User studies on hit position reveal

[Granka,Joachims,Gay 2004]

All elements of Web IR ranking spammed •  Term frequency (tf in the tf.idf, Okapi BM25 etc ranking schemes) •  Tf weighted by HTML elements title, headers, font size, face •  Heaviest weight in ranking: •  URL, domain name part •  Anchor text: best Aarhus page •  URL length, depth from server root •  Indegree, PageRank, link based centrality

Web Spam Taxonomy 1.

Content spam

[Gyöngyi, Garcia-Molina, 2005]

Spammed ranking elements • Domain name adjustableloanmortgagemastersonline.compay.dahannusaprima.co.uk buy-canon-rebel-20d-lens-case.camerasx.com

• Anchor text (title, H1, etc) free, great deals, cheap, inexpensive, cheap, free

• Meta keywords (anyone still relying on that??) ->... SanDisk Sansa e250 - 2GB MP3 Pla www.atangledweb.co.uk/index02.html">->... AIGO F820+ 1GB Beach inspired M www.atangledweb.co.uk/index03.html">->... Targus I-Pod Mini Sound Enhancer index04.html">->... Sony NWA806FP.CE7 4GB video WALKMAN ... S

Keyword stuffing, generated copies

Google ads

Web Spam Taxonomy 2.

Link spam

Hyperlinks: Good, Bad, Ugly “hyperlink structure contains an enormous amount of latent human annotation that can be extremely valuable for automatically inferring notions of authority.” (Chakrabarti et. al. ’99) •  Honest link, human annotation •  No value of recommendation, e.g. „affiliate programs”, navigation, ads … •  Deliberate manipulation, link spam

Link farms

WWW

Entry point from honest web: •  Honey pots: copies of quality content •  Dead links to parking domain •  Blog or guestbook comment spam

Link farms Multidomain, Multi-IP

Honey pot: quality content copy 411amusement.com 411 sites A-Z list

411fashion.com 411 sites A-Z list

target

411zoos.com 411 sites A-Z list

PageRank supporter distribution

ρ=0.61

ρ=0.97

low

high PageRank

Honest: fhh.hamburg.de

low

high PageRank

Spam: radiopr.bildflirt.de (part of www.popdata.de farm) [Benczúr,Csalogány,Sarlós,Uher 2005]

Know your neighbor • Honest pages rarely point to spam • Spam cites many, many spam 1.  Predicted spamicity p(v) for all pages 2.  Target page u, new feature f(u) by neighbor p(v) aggregation 3.  Reclassification by adding the new feature

v7

v1

?

v2 u

Web Spam Taxonomy 3.

Cloaking and hiding

Formatting • One-pixel image • White over white • Color, position from stylesheet • … Idea: crawlers do simplified HTML processing Importance for crawlers to run rendering and script execution!

Obfuscated JavaScript var1=100;var3=200;var2=var1 + var3; var4=var1;var5=var4 + var3; if(var2==var5) document.location="http:// umlander.info/ mega/free software downloads.html"; •  Redirection through window.location •  eval: spam content (text, link) from random looking static data •  document.write

HTTP level cloaking • User agent, client host filtering

• Different for users and for GoogleBot • „Collaboration service” of spammers for crawler IPs, agents and behavior

Web Spam Taxonomy 4.

Spam in social media

New target: blogs, guest books

Fake blogs

Spam hunting • Crawl time? • Machine learning • Manual labeling • Collaboration, effort and knowledge sharing • Benchmarks (WEBSPAM-UK)

No free lunch: no fully automatic filtering •  Manual labels (black AND white lists) primarily determine quality •  Can blacklist only a tiny fraction •  Recall 10% of sites are spam •  Needs machine learning

•  Models quickly decay Measurement: training on intersection with WEBSPAM-UK2006 labels, test WEBSPAM-UK2007

•  Central to the service: •  Aid manual assessment •  Aid information and label sharing •  Catch spam farms that span different TLDs 31

Crawl-time vs. post-processing •  Simple filters in crawler • cannot handle unseen sites • needs large bootstrap crawl

•  Crawl time feature generation and classification • needs interface in crawler to access content • Needs model from bootstrap or external crawl (may be smaller) • Sounds expensive but needs to be done only once per site

•  The hard work is done post-processing both cases

Architecture

Local storages

access

Assessment interface AND collaboration infrastructure May share features, INTERACTION extracts Active learning across institutions feature feed text files

Collaboration and Assessment Interface •  Automatic operation 1.  Compute features over bootstrap crawl 2.  Classify by settings from central service

•  Assessment and collaboration 1.  Register the domains of the archive in the central service (with feature vectors?) 2.  Label using active learning (local or central classification?) 3.  Share and revise labels, explanations

Managing snapshots

Attributes Explanations •  Add yours •  Read others’, maybe another institute Assessment aid

The Web Spam Challenge •  UK-WEBSPAM2006 (UbiCrawler crawl 2006, Yahoo Research, 2007) •  9000 Web sites, 500,000 links •  767 spam, 7472 honest

• UK-WEBSPAM2007 (this year’s contest) • 114,000 Web sites, 3 bio links • 222 spam, 3776 honest • 3 TByte full uncompressed data

• Future challenges? For archival needs? • Time snapshots, page history features

Questions? András A. Benczúr datamining.sztaki.hu/ [email protected]

Adrienn Szabo David Siklosi

Jacint Szabo

Istvan Biro

Zsolt Fekete

Miklos Attila Simon Kurucz Pereszlényi Racz

Web Spam Filtering @ LiWA - Living Web Archives WP 3: Data Cleansing and Noise Filtering IWAW presentation September 19, 2008 Aarhus, Denmark András A. Benczúr Hungarian Academy of Sciences

datamining.sztaki.hu/ [email protected]