Adrienn Szabo David Siklosi
Jacint Szabo
Istvan Biro
Zsolt Fekete
Miklos Attila Simon Kurucz Pereszlényi Racz
Web Spam Filtering @ LiWA - Living Web Archives WP 3: Data Cleansing and Noise Filtering IWAW presentation September 19, 2008 Aarhus, Denmark András A. Benczúr Hungarian Academy of Sciences
Web Spam: a Survey with Vision for the Archivist András A. Benczúr, Dávid Siklósi, Jácint Szabó, István Bíró, Zsolt Fekete, Attila Pereszlényi, Simon Rácz, Adrienn Szabó Hungarian Academy of Sciences (MTA SZTAKI) Data Mining and Web Search Group
This talk is about …
Web spam: for (or against) engines
Web Spam vs. E-mail Spam • Web Spam not (necessarily) targeted against end user E.g. improve the Google ranking for a „customer”
• More effectively fought against since • No filter available for spammer to test • Slow feedback (crawler finds, visits, gets into index)
• But very costly if not fought against: 10+% sites, near 20% HTML pages
Distribution of categories 2004 .de crawl Courtesy: T. Suel
Unknown 0.4% Alias 0.3% Empty 0.4% Non-existent 7.9% Ad 3.7% Weborg 0.8%
Spam 16.5% Reputable 70.0%
Spammers’ target is Google … • High revenue for top SE ranking • Manipulation, “Search Engine Optimization” • Content spam Keywords, popular expressions, mis-spellings • Link spam „Farms”: densely connected sites, redirects • Maybe indirect revenue • Affiliate programs, Google AdSense • Ad display, traffic funneling
„spam industry had a revenue potential of $4.5 billion in year 2004 if they had been able to completely fool all search engines on all commercially viable queries” [Amitay 2004]
Time elapsed to reach hit position
Time spent looking at hit position
User studies on hit position reveal
[Granka,Joachims,Gay 2004]
All elements of Web IR ranking spammed • Term frequency (tf in the tf.idf, Okapi BM25 etc ranking schemes) • Tf weighted by HTML elements title, headers, font size, face • Heaviest weight in ranking: • URL, domain name part • Anchor text: best Aarhus page • URL length, depth from server root • Indegree, PageRank, link based centrality
Web Spam Taxonomy 1.
Content spam
[Gyöngyi, Garcia-Molina, 2005]
Spammed ranking elements • Domain name adjustableloanmortgagemastersonline.compay.dahannusaprima.co.uk buy-canon-rebel-20d-lens-case.camerasx.com
• Anchor text (title, H1, etc) free, great deals, cheap, inexpensive, cheap, free
• Meta keywords (anyone still relying on that??) ->... SanDisk Sansa e250 - 2GB MP3 Pla www.atangledweb.co.uk/index02.html">->... AIGO F820+ 1GB Beach inspired M www.atangledweb.co.uk/index03.html">->... Targus I-Pod Mini Sound Enhancer index04.html">->... Sony NWA806FP.CE7 4GB video WALKMAN ... S
Keyword stuffing, generated copies
Google ads
Web Spam Taxonomy 2.
Link spam
Hyperlinks: Good, Bad, Ugly “hyperlink structure contains an enormous amount of latent human annotation that can be extremely valuable for automatically inferring notions of authority.” (Chakrabarti et. al. ’99) • Honest link, human annotation • No value of recommendation, e.g. „affiliate programs”, navigation, ads … • Deliberate manipulation, link spam
Link farms
WWW
Entry point from honest web: • Honey pots: copies of quality content • Dead links to parking domain • Blog or guestbook comment spam
Link farms Multidomain, Multi-IP
Honey pot: quality content copy 411amusement.com 411 sites A-Z list
411fashion.com 411 sites A-Z list
target
411zoos.com 411 sites A-Z list
PageRank supporter distribution
ρ=0.61
ρ=0.97
low
high PageRank
Honest: fhh.hamburg.de
low
high PageRank
Spam: radiopr.bildflirt.de (part of www.popdata.de farm) [Benczúr,Csalogány,Sarlós,Uher 2005]
Know your neighbor • Honest pages rarely point to spam • Spam cites many, many spam 1. Predicted spamicity p(v) for all pages 2. Target page u, new feature f(u) by neighbor p(v) aggregation 3. Reclassification by adding the new feature
v7
v1
?
v2 u
Web Spam Taxonomy 3.
Cloaking and hiding
Formatting • One-pixel image • White over white • Color, position from stylesheet • … Idea: crawlers do simplified HTML processing Importance for crawlers to run rendering and script execution!
Obfuscated JavaScript var1=100;var3=200;var2=var1 + var3; var4=var1;var5=var4 + var3; if(var2==var5) document.location="http:// umlander.info/ mega/free software downloads.html"; • Redirection through window.location • eval: spam content (text, link) from random looking static data • document.write
HTTP level cloaking • User agent, client host filtering
• Different for users and for GoogleBot • „Collaboration service” of spammers for crawler IPs, agents and behavior
Web Spam Taxonomy 4.
Spam in social media
New target: blogs, guest books
Fake blogs
Spam hunting • Crawl time? • Machine learning • Manual labeling • Collaboration, effort and knowledge sharing • Benchmarks (WEBSPAM-UK)
No free lunch: no fully automatic filtering • Manual labels (black AND white lists) primarily determine quality • Can blacklist only a tiny fraction • Recall 10% of sites are spam • Needs machine learning
• Models quickly decay Measurement: training on intersection with WEBSPAM-UK2006 labels, test WEBSPAM-UK2007
• Central to the service: • Aid manual assessment • Aid information and label sharing • Catch spam farms that span different TLDs 31
Crawl-time vs. post-processing • Simple filters in crawler • cannot handle unseen sites • needs large bootstrap crawl
• Crawl time feature generation and classification • needs interface in crawler to access content • Needs model from bootstrap or external crawl (may be smaller) • Sounds expensive but needs to be done only once per site
• The hard work is done post-processing both cases
Architecture
Local storages
access
Assessment interface AND collaboration infrastructure May share features, INTERACTION extracts Active learning across institutions feature feed text files
Collaboration and Assessment Interface • Automatic operation 1. Compute features over bootstrap crawl 2. Classify by settings from central service
• Assessment and collaboration 1. Register the domains of the archive in the central service (with feature vectors?) 2. Label using active learning (local or central classification?) 3. Share and revise labels, explanations
Managing snapshots
Attributes Explanations • Add yours • Read others’, maybe another institute Assessment aid
The Web Spam Challenge • UK-WEBSPAM2006 (UbiCrawler crawl 2006, Yahoo Research, 2007) • 9000 Web sites, 500,000 links • 767 spam, 7472 honest
• UK-WEBSPAM2007 (this year’s contest) • 114,000 Web sites, 3 bio links • 222 spam, 3776 honest • 3 TByte full uncompressed data
• Future challenges? For archival needs? • Time snapshots, page history features
Questions? András A. Benczúr datamining.sztaki.hu/
[email protected]
Adrienn Szabo David Siklosi
Jacint Szabo
Istvan Biro
Zsolt Fekete
Miklos Attila Simon Kurucz Pereszlényi Racz
Web Spam Filtering @ LiWA - Living Web Archives WP 3: Data Cleansing and Noise Filtering IWAW presentation September 19, 2008 Aarhus, Denmark András A. Benczúr Hungarian Academy of Sciences
datamining.sztaki.hu/
[email protected]