5/12/2014
Kuliah 12
Text & Web Mining Data Mining – Ilmu Komputer IPB
Data terstruktur • Sejauh ini kita berurusan dengan data terstruktur, Attribute Value Attribute Value Attribute Value Attribute Value
Outlook Sunny Temperature Hot Windy Yes Humidity High Play Yes
• Umumnya data mining menggunakan data semacam ini
1
5/12/2014
5/12/2014
Complex Data Types • Berkembangnya data complex
• Spatial data: geographic data, medical &
satellite images • Multimedia data: images, audio, & video • Time-series data: banking data & stock exchange data • Text data: word descriptions for objects • World-Wide-Web: highly unstructured text & multimedia data
5/12/2014
Basisdata Teks • Dalam prakteknya terdapat banyak basis data teks: • artikel berita • paper riset • buku • perpustakaan digital • e-mail • halaman web • Berkembang dengan cepat baik dari segi jumlah maupun
kepentingan (80%)
2
5/12/2014
Text Mining • Text mining merujuk pada data mining yang
menggunakan dokumen teks sebagai data • Hampir semua tugas Text Mining menggunakan metode Information Retrieval (IR) untuk pra-proses dokumen teks. • Metode ini sedikit berbeda daripada metode pra-proses data yang digunakan dalam tabel relasional • Web search juga berakar pada IR
CS583, Bing Liu, UIC
Definisi Text Mining • Discover useful and previously unknown
“gems” of information in large text collections
3
5/12/2014
Definisi Text Mining Text Mining is understood as a process of automatically extracting meaningful, useful, previously unknown and ultimately comprehensible information from textual document repositories. Text Mining = Data Mining (applied to text data) + basic linguistics
Definisi • “yang tidak diketahui sebelumnya” ? • Definisi ketat • Informasi yang bahkan penulisnya tidak mengetahui • Contoh: menemukan metode baru untuk pertumbuhan rambut yang merupakan efek samping dari suatu prosedur • Definisi longgar
• Menemukan kembali informasi yang telah ditulis pengarang
dalam teksnya • Contoh: secara otomatis mengekstrak nama produk dari sebuah halaman web
4
5/12/2014
Text Mining Tasks • Diberikan: • Sumber dokumen tekstual • Kueri terbatas (berbasis teks) yang didefinisikan dengan baik • Temukan: • Kalimat dengan informasi relevan • Ekstrak informasi relevan & abaikan informasi yang tidak relevan • Hubungkan informasi & keluaran yang saling berhubungan dalam format yang sudah ditetapkan sebelumnya
Tasks addressed by TM • Search and retrieval • Semantic analysis • Clustering • Categorization • Feature extraction • Ontology building • Dynamic focusing
5
5/12/2014
DM vs TM Data Mining Object of investigation
Numerical and categorical data
Object structure Relational databases
Text Mining Texts
Free form texts
Goal
Predict outcomes of future situations
Retrieve relevant information, distill the meaning, categorize and target-deliver
Methods
Machine learning: SKAT, DT, NN, GA, MBR, MBA
Indexing, special neural network processing, linguistics, ontologies
Current market size
100,000 analysts at large and midsize companies
100,000,000 corporate workers and individual users
Maturity
Broad implementation since 1994
Broad implementation starting 2000
“Search” vs “Discover”
Structured Data Unstructured Data (Text)
Search (goal-oriented)
Discover (opportunistic)
Data Retrieval
Data Mining
Information Retrieval
Text Mining
6
5/12/2014
Aplikasi Text Mining • Pemasaran: Menemukan
kelompok pembeli yang potensial berdasarkan profil teks pengguna • contoh. amazon
• Industri: Mengidentifikasi
situs web kelompok pesaing • Produk pesaing dan harganya
• Pencarian kerja:
mengidentifikasi parameter dalam pencarian pekerjaan •
www.flipdog.com
Aplikasi Text Mining • Search engines • Enterprise portals • Knowledge management systems • e-Business systems • Vertical applications: • e-mail categorization and routing • Call center notes categorization • CRM systems
7
5/12/2014
User Interface Text Operations
Query Operations
Indexing
Searching
INDEX
Ranking Text Database
Search Subsystem query
parse query query tokens
ranked document set
stop list*
non-stoplist tokens
ranking* stemming* stemmed terms *Indicates optional operation.
retrieved document set
Boolean operations*
relevant document set
Inverted file system
8
5/12/2014
Indexing Subsystem documents Documents text
assign document IDs document numbers and *field numbers
break into tokens tokens
stop list* non-stoplist tokens
*Indicates optional operation.
stemming* stemmed terms
term weighting*
terms with weights
Inverted file system
Text Mining Sample Documents
Text document
Transformed
Representation models
Learning
Learning
Working
Domain specific templates/models
Visualizations
9
5/12/2014
Text characteristics: Outline • Large textual data base • High dimensionality • Several input modes • Dependency • Ambiguity • Noisy data • Not well structured text
Text characteristics • Large textual data base • Efficiency consideration • over 2,000,000,000 web pages • almost all publications are also in electronic form
• High dimensionality (Sparse input) • Consider each word/phrase as a dimension • Several input modes • e.g., Web mining: information about user is generated by semantics, browse pattern and outside knowledgebase.
10
5/12/2014
Text characteristics • Dependency • relevant information is a complex conjunction of words/phrases • e.g., Document categorization.
Pronoun disambiguation.
• Ambiguity • Word ambiguity • Pronouns (he, she …) • “buy”, “purchase”
• Semantic ambiguity • The king saw the rabbit with his glasses. (8 meanings)
Text characteristics • Noisy data • Example: Spelling mistakes
• Not well structured text • Chat rooms • “r u available ?” • “Hey whazzzzzz up”
• Speech
11
5/12/2014
Text mining process
Text mining process • Text preprocessing • Syntactic/Semantic text
analysis
• Features Generation • Bag of words
• Features Selection • Simple counting • Statistics
• Text/Data Mining • Classification- Supervised
learning • Clustering- Unsupervised
learning
• Analyzing results
12
5/12/2014
Syntactic / Semantic text analysis • Part of Speech (pos) tagging • Find the corresponding pos for each word
e.g., John (noun) gave (verb) the (det) ball (noun) • ~98% accurate.
• Word sense disambiguation • Context based or proximity based • Very accurate
• Parsing • Generates a parse tree (graph) for each sentence • Each sentence is a stand alone graph
Feature Generation: Bag of words • Text document is represented by the words it contains
(and their occurrences) • e.g., “Lord of the rings” {“the”, “Lord”, “rings”, “of”} • Highly efficient • Makes learning far simpler and easier • Order of words is not that important for certain applications
• Stemming: identifies a word by its root • e.g., flying, flew fly • Reduce dimensionality • Stop words: The most common words are unlikely to help
text mining • e.g., “the”, “a”, “an”, “you” …
13
5/12/2014
Feature Generation: D2K Example Hi, Here is your weekly update (that unfortunately hasn't gone out in about a month). Not much action here right now. 1) Due to the unwavering insistence of a member of the weekly update (that unfortunately gone out group, thehi, ncsa.d2k.modules.core.datatype package is month). much action here right now. 1) application. due unwavering insistence now completely independent of the d2k member group, ncsa.d2k.modules.core.datatype package 2) Transformations are now handled differently in Tables. now completely independent d2k application. 2) Previously,transformations transformations were done using a now handled differently tables. previously, TransformationModule. That module could then be added transformations done using transformationmodule. module to a list that an ExampleTable kept.kept. Now,now, there is an called added list exampletable interface interface called Transformation and a sub-interface called transformation sub-interface called hi week update unfortunate go out month much action here ReversibleTransformation. reversibletransformation. right now 1 due unwaver insistence member group ncsa d2k modules core datatype package now complete independence d2k application 2 transformation now handle different table previous transformation do use transformationmodule module add list exampletable keep now interface call transformation sub-interface call reversibletransformation
Feature Generation: XML •
Current keyword-oriented search engines cannot handle rich queries like • Find all books authored by “Scooby-Doo”.
•
XML: Extensible Markup Language • XML documents have a nested structure in which each element is associated with a tag. • Tags describe the semantics of elements.
The making of a bad movie Scooby-Doo Cartoons
14
5/12/2014
Feature selection • Reduce dimensionality • Learners have difficulty addressing tasks with high dimensionality • Irrelevant features • Not all features help! • e.g., the existence of a noun in a news article is unlikely to help
classify it as “politics” or “sport”
Feature selection: D2K Example I hi week update unfortunate go out month much action here right now 1 due unwaver insistence member group ncsa d2k modules do
core datatype package complete independence application hi 2 transformationweek update handle unfortunate different go table out previous month use much transformationmodule action add here list exampletable right now keep due interface insistence call sub-interface member group reversibletransformation ncsa d2k modules
do core datatype package complete independence application transformation handle different table previous use add list keep interface call sub-interface
15
5/12/2014
Feature selection: D2K Example II hi week update unfortunate go out month much action here right now 1 due unwaver insistence member group ncsa d2k modules do
core datatype package complete independence application hi 2 transformationweek update handle unfortunate different go table out previous month use much transformationmodule action add here list exampletable right now keep due interface insistence call sub-interface member group reversibletransformation ncsa d2k modules
do core datatype package hi complete week independence update application unfortunate transformation month handle action different right table previous due use insistence add member list group keep ncsa interface d2k call modules sub-interface
core
datatype package complete independence application transformation handle different table previous add list interface call sub-interface
Text Mining: Classification definition • Given: a collection of labeled records (training set) • Each record contains a set of features (attributes), and the true class (label) • Find: a model for the class as a function of the
values of the features • Goal: previously unseen records should be assigned a class as accurately as possible • A test set is used to determine the accuracy of the model.
Usually, the given data set is divided into training and test sets, with training set used to build the model and test set used to validate it
16
5/12/2014
Text Mining: Clustering definition • Given: a set of documents and a similarity measure
among documents • Find: clusters such that: • Documents in one cluster are more similar to one another • Documents in separate clusters are less similar to one another
• Goal: • Finding a correct set of documents
Similarity Measures: • Euclidean Distance if attributes are continuous • Other Problem-specific Measures • e.g., how many words are common in these documents
Contoh GREAT Camera., Jun 3, 2004 Reviewer: jprice174 from Atlanta, Ga. I did a lot of research last year before I bought this camera... It kinda hurt to leave behind my beloved nikon 35mm SLR, but I was going to Italy, and I needed something smaller, and digital. The pictures coming out of this camera are amazing. The 'auto' feature takes great pictures most of the time. And with digital, you're not wasting film if the picture doesn't come out. … 34
Summary: Feature1: picture Positive: 12 • The pictures coming out of this camera are amazing. • Overall this is a good camera with a really good picture clarity. … Negative: 2 • The pictures come out hazy if your hands shake even for a moment during the entire process of taking a picture. • Focusing on a display rack about 20 feet away in a brightly lit room during day time, pictures produced by this camera were blurry and in a shade of orange. Feature2: battery life … CS583, Bing Liu, UIC
17
5/12/2014
Visual Comparison +
Summary of reviews of Digital camera 1
_ Picture
Comparison of reviews of
Battery
Zoom
Size
Weight
+
Digital camera 1 Digital camera 2
_ 35
CS583, Bing Liu, UIC
Information Extraction Posting from Newsgroup Telecommunications. Solaris Systems Administrator. 55-60K. Immediate need. 3P is a leading telecommunications firm in need of a energetic individual to fill the following position in the Atlanta office: SOLARIS SYSTEM ADMINISTRATOR Salary: 50-60K with full benefits Location: Atlanta, Georgia no relocation assistance provided
FILLED TEMPLATE job title: SOLARIS SYSTEM ADMINISTRATOR salary: 55-60K city: Atlanta state: Georgia platform: SOLARIS area: Telecommunications
18
5/12/2014
Classification: An Example
10
Ex# Country Marital Status
Income
1
England Single
125K
2
England Married
3
England Single
70K
Yes
4
Italy
Married
40K
No
5
USA
Divorced 95K
No
6
England Married
7
England
8
Italy
9
France
10
Denmark Single
Hooligan Yes Yes
60K
Country Marital Status
Income
England Single
75K
?
Turkey
50K
?
150K
?
England Married
Yes
20K
Yes
Single
85K
Yes
Married
75K
No
50K
No
Married
Itlay
Hooligan
Divorced 90K
?
Single
40K
?
Married
80K
?
10
Training Set
Learn Classifier
Test Set
Model
19
5/12/2014
Text Classification: An Example Ex# Hooligan 1 2 3 4 5 6 7 8 10
An English football fan … During a game in Italy … England has been beating France … Italian football fans were cheering … An average USA salesman earns 75K The game in London was horrific Manchester city is likely to win the championship Rome is taking the lead in the football league
Yes
Hooligan
Yes Yes No
A Danish football fan
?
Turkey is playing vs. France. The Turkish fans …
?
10
No
Test Set
Yes Yes Yes
Training Set
Learn Classifier
Model
20
5/12/2014
Web Mining Data mining – Ilmu Komputer IPB
Web Mining WWW Knowledge
21
5/12/2014
Example: Web data extraction Data region1 A data record A data record
Data region2
CS583, Bing Liu, UIC
43
Align and extract data items (e.g., region1) image1 EN7410 17-inch LCD Monitor Black/Dark charcoal
$299.9 9
Add to Cart
(Delivery / Pick-Up )
Penny Shopping
Compare
image2 17-inch LCD Monitor
$249.9 9
Add to Cart
(Delivery / Pick-Up )
Penny Shopping
Compare
image3 AL1714 17inch LCD Monitor, Black
$269.9 9
Add to Cart
(Delivery / Pick-Up )
Penny Shopping
Compare
$299.9 9
Save Add $70 to After: Cart $70 mailinrebate(s)
(Delivery / Pick-Up )
Penny Shopping
Compare
image4 SyncMaste r 712n 17inch LCD Monitor, Black
Was: $369.9 9
CS583, Bing Liu, UIC
22
5/12/2014
Ads vs. search results
Reproduced from Ullman & Rajaraman with permission
Ads vs. search results Search advertising is the revenue model • Multi-billion-dollar industry • Advertisers pay for clicks on their ads
Interesting problems • How to pick the top 10 results for a search from 2,230,000 matching pages? • What ads to show for a search? • If I’m an advertiser, which search terms should I bid on and how much to bid? Reproduced from Ullman & Rajaraman with permission
23
5/12/2014
What’s Web Mining? Discovering interesting and useful information from Web content and usage • Web search : Google, Yahoo,
• Advertising, e.g. Google Adsense MSN, Ask, … • Fraud detection: click fraud • Specialized search: e.g. Froogle detection, … (comparison shopping), job ads • Improving Web site design and (Flipdog) performance • eCommerce : • Recommendations: e.g. Netflix,
Amazon • improving conversion rate: next
best product to offer
May 12, 2014
Web Mining
Web Mining • Web mining - data mining techniques to
automatically discover and extract information from Web documents/services (Etzioni, 1996). • Web mining research – integrate research from several research communities (Kosala and Blockeel, July 2000) such as: • Database (DB) • Information retrieval (IR) • The sub-areas of machine learning (ML) • Natural language processing (NLP)
24
5/12/2014
5/12/2014
Web Mining • The World Wide Web may have more opportunities
for data mining than any other area • However, there are serious challenges: • It is too huge • Complexity of Web pages is greater than any traditional
text document collection • It is highly dynamic • It has a broad diversity of users • Only a tiny portion of the information is truly useful
How big is the Web ?
Technically, infinite
Because of dynamically generated content Lots of duplication (30-40%)
Number of pages
Best estimate of “unique” static HTML pages comes from search engine claims
Google = 8 billion, Yahoo = 20 billion Lots of marketing hype
Reproduced from Ullman & Rajaraman with permission
25
5/12/2014
Why Mine the Web? • Enormous wealth of textual information on the Web. • Book/CD/Video stores (e.g., Amazon) • Restaurant information (e.g., Zagats) • Car prices (e.g., Carpoint)
• Lots of data on user access patterns • Web logs contain sequence of URLs accessed by users
• Possible to retrieve “previously unknown” information • People who ski also frequently break their leg. • Restaurants that serve sea food in California are likely to be outside
San-Francisco
In the May 2014, 975,262,468 sites — 16 million more than last month
http://news.netcraft.com/archives/category/web-server-survey/
26
5/12/2014
Unique Features of the Web • The Web is a huge collection of documents
where many contain: • Hyper-link information • Access and usage information
• The Web is very dynamic • Web pages are constantly being generated (removed)
Challenge: Develop new Web mining algorithms to . . . •Exploit hyper-links and access patterns. •Be adaptable to its documents source
Web Mining vs Data Mining
Structure
• Web is not relation • Textual information and linkage structure
Scale
• Usage data is huge and growing rapidly • Data generated per day is comparable to largest conventional data warehouses
Speed
• Often need to react to evolving usage patterns in real-time (e.g., merchandising) • No human in the loop
27
5/12/2014
May 12, 2014
Web Mining
Web Mining Taxonomy
Web Mining
Web Content Mining
Web Structure Mining
Web Usage Mining
Web Mining Taxonomy Web Mining
Web Content Mining
Web Page Content Mining Identify information within given web pages
Web Structure Mining
Search Result Mining Categorizes documents using phrases in titles and snippets
Uses interconnections between web pages to give weight to pages
Web Usage Mining
General Access Pattern Tracking Understand access patterns and trends to improve structure
Customized Usage Tracking Analyzes access patterns of a user to improve response
Distinguish personal home pages from other web pages
28
5/12/2014
May 12, 2014
Web Mining
Mining the World Wide Web Web Mining
Web Content Mining
Web Structure Web Usage Mining Mining
Web Page Content Mining Web Page Summarization WebOQL(Mendelzon et.al. 1998) …: Customized Web Structuring query languages; Search Result General Access Pattern Tracking Usage Tracking Mining Can identify information within given web pages •(Etzioni et.al. 1997):Uses heuristics to distinguish personal home pages from other web pages •ShopBot (Etzioni et.al. 1997): Looks for product prices within web pages
May 12, 2014
Web Mining
Mining the World Wide Web Web Mining Web Content Mining Web Page Content Mining
Web Structure Mining
Web Usage Mining
Search Result Mining Search Engine Result Summarization •Clustering Search Result (Leouski and Croft, 1996, Zamir and Etzioni, 1997): Categorizes documents using phrases in titles and snippets
General Access Customized Pattern Tracking Usage Tracking
29
5/12/2014
May 12, 2014
Web Mining
Mining the World Wide Web Web Mining Web Content Mining
Web Usage Mining Web Structure Mining Using Links •PageRank (Brin et al., 1998) •CLEVER (Chakrabarti et al., 1998) Use interconnections between web pages General Access Search Result to give weight to pages. Pattern Tracking Mining
Web Page Content Mining
Using Generalization •MLDB (1994) Uses a multi-level database representation of the Web. Counters (popularity) and link lists are used for capturing structure.
May 12, 2014
Customized Usage Tracking
Web Mining
Mining the World Wide Web Web Mining
Web Content Mining
Web Page Content Mining Search Result Mining
Web Structure Mining
Web Usage Mining
General Access Pattern Tracking •Web Log Mining (Zaïane, Xin and Han, 1998) Uses KDD techniques to understand general access patterns and trends. Can shed light on better structure and grouping of resource providers.
Customized Usage Tracking
30
5/12/2014
May 12, 2014
Web Mining
Mining the World Wide Web Web Mining
Web Content Mining
Web Page Content Mining
Web Structure Mining
Web Usage Mining
Customized Usage Tracking
General Access Pattern Tracking
Search Result Mining
•Adaptive Sites (Perkowitz and Etzioni, 1997) Analyzes access patterns of each user at a time. Web site restructures itself automatically by learning from user access patterns.
Web Content Mining Approaches • Information Retrieval Approach • To assist or to improve the information finding or
filtering the information to the users usually based on either inferred or solicited user profiles. • Database Approach • To model the data on the Web and to integrated
them so that more sophisticated queries other than the keywords based could be performed.
31
5/12/2014
5/12/2014
Web Content Mining View of Data Main Data Representation
Methods Applications
IR View Unstructured Semi-structured Text documents Hypertext documents Bag of words, n-grams Terms, phrases Concepts or ontology Relational Machine Learning Statistics Categorization Clustering Finding extraction rules Finding patterns in text User modeling
May 12, 2014
DB View Semi-structured Web site as DB Hypertext documents Edge-labeled graph Relational
ILP Association rules Finding frequent substructures Web site schema discovery
Web Mining
Isu dalam Web Content Mining • Pengembangan alat cerdar untuk IR • Mencari kata kunci & frasa kunci • Menemukan aturan gramatikal & collocation
• Klasifikasi/kategorisasi hyperteks • Mengekstra frasa kunci dari dokumen html
• Ekstraksi model/aturan pembelajaran
• Hierarchical clustering • Memprediksi keterhubungan kata
• Membangun web Query system (WebOQL, XMLQL) • Mining multimedia data
32
5/12/2014
5/12/2014
Web Structure Mining View of Data Main Data Representation Methods Applications
May 12, 2014
Links structure Links structure Graph Proprietary algorithms Categorization Clustering
Web Mining
Web Structure Mining • Untuk menemukan struktur link dari hyperlinks
pada level antardokumen untuk membangun ringkasan struktur tentang situs web • Arah 1: berbasis hyperlinks, mengkategorikan halaman
Web & informasi yang dibangun • Arah 2: menemukan struktur dari dokumen web itu
sendiri • Arah 3: menemukan kealamiahan hierarki/jaringan
hyperlinks pada situsweb tertentu
33
5/12/2014
May 12, 2014
Web Mining
Web Structure Mining • Menemukan halaman web yg authorative • Menemukembalikan halaman yang tidak hanya relevan, tapi
juga berkualitas tinggi/authorative terhadap topik • Hyperlinks dapat merujuk authority • Web menganfung juga hyperlinks dari satu halaman ke
halaman lain • Hyperlinks mengandung anotasi manusia berjumlah besar • Hyperlink yang merujuk ke halaman lain, dapat dipertimbangkan sebagai kesukaan pengarang terhadap halaman lain
5/12/2014
Web Usage Mining View of Data Main Data Representation Methods
Applications
Interactivity Server logs Browser logs Relational table Graph Machine learning Statistics Association rules Site construction, adaptation & management Marketing User modeling
34
5/12/2014
May 12, 2014
Web Mining
Web Usage Mining • Web usage mining juga disebut Web
log mining • Teknik mining untuk menemukan pola
penggunaan yang menarik dari data sekunder yang diturunkan dari interaksi pengguna ketika menjelajahi web
May 12, 2014
Web Mining
Web Usage Mining • Aplikasi • Menargetkan kostumer yang potensial untuk produk
elektronik • Memperluas kualitas dan pengantaran Internet Information Services kepada pengguna akhir. • Memperbaiki performa sistem web server • Mengidentifikasi lokasi iklan yang potensial • Memfasilitasi personalisasi/situs adaptif • Memperbaki desain situs • Deteksi fraud/intrusion • Memprediksi aksi pengguna
35
5/12/2014
May 12, 2014
Web Mining
May 12, 2014
Web Mining
Log Data - Simple Analysis • Statistical analysis of users
– Length of path – Viewing time – Number of page views • Statistical analysis of site
– Most common pages viewed – Most common invalid URL
36
5/12/2014
May 12, 2014
Web Mining
Web Log – Data Mining Applications • Association rules
– Find pages that are often viewed together • Clustering
– Cluster users based on browsing patterns – Cluster pages based on content • Classification
– Relate user attributes to patterns
Common Log Format • Remotehost: browser hostname or IP # • Remote log name of user (almost
always "-" meaning "unknown") • Authuser: authenticated username • Date: Date and time of the request • "request”: exact request lines from client • Status: The HTTP status code returned • Bytes: The content-length of response
37
5/12/2014
May 12, 2014
Web Mining
75
SERVER LOGS
May 12, 2014
Web Mining
Fields • Client IP: 128.101.228.20 • Authenticated User ID: - • Time/Date: [10/Nov/1999:10:16:39 -0600] • Request: "GET / HTTP/1.0" • Status: 200 • Bytes: • Referrer: “-” • Agent: "Mozilla/4.61 [en] (WinNT; I)"
38
5/12/2014
Searching the Web
The Web
Content aggregators
Content consumers
Reproduced from Ullman & Rajaraman with permission
Web search basics Sponsored Links CG Appliance Express Discount Appliances (650) 756-3931 Same Day Certified Installation www.cgappliance.com San Francisco-Oakland-San Jose, CA
User
Miele Vacuum Cleaners Miele Vacuums- Complete Selection Free Shipping! www.vacuums.com Miele Vacuum Cleaners Miele-Free Air shipping! All models. Helpful advice. www.best-vacuum.com
Web
Results 1 - 10 of about 7,310,000 for miele. (0.12 seconds)
Miele, Inc -- Anything else is a compromise At the heart of your home, Appliances by Miele. ... USA. to miele.com. Residential Appliances. Vacuum Cleaners. Dishwashers. Cooking Appliances. Steam O ven. Coffee System ... www.miele.com/ - 20k - Cached - Similar pages
Web crawler
Miele Welcome to Miele, the home of the very best appliances and kitchens in the world. www.miele.co.uk/ - 3k - Cached - Similar pages
Miele - Deutscher Hersteller von Einbaugeräten, Hausgeräten ... - [ Translate this page ] Das Portal zum Thema Essen & Geniessen online unter www.zu-tisch.de. Miele weltweit ...ein Leben lang. ... Wählen Sie die Miele Vertretung Ihres Landes. www.miele.de/ - 10k - Cached - Similar pages Herzlich willkommen bei Miele Österreich - [ Translate this page ] Herzlich willkommen bei Miele Österreich Wenn Sie nicht automatisch weitergeleitet werden, klicken Sie bitte hier! HAUSHALTSGERÄTE ... www.miele.at/ - 3k - Cached - Similar pages
Search
Indexer
The Web Indexes
Ad indexes
Reproduced from Ullman & Rajaraman with permission
39
5/12/2014
Mining the Web Web
Spider
Documents source
Query
IR / IE System
1. Doc1 2. Doc2 3. Doc3 . .
Ranked Documents
5/12/2014
Data Mining: Principles and Algorithms
Search Engine Ranking based on link structure analysis
Search
Rank Functions
Similarity based on content or text
Importance Ranking (Link Analysis)
Relevance Ranking Backward Link (Anchor Text)
Indexer
Inverted Index
Term Dictionary (Lexicon)
Web Topology Graph
Anchor Text Generator
Meta Data
Forward Index
Forward Link
Web Graph Constructor
URL Dictioanry
Web Page Parser
Web Pages
40
5/12/2014
5/12/2014
Data Mining: Principles and Algorithms
Layout Structure • Compared to plain text, a web page is a 2D presentation • Rich visual effects created by different term types, formats,
separators, blank areas, colors, pictures, etc • Different parts of a page are not equally important Title: CNN.com International H1: IAEA: Iran had secret nuke agenda H3: EXPLOSIONS ROCK BAGHDAD … TEXT BODY (with position and font type): The International Atomic Energy Agency has concluded that Iran has secretly produced small amounts of nuclear materials including low enriched uranium and plutonium that could be used to develop nuclear weapons according to a confidential report obtained by CNN… Hyperlink: • URL: http://www.cnn.com/... • Anchor Text: AI oaeda…
Image: •URL: http://www.cnn.com/image/... •Alt & Caption: Iran nuclear …
Anchor Text: CNN Homepage News …
5/12/2014
Data Mining: Principles and Algorithms
Web Page Block—Better Information Unit Web Page Blocks
Importance = Low
Importance = Med
Importance = High
41
5/12/2014
Web Usage Mining Applications: Simple and Basic: • Monitor performance, bandwidth usage • Catch errors (404 errors- pages not found) • Improve web site design • (shortcuts for frequent paths, remove links not used, etc)
Advanced and Business Critical : • eCommerce: improve conversion, sales, profit • Fraud detection: click stream fraud, …
Web Usage Mining – Three Phases
42
5/12/2014
Web Usage Mining Issues • Identification of exact user not possible. • Exact sequence of pages referenced by a user not
possible due to caching. • Session not well defined • Security, privacy, and legal issues
Systems Issues Web data sets can be very large
• Tens to hundreds of terabytes
Cannot mine on a single server!
• Need large farms of servers
How to organize hardware/software to mine multiterabye data sets
• Without breaking the bank!
43
5/12/2014
root
Ontology Learning
... furnishing
event
area
accomodation region ... city hotel
... youth hostel
is-a hierarchy
wellness hotel
Association Rule Mining
Derived concept pairs (wellness hotel, area) (hotel, area) (accomodation, area)
Generalized Conceptual Relation hasLocation(accomodation,area)
[Mädche, Staab: ECAI 2000]
Semantic Web Structure/Content Mining Ontology
name
GolfCourse
FORALL X, Y Y: Hotel[cooperatesWith ->> X] > Y].
Cooperat es With
Organization belongsTo Hotel
Knowledge base Hotel: Wellnesshotel GolfCourse: Seaview belongsTo(Seaview, Wellnesshotel)
ILP Based Association Rule Mining, eg. [Dehaspe, Toivonen, J. DMKD 1998]
... Hotel(x), GolfCourse(y), belongsTo(y,x) hasStars(x,5) support = 0.4 %
confidence = 89 %
44
5/12/2014
5/12/2014
Complex Data Types Summary • Emerging areas of mining complex data types: • Text mining can be done quite effectively, especially if
the documents are semi-structured • Web mining is more difficult due to lack of such
structure • Data includes text documents, hypertext documents, link
structure, and logs • Need to rely on unsupervised learning, sometimes
followed up with supervised learning such as classification
45