Text & Web Mining Data Mining Ilmu Komputer IPB

5/12/2014 Kuliah 12 Text & Web Mining Data Mining – Ilmu Komputer IPB Data terstruktur • Sejauh ini kita berurusan dengan data terstruktur, Attribu...
Author: Earl Bradley
5 downloads 2 Views 2MB Size
5/12/2014

Kuliah 12

Text & Web Mining Data Mining – Ilmu Komputer IPB

Data terstruktur • Sejauh ini kita berurusan dengan data terstruktur, Attribute  Value Attribute  Value Attribute  Value  Attribute  Value

Outlook  Sunny Temperature  Hot Windy  Yes Humidity  High Play  Yes

• Umumnya data mining menggunakan data semacam ini

1

5/12/2014

5/12/2014

Complex Data Types • Berkembangnya data complex

• Spatial data: geographic data, medical &

satellite images • Multimedia data: images, audio, & video • Time-series data: banking data & stock exchange data • Text data: word descriptions for objects • World-Wide-Web: highly unstructured text & multimedia data

5/12/2014

Basisdata Teks • Dalam prakteknya terdapat banyak basis data teks: • artikel berita • paper riset • buku • perpustakaan digital • e-mail • halaman web • Berkembang dengan cepat baik dari segi jumlah maupun

kepentingan (80%)

2

5/12/2014

Text Mining • Text mining merujuk pada data mining yang

menggunakan dokumen teks sebagai data • Hampir semua tugas Text Mining menggunakan metode Information Retrieval (IR) untuk pra-proses dokumen teks. • Metode ini sedikit berbeda daripada metode pra-proses data yang digunakan dalam tabel relasional • Web search juga berakar pada IR

CS583, Bing Liu, UIC

Definisi Text Mining • Discover useful and previously unknown

“gems” of information in large text collections

3

5/12/2014

Definisi Text Mining Text Mining is understood as a process of automatically extracting meaningful, useful, previously unknown and ultimately comprehensible information from textual document repositories. Text Mining = Data Mining (applied to text data) + basic linguistics

Definisi • “yang tidak diketahui sebelumnya” ? • Definisi ketat • Informasi yang bahkan penulisnya tidak mengetahui • Contoh: menemukan metode baru untuk pertumbuhan rambut yang merupakan efek samping dari suatu prosedur • Definisi longgar

• Menemukan kembali informasi yang telah ditulis pengarang

dalam teksnya • Contoh: secara otomatis mengekstrak nama produk dari sebuah halaman web

4

5/12/2014

Text Mining Tasks • Diberikan: • Sumber dokumen tekstual • Kueri terbatas (berbasis teks) yang didefinisikan dengan baik • Temukan: • Kalimat dengan informasi relevan • Ekstrak informasi relevan & abaikan informasi yang tidak relevan • Hubungkan informasi & keluaran yang saling berhubungan dalam format yang sudah ditetapkan sebelumnya

Tasks addressed by TM • Search and retrieval • Semantic analysis • Clustering • Categorization • Feature extraction • Ontology building • Dynamic focusing

5

5/12/2014

DM vs TM Data Mining Object of investigation

Numerical and categorical data

Object structure Relational databases

Text Mining Texts

Free form texts

Goal

Predict outcomes of future situations

Retrieve relevant information, distill the meaning, categorize and target-deliver

Methods

Machine learning: SKAT, DT, NN, GA, MBR, MBA

Indexing, special neural network processing, linguistics, ontologies

Current market size

100,000 analysts at large and midsize companies

100,000,000 corporate workers and individual users

Maturity

Broad implementation since 1994

Broad implementation starting 2000

“Search” vs “Discover”

Structured Data Unstructured Data (Text)

Search (goal-oriented)

Discover (opportunistic)

Data Retrieval

Data Mining

Information Retrieval

Text Mining

6

5/12/2014

Aplikasi Text Mining • Pemasaran: Menemukan

kelompok pembeli yang potensial berdasarkan profil teks pengguna • contoh. amazon

• Industri: Mengidentifikasi

situs web kelompok pesaing • Produk pesaing dan harganya

• Pencarian kerja:

mengidentifikasi parameter dalam pencarian pekerjaan •

www.flipdog.com

Aplikasi Text Mining • Search engines • Enterprise portals • Knowledge management systems • e-Business systems • Vertical applications: • e-mail categorization and routing • Call center notes categorization • CRM systems

7

5/12/2014

User Interface Text Operations

Query Operations

Indexing

Searching

INDEX

Ranking Text Database

Search Subsystem query

parse query query tokens

ranked document set

stop list*

non-stoplist tokens

ranking* stemming* stemmed terms *Indicates optional operation.

retrieved document set

Boolean operations*

relevant document set

Inverted file system

8

5/12/2014

Indexing Subsystem documents Documents text

assign document IDs document numbers and *field numbers

break into tokens tokens

stop list* non-stoplist tokens

*Indicates optional operation.

stemming* stemmed terms

term weighting*

terms with weights

Inverted file system

Text Mining Sample Documents

Text document

Transformed

Representation models

Learning

Learning

Working

Domain specific templates/models

Visualizations

9

5/12/2014

Text characteristics: Outline • Large textual data base • High dimensionality • Several input modes • Dependency • Ambiguity • Noisy data • Not well structured text

Text characteristics • Large textual data base • Efficiency consideration • over 2,000,000,000 web pages • almost all publications are also in electronic form

• High dimensionality (Sparse input) • Consider each word/phrase as a dimension • Several input modes • e.g., Web mining: information about user is generated by semantics, browse pattern and outside knowledgebase.

10

5/12/2014

Text characteristics • Dependency • relevant information is a complex conjunction of words/phrases • e.g., Document categorization.

Pronoun disambiguation.

• Ambiguity • Word ambiguity • Pronouns (he, she …) • “buy”, “purchase”

• Semantic ambiguity • The king saw the rabbit with his glasses. (8 meanings)

Text characteristics • Noisy data • Example: Spelling mistakes

• Not well structured text • Chat rooms • “r u available ?” • “Hey whazzzzzz up”

• Speech

11

5/12/2014

Text mining process

Text mining process • Text preprocessing • Syntactic/Semantic text

analysis

• Features Generation • Bag of words

• Features Selection • Simple counting • Statistics

• Text/Data Mining • Classification- Supervised

learning • Clustering- Unsupervised

learning

• Analyzing results

12

5/12/2014

Syntactic / Semantic text analysis • Part of Speech (pos) tagging • Find the corresponding pos for each word

e.g., John (noun) gave (verb) the (det) ball (noun) • ~98% accurate.

• Word sense disambiguation • Context based or proximity based • Very accurate

• Parsing • Generates a parse tree (graph) for each sentence • Each sentence is a stand alone graph

Feature Generation: Bag of words • Text document is represented by the words it contains

(and their occurrences) • e.g., “Lord of the rings”  {“the”, “Lord”, “rings”, “of”} • Highly efficient • Makes learning far simpler and easier • Order of words is not that important for certain applications

• Stemming: identifies a word by its root • e.g., flying, flew  fly • Reduce dimensionality • Stop words: The most common words are unlikely to help

text mining • e.g., “the”, “a”, “an”, “you” …

13

5/12/2014

Feature Generation: D2K Example Hi, Here is your weekly update (that unfortunately hasn't gone out in about a month). Not much action here right now. 1) Due to the unwavering insistence of a member of the weekly update (that unfortunately gone out group, thehi, ncsa.d2k.modules.core.datatype package is month). much action here right now. 1) application. due unwavering insistence now completely independent of the d2k member group, ncsa.d2k.modules.core.datatype package 2) Transformations are now handled differently in Tables. now completely independent d2k application. 2) Previously,transformations transformations were done using a now handled differently tables. previously, TransformationModule. That module could then be added transformations done using transformationmodule. module to a list that an ExampleTable kept.kept. Now,now, there is an called added list exampletable interface interface called Transformation and a sub-interface called transformation sub-interface called hi week update unfortunate go out month much action here ReversibleTransformation. reversibletransformation. right now 1 due unwaver insistence member group ncsa d2k modules core datatype package now complete independence d2k application 2 transformation now handle different table previous transformation do use transformationmodule module add list exampletable keep now interface call transformation sub-interface call reversibletransformation

Feature Generation: XML •

Current keyword-oriented search engines cannot handle rich queries like • Find all books authored by “Scooby-Doo”.



XML: Extensible Markup Language • XML documents have a nested structure in which each element is associated with a tag. • Tags describe the semantics of elements.

The making of a bad movie Scooby-Doo Cartoons

14

5/12/2014

Feature selection • Reduce dimensionality • Learners have difficulty addressing tasks with high dimensionality • Irrelevant features • Not all features help! • e.g., the existence of a noun in a news article is unlikely to help

classify it as “politics” or “sport”

Feature selection: D2K Example I hi week update unfortunate go out month much action here right now 1 due unwaver insistence member group ncsa d2k modules do

core datatype package complete independence application hi 2 transformationweek update handle unfortunate different go table out previous month use much transformationmodule action add here list exampletable right now keep due interface insistence call sub-interface member group reversibletransformation ncsa d2k modules

do core datatype package complete independence application transformation handle different table previous use add list keep interface call sub-interface

15

5/12/2014

Feature selection: D2K Example II hi week update unfortunate go out month much action here right now 1 due unwaver insistence member group ncsa d2k modules do

core datatype package complete independence application hi 2 transformationweek update handle unfortunate different go table out previous month use much transformationmodule action add here list exampletable right now keep due interface insistence call sub-interface member group reversibletransformation ncsa d2k modules

do core datatype package hi complete week independence update application unfortunate transformation month handle action different right table previous due use insistence add member list group keep ncsa interface d2k call modules sub-interface

core

datatype package complete independence application transformation handle different table previous add list interface call sub-interface

Text Mining: Classification definition • Given: a collection of labeled records (training set) • Each record contains a set of features (attributes), and the true class (label) • Find: a model for the class as a function of the

values of the features • Goal: previously unseen records should be assigned a class as accurately as possible • A test set is used to determine the accuracy of the model.

Usually, the given data set is divided into training and test sets, with training set used to build the model and test set used to validate it

16

5/12/2014

Text Mining: Clustering definition • Given: a set of documents and a similarity measure

among documents • Find: clusters such that: • Documents in one cluster are more similar to one another • Documents in separate clusters are less similar to one another

• Goal: • Finding a correct set of documents

Similarity Measures: • Euclidean Distance if attributes are continuous • Other Problem-specific Measures • e.g., how many words are common in these documents

Contoh GREAT Camera., Jun 3, 2004 Reviewer: jprice174 from Atlanta, Ga. I did a lot of research last year before I bought this camera... It kinda hurt to leave behind my beloved nikon 35mm SLR, but I was going to Italy, and I needed something smaller, and digital. The pictures coming out of this camera are amazing. The 'auto' feature takes great pictures most of the time. And with digital, you're not wasting film if the picture doesn't come out. … 34

Summary: Feature1: picture Positive: 12 • The pictures coming out of this camera are amazing. • Overall this is a good camera with a really good picture clarity. … Negative: 2 • The pictures come out hazy if your hands shake even for a moment during the entire process of taking a picture. • Focusing on a display rack about 20 feet away in a brightly lit room during day time, pictures produced by this camera were blurry and in a shade of orange. Feature2: battery life … CS583, Bing Liu, UIC

17

5/12/2014

Visual Comparison +

Summary of reviews of Digital camera 1

_ Picture

Comparison of reviews of

Battery

Zoom

Size

Weight

+

Digital camera 1 Digital camera 2

_ 35

CS583, Bing Liu, UIC

Information Extraction Posting from Newsgroup Telecommunications. Solaris Systems Administrator. 55-60K. Immediate need. 3P is a leading telecommunications firm in need of a energetic individual to fill the following position in the Atlanta office: SOLARIS SYSTEM ADMINISTRATOR Salary: 50-60K with full benefits Location: Atlanta, Georgia no relocation assistance provided

FILLED TEMPLATE job title: SOLARIS SYSTEM ADMINISTRATOR salary: 55-60K city: Atlanta state: Georgia platform: SOLARIS area: Telecommunications

18

5/12/2014

Classification: An Example

10

Ex# Country Marital Status

Income

1

England Single

125K

2

England Married

3

England Single

70K

Yes

4

Italy

Married

40K

No

5

USA

Divorced 95K

No

6

England Married

7

England

8

Italy

9

France

10

Denmark Single

Hooligan Yes Yes

60K

Country Marital Status

Income

England Single

75K

?

Turkey

50K

?

150K

?

England Married

Yes

20K

Yes

Single

85K

Yes

Married

75K

No

50K

No

Married

Itlay

Hooligan

Divorced 90K

?

Single

40K

?

Married

80K

?

10

Training Set

Learn Classifier

Test Set

Model

19

5/12/2014

Text Classification: An Example Ex# Hooligan 1 2 3 4 5 6 7 8 10

An English football fan … During a game in Italy … England has been beating France … Italian football fans were cheering … An average USA salesman earns 75K The game in London was horrific Manchester city is likely to win the championship Rome is taking the lead in the football league

Yes

Hooligan

Yes Yes No

A Danish football fan

?

Turkey is playing vs. France. The Turkish fans …

?

10

No

Test Set

Yes Yes Yes

Training Set

Learn Classifier

Model

20

5/12/2014

Web Mining Data mining – Ilmu Komputer IPB

Web Mining WWW Knowledge

21

5/12/2014

Example: Web data extraction Data region1 A data record A data record

Data region2

CS583, Bing Liu, UIC

43

Align and extract data items (e.g., region1) image1 EN7410 17-inch LCD Monitor Black/Dark charcoal

$299.9 9

Add to Cart

(Delivery / Pick-Up )

Penny Shopping

Compare

image2 17-inch LCD Monitor

$249.9 9

Add to Cart

(Delivery / Pick-Up )

Penny Shopping

Compare

image3 AL1714 17inch LCD Monitor, Black

$269.9 9

Add to Cart

(Delivery / Pick-Up )

Penny Shopping

Compare

$299.9 9

Save Add $70 to After: Cart $70 mailinrebate(s)

(Delivery / Pick-Up )

Penny Shopping

Compare

image4 SyncMaste r 712n 17inch LCD Monitor, Black

Was: $369.9 9

CS583, Bing Liu, UIC

22

5/12/2014

Ads vs. search results

Reproduced from Ullman & Rajaraman with permission

Ads vs. search results Search advertising is the revenue model • Multi-billion-dollar industry • Advertisers pay for clicks on their ads

Interesting problems • How to pick the top 10 results for a search from 2,230,000 matching pages? • What ads to show for a search? • If I’m an advertiser, which search terms should I bid on and how much to bid? Reproduced from Ullman & Rajaraman with permission

23

5/12/2014

What’s Web Mining? Discovering interesting and useful information from Web content and usage • Web search : Google, Yahoo,

• Advertising, e.g. Google Adsense MSN, Ask, … • Fraud detection: click fraud • Specialized search: e.g. Froogle detection, … (comparison shopping), job ads • Improving Web site design and (Flipdog) performance • eCommerce : • Recommendations: e.g. Netflix,

Amazon • improving conversion rate: next

best product to offer

May 12, 2014

Web Mining

Web Mining • Web mining - data mining techniques to

automatically discover and extract information from Web documents/services (Etzioni, 1996). • Web mining research – integrate research from several research communities (Kosala and Blockeel, July 2000) such as: • Database (DB) • Information retrieval (IR) • The sub-areas of machine learning (ML) • Natural language processing (NLP)

24

5/12/2014

5/12/2014

Web Mining • The World Wide Web may have more opportunities

for data mining than any other area • However, there are serious challenges: • It is too huge • Complexity of Web pages is greater than any traditional

text document collection • It is highly dynamic • It has a broad diversity of users • Only a tiny portion of the information is truly useful

How big is the Web ?

Technically, infinite

Because of dynamically generated content Lots of duplication (30-40%)

Number of pages

Best estimate of “unique” static HTML pages comes from search engine claims

Google = 8 billion, Yahoo = 20 billion Lots of marketing hype

Reproduced from Ullman & Rajaraman with permission

25

5/12/2014

Why Mine the Web? • Enormous wealth of textual information on the Web. • Book/CD/Video stores (e.g., Amazon) • Restaurant information (e.g., Zagats) • Car prices (e.g., Carpoint)

• Lots of data on user access patterns • Web logs contain sequence of URLs accessed by users

• Possible to retrieve “previously unknown” information • People who ski also frequently break their leg. • Restaurants that serve sea food in California are likely to be outside

San-Francisco

In the May 2014, 975,262,468 sites — 16 million more than last month

http://news.netcraft.com/archives/category/web-server-survey/

26

5/12/2014

Unique Features of the Web • The Web is a huge collection of documents

where many contain: • Hyper-link information • Access and usage information

• The Web is very dynamic • Web pages are constantly being generated (removed)

Challenge: Develop new Web mining algorithms to . . . •Exploit hyper-links and access patterns. •Be adaptable to its documents source

Web Mining vs Data Mining

Structure

• Web is not relation • Textual information and linkage structure

Scale

• Usage data is huge and growing rapidly • Data generated per day is comparable to largest conventional data warehouses

Speed

• Often need to react to evolving usage patterns in real-time (e.g., merchandising) • No human in the loop

27

5/12/2014

May 12, 2014

Web Mining

Web Mining Taxonomy

Web Mining

Web Content Mining

Web Structure Mining

Web Usage Mining

Web Mining Taxonomy Web Mining

Web Content Mining

Web Page Content Mining Identify information within given web pages

Web Structure Mining

Search Result Mining Categorizes documents using phrases in titles and snippets

Uses interconnections between web pages to give weight to pages

Web Usage Mining

General Access Pattern Tracking Understand access patterns and trends to improve structure

Customized Usage Tracking Analyzes access patterns of a user to improve response

Distinguish personal home pages from other web pages

28

5/12/2014

May 12, 2014

Web Mining

Mining the World Wide Web Web Mining

Web Content Mining

Web Structure Web Usage Mining Mining

Web Page Content Mining Web Page Summarization WebOQL(Mendelzon et.al. 1998) …: Customized Web Structuring query languages; Search Result General Access Pattern Tracking Usage Tracking Mining Can identify information within given web pages •(Etzioni et.al. 1997):Uses heuristics to distinguish personal home pages from other web pages •ShopBot (Etzioni et.al. 1997): Looks for product prices within web pages

May 12, 2014

Web Mining

Mining the World Wide Web Web Mining Web Content Mining Web Page Content Mining

Web Structure Mining

Web Usage Mining

Search Result Mining Search Engine Result Summarization •Clustering Search Result (Leouski and Croft, 1996, Zamir and Etzioni, 1997): Categorizes documents using phrases in titles and snippets

General Access Customized Pattern Tracking Usage Tracking

29

5/12/2014

May 12, 2014

Web Mining

Mining the World Wide Web Web Mining Web Content Mining

Web Usage Mining Web Structure Mining Using Links •PageRank (Brin et al., 1998) •CLEVER (Chakrabarti et al., 1998) Use interconnections between web pages General Access Search Result to give weight to pages. Pattern Tracking Mining

Web Page Content Mining

Using Generalization •MLDB (1994) Uses a multi-level database representation of the Web. Counters (popularity) and link lists are used for capturing structure.

May 12, 2014

Customized Usage Tracking

Web Mining

Mining the World Wide Web Web Mining

Web Content Mining

Web Page Content Mining Search Result Mining

Web Structure Mining

Web Usage Mining

General Access Pattern Tracking •Web Log Mining (Zaïane, Xin and Han, 1998) Uses KDD techniques to understand general access patterns and trends. Can shed light on better structure and grouping of resource providers.

Customized Usage Tracking

30

5/12/2014

May 12, 2014

Web Mining

Mining the World Wide Web Web Mining

Web Content Mining

Web Page Content Mining

Web Structure Mining

Web Usage Mining

Customized Usage Tracking

General Access Pattern Tracking

Search Result Mining

•Adaptive Sites (Perkowitz and Etzioni, 1997) Analyzes access patterns of each user at a time. Web site restructures itself automatically by learning from user access patterns.

Web Content Mining Approaches • Information Retrieval Approach • To assist or to improve the information finding or

filtering the information to the users usually based on either inferred or solicited user profiles. • Database Approach • To model the data on the Web and to integrated

them so that more sophisticated queries other than the keywords based could be performed.

31

5/12/2014

5/12/2014

Web Content Mining View of Data Main Data Representation

Methods Applications

IR View Unstructured Semi-structured Text documents Hypertext documents Bag of words, n-grams Terms, phrases Concepts or ontology Relational Machine Learning Statistics Categorization Clustering Finding extraction rules Finding patterns in text User modeling

May 12, 2014

DB View Semi-structured Web site as DB Hypertext documents Edge-labeled graph Relational

ILP Association rules Finding frequent substructures Web site schema discovery

Web Mining

Isu dalam Web Content Mining • Pengembangan alat cerdar untuk IR • Mencari kata kunci & frasa kunci • Menemukan aturan gramatikal & collocation

• Klasifikasi/kategorisasi hyperteks • Mengekstra frasa kunci dari dokumen html

• Ekstraksi model/aturan pembelajaran

• Hierarchical clustering • Memprediksi keterhubungan kata

• Membangun web Query system (WebOQL, XMLQL) • Mining multimedia data

32

5/12/2014

5/12/2014

Web Structure Mining View of Data Main Data Representation Methods Applications

May 12, 2014

Links structure Links structure Graph Proprietary algorithms Categorization Clustering

Web Mining

Web Structure Mining • Untuk menemukan struktur link dari hyperlinks

pada level antardokumen untuk membangun ringkasan struktur tentang situs web • Arah 1: berbasis hyperlinks, mengkategorikan halaman

Web & informasi yang dibangun • Arah 2: menemukan struktur dari dokumen web itu

sendiri • Arah 3: menemukan kealamiahan hierarki/jaringan

hyperlinks pada situsweb tertentu

33

5/12/2014

May 12, 2014

Web Mining

Web Structure Mining • Menemukan halaman web yg authorative • Menemukembalikan halaman yang tidak hanya relevan, tapi

juga berkualitas tinggi/authorative terhadap topik • Hyperlinks dapat merujuk authority • Web menganfung juga hyperlinks dari satu halaman ke

halaman lain • Hyperlinks mengandung anotasi manusia berjumlah besar • Hyperlink yang merujuk ke halaman lain, dapat dipertimbangkan sebagai kesukaan pengarang terhadap halaman lain

5/12/2014

Web Usage Mining View of Data Main Data Representation Methods

Applications

Interactivity Server logs Browser logs Relational table Graph Machine learning Statistics Association rules Site construction, adaptation & management Marketing User modeling

34

5/12/2014

May 12, 2014

Web Mining

Web Usage Mining • Web usage mining juga disebut Web

log mining • Teknik mining untuk menemukan pola

penggunaan yang menarik dari data sekunder yang diturunkan dari interaksi pengguna ketika menjelajahi web

May 12, 2014

Web Mining

Web Usage Mining • Aplikasi • Menargetkan kostumer yang potensial untuk produk

elektronik • Memperluas kualitas dan pengantaran Internet Information Services kepada pengguna akhir. • Memperbaiki performa sistem web server • Mengidentifikasi lokasi iklan yang potensial • Memfasilitasi personalisasi/situs adaptif • Memperbaki desain situs • Deteksi fraud/intrusion • Memprediksi aksi pengguna

35

5/12/2014

May 12, 2014

Web Mining

May 12, 2014

Web Mining

Log Data - Simple Analysis • Statistical analysis of users

– Length of path – Viewing time – Number of page views • Statistical analysis of site

– Most common pages viewed – Most common invalid URL

36

5/12/2014

May 12, 2014

Web Mining

Web Log – Data Mining Applications • Association rules

– Find pages that are often viewed together • Clustering

– Cluster users based on browsing patterns – Cluster pages based on content • Classification

– Relate user attributes to patterns

Common Log Format • Remotehost: browser hostname or IP # • Remote log name of user (almost

always "-" meaning "unknown") • Authuser: authenticated username • Date: Date and time of the request • "request”: exact request lines from client • Status: The HTTP status code returned • Bytes: The content-length of response

37

5/12/2014

May 12, 2014

Web Mining

75

SERVER LOGS

May 12, 2014

Web Mining

Fields • Client IP: 128.101.228.20 • Authenticated User ID: - • Time/Date: [10/Nov/1999:10:16:39 -0600] • Request: "GET / HTTP/1.0" • Status: 200 • Bytes: • Referrer: “-” • Agent: "Mozilla/4.61 [en] (WinNT; I)"

38

5/12/2014

Searching the Web

The Web

Content aggregators

Content consumers

Reproduced from Ullman & Rajaraman with permission

Web search basics Sponsored Links CG Appliance Express Discount Appliances (650) 756-3931 Same Day Certified Installation www.cgappliance.com San Francisco-Oakland-San Jose, CA

User

Miele Vacuum Cleaners Miele Vacuums- Complete Selection Free Shipping! www.vacuums.com Miele Vacuum Cleaners Miele-Free Air shipping! All models. Helpful advice. www.best-vacuum.com

Web

Results 1 - 10 of about 7,310,000 for miele. (0.12 seconds)

Miele, Inc -- Anything else is a compromise At the heart of your home, Appliances by Miele. ... USA. to miele.com. Residential Appliances. Vacuum Cleaners. Dishwashers. Cooking Appliances. Steam O ven. Coffee System ... www.miele.com/ - 20k - Cached - Similar pages

Web crawler

Miele Welcome to Miele, the home of the very best appliances and kitchens in the world. www.miele.co.uk/ - 3k - Cached - Similar pages

Miele - Deutscher Hersteller von Einbaugeräten, Hausgeräten ... - [ Translate this page ] Das Portal zum Thema Essen & Geniessen online unter www.zu-tisch.de. Miele weltweit ...ein Leben lang. ... Wählen Sie die Miele Vertretung Ihres Landes. www.miele.de/ - 10k - Cached - Similar pages Herzlich willkommen bei Miele Österreich - [ Translate this page ] Herzlich willkommen bei Miele Österreich Wenn Sie nicht automatisch weitergeleitet werden, klicken Sie bitte hier! HAUSHALTSGERÄTE ... www.miele.at/ - 3k - Cached - Similar pages

Search

Indexer

The Web Indexes

Ad indexes

Reproduced from Ullman & Rajaraman with permission

39

5/12/2014

Mining the Web Web

Spider

Documents source

Query

IR / IE System

1. Doc1 2. Doc2 3. Doc3 . .

Ranked Documents

5/12/2014

Data Mining: Principles and Algorithms

Search Engine Ranking based on link structure analysis

Search

Rank Functions

Similarity based on content or text

Importance Ranking (Link Analysis)

Relevance Ranking Backward Link (Anchor Text)

Indexer

Inverted Index

Term Dictionary (Lexicon)

Web Topology Graph

Anchor Text Generator

Meta Data

Forward Index

Forward Link

Web Graph Constructor

URL Dictioanry

Web Page Parser

Web Pages

40

5/12/2014

5/12/2014

Data Mining: Principles and Algorithms

Layout Structure • Compared to plain text, a web page is a 2D presentation • Rich visual effects created by different term types, formats,

separators, blank areas, colors, pictures, etc • Different parts of a page are not equally important Title: CNN.com International H1: IAEA: Iran had secret nuke agenda H3: EXPLOSIONS ROCK BAGHDAD … TEXT BODY (with position and font type): The International Atomic Energy Agency has concluded that Iran has secretly produced small amounts of nuclear materials including low enriched uranium and plutonium that could be used to develop nuclear weapons according to a confidential report obtained by CNN… Hyperlink: • URL: http://www.cnn.com/... • Anchor Text: AI oaeda…

Image: •URL: http://www.cnn.com/image/... •Alt & Caption: Iran nuclear …

Anchor Text: CNN Homepage News …

5/12/2014

Data Mining: Principles and Algorithms

Web Page Block—Better Information Unit Web Page Blocks

Importance = Low

Importance = Med

Importance = High

41

5/12/2014

Web Usage Mining Applications: Simple and Basic: • Monitor performance, bandwidth usage • Catch errors (404 errors- pages not found) • Improve web site design • (shortcuts for frequent paths, remove links not used, etc)

Advanced and Business Critical : • eCommerce: improve conversion, sales, profit • Fraud detection: click stream fraud, …

Web Usage Mining – Three Phases

42

5/12/2014

Web Usage Mining Issues • Identification of exact user not possible. • Exact sequence of pages referenced by a user not

possible due to caching. • Session not well defined • Security, privacy, and legal issues

Systems Issues Web data sets can be very large

• Tens to hundreds of terabytes

Cannot mine on a single server!

• Need large farms of servers

How to organize hardware/software to mine multiterabye data sets

• Without breaking the bank!

43

5/12/2014

root

Ontology Learning

... furnishing

event

area

accomodation region ... city hotel

... youth hostel

is-a hierarchy

wellness hotel

Association Rule Mining

Derived concept pairs (wellness hotel, area) (hotel, area) (accomodation, area)

Generalized Conceptual Relation hasLocation(accomodation,area)

[Mädche, Staab: ECAI 2000]

Semantic Web Structure/Content Mining Ontology

name

GolfCourse

FORALL X, Y Y: Hotel[cooperatesWith ->> X] > Y].

Cooperat es With

Organization belongsTo Hotel

Knowledge base Hotel: Wellnesshotel GolfCourse: Seaview belongsTo(Seaview, Wellnesshotel)

ILP Based Association Rule Mining, eg. [Dehaspe, Toivonen, J. DMKD 1998]

... Hotel(x), GolfCourse(y), belongsTo(y,x)  hasStars(x,5) support = 0.4 %

confidence = 89 %

44

5/12/2014

5/12/2014

Complex Data Types Summary • Emerging areas of mining complex data types: • Text mining can be done quite effectively, especially if

the documents are semi-structured • Web mining is more difficult due to lack of such

structure • Data includes text documents, hypertext documents, link

structure, and logs • Need to rely on unsupervised learning, sometimes

followed up with supervised learning such as classification

45