Data Mining: Concepts and Techniques. Web Mining. Li Xiong

Data Mining: Concepts and Techniques Web Mining Li Xiong Slides credits: Jiawei Han and Micheline Kamber; Anand Rajaraman, Jeffrey D. Ullman Olfa Nas...
0 downloads 1 Views 386KB Size
Data Mining: Concepts and Techniques Web Mining Li Xiong

Slides credits: Jiawei Han and Micheline Kamber; Anand Rajaraman, Jeffrey D. Ullman Olfa Nasraoui Bing Liu 4/9/2008

1

Web Mining „

Web mining vs. data mining „ Structure (or lack of it) „

„

Scale „

„

Linkage structure and lack of structure in textual information Data generated per day is comparable to largest conventional data warehouses

Speed „

Often need to react to evolving usage patterns in real-time (e.g., merchandising)

Web Mining „

Structure Mining „ Extracting info from topology of the Web (links among

pages)

„

Content Mining „ Extracting info from page content (text, images, audio

or video, etc)

Natural language processing and information retrieval Usage Mining „ Extracting info from user’s usage data on the web (how user visits the pages or makes transactions) „

„

4/9/2008

Li Xiong

3

Web Mining

4/9/2008

4

Web Mining „

„

„

Web structure mining „ Web graph structure and link analysis Web text mining „ Text representation and IR models Web usage mining „ Collaborative filtering

4/9/2008

Li Xiong

5

Structure of Web Graph „

„

„

Web as a directed graph „ Pages = nodes, hyperlinks = edges Problem: Understand the macroscopic structure and evolution of the web graph Practical implications „ Crawling, browsing, computation of link analysis algorithms

Power-law degree distribution

Source: Broder et al, 00

Bow-tie Structure (Broder et al. 00)

The Daisy Structure (Donato et al. 05)

4/9/2008

9

Link Analysis „

„

„

Problem: exploit the link structure of a graph to order or prioritize the set of objects within the graph Application of social network analysis at actor level: centrality and prestige Algorithms „ „

April 9, 2008

PageRank HITS

Li Xiong

10

PageRank (Brin & Page’98) „

Intuition „ Web pages are not equally “important” „

„

Links as citations: a page cited often is more important „ „

www.stanford.edu has 23,400 inlinks www.joe-schmoe.com has 1 inlink

Recursive model: links from heavily linked pages weighted more PageRank is essentially the eigenvector prestige in social network „

„

www.joe-schmoe.com v www.stanford.edu

Simple Recursive Flow Model „

„

„

Each link’s vote is proportional to the importance of its source page If page P with importance x has n outlinks, each link gets x/n votes Page P’s own importance is the sum of the votes on its inlinks y = y /2 + a /2 y/2 Yahoo y a = y /2 + m m = a /2 a/2 y/2 Solving the equation with constraint: y+a+m = 1 m y = 2/5, a = 2/5, m = 1/5 Amazon M’soft a/2 m a

Matrix formulation „

„ „ „

Web link matrix M: one row and one column per web page ⎧1 if (i, j ) ∈ E ⎪ M ij = ⎨ O j ⎪⎩ 0 otherwise Rank vector r: one entry per web page Flow equation: r = Mr r is an eigenvector of the M

j

i

i

= j

M

r

r

Matrix formulation Example y a y 1/2 1/2 a 1/2 0 m 0 1/2

Yahoo

m 0 1 0

r = Mr Amazon

M’soft

y = y /2 + a /2 a = y /2 + m m = a /2

y 1/2 1/2 0 y a = 1/2 0 1 a m 0 1/2 0 m

Power Iteration method Solving equation: r = Mr „ „ „ „

Suppose there are N web pages Initialize: r0 = [1/N,….,1/N]T Iterate: rk+1 = Mrk Stop when |rk+1 - rk|1 < ε „ |x|1 = ∑1≤i≤N|xi| is the L1 norm „ Can use any other vector norm e.g., Euclidean

Power Iteration Example y a y 1/2 1/2 a 1/2 0 m 0 1/2

Yahoo

Amazon y a = m

m 0 1 0

M’soft 1/3 1/3 1/3

1/3 1/2 1/6

5/12 1/3 1/4

3/8 11/24 . . . 1/6

2/5 2/5 1/5

Random Walk Interpretation „

„

Imagine a random web surfer „ At any time t, surfer is on some page P „ At time t+1, the surfer follows an outlink from P uniformly at random „ Ends up on some page Q linked from P „ Process repeats indefinitely p(t) is the probability distribution whose ith component is the probability that the surfer is at page i at time t

The stationary distribution „

„

„

Where is the surfer at time t+1? „ p(t+1) = Mp(t) Suppose the random walk reaches a state such that p(t+1) = Mp(t) = p(t) „ Then p(t) is a stationary distribution for the random walk Our rank vector r satisfies r = Mr

Existence and Uniqueness of the Solution „

Theory of random walks (aka Markov processes): A finite Markov chain defined by the stochastic matrix has a unique stationary probability distribution if the matrix is irreducible and aperiodic.

April 9, 2008

Mining and Searching Graphs in Graph Databases

19

M is a not stochastic matrix „

M is the transition matrix of the Web graph ⎧1 ⎪ M ij = ⎨ O j ⎪⎩ 0

„

if (i, j ) ∈ E otherwise

It does not satisfy

n

∑M i =1

„

ij

=1

Many web pages have no out-links „

Such pages are called the dangling pages.

CS583, Bing Liu, UIC

20

M is a not irreducible „

„

Irreducible means that the Web graph G is strongly connected. Definition: A directed graph G = (V, E) is strongly connected if and only if, for each pair of nodes u, v ∈ V, there is a path from u to v. A general Web graph is not irreducible because „ for some pair of nodes u and v, there is no path from u to v.

CS583, Bing Liu, UIC

21

M is a not aperiodic A state i in a Markov chain being periodic means that there exists a directed cycle that the chain has to traverse. Definition: A state i is periodic with period k > 1 if k is the smallest number such that all paths leading from state i back to state i have a length that is a multiple of k. „ If a state is not periodic (i.e., k = 1), it is aperiodic. „ A Markov chain is aperiodic if all states are aperiodic.

„

CS583, Bing Liu, UIC

22

Solution: Random teleports „ „

Add a link from each page to every page At each time step, the random surfer has a small probability teleporting to those links „ With probability β, follow a link at random „ With probability 1-β, jump to some page uniformly at random „ Common values for β are in the range 0.8 to 0.9

Random teleports Example (β = 0.8) 1/2 1/2 0 0.8 1/2 0 0 0 1/2 1

Yahoo

Amazon y a = m

y 7/15 7/15 1/15 a 7/15 1/15 1/15 m 1/15 7/15 13/15

M’soft 1 1 1

1.00 0.60 1.40

1/3 1/3 1/3 + 0.2 1/3 1/3 1/3 1/3 1/3 1/3

0.84 0.60 1.56

0.776 0.536 . . . 1.688

7/11 5/11 21/11

Matrix formulation „

Matrix vector A „ Aij = βMij + (1-β)/N „ Mij = 1/|O(j)| when j→i and Mij = 0 otherwise Verify that A is a stochastic matrix The page rank vector r is the principal eigenvector of this matrix „ satisfying r = Ar Equivalently, r is the stationary distribution of the random walk with teleports „

„

„

Advantages and Limitations of PageRank „ „

„ „

Fighting spam PageRank is a global measure and is query independent Computed offline Criticism: query-independence. „ It could not distinguish between pages that are authoritative in general and pages that are authoritative on the query topic.

CS583, Bing Liu, UIC

26

HITS: Capturing Authorities & Hubs (Kleinberg’98) „

„

Intuitions „

Pages that are widely cited are good authorities

„

Pages that cite many other pages are good hubs

HITS (Hypertext-Induced Topic Selection) „ When the user issues a search query, HITS expands the list of relevant pages returned by a search engine and produces two rankings Hubs Authorities 1. Authorities are pages containing useful information and linked by Hubs „ „

2.

course home pages home pages of auto manufacturers

Hubs are pages that link to Authorities „ „

April 9, 2008

course bulletin list of US auto manufacturers Data Mining: Concepts and Techniques

27

Matrix Formulation „

„

„

Transition (adjacency) matrix A „ A[i, j] = 1 if page i links to page j, 0 if Hubs not The hub score vector h: score is proportional to the sum of the authority scores of the pages it links to „ h = λAa „ Constant λ is a scale factor The authority score vector a: score is proportional to the sum of the hub scores of the pages it is linked from T „ a = μA h „ Constant μ is scale factor

Authorities

Transition Matrix Example

Yahoo A=

Amazon

M’soft

y y 1 a 1 m 0

a m 1 1 0 1 1 0

Iterative algorithm „ „ „ „ „ „

Initialize h, a to all 1’s h = Aa Scale h so that its max entry is 1.0 a = ATh Scale a so that its max entry is 1.0 Continue until h, a converge

Iterative Algorithm Example 111 A= 101 010

110 AT = 1 0 1 110

a(yahoo) = a(amazon) = a(m’soft) =

1 1 1

1 1 1

... 1 0.75 . . . ... 1

1 0.732 1

h(yahoo) = h(amazon) = h(m’soft) =

1 1 1

... 1 1 1 2/3 0.71 0.73 . . . 1/3 0.29 0.27 . . .

1.000 0.732 0.268

1 4/5 1

Existence and Uniqueness of the Solution h = λAa a = μAT h h = λμAAT h a = λμATA a Under reasonable assumptions about A, the dual iterative algorithm converges to vectors h* and a* such that: T • h* is the principal eigenvector of the matrix AA T • a* is the principal eigenvector of the matrix A A

Strengths and weaknesses of HITS „

„

Strength: its ability to rank pages according to the query topic, which may be able to provide more relevant authority and hub pages. Weaknesses: „

Easily spammed Topic drift

„

Inefficiency at query time

„

33

PageRank and HITS „

„

„

Model „ PageRank: depends on the links into S „ HITS: depends on the value of the other links out of S Characteristics „ Spam resistance „ Query independence Destinies post-1998 „ PageRank: trademark of Google „ HITS: not commonly used by search engines (Ask.com?)

Web Mining „

„ „

Web structure mining „ Web graph structure „ Link analysis Web text mining Web usage mining „ Collaborative filtering

4/9/2008

35

Text Mining „

„

„

Text mining refers to data mining using text documents as data. Tasks „ Text summarization „ Text classification „ Text clustering „ … Intersection with Information Retrieval and Natural Language Processing Li Xiong

Levels of text representations „ „ „ „ „ „ „ „ „ „ „ „

Character (character n-grams and sequences) Words (stop-words, stemming, lemmatization) Phrases (word n-grams, proximity features) Part-of-speech tags Taxonomies / thesauri Vector-space model Language models Full-parsing Cross-modality Collaborative tagging / Web2.0 Templates / Frames Ontologies / First order theories

N-Gram „

„

N-gram: a sub-sequence of n items from a given sequence. „ The items can be characters, words or base pairs according to the application. „ Unigram, bigram, trigram Example: Google n-gram corpus 4-grams serve serve serve serve serve serve

as as as as as as

the the the the the the

incoming (92) incubator (99) independent (794) index (223) indication (72) indicator (120)

Bag-of-Words Document Representation

Vector space model „ „

„

Each document is represented as a vector. Given a collection of documents D, let V = {t1, t2, ..., t|V|} be the set of distinctive words/terms in the collection. V is called the vocabulary. A weight wij > 0 is associated with each term ti of a document dj. For a term that does not appear in document dj, wij = 0. dj = (w1j, w2j, ..., w|V|j)

40

TFIDF Weighting „ „

TF (Term frequency) IDF (Inverse Document Frequency)

N tfidf ( w) = tf . log( ) df ( w) „ „ „ „

Tf(w) – term frequency (number of word occurrences in a document) Df(w) – document frequency (number of documents containing the word) N – number of all documents TfIdf(w) – relative importance of the word in the document

Similarity between document vectors „

„

Each document is represented as a vector of weights D = Cosine similarity (dot product) is the most widely used similarity measure between two document vectors „ „ „

…calculates cosine of the angle between document vectors …efficient to calculate (sum of products of intersecting words) …similarity value between 0 (different) and 1 (the same)

Sim( D1 , D2 ) =

∑x

x

1i 2i

i

2 x ∑j j

2 x ∑k k

Web Mining „

„ „

Web structure mining „ Web graph structure „ Link analysis Web text mining Web usage mining „ Collaborative filtering

4/9/2008

Li Xiong

43

Web Usage Data „

„

Web Logs: Low level „ Tracks queries, individual pages/items requested by a Web browser Application logs: Higher level „ When customers check in and check out, items placed or removed from shopping cart, …etc

4/9/2008

44

Web Usage Mining „

„

„

„

Association rule mining „ Discovered associations between pages and products Sequential pattern discovery „ Help to discover visit patterns and make predictions about visit patterns Clustering „ Group similar sessions into clusters which may correspond to user profiles / modes of usage of the website Collaborative Filtering „ Filter/recommend pages and products based on similar users

4/9/2008

45

Collaborative Filtering: Motivation „

„

User Perspective „ Lots of web pages, online products, books, movies, etc. „ Reduce my choices…please… Manager Perspective “ if I have 3 million customers on the web, I should have 3 million stores on the web.” CEO of Amazon.com [SCH01]

4/9/2008

Data Mining: Principles and Algorithms

46

Basic Approaches „

„

Collaborative Filtering (CF) „ Based on the active user’s history „ Based on other users’ collective behavior Content-based Filtering „ Based on keywords and other features

4/9/2008

Data Mining: Principles and Algorithms

47

Collaborative Filtering: A Framework Items: I i1 u1 u2

3



2

ui

1

i2

… i j … in

1.5 …. …

2

rij=?

...

um Users: U

4/9/2008

3

The task: Q1: Find Unknown ratings? Q2: Which items should we recommend to this user? . . .

Unknown function f: U x I→ R Data Mining: Principles and Algorithms

48

Collaborative Filtering: Main Methods „

„

User-User Methods „ Memory-based: K-NN „ Model-based: Clustering Item-Item Method „ Correlation Analysis „ Linear Regression „ Belief Network „ Association Rule Mining

4/9/2008

Data Mining: Principles and Algorithms

49

User-User method: Intuition Target Customer

Q3: How to combine? Q1: How to measure similarity? Q2: How to select neighbors? 4/9/2008

Data Mining: Principles and Algorithms

50

How to Measure Similarity? i1 „

Pearson correlation coefficient w p ( a, i ) =

∑ (r

j∈Commonly Rated Items

„

ui

− r )(rij − ri )

aj a j∈ Commonly Rated Items 2 ( r − r ) ∑ aj a

in

ua

2 ( r − r ) ∑ ij i

j∈Commonly Rated Items

Cosine measure „

Users are vectors in product-dimension space

ra .ri wc (a, i) = r a 2 * ri 4/9/2008

2

Data Mining: Principles and Algorithms

51

Nearest Neighbor Approaches [SAR00a] „

„

Offline phase: „ Do nothing…just store transactions Online phase: „ Identify highly similar users to the active one „ „

„

Best K ones All with a measure greater than a threshold

Prediction

raj = ra User a’s neutral

∑ w(a, i)(r − r ) + ∑ w(a, i) ij

i

i

i

User i’s deviation

User a’s estimated deviation 4/9/2008

Data Mining: Principles and Algorithms

52

Clustering [BRE98] „

„

Offline phase: „ Build clusters: k-mean, k-medoid, etc. Online phase: „ Identify the nearest cluster to the active user „ Prediction: „ „

Use the center of the cluster Weighted average between cluster members „

4/9/2008

Weights depend on the active user

Data Mining: Principles and Algorithms

53

Clustering vs. k-NN Approaches „

„

K-NN using Pearson measure is slower but more accurate Clustering is more scalable Active user

Bad recommendations

4/9/2008

Data Mining: Principles and Algorithms

54

Reference: Link Analysis „

„

„

„

Brin, S. and Page, L. The anatomy of a large-scale hypertextual Web search engine (PageRank). In Computer Networks and ISDN Systems, 1998 J. Kleinberg. Authoritative sources in a hyperlinked environment (HITS). In ACM-SIAM Symp. Discrete Algorithms, 1998 S. Chakrabarti, B. Dom, D. Gibson, J. Kleinberg, S.R. Kumar, P. Raghavan, S. Rajagopalan, and A. Tomkins, Mining the link structure of the World Wide Web. IEEE Computer’99 D. Cai, X. He, J. Wen, and W. Ma, Block-level Link Analysis. SIGIR'2004.

References 1.

5. 6.

C. D. Manning and H. Schutze, “Foundations of Natural Language Processing”, MIT Press, 1999. S. Russell and P. Norvig, “Artificial Intelligence: A Modern Approach”, Prentice Hall, 1995. S. Chakrabarti, “Mining the Web: Statistical Analysis of Hypertext and SemiStructured Data”, Morgan Kaufmann, 2002. G. Miller, R. Beckwith, C. FellBaum, D. Gross, K. Miller, and R. Tengi. Five papers on WordNet. Princeton University, August 1993. C. Zhai, Introduction to NLP, Lecture Notes for CS 397cxz, UIUC, Fall 2003. M. Hearst, Untangling Text Data Mining, ACL’99, invited paper.

7.

http://www.sims.berkeley.edu/~hearst/papers/acl99/acl99-tdm.html R. Sproat, Introduction to Computational Linguistics, LING 306, UIUC, Fall

2. 3. 4.

8. 9.

2003. A Road Map to Text Mining and Web Mining, University of Texas resource page. http://www.cs.utexas.edu/users/pebronia/text-mining/ Computational Linguistics and Text Mining Group, IBM Research,

http://www.research.ibm.com/dssgrp/

References „

Fabrizio Sebastiani, “Machine Learning in Automated Text Categorization”, ACM Computing Surveys, Vol. 34, No.1, March 2002

„

Soumen Chakrabarti, “Data mining for hypertext: A tutorial survey”, ACM SIGKDD Explorations, 2000.

„

Cleverdon, “Optimizing convenient online accesss to bibliographic databases”, Information Survey, Use4, 1, 37-47, 1984

„

Yiming Yang, “An evaluation of statistical approaches to text categorization”, Journal of Information Retrieval, 1:67-88, 1999.

„

Yiming Yang and Xin Liu “A re-examination of text categorization methods”. Proceedings of ACM SIGIR Conference on Research and Development in Information Retrieval (SIGIR'99, pp 42--49), 1999.

References: Collaborative Filtering „

„

„

„

„

„

Charu C. Aggarwal, Joel L. Wolf, Kun-Lung Wu, Philip S. Yu: Horting Hatches an Egg: A New Graph-Theoretic Approach to Collaborative Filtering. KDD 1999: 201-212 J. Breese, D. Heckerman, C. Kadie Empirical Analysis of Predictive Algorithms for Collaborative Filtering. In Proc. 14th Conf. Uncertainty in Artificial Intelligence, Madison, July 1998. Yoon Ho Cho and Jae Kyeong Kim: Application of Web usage mining and product taxonomy to collaborative recommendations in e-commerce. Expert Systems with Applications, 26(2), 2003 William W. Cohen, Robert E. Schapire, and Yoram Singer. Learning to order things. In Advances in Neural Processing Systems 10, Denver, CO, 1997 Toshihiro Kamishima: Nantonac collaborative filtering: recommendation based on order responses. KDD 2003: 583-588 Lee, C.-H, Kim, Y.-H., Rhee, P.-K. Web personalization expert with combining collaborative filtering and association rule mining technique. Expert Systems with Applications, v 21, n 3, October, 2001, p 131-137

4/9/2008

Data Mining: Principles and Algorithms

58

References: Collaborative Filtering „

„

„

„

„

W. Lin, 2001P, online presentation available at: http://www.wiwi.huberlin.de/~myra/WEBKDD2000/WEBKDD2000_ARCHIVE/LinAlvarezRuiz_Web KDD2000.ppt Weiyang Lin, Sergio A. Alvarez, and Carolina Ruiz. Efficient adaptive-support association rule mining for recommender systems. Data Mining and Knowledge Discovery, 6:83--105, 2002 G. Linden, B. Smith, and J. York, "Amazon.com Recommendations Iemto item collaborative filtering", IEEE Internet Computing, Vo. 7, No. 1, pp. 7680, Jan. 2003. Badrul M. Sarwar, George Karypis, Joseph A. Konstan, John Riedl: Analysis of recommendation algorithms for e-commerce. ACM Conf. Electronic Commerce 2000: 158-167 B. Sarwar, G. Karypis, J. Konstan, and J. Riedl: Application of dimensionality reduction in recommender systems--a case study. In ACM WebKDD 2000 Web Mining for E-Commerce Workshop, 2000. B. M. Sarwar, G. Karypis, J. A. Konstan, and J. Riedl. Item-based collaborative filtering recommendation algorithms. WWW’01

4/9/2008

Data Mining: Principles and Algorithms

59

References: Collaborative Filtering „

„

„

„

„

„

„

B. Sarwar, 2000P, online presentation available at: http://www.wiwi.huberlin.de/~myra/WEBKDD2000/WEBKDD2000_ARCHIVE/badrul.ppt J. Ben Schafer, Joseph A. Konstan, John Riedl: E-Commerce Recommendation Applications. Data Mining and Knowledge Discovery 5(1/2): 115-153, 2001 L.H. Ungar and D.P. Foster: Clustering Methods for Collaborative Filtering, AAAI Workshop on Recommendation Systems, 1998. Yi-Fan Wang, Yu-Liang Chuang, Mei-Hua Hsu and Huan-Chao Keh: A personalized recommender system for the cosmetic business. Expert Systems with Applications, v 26, n 3, April, 2004 Pages 427-434 S. Vucetic and Z. Obradovic. A regression-based approach for scaling-up personalized recommender systems in e-commerce. In ACM WebKDD 2000 Web Mining for E-Commerce Workshop, 2000. Kai Yu, Xiaowei Xu, Martin Ester, and Hans-Peter Kriegel: Selecting relevant instances for efficient accurate collaborative filtering. In Proceedings of the 10th CIKM, pages 239--246. ACM Press, 2001. Cheng Zhai, Spring 2003 online course notes available at: http://sifaka.cs.uiuc.edu/course/2003-497CXZ/loc/cf.ppt

4/9/2008

Data Mining: Principles and Algorithms

60