Barna Saha & Saket Navlakha. March 13, 2008

Survey on Information Diffusion Barna Saha & Saket Navlakha CMSC 828G March 13, 2008 Slid adapted Slides d t d ffrom D David id K Kempe, JJure L Lesko...
Author: Rosemary Norman
2 downloads 1 Views 1022KB Size
Survey on Information Diffusion Barna Saha & Saket Navlakha CMSC 828G March 13, 2008 Slid adapted Slides d t d ffrom D David id K Kempe, JJure L Leskovec, k and dL Lada d Adamic Ad i [[see references] f ]

What is Information Diffusion ? Generic term Anything y g that propagates p p g over a network comes under the umbrella of “information diffusion.”

Two Major Goals of Information Diffusion Maximizing spread of influence. „

Opinion, Idea, Innovation, Recommendation

Early detection of outbreak. „

Disease,, Contamination

Illustrative Example 1:Spread of Influence

A new product Whom to give free is available in p to maximize the market. samples market

the purchase of the product ? p

Illustrative Example p 2: Earlyy Detection of Outbreak ƒ

Contaminants in water areOn spreading p which g nodes should in a city water we place sensors to distribution networky detect the all efficiently

possible contaminations? S

5

Goal 1: Spread Of Influence

Goal 1: Spread of Influence †

†

Models of influence with provable guarantee of goodness. Empirical study of models on viral marketing and recommendation dataset.

Social Network and Spread of Influence What h is i a social i l network? k „

The graph of relationships and interactions within a group of individuals.

Fundamental medium for INFLUENCE propagation ti among its it members „

“Word of mouth” effect

Problem Setting Given Gi „ „

a limited budget B for initial advertising (e.g. give away free samples p of pproduct)) estimates for influence between individuals

Goal „

trigger a large cascade of influence (e.g. further adoptions of a product)

Question „

Which set of individuals should B target at?

What we need Form models F d l off iinfluence fl in i social i l networks. k Obtain data about particular network (to estimate inter-personal influence). influence) Devise algorithm to maximize spread of influence. Optimization problem first introduced by Domingos/Richardson [KDD '01/KDD '02]

Two Basic Models of Influence Th h ld Model Threshold M d l

Cascade Model

We will discuss Linear Threshold Model. We will discuss Independent Cascade Model.

General Operational View Some nodes start active (bought the product). Active nodes may cause others to activate, etc. Monotonicity: active nodes never deactivate.

Li h ld M d l [G tt '78] Linear Th Threshold Model [Granovetter A node is influenced by each neighbor according to a weight such that Alice 0.7

Bob 0.2 You

Each node has a threshold which is chosen uniformly at random from the interval [0,1]. A node becomes active if

Example: 0.6

0.3

0.2

0.2 Inactive Node

X

0.1

0.4

U 0.5

w

0.3

Threshold

02 0.2 0.5

Stop!

Active Node

v

Influence

Independent Cascade Model When node v becomes active, it has a single chance of activating each currently inactive neighbor w. The activation attempt succeeds with probability pvw

Example: 0.6 0.3

0.2

Inactive Node

0.2

Active Node

X 0.4

U

0.1

Newly active node Successful

0.5

w

0.3

0.2

0.5

attempt

v

Stop!

attempt Unsuccessful

Influence Maximization Problem Influence of node set S: f(S) „

Expected number of active nodes at the end, if set K initialf(S) Key: is i submodular d l S is the active setb

Problem: „ „

Given a parameter k (budget), find a k-node set S to maximize f(S) Constrained optimization problem with f(S) as the objective function

Properties of f(S) Non-negative (obviously) Monotone: f (S + v) ≥ f (S ) Submodular: „ Let N be a finite set N „ A set function f : 2 a ℜ is submodular iff ∀S ⊂ T ⊂ N , ∀v ∈ N \ T ,

f ( S + v ) − f ( S ) ≥ f (T + v ) − f (T )

(diminishing returns)

Example

S T

T

S

Bad News For a submodular F b d l ffunction ti f, f if f only l takes t k nonnegative value, and is monotone, finding a k-element set S for which f(S) is maximized is an NP NP-hard hard optimization problem. It is NP NP-hard hard to determine the optimum k element set for influence maximization for both independent cascade model and linear threshold model.

Good News: We can use Greedy d Algorithm! l ih

G d algorithm Greedy l ih

„ Add node v to S that maximizes

reward d

a b c d e

b

a c

e

f(S +v)) - f(S) for f k iterations. i i How good (bad) it is? „ Theorem: The greedy algorithm is a (1 – 1/e) approximation. „ The resulting set S activates at least (1- 1/e) > 63% of the number of nodes that any size-k set S could activate. 21

Submodularity for Independent Cascade 0.6

Coins for edges are flipped during activation attempts.

0.3

0.2

0.2

0.1

0.4 05 0.5

03 0.3 0.5

Submodularity for Independent Cascade Can pre-flip all coins and reveal results immediately. Active nodes in the endd are reachable h bl via i green paths from initiallyy targeted g nodes. Study reachability in green graphs

0.6

0.3

0.2

0.2

0.1 0.4 0.5 05

0.3 0.5

Graph Reachability Is Submodular

From the picture: g( g(T +v)) - g( g(T)) ⊆ g( g(S +v)) g(S) when S ⊆ T

Submodularity of f(S) Fact: F t A non-negative ti lilinear combination bi ti off submodular b d l functions is submodular

f ( S ) = ∑ Prob(G is green graph) ⋅ gG ( S ) G

gG(S): nodes reachable from S in G. Each gG((S): ) is submodular (previous (p slide). ) Probabilities are non-negative.

Submodularity for Linear Threshold Use similar “green graph” idea. Once a graph is fixed, “reachability” argument is id i l identical. How do we fix a green graph now? Each node picks at most one incoming edge, with probabilities proportional to edge weights. E i l to linear Equivalent li threshold h h ld model d l (trickier ( i ki proof). f)

Evaluating ƒ(S) How to evaluate H l ƒ(S)? Still an open question of how to compute efficiently. B t very goodd estimates But: ti t by b simulation i l ti „ repeating the diffusion process often enough (polynomial in n; 1/ε) „ Achieve (1 ε)-approximation to f(S). Generalization of Nemhauser/Wolsey proof shows: Greedy algorithm is now a (1-1/e- ε′)-approximation.

General Model Independent Cascade and Linear Threshold are two specific models. W would We ld like lik algorithms l i h for f as large l a class l as possible. H tto generalize How li these th models? d l?

General Threshold Model: Each node v has an activation function hv : V →[0,1]. v becomes active when hv(A) exceeds v's threshold θv, where A: active neighbors of v Linear Threshold: special case where hv(A) =

General Cascade Model: Activation probabilities puv change as a function of who has already tried and failed (now: puv(F)) Od i d Order-independence: d order d off attempts does d not matter. General Threshold Model Cascade I d Independent d and t Cascade: C General d special i l case (p ( uv independent i d d t of F). Model are equivalent

Generality vs Feasibility In general, any non-trivial approximation of f is NPhard. H generall a model How d l can we handle? h dl ? Decreasing Cascade Model: puv(F) are non-increasing i F. in F For the Decreasing Cascade Model, the greedy algorithm is a (1 – 1/e)-approximation. 1/e) approximation Conjecture: If each activation function hv(A) is submodular then f(S) is submodular submodular, submodular.

T (Jure) Two (J ) P Papers † †

†

The Dynamics of Viral Marketing Patterns off Influence f in a Recommendation Network Goal: Understand how the propagation of recommendations empirically influences product purchasing

Wh even bother? Why b th ? †

Evidence E id that h recommendations d i andd viral i l marketing works: „ „ „ „

Amazon: “People A “P l who h bought b ht X also l bought b ht Y” Ebay: Ratings affect likelihood of item being bought Hotmail, Gmail 68% of consumers consult friends or family before purchasing electronics

Th Long The L Tail T il

Chris Anderson, ‘The Long Tail’, Wired - October 2004

D t t Dataset „ Large online retailer (June 2001 – May 2003) „ Recommendation network: nodes (people) and edges (recommendations) „ Senders and followers of recommendations receive discounts on products 10% credit

10% off

„ Recommendations are made to any number of people at the time of purchase „ Only the recipient who buys first gets a discount

D t t Statistics Dataset St ti ti

high low

products d

customers

recommendations d i

Book

103,161

2,863,977

5,741,611

2,097,809

2,859,096

83,113

DVD

19,829

805,285

8,180,393

962,341

837,300

75,421

Music

393,598

794,148

1,443,847

585,738

721,673

10,576

Video

26,131

239,583

280,270

160,683

165,109

1,376

542,719

3,943,084

15,646,121

3,153,676

4,574,178

170,486

Full

edges d

purchases h

„ 9% of DVD purchases are due to recommendations ((3% of books)) „ Networks are very sparsely connected (low average degree)

responses

Product recommendation network purchase following a recommendation customer t recommending di a product customer not buying a recommended product • Stars, disconnected components

C Cascade d fformation ti process „ Time: Time t1 < t2 < … < tn

legend t4 received recommendation and propagated it forward

t1

received a recommendation but didn’t didn t propagate

t2 t6

Id tif i cascades Identifying d † †

Create separate graphh for C f eachh product d Delete late recommendations „ „

Recommendations after first purchase We get time-increasing graph

†

Delete no-purchase nodes

†

Now connected components correspond to maximal cascades!

measuring cascade sizes for books

steep drop-off 6

10

-4.98

= 1.8e6 x

books

2

R =0.99

Cou unt

4

10

very few large cascades

2

10

0

10 0 10

1

10 Size of cascade

2

10

measuring g cascade sizes for DVDs „ DVD cascades can grow large „ Possibly P ibl a product d t off websites b it where h people l sign i up tto

exchange recommendations shallow drop off – fat tail

= 3.4e3 x-1.56 R2=0.83 4

Count

10

a number of large cascades

2

10

0

10 0 10

1

2

10 10 Size of cascade

3

10

Q1: How does the number of incoming recs. affect the probability of buying? BOOKS

DVDs

0.06

0.08

Probabilitty of Buying

Probabilitty of Buying

0.05 0.04 0.03 0.02

0.06

0.04

0.02

0.01 0

2

4 6 8 Incoming Recommendations

10

• Book recommendations rarely followed • Successes decreases with more recs.

0

10

20 30 40 50 Incoming Recommendations

60

• DVD immune after 10 recommendations

Q2: How does the number of outgoing recs recs. effect the number of purchases? BOOKS

DVDs

6

0.5

Numbe er of Purchasses

Numbe er of Purchasses

7

0.4 0.3 0.2 0.1 0

5 4 3 2 1

10

20 30 40 50 Outgoing Recommendations

60

0

20

• Books: saturation - Spamming: people stop listening after a while • DVDs: keeps increasing (colliding recommendations.)

40 60 80 100 120 Outgoing Recommendations

140

Q3: How does recommendation exchange effect purchase? BOOKS

DVDs

-3

x 10

0.07

Probab bility of buying g

Probab bility of buying g

12

10

8

6

4

5

10 15 20 25 30 35 Exchanged recommendations

40

0.06 0.05 0 04 0.04 0.03 0 02 0.02

5

10 15 20 25 30 35 Exchanged recommendations

40

• For rec. r on product p from user u to user v: determine number of recs.

exchanged before r, then check whether v purchased p after r arrived. Æ Recommendations begin to lose their effect after a few exchanges.

Di Discussion i (1) †

Epidemic model assumption: „

†

Individuals convert once a fraction of their neighbors who are infected exceeds a threshold.

Viral Marketing insight: „

Probabilityy of ppurchasingg a pproduct saturates

Di Discussion i (2) †

Sexual-contact network assumption: „

†

High-degree nodes have equal probability of infecting each of its neighbors.

Viral Marketing insight: „

As yyou send more and more recommendations the number of purchase can decline.

Di Discussion i (3) †

Epidemic model assumption: „

†

Every time individuals interact, they have equal probability of being infected.

Viral Marketing insight: „

Probabilityy of infection (purchasing) (p g) decreases with repeated interaction.

Oth Critiques Other C iti †

What if person purchased elsewhere?

†

Lack of money

†

Recommendations could have occurred offline

†

Recommendations only occur after purchase, not after receiving product

T Topological l i l structure off cascades d †

What kinds Wh ki d off cascades d arise i frequently f l in i real life?

†

Are they like trees, stars, or something else?

C Cascade d enumeration ti „ Maximal cascades do not reveal what are the cascade building blocks (local structures) „ Given a maximal cascade we want to enumerate all local cascades: „ For every node we explore the cascade in the neighborhood up to 1, 2, 3,… steps away „ This way we capture the local structure of the cascade around the node source node 1 step away 2 steps away

C Counting ti cascades d (graph ( h isomorphism) i hi ) „ To count cascades we need to determine whether a new cascade is isomorphic to already seen one:

? ==

Graphs are isomorphic if there exists a node mapping so that nodes have same neighbors

„ No polynomial graph isomorphism algorithm is known so we reside to approximate solution known,

A Approximate i t graphh isomorphism i hi „ Do not compare the graphs directly, but for each graph we create a signature „Ag good signature g is one where isomorphic p graphs have the same signature, but few nonisomorphic graphs share the same signature

Compare th C the graph signatures

C ti a signature Creating i t „ We propose multilevel approach „ Different levels of the signature simple „ Number of nodes, number of edges (fast/inaccurate) „ Sorted in- and out- degree sequence „ Singular values of graph adjacency matrix

„ For small graphs (n < 9) we perform p test exact isomorphism

complex (slow/accurate)

C Comparing i signatures i t „ First compare simple signatures „ Compare the graphs with the same simple signature g using g more and more complicated p (expensive/accurate) signatures graphs) p ) we p perform exact „ At the end ((for small g isomorphism resolution „ Since Si we are iinterested t t d iin b building ildi bl blocks k off cascades which are generally small, the precision for small graphs is more important

C Comparing i signatures i t – Example E l Compare simple signature (number of nodes/edges)

Compare simple signature (degree sequence)

Compare p simple p signature g (Singular values)

F Frequent t cascade d subgraphs b h (1) high g low

„ General observations: „ DVDs have the richest cascades d ((mostt recommendations, most densely linked) „ Books B k have h smallll variety i t off cascades „ Music is much larger than video id b butt d does nott h have much variety in cascades

cascades

different

Book

122,657

959

DVD

289 055 289,055

87 614 87,614

Music

13,330

158

Video

1,928

109

F Frequent t cascade d subgraphs b h (2) is the most common cascade subgraph g p „ It accounts for ~75% cascades in books, CD and VHS,, onlyy 12% of DVD cascades „ Chains (

) are more frequent than

is 6 (1 (1.2 2 for DVD) times more frequent than „ For DVDs „

is more frequent than

iis more ffrequentt than th a collision lli i (but collision has less edges)

T i l classes Typical l off cascades d „ No p propagation p g

„ Common friends

„ A complicated cascade

Ob Observations ti † †

Double-counting? D bl i ? Cascade frequencies do not simply decrease monotonically for denser subgraphs „ „

† †

Frequency not dependent on size Thus some structure or property exists

Splits are more common than collisions Cascades are a form of collective behavior

Goal 2: Early Detection of OutBreak

Snapshots: Obj ti Objective: 1) Minimize time to detection 2) Maximize number of detected propagations 3) Minimize number of infected people Observation: The objective functions are submodular S1

New sensor: S’

Adding S’ helps a lot S2

Placement A={S1, S2}

S1

Adding S’ helps S2 S3 very little S4

Placement A={S1, S2, S3, S4}

61

Alternative Models of Disease Outbreak Goal: Estimate population size effected by a disease; estimating “epidemic threshold” F ll mixed Fully i d models d l like lik SIR andd SIS More generic models based on percolation theory Main difference from previous models: nodes can infect its neighbors until it recovers. C easily Can il be b incorporated i d in i the h independent i d d cascade models.

R f References †

D. Kempe D Kempe, J. J Kleinberg, Kleinberg E. E Tardos. Tardos Influential Nodes in a Diffusion Model for Social Networks Networks. Proc. Proc 32nd International Colloquium on Automata, Languages and Programming (ICALP), 2005.

†

D. Kempe, J. Kleinberg, E. Tardos. Maximizing the Spread of Influence through a Social Network. Proc. 9th ACM SIGKDD Intl. Conf. on Knowledge Discovery and Data Mining, 2003.

†

Jure Leskovec, Lada A. Adamic, Bernardo A. Huberman. The Dynamics of Viral Marketing. ACM Transactions on the Web 2007.

†

Jure Leskovec, Ajit Singh, Jon Kleinberg. Patterns of Influence in a Recommendation Network. PacificAsia Conference on Knowledge Discovery and Data Mining, Mining 2006. 2006

†

J. Kleinberg. Cascading Behavior in Networks: Algorithmic and Economic Issues. In Algorithmic Game Theory (N. Nisan, T. Roughgarden, E. Tardos, V. Vazirani, eds.), Cambridge University Press, 2007.

†

J Jure L Leskovec, k A Andreas d K Krause, C Carlos l G Guestrin, t i Christos Ch i t Faloutsos, F l t Jeanne J VanBriesen, V Bi Natalie N t li S. S Glance: Cost-effective outbreak detection in networks. KDD 2007: 420-429

†

Kemp slides: http://www.cs.washington.edu/affiliates/meetings/talks04/kempe.pdf Leskovec slides: http://www.cs.cmu.edu/~jure/pubs/cascades-pakdd06.ppt Adamic slides: http://www-personal.umich.edu/%7Eladamic/papers/viral/ViralMarketingW.ppt

† †