Survey on Information Diffusion Barna Saha & Saket Navlakha CMSC 828G March 13, 2008 Slid adapted Slides d t d ffrom D David id K Kempe, JJure L Leskovec, k and dL Lada d Adamic Ad i [[see references] f ]
What is Information Diffusion ? Generic term Anything y g that propagates p p g over a network comes under the umbrella of “information diffusion.”
Two Major Goals of Information Diffusion Maximizing spread of influence.
Opinion, Idea, Innovation, Recommendation
Early detection of outbreak.
Disease,, Contamination
Illustrative Example 1:Spread of Influence
A new product Whom to give free is available in p to maximize the market. samples market
the purchase of the product ? p
Illustrative Example p 2: Earlyy Detection of Outbreak
Contaminants in water areOn spreading p which g nodes should in a city water we place sensors to distribution networky detect the all efficiently
possible contaminations? S
5
Goal 1: Spread Of Influence
Goal 1: Spread of Influence
Models of influence with provable guarantee of goodness. Empirical study of models on viral marketing and recommendation dataset.
Social Network and Spread of Influence What h is i a social i l network? k
The graph of relationships and interactions within a group of individuals.
Fundamental medium for INFLUENCE propagation ti among its it members
“Word of mouth” effect
Problem Setting Given Gi
a limited budget B for initial advertising (e.g. give away free samples p of pproduct)) estimates for influence between individuals
Goal
trigger a large cascade of influence (e.g. further adoptions of a product)
Question
Which set of individuals should B target at?
What we need Form models F d l off iinfluence fl in i social i l networks. k Obtain data about particular network (to estimate inter-personal influence). influence) Devise algorithm to maximize spread of influence. Optimization problem first introduced by Domingos/Richardson [KDD '01/KDD '02]
Two Basic Models of Influence Th h ld Model Threshold M d l
Cascade Model
We will discuss Linear Threshold Model. We will discuss Independent Cascade Model.
General Operational View Some nodes start active (bought the product). Active nodes may cause others to activate, etc. Monotonicity: active nodes never deactivate.
Li h ld M d l [G tt '78] Linear Th Threshold Model [Granovetter A node is influenced by each neighbor according to a weight such that Alice 0.7
Bob 0.2 You
Each node has a threshold which is chosen uniformly at random from the interval [0,1]. A node becomes active if
Example: 0.6
0.3
0.2
0.2 Inactive Node
X
0.1
0.4
U 0.5
w
0.3
Threshold
02 0.2 0.5
Stop!
Active Node
v
Influence
Independent Cascade Model When node v becomes active, it has a single chance of activating each currently inactive neighbor w. The activation attempt succeeds with probability pvw
Example: 0.6 0.3
0.2
Inactive Node
0.2
Active Node
X 0.4
U
0.1
Newly active node Successful
0.5
w
0.3
0.2
0.5
attempt
v
Stop!
attempt Unsuccessful
Influence Maximization Problem Influence of node set S: f(S)
Expected number of active nodes at the end, if set K initialf(S) Key: is i submodular d l S is the active setb
Problem:
Given a parameter k (budget), find a k-node set S to maximize f(S) Constrained optimization problem with f(S) as the objective function
Properties of f(S) Non-negative (obviously) Monotone: f (S + v) ≥ f (S ) Submodular: Let N be a finite set N A set function f : 2 a ℜ is submodular iff ∀S ⊂ T ⊂ N , ∀v ∈ N \ T ,
f ( S + v ) − f ( S ) ≥ f (T + v ) − f (T )
(diminishing returns)
Example
S T
T
S
Bad News For a submodular F b d l ffunction ti f, f if f only l takes t k nonnegative value, and is monotone, finding a k-element set S for which f(S) is maximized is an NP NP-hard hard optimization problem. It is NP NP-hard hard to determine the optimum k element set for influence maximization for both independent cascade model and linear threshold model.
Good News: We can use Greedy d Algorithm! l ih
G d algorithm Greedy l ih
Add node v to S that maximizes
reward d
a b c d e
b
a c
e
f(S +v)) - f(S) for f k iterations. i i How good (bad) it is? Theorem: The greedy algorithm is a (1 – 1/e) approximation. The resulting set S activates at least (1- 1/e) > 63% of the number of nodes that any size-k set S could activate. 21
Submodularity for Independent Cascade 0.6
Coins for edges are flipped during activation attempts.
0.3
0.2
0.2
0.1
0.4 05 0.5
03 0.3 0.5
Submodularity for Independent Cascade Can pre-flip all coins and reveal results immediately. Active nodes in the endd are reachable h bl via i green paths from initiallyy targeted g nodes. Study reachability in green graphs
0.6
0.3
0.2
0.2
0.1 0.4 0.5 05
0.3 0.5
Graph Reachability Is Submodular
From the picture: g( g(T +v)) - g( g(T)) ⊆ g( g(S +v)) g(S) when S ⊆ T
Submodularity of f(S) Fact: F t A non-negative ti lilinear combination bi ti off submodular b d l functions is submodular
f ( S ) = ∑ Prob(G is green graph) ⋅ gG ( S ) G
gG(S): nodes reachable from S in G. Each gG((S): ) is submodular (previous (p slide). ) Probabilities are non-negative.
Submodularity for Linear Threshold Use similar “green graph” idea. Once a graph is fixed, “reachability” argument is id i l identical. How do we fix a green graph now? Each node picks at most one incoming edge, with probabilities proportional to edge weights. E i l to linear Equivalent li threshold h h ld model d l (trickier ( i ki proof). f)
Evaluating ƒ(S) How to evaluate H l ƒ(S)? Still an open question of how to compute efficiently. B t very goodd estimates But: ti t by b simulation i l ti repeating the diffusion process often enough (polynomial in n; 1/ε) Achieve (1 ε)-approximation to f(S). Generalization of Nemhauser/Wolsey proof shows: Greedy algorithm is now a (1-1/e- ε′)-approximation.
General Model Independent Cascade and Linear Threshold are two specific models. W would We ld like lik algorithms l i h for f as large l a class l as possible. H tto generalize How li these th models? d l?
General Threshold Model: Each node v has an activation function hv : V →[0,1]. v becomes active when hv(A) exceeds v's threshold θv, where A: active neighbors of v Linear Threshold: special case where hv(A) =
General Cascade Model: Activation probabilities puv change as a function of who has already tried and failed (now: puv(F)) Od i d Order-independence: d order d off attempts does d not matter. General Threshold Model Cascade I d Independent d and t Cascade: C General d special i l case (p ( uv independent i d d t of F). Model are equivalent
Generality vs Feasibility In general, any non-trivial approximation of f is NPhard. H generall a model How d l can we handle? h dl ? Decreasing Cascade Model: puv(F) are non-increasing i F. in F For the Decreasing Cascade Model, the greedy algorithm is a (1 – 1/e)-approximation. 1/e) approximation Conjecture: If each activation function hv(A) is submodular then f(S) is submodular submodular, submodular.
T (Jure) Two (J ) P Papers
The Dynamics of Viral Marketing Patterns off Influence f in a Recommendation Network Goal: Understand how the propagation of recommendations empirically influences product purchasing
Wh even bother? Why b th ?
Evidence E id that h recommendations d i andd viral i l marketing works:
Amazon: “People A “P l who h bought b ht X also l bought b ht Y” Ebay: Ratings affect likelihood of item being bought Hotmail, Gmail 68% of consumers consult friends or family before purchasing electronics
Th Long The L Tail T il
Chris Anderson, ‘The Long Tail’, Wired - October 2004
D t t Dataset Large online retailer (June 2001 – May 2003) Recommendation network: nodes (people) and edges (recommendations) Senders and followers of recommendations receive discounts on products 10% credit
10% off
Recommendations are made to any number of people at the time of purchase Only the recipient who buys first gets a discount
D t t Statistics Dataset St ti ti
high low
products d
customers
recommendations d i
Book
103,161
2,863,977
5,741,611
2,097,809
2,859,096
83,113
DVD
19,829
805,285
8,180,393
962,341
837,300
75,421
Music
393,598
794,148
1,443,847
585,738
721,673
10,576
Video
26,131
239,583
280,270
160,683
165,109
1,376
542,719
3,943,084
15,646,121
3,153,676
4,574,178
170,486
Full
edges d
purchases h
9% of DVD purchases are due to recommendations ((3% of books)) Networks are very sparsely connected (low average degree)
responses
Product recommendation network purchase following a recommendation customer t recommending di a product customer not buying a recommended product • Stars, disconnected components
C Cascade d fformation ti process Time: Time t1 < t2 < … < tn
legend t4 received recommendation and propagated it forward
t1
received a recommendation but didn’t didn t propagate
t2 t6
Id tif i cascades Identifying d
Create separate graphh for C f eachh product d Delete late recommendations
Recommendations after first purchase We get time-increasing graph
Delete no-purchase nodes
Now connected components correspond to maximal cascades!
measuring cascade sizes for books
steep drop-off 6
10
-4.98
= 1.8e6 x
books
2
R =0.99
Cou unt
4
10
very few large cascades
2
10
0
10 0 10
1
10 Size of cascade
2
10
measuring g cascade sizes for DVDs DVD cascades can grow large Possibly P ibl a product d t off websites b it where h people l sign i up tto
exchange recommendations shallow drop off – fat tail
= 3.4e3 x-1.56 R2=0.83 4
Count
10
a number of large cascades
2
10
0
10 0 10
1
2
10 10 Size of cascade
3
10
Q1: How does the number of incoming recs. affect the probability of buying? BOOKS
DVDs
0.06
0.08
Probabilitty of Buying
Probabilitty of Buying
0.05 0.04 0.03 0.02
0.06
0.04
0.02
0.01 0
2
4 6 8 Incoming Recommendations
10
• Book recommendations rarely followed • Successes decreases with more recs.
0
10
20 30 40 50 Incoming Recommendations
60
• DVD immune after 10 recommendations
Q2: How does the number of outgoing recs recs. effect the number of purchases? BOOKS
DVDs
6
0.5
Numbe er of Purchasses
Numbe er of Purchasses
7
0.4 0.3 0.2 0.1 0
5 4 3 2 1
10
20 30 40 50 Outgoing Recommendations
60
0
20
• Books: saturation - Spamming: people stop listening after a while • DVDs: keeps increasing (colliding recommendations.)
40 60 80 100 120 Outgoing Recommendations
140
Q3: How does recommendation exchange effect purchase? BOOKS
DVDs
-3
x 10
0.07
Probab bility of buying g
Probab bility of buying g
12
10
8
6
4
5
10 15 20 25 30 35 Exchanged recommendations
40
0.06 0.05 0 04 0.04 0.03 0 02 0.02
5
10 15 20 25 30 35 Exchanged recommendations
40
• For rec. r on product p from user u to user v: determine number of recs.
exchanged before r, then check whether v purchased p after r arrived. Æ Recommendations begin to lose their effect after a few exchanges.
Di Discussion i (1)
Epidemic model assumption:
Individuals convert once a fraction of their neighbors who are infected exceeds a threshold.
Viral Marketing insight:
Probabilityy of ppurchasingg a pproduct saturates
Di Discussion i (2)
Sexual-contact network assumption:
High-degree nodes have equal probability of infecting each of its neighbors.
Viral Marketing insight:
As yyou send more and more recommendations the number of purchase can decline.
Di Discussion i (3)
Epidemic model assumption:
Every time individuals interact, they have equal probability of being infected.
Viral Marketing insight:
Probabilityy of infection (purchasing) (p g) decreases with repeated interaction.
Oth Critiques Other C iti
What if person purchased elsewhere?
Lack of money
Recommendations could have occurred offline
Recommendations only occur after purchase, not after receiving product
T Topological l i l structure off cascades d
What kinds Wh ki d off cascades d arise i frequently f l in i real life?
Are they like trees, stars, or something else?
C Cascade d enumeration ti Maximal cascades do not reveal what are the cascade building blocks (local structures) Given a maximal cascade we want to enumerate all local cascades: For every node we explore the cascade in the neighborhood up to 1, 2, 3,… steps away This way we capture the local structure of the cascade around the node source node 1 step away 2 steps away
C Counting ti cascades d (graph ( h isomorphism) i hi ) To count cascades we need to determine whether a new cascade is isomorphic to already seen one:
? ==
Graphs are isomorphic if there exists a node mapping so that nodes have same neighbors
No polynomial graph isomorphism algorithm is known so we reside to approximate solution known,
A Approximate i t graphh isomorphism i hi Do not compare the graphs directly, but for each graph we create a signature Ag good signature g is one where isomorphic p graphs have the same signature, but few nonisomorphic graphs share the same signature
Compare th C the graph signatures
C ti a signature Creating i t We propose multilevel approach Different levels of the signature simple Number of nodes, number of edges (fast/inaccurate) Sorted in- and out- degree sequence Singular values of graph adjacency matrix
For small graphs (n < 9) we perform p test exact isomorphism
complex (slow/accurate)
C Comparing i signatures i t First compare simple signatures Compare the graphs with the same simple signature g using g more and more complicated p (expensive/accurate) signatures graphs) p ) we p perform exact At the end ((for small g isomorphism resolution Since Si we are iinterested t t d iin b building ildi bl blocks k off cascades which are generally small, the precision for small graphs is more important
C Comparing i signatures i t – Example E l Compare simple signature (number of nodes/edges)
Compare simple signature (degree sequence)
Compare p simple p signature g (Singular values)
F Frequent t cascade d subgraphs b h (1) high g low
General observations: DVDs have the richest cascades d ((mostt recommendations, most densely linked) Books B k have h smallll variety i t off cascades Music is much larger than video id b butt d does nott h have much variety in cascades
cascades
different
Book
122,657
959
DVD
289 055 289,055
87 614 87,614
Music
13,330
158
Video
1,928
109
F Frequent t cascade d subgraphs b h (2) is the most common cascade subgraph g p It accounts for ~75% cascades in books, CD and VHS,, onlyy 12% of DVD cascades Chains (
) are more frequent than
is 6 (1 (1.2 2 for DVD) times more frequent than For DVDs
is more frequent than
iis more ffrequentt than th a collision lli i (but collision has less edges)
T i l classes Typical l off cascades d No p propagation p g
Common friends
A complicated cascade
Ob Observations ti
Double-counting? D bl i ? Cascade frequencies do not simply decrease monotonically for denser subgraphs
Frequency not dependent on size Thus some structure or property exists
Splits are more common than collisions Cascades are a form of collective behavior
Goal 2: Early Detection of OutBreak
Snapshots: Obj ti Objective: 1) Minimize time to detection 2) Maximize number of detected propagations 3) Minimize number of infected people Observation: The objective functions are submodular S1
New sensor: S’
Adding S’ helps a lot S2
Placement A={S1, S2}
S1
Adding S’ helps S2 S3 very little S4
Placement A={S1, S2, S3, S4}
61
Alternative Models of Disease Outbreak Goal: Estimate population size effected by a disease; estimating “epidemic threshold” F ll mixed Fully i d models d l like lik SIR andd SIS More generic models based on percolation theory Main difference from previous models: nodes can infect its neighbors until it recovers. C easily Can il be b incorporated i d in i the h independent i d d cascade models.
R f References
D. Kempe D Kempe, J. J Kleinberg, Kleinberg E. E Tardos. Tardos Influential Nodes in a Diffusion Model for Social Networks Networks. Proc. Proc 32nd International Colloquium on Automata, Languages and Programming (ICALP), 2005.
D. Kempe, J. Kleinberg, E. Tardos. Maximizing the Spread of Influence through a Social Network. Proc. 9th ACM SIGKDD Intl. Conf. on Knowledge Discovery and Data Mining, 2003.
Jure Leskovec, Lada A. Adamic, Bernardo A. Huberman. The Dynamics of Viral Marketing. ACM Transactions on the Web 2007.
Jure Leskovec, Ajit Singh, Jon Kleinberg. Patterns of Influence in a Recommendation Network. PacificAsia Conference on Knowledge Discovery and Data Mining, Mining 2006. 2006
J. Kleinberg. Cascading Behavior in Networks: Algorithmic and Economic Issues. In Algorithmic Game Theory (N. Nisan, T. Roughgarden, E. Tardos, V. Vazirani, eds.), Cambridge University Press, 2007.
J Jure L Leskovec, k A Andreas d K Krause, C Carlos l G Guestrin, t i Christos Ch i t Faloutsos, F l t Jeanne J VanBriesen, V Bi Natalie N t li S. S Glance: Cost-effective outbreak detection in networks. KDD 2007: 420-429
Kemp slides: http://www.cs.washington.edu/affiliates/meetings/talks04/kempe.pdf Leskovec slides: http://www.cs.cmu.edu/~jure/pubs/cascades-pakdd06.ppt Adamic slides: http://www-personal.umich.edu/%7Eladamic/papers/viral/ViralMarketingW.ppt