Community Detection in Content-Sharing Social Networks

Community Detection in Content-Sharing Social Networks Nagarajan Natarajan Prithviraj Sen† Vineet Chaoji† Dept. of Comp. Sci., UT Austin, Email: na...
Author: Grace Burns
9 downloads 1 Views 365KB Size
Community Detection in Content-Sharing Social Networks Nagarajan Natarajan

Prithviraj Sen†

Vineet Chaoji†

Dept. of Comp. Sci., UT Austin, Email: [email protected]

IBM Research - Almaden Email: [email protected]

Amazon, Bangalore Email: [email protected]

Abstract—Network structure and content in microblogging sites like Twitter influence each other — user A on Twitter follows user B for the tweets that B posts on the network, and A may then re-tweet the content shared by B to his/her own followers. In this paper, we propose a probabilistic model to jointly model link communities and content topics by leveraging both the social graph and the content shared by users. We model a community as a distribution over users, use it as a source for topics of interest, and jointly infer both communities and topics using Gibbs sampling. While modeling communities using the social graph, or modeling topics using content have received a great deal of attention, a few recent approaches try to model topics in content-sharing platforms using both content and social graph. Our work differs from the existing generative models in that we explicitly model the social graph of users along with the user-generated content, mimicking how the two entities co-evolve in content-sharing platforms. Recent studies have found Twitter to be more of a content-sharing network and less a social network, and it seems hard to detect tightly knit communities from the follower-followee links. Still, the question of whether we can extract Twitter communities using both links and content is open. In this paper, we answer this question in the affirmative. Our model discovers coherent communities and topics, as evinced by qualitative results on sub-graphs of Twitter users. Furthermore, we evaluate our model on the task of predicting follower-followee links. We show that joint modeling of links and content significantly improves link prediction performance on a sub-graph of Twitter (consisting of about 0.7 million users and over 27 million tweets), compared to generative models based on only structure or only content and paths-based methods such as Katz.

I.

I NTRODUCTION

Social networks have become extremely popular in the last few years. Essentially consisting of users inter-connected via links, two of the most important reasons, arguably, behind social networks being popular are: 1) Allowing users to interact with their neighbors, and 2) Doubling up as content-sharing platforms where users post tweets, write on each other’s walls, upload photos, like videos and engage with other users and content of interest by simply interacting with the network. In the context of social networks, the concept of a community is very important. Roughly speaking, a community is a group of users who share similar interests, consume similar content or interact with each other in many ways. Often, a user may connect to other users who share content of interest. A link between two users increases the chances of them sharing common interests but does not necessarily imply it. Also, it is not necessary that two users who do not share a link may † This

work was done while the authors were at Yahoo! Labs, Bangalore.

not share any interests. Thus, to determine communities one needs to leverage both the link structure of the network and the content consumption behaviour of various users which makes it a particularly challenging problem. Community detection is important since it gives us greater insight into the structure of the network and can potentially drive a number of practical applications such as predicting unobserved or future links between users and deploying targetted marketing campaigns to specific subsets of users for greater effectiveness. Even though a number of existing initiatives have concentrated on leveraging both network and content in social networks [5], [6], [7], [18], [20], [21], [23], [24], [12], only a handful concentrate on discovering communities [27], [30], [34]. Furthermore, most of the efforts exclusively concentrate on doing so for email networks. Most modern-day social networks such as Twitter, Facebook and Google+ differ sharply from email networks in a number of contrasting ways. Take the case of Twitter, for instance. The tweets of a user u are automatically made available for consumption by all of his/her followers (users in the neighborhood of u). The main difference between an email network and Twitter is the fact that the set of followers of u does not change on a tweet-to-tweet basis and largely remains constant whereas, more often than not, two emails sent by the same sender in an email network are likely to have different sets of recipients. The same characteristic is also found in Facebook, where the predominant behavior is for a user to post content on a wall and the set of users who are then able to view the post does not change on a post-topost basis. In Google+, one can post content to a circle that then becomes available for consumption by all members of the circle and thus once again, unlike email networks, the users consuming the content do not change on a post-to-post basis. In this paper, we concentrate on discovering communities from social networks such as Twitter, Facebook and Google+. Our models take into account user-generated content, social network link structure and the fact that users consuming content do not change on a post-to-post basis. In such networks, the evidence for engagement of a user within a community is contained at two levels of abstraction: Post-level and Userlevel. Each post can potentially contain different content(postlevel evidence), and the users consuming each post are usually related to the user making the post(user-level evidence). The implication is that we need to carefully consider how to model these different sources of evidence at different levels of abstraction. It is possible to adapt the models proposed in [27], [30], [34] to our setting by modeling each post as an email and fixing the set of recipients of all posts made by user

u to the same set. However, it may potentially lead to unwanted bias in the community discovery process since for each post by u we will need to explain the same set of recipients. We propose a generative topic model for discovering communities from social networks, following the rich tradition of topic models for social network analysis. Our model leverages both content and network structure to discover communities. We define a community as a distribution over users. We associate a community mixture with each user in the network. For generating the network links, we assume that users connect to members from a mixture of communities. Besides being a distribution over users, each community is also associated with a mixture of topics. In our model, the community variable is treated as a first-class citizen and is used to explain all the evidence associated with different levels of abstraction, provided by the social network. The above assumptions result in a light-weight model enabling efficient inference algorithms. Our contributions are as follows: • We propose the Link-Content model for discovering topicbased communities in social networks. • We propose an efficient Gibbs sampling algorithm to perform inference for the above model. • We empirically show that our model helps infer hidden communities with coherent topics in the Twitter network. • As an application, we show how the discovered communities and topics can be used to predict links in the social network. II.

R ELATED W ORK

Given the amount of work done in generative modeling for social network analysis in the last decade or so, it is impossible to give an exhaustive list of related work. In this section, we describe a few of the closely related ones and attempt to highlight the key differences between them and the model we propose in this paper. One of the earliest works in community detection is [1] where each node in the network is constrained to belong to at most one community. Later extensions such as [13] removed such restrictive assumptions to allow mixed-membership besides showing how to incorporate degree bias. Other works in this area include methods based on the between-ness measure [10], methods based on modularity [26], EM-based approaches [2] etc. We refer the interested reader to excellent survey articles such as [25] and [8] that describe various approaches listing their pros and cons. Also related to this area is the work on graph generation whose goal is to design algorithms that produce real-world network structures. Some of these works have proposed algorithms that generate network structures containing communities (e.g., [15]). More recent works have concentrated on joint modeling of both content and structure for detecting communities and/or topics. For example, CUT [34] focuses on email networks and presents two models. The first model posits that an email is generated by first picking a community, then generating one of the recipients from the community, generating a topic of interest to the recipient and then generating a word in the email given the topic. The second model replaces the

notion of a community being a distribution over users with a distribution over topics instead. CUT models assume a very tight connection between the content of the email and its recipients. In CART [27], the authors incorporate the sender into the generation process, building on the ART [20] model. In ART, the generation process picks the topic for a word in the email by conditioning on the sender and one of the recipients of the email. CART involves a community within the ART model by positing that the sender and the recipients are generated from a community. Recently proposed TURCM2 (amongst other models [30]) picks a topic given the sender of the email and generates the whole email given this topic. In parallel, the sender and the topic are also used to pick a community from which the set of recipients is generated. Note how TURCM-2 differs from earlier work [34], [27] by not involving any of the recipients in the email content (words) generation process. TURCM-2 is also capable of modeling different kinds of interactions (e.g., tweet, re-tweet or a replytweet in Twitter). Even though the above approaches utilize both content and network jointly to discover communities, certain aspects of these are motivated by their focus on email networks which do not necessarily apply to social networks. More specifically, all of the above models generate for each email the set of recipients along with the words in the email. As discussed in the previous section, in social networks, users viewing posted content usually do not change on a post-to-post basis and are more closely connected to the user posting the content instead. If one were to treat each instance of content posted on a social network as an “email” instead, with the user who does the posting treated as the “sender” and the users consuming the content treated as “recipients” by extending any of the above models, then this will lead to significant computational inefficiency at best (since sets of users consuming content is usually large thus increasing the size of the set of recipients) and may also lead to unwanted bias in the generation process at worst (since we will be generating the same set of consuming users for each instance of posted content). In short, the above models are more suitable for social networks where the recipient list is usually small and changes on a post-to-post basis. Joint modeling of links and content is also popular in Citation networks. One of the earliest works in this area is Missing-Link [5] where the goal is to leverage on both the content present in publications which form part of the citation network and the citations which link publications together. Link-LDA [7] improves on Missing-Link by including appropriate priors to smoothen the parameter estimation process and avoid over-fitting. Link-PLSA-LDA [23] further improves upon Link-LDA by exploiting the topical relationship between publications on either side of the citation (hyperlink in their case, since their experiments were conducted on blog networks). Besides these, some models that capture the flow of influence through citations[6] and some attempt to explain citations between publications by taking into account their topical similarity and similarity between their authors’ preference over communities[18]. More recently, two topic models [12] that incorporate features from the social graph among users for modeling topics in online social media were proposed. The models are motivated by the phenomena of “homophily” — users connect to similar users and “social

influence” theory — connected users become similar to each other. While our model has similar motivations, our focus is on user communities, in sharp contrast to theirs that focus only on content topics. The area of social network analysis has greatly benefited from the topic modeling framework ever since the first topic model, Latent Dirichlet Allocation (LDA) [3] was proposed. LDA finds latent topics from a corpus of documents where each topic consists of words that frequently co-occur within the same documents. Following LDA, the Author-Topic model [29] showed how one could extend LDA and use the topic modeling framework to model an author’s preference over topics. The aforementioned models use only the content present in the corpus. Since then, topic models have been proposed to model other forms of media besides text. In what follows, we attempt to classify the existing models for link and/or content based on the entities being modeled and observed. Classification of Existing Models: Consider the most general model setting involving two types of entities – authors and documents. The authors connect with each other through a social graph S which can either be symmetric (e.g., Facebook) if the relationship is reciprocal or asymmetric (e.g., Twitter) in case of a follower-followee relationship. Similarly, a content graph is formed by documents D and (citation) links L between documents. The classification is presented in Table I. M ODEL Latent Dirichlet Allocation [3] Author-Recipient-Topic Model [21] CUT models [34] Feature Topic Model, Social Topic Model [12] Stochastic block models [1], [13] Kronecker Graphs [15] Mixed membership models [7] PLSA-PHITS [5] Joint Latent Topic Models (Pairwise Link-LDA, Link-PLSA-LDA)[24] Topic-Link LDA [18] TURCM models [30]

M ODELED E NTITY D D D D

O BSERVED E NTITY D D, S D, S D, S

S S D, L D, L

S S D, L D, L

D, L D, L D, S

D, L D, L D, S

TABLE I.

A CLASSIFICATION OF STATE - OF - THE - ART NETWORK AND CONTENT MODELS . E ACH MODEL CONSIDERS A PART OF A HETEROGENEOUS NETWORK COMPOSED OF SOCIAL (S) AND CONTENT (L) NETWORKS , AND USER - GENERATED CONTENT D TO MODEL CONTENT OR LINKS OR BOTH .

III. T HE L INK -C ONTENT M ODEL In this section, we present our topic model for links and content in social networks. We begin with some notation, then describe the model and its inference procedure. In the last subsection, we show how to apply our model for the task of link prediction. A. Notation and Problem Definition Let U denote a set of users numbered 1, 2, . . . , U . Let S ⊆ U×U denote a set of edges over U. An edge u → v ∈ S indicates user u is related to user v. Note that, edges in S are directed. For instance, u → v in case of Twitter indicates that u follows v. Thus, we can define a set of followers for each

S D U Du Lu V K T ν µ α β

Fig. 1.

Observed quantities Social network graph Docs. in the social network Users in the social network Content uploaded by user u Friends of u Vocabulary size Input parameters Number of communities Number of topics Hyper-parameters Hyper-parameter for δu Hyper-parameter for ψk Hyper-parameter for θk Hyper-parameter for φz

Muk Mkz Fuk Fkv Nzw δu ψk θk φz

Counts #docs. ∈ Du community k generates #docs. community k & topic z generate #friends ∈ Lu community k generates #times community k generates v #times topic z generates w Latent variables User u’s preference over communities Community k’s distribution over users Topic distribution of community k Word distribution of topic z

Notation used for the Link-Content Model

user: followers(v) = {u|u → v ∈ S}. We denote the inverse concept of friends using the shorthand Lu = {v|u → v ∈ S}. Let D denote a set of documents numbered 1, 2, . . . , D and let Du ⊆ D denote the set of documents shared by u ∈ U. For instance, in the case of Twitter we model each tweet as a (short) document and Du consists of all tweets posted by user u. The triple hU, S, Di defines a social network. We now describe how we model communities. Let X denote a random variable with domain dom. Let P Pr(X) denote a multinomial distribution over X such that x∈dom Pr(X = x) = 1 and ∀x ∈ dom, Pr(X = x) ≥ 0. We use the shorthand Pr(x) instead of Pr(X = x) whenever the random variable is clear from the context. We model communities as multinomial distributions over U. We use K to denote the number of communities and ψk to denote theP kth community’s U distribution. Thus, ∀u ∈ U, ψk (u) ≥ 0 and u=1 ψk (u) = 1. In what follows, we abbreviate ψk (u) to ψku . In the model that we propose subsequently, each community is also associated with topics of interest. The zth topic, φz , is a distribution over words in the vocabulary and φzw denotes the probability with which word w is produced by z. T denotes the number of topics. The topics associated with the kth community are given by the distribution θk where θkz denotes the likelihood of k using topic z. Lastly, we also associate with u ∈ U, a preference over the set of all communities δu where δuk denotes how likely u is to choose community k. B. The Model: Generative Semantics & Plate Notation The Link-Content model is a generative model that jointly models social networks including users, links and documents. An important aspect of the model is that for user u, we generate u’s friends Lu instead of, for instance, u’s followers (in contrast to other approaches [30]). This is because members of social networks such as Twitter usually have more control choosing whom they want to follow rather than who should follow them. Thus, one would expect Lu to more accurately reflect u’s preferences. We also take the posting user’s characteristics into account while generating documents. Each document is generated from a single topic (instead of using a mixture [3]). Twitter posts are restricted to only a few characters each. Based on our experience, a single topic can well explain such short documents. This restriction can easily be removed. The simplest way to describe the link-content model is by describing its generative semantics depicted pictorially in

ν u

δu

µ

ψ

k

k′

f

z

D

α

θ K

L

K

where ∆ collectively refers to the latent variables {δu }U u=1 . The RHS above is expressed using a system of counts (also described in Figure 1): Mku and Fuk denote the number of documents in Du and friends in Lu respectively, that the user u chose to generate using community k, Mkz denotes the number of documents in D that were generated using the communitytopic pair hk, zi, Fku denotes the number of times u was generated as a friend of some other user from community k, and Nzw counts the number of times word w was generated using topic z across the whole corpus D. Note the overloading of the count symbols M and F but the subscripts and the order of the indices in the subscript, uk or ku in the case of F , should make it clear which count we are referring to. Equation 4 can be further simplified by integrating over Θ, Ψ, ∆, Φ:

w

φ N

β T

U Fig. 2.

Plate model for the link-content model

Pr(U, S, D, K, Z; α, β, ν, µ) common term

Figure 2 using plate model notation [4] where nodes denote random variables and plates denote repetitions. Given a social network hU, S, Di, the link-content model treats communities as first-class citizens and uses these to generate both the documents D and the network component S. For each d ∈ Du , the link-content model chooses a community k from u’s preference distribution over the communities δu , it then generates a topic z from the k’s distribution over topics θk and finally, the words in d from φz . In a similar fashion, for each friend v ∈ Lu , the link-content model first chooses a community from δu and then uses the community’s distribution over users ψk to generate v. In Figure 2, we employ hyper-parameters ν, µ, α, β to parameterize prior distributions for latent variables δu , ψk , θk , φz , respectively. For topic models, one usually employs Dirichlet priors since, along with multinomial distributions, they form a conjugate pair thus leading to efficient inference. The shaded nodes in Figure 2 denote observed random variables and the values of the unshaded nodes will be determined by running an inference procedure, which we explain next. C. Inference for Link-Content Model Two quantities of interest are the probability assigned by the model to d ∈ Du and the probability assigned to a friend vP ∈PLu . The former, denoted Pr(d|δu , Θ, Φ), is equal to z k Pr(d, k, z|δu , Θ, Φ) where Pr(d, k, z|δu , Θ, Φ) = Y Y Pr(k|δu ) Pr(z|θk ) Pr(w|φz ) = δuk θkz φzw (1) w∈d

w∈d

The latter is simply Pr(v|δu , Ψ) =

P

k Pr(v, k|δu , Ψ) where:

Pr(v, k|δu , Ψ) = Pr(k|δu ) Pr(v|ψk ) = δuk ψkv

(2)

We can now express the full joint probability of a social network hU, S, Di where each d ∈ D has been assigned a community-topic pair and each friend v ∈ Lu for all users u ∈ U has been assigned a community under our model. Collectively, we refer to the community assignments by K and the topic assignments by Z. After applying Equations 1, 2, and collecting exponents, Pr(U, S, D, K, Z, Θ, Φ, ∆, Ψ; α, β, ν, µ) ∝ U,K Y

Muk +Fuk +νk −1 δuk

K,U Y

Fku +µu −1 ψku

K,T Y

k=1 u=1 T,V Y

Nzw +βw −1 φzw

u=1 k=1

k=1 z=1

Mkz +αz −1 θkz

z=1 w=1

(3)

=

link-specific term

z }| {z }| { U Q K Q Y Y +ν ) +µ ) k Γ(Muk + Fuk u Γ(Fku P k P u ν ) Γ(M + F + Γ(F + u· u· k· k k k=1 u µu ) u=1 K Q T Q Y + α ) Y w Γ(Nzw + βw ) z Γ(Mkz P z P Γ(Mk· + z αz ) z=1 Γ(Nz· + w βw ) k=1 | {z } content-specific term

where Γ(x) denotes the Gamma function which is the generalization of the factorial function to real numbers. There are many choices of inference algorithms for topic models. We use Gibbs sampling [9] due to its accuracy, relative efficiency and simplicity. To set up Gibbs sampling, one needs to derive conditional probability expressions that explicitly show how one assignment depends on all the other assignments in the social network. For brevity, we skip the detailed derivation and direct the reader to other works that describe the standard machinery (e.g., [11]). Algorithm 1 describes Gibbs sampling inference for the link-content model. There are three distinct phases: initialization (with random assignments), burn-in and sample collection. Lines 16 and 19 denote the sampling steps during inference. In link-content model, there are two distinct entities for which we need to produce assignments: assigning a community k to each friend v ∈ Lu for each user u ∈ U (line 16 in Algorithm 1) and assigning a community-topic pair hk, zi to each document d ∈ D (line 19 in Algorithm 1). The conditional probability expression for sampling a community k for v ∈ Lu is the product of two terms: the first term measures how often k was used to generate either documents posted by u or other friends of u and the second term measures how often v was generated from community k as a friend of other users in the network. Note that, a superscript “-” in the terminology used in Algorithm 1 represents counts computed without including the entities being sampled for. For instance, in line 16 of − Algorithm 1, Fuk denotes the number of times community k was used to generate a friend of u without counting the current friend v ∈ Lu for whom we are sampling for. Also, a “·” in the subscript denotes an index that is summed over, e.g., PK − − Fu· = k′ =1 Fuk ′ . The conditional probability expression for assigning a community-topic pair hk, zi to document d is a product of three terms: the first term measures how often k was used to generate other documents posted by u or by friends of u, the second term measures how often topic z was generated

Algorithm 1: Gibbs sampling for Link-Content 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15

/*Initialize*/ for u ∈ U do for v ∈ Lu do k ∼ uniform[1 . . . K] Assign community k to link u → v for d ∈ Du do k ∼ uniform[1 . . . K] z ∼ uniform[1 . . . T ] Assign community-topic pair hk, zi to d /* Burn-in */ I ← number of burn-in iterations i←0 while i < I do for u ∈ U do for v ∈ Lu do k∼

16 17

Assign community k to link u → v − − − Muk +Fuk +νk Mkz +αz P − P k′ νk′ Mk· + z ′ αz ′ − Q Qndw Nzw +i−1+βw P i=1 N − +P ′ w∈d z· w

Suggest Documents