Minimizing Seed Set for Viral Marketing

2011 11th IEEE International Conference on Data Mining Minimizing Seed Set for Viral Marketing Cheng Long, Raymond Chi-Wing Wong Department of Comput...
Author: Kelly Watkins
1 downloads 0 Views 279KB Size
2011 11th IEEE International Conference on Data Mining

Minimizing Seed Set for Viral Marketing Cheng Long, Raymond Chi-Wing Wong Department of Computer Science and Engineering The Hong Kong University of Science and Technology {clong, raywong}@cse.ust.hk Ada

Abstractโ€”Viral marketing has attracted considerable concerns in recent years due to its novel idea of leveraging the social network to propagate the awareness of products. Speci๏ฌcally, viral marketing is to ๏ฌrst target a limited number of users (seeds) in the social network by providing incentives, and these targeted users would then initiate the process of awareness spread by propagating the information to their friends via their social relationships. Extensive studies have been conducted for maximizing the awareness spread given the number of seeds. However, all of them fail to consider the common scenario of viral marketing where companies hope to use as few seeds as possible yet in๏ฌ‚uencing at least a certain number of users. In this paper, we propose a new problem, called ๐ฝ-MIN-Seed, whose objective is to minimize the number of seeds while at least ๐ฝ users are in๏ฌ‚uenced. ๐ฝ-MIN-Seed, unfortunately, is proved to be NP-hard in this work. In such case, we develop a greedy algorithm that can provide error guarantees for ๐ฝ-MIN-Seed. Furthermore, for the problem setting where ๐ฝ is equal to the number of all users in the social network, denoted by Full-Coverage, we design other ef๏ฌcient algorithms. Extensive experiments were conducted on real datasets to verify our algorithm.

.7

Bob

.7 .9

n1 Connie

.6

David

Fig. 1. Social network (IC model)

1 n3

n2 Fig. 2.

1

n4

1 Counter example (๐›ผ(โ‹…))

via the relationships among users in the social network. A lot of models about how the above diffusion process works have been proposed [5-10]. Among them, the Independent Cascade Model (IC model) [5, 6] and the Linear Threshold Model (LT model) [7, 8] are the two that are widely used in the literature. In the social network, the IC model simulates the situation where for each in๏ฌ‚uenced user ๐‘ข, each of its neighbors has a probability to be in๏ฌ‚uenced by ๐‘ข, while the LT model captures the phenomenon where each userโ€™s tendency to become in๏ฌ‚uenced increases when more of its neighbors become in๏ฌ‚uenced. Consider the following scenario of viral marketing. A company wants to advertise a new product via viral marketing within a social network. Speci๏ฌcally, it hopes that at least a certain number of users, says ๐ฝ, in the social network must be in๏ฌ‚uenced yet the number of seeds for viral marketing should be as small as possible. Clearly, the above problem can be formalized as follows. Given a social network ๐บ(๐‘‰, ๐ธ), we want to ๏ฌnd a set of seeds such that the size of the seed set is minimized and at least ๐ฝ users are in๏ฌ‚uenced at the end of viral marketing. We call this problem ๐ฝ-MIN-Seed. We use Figure 1 to illustrate the main idea of the ๐ฝ-MINSeed problem. The four nodes shown in Figure 1 represent four members in a family, namely Ada, Bob, Connie and David, respectively. In the following, we use the terms โ€œnodesโ€ and โ€œusersโ€ interchangeably since they correspond to the same concept. The directed edge (๐‘ข, ๐‘ฃ) with the weight of ๐‘ค๐‘ข,๐‘ฃ indicates that node ๐‘ข has the probability of ๐‘ค๐‘ข,๐‘ฃ to in๏ฌ‚uence node ๐‘ฃ for the awareness of the product. Now, we want to ๏ฌnd the smallest seed set such that at least 3 nodes can be in๏ฌ‚uenced by this seed set. It is easy to verify that the expected in๏ฌ‚uence incurred by seed set {๐ด๐‘‘๐‘Ž} is about 3.571 under the IC model and no smaller seed set can incur at least 3 in๏ฌ‚uenced nodes. Hence, seed set {๐ด๐‘‘๐‘Ž} is our solution. ๐ฝ-MIN-Seed can be applied to most (if not all) applica-

I. I NTRODUCTION Viral marketing is an advertising strategy that takes the advantage of the effect of โ€œword-of-mouthโ€ among the relationships of individuals to promote a product. Instead of covering massive users directly as traditional advertising methods [1] do, viral marketing targets a limited number of initial users (by providing incentives) and utilizes their social relationships, such as friends, families and co-workers, to further spread the awareness of the product among individuals. Each individual who gets the awareness of the product is said to be in๏ฌ‚uenced. The number of all in๏ฌ‚uenced individuals corresponds to the in๏ฌ‚uence incurred by the initial users. According to some recent research studies [2], people tend to trust the information from their friends, relatives or families more than that from general advertising media like TVs. Hence, it is believed that viral marketing is one of the most effective marketing strategies [3]. In fact, extensive commercial instances of viral marketing succeed in real life. For example, Nike Inc. used social networking websites such as orkut.com and facebook.com to market products successfully [4]. The propagation process of viral marketing within a social network can be described in the following way. At the beginning, the advertiser selects a set of initial users and provides these users incentives so that they are willing to initiate the awareness spread of the product in the social network. We call these initial users seeds. Once the propagation is initiated, the information of the product diffuses or spreads 1550-4786/11 $26.00 ยฉ 2011 IEEE DOI 10.1109/ICDM.2011.99

.6

.8

1 The computation of the expected in๏ฌ‚uence incurred by a seed is calculated by considering all cascades from this seed. E.g., the expected in๏ฌ‚uence on Bob incurred by Ada is 1 โˆ’ (1 โˆ’ 0.8) โ‹… (1 โˆ’ 0.6 โ‹… 0.7) = 0.884.

427

tions of viral marketing. Intuitively, ๐ฝ-MIN-Seed asks for the minimum cost (seeds) while satisfying an explicit requirement of revenue (in๏ฌ‚uenced nodes). Clearly, in the mechanism of viral marketing, a seed and an in๏ฌ‚uenced node correspond to cost and potential revenue of a company, respectively, Because the company has to pay the seeds for incentives, while an in๏ฌ‚uenced node might bring revenue to the company. In many cases, companies face the situation where the goal of revenue has been set up explicitly and the cost should be minimized. Thus, ๐ฝ-MIN-Seed meets these companiesโ€™ demands. Another area where ๐ฝ-MIN-Seed can be widely used is the โ€œmajority-decision ruleโ€ (e.g., the three-๏ฌfths majority rule in the US Senate). By majority-decision rule, we mean the principle under which the decision is determined by the majority (or a certain portion) of participants. That is, in order to affect a group of people to make a decision, e.g., purchasing our products, we only need to convince a certain number of members in this group, says ๐ฝ, which is the threshold of the number of people to agree on the decision. Clearly, for these kinds of applications, ๐ฝ-MIN-Seed could be used to affect the decision of the whole group and yield the minimum cost. In fact, ๐ฝ-MIN-Seed is particularly useful in the election campaigns where the โ€œmajority-decision ruleโ€ is adopted. No existing studies have been conducted for ๐ฝ-MIN-Seed even though it plays an essential role in the viral marketing ๏ฌeld. In fact, most existing studies related to viral marketing focus on maximizing the in๏ฌ‚uence incurred by a certain number of seeds, says ๐‘˜ [11-16]. Speci๏ฌcally, they aim at maximizing the number of in๏ฌ‚uenced nodes when only ๐‘˜ seeds are available. We denote this problem by ๐‘˜-MAX-In๏ฌ‚uence. Clearly, ๐ฝ-MIN-Seed and ๐‘˜-MAX-In๏ฌ‚uence have different goals with different given resources. Naยจฤฑvely, we can solve the ๐ฝ-MIN-Seed problem by adapting an existing algorithm for ๐‘˜-MAX-In๏ฌ‚uence. Let ๐‘˜ be the number of seeds. We set ๐‘˜ = 1 at the beginning and increment ๐‘˜ by 1 at the end of each iteration. For each iteration, we use an existing algorithm for ๐‘˜-MAX-In๏ฌ‚uence to calculate the maximum number of nodes, denoted by ๐ผ, that can be in๏ฌ‚uenced by a seed set with the size equal to ๐‘˜. If ๐ผ โ‰ฅ ๐ฝ, we stop our process and return the current number ๐‘˜. Otherwise, we increment ๐‘˜ by 1 and perform the next iteration. However, this naยจฤฑve method is very time-consuming since it issues the existing algorithm for ๐‘˜-MAX-In๏ฌ‚uence many times for solving ๐ฝ-MIN-Seed. Note that ๐‘˜-MAX-In๏ฌ‚uence is NP-hard [12]. Any existing algorithm for ๐‘˜-MAX-In๏ฌ‚uence is computation-expensive, which results in this naยจฤฑve method with a high computation cost. Hence, we should resort to other more ef๏ฌcient solutions. In this paper, ๐ฝ-MIN-Seed is, unfortunately, proved to be NP-hard. Motivated by this, we design an approximate (greedy) algorithm for ๐ฝ-MIN-Seed. Speci๏ฌcally, our algorithm iteratively adds into a seed set one node that generates the greatest in๏ฌ‚uence gain until the in๏ฌ‚uence incurred by the seed set is at least ๐ฝ. Besides, we work out an additive error bound and a multiplicative error bound for this greedy algorithm.

In some cases, the companies would set the parameter ๐ฝ for ๐ฝ-MIN-Seed to be the total number of users in the underlying social network since they want to in๏ฌ‚uence as many users as possible. Motivated by this, we further discuss our ๐ฝ-MINSeed problem under the special setting where ๐ฝ = โˆฃ๐‘‰ โˆฃ (the total number of users). We call this special instance of ๐ฝ-MINSeed as Full-Coverage for which we design other ef๏ฌcient algorithms. We summarize our contributions as follows. Firstly, to the best of our knowledge, we are the ๏ฌrst to propose the ๐ฝ-MINSeed problem, which is a fundamental problem in viral marketing. Secondly, we prove that ๐ฝ-MIN-Seed is NP-hard in this paper. Under such situation, we develop a greedy algorithm framework for ๐ฝ-MIN-Seed, which, fortunately, can provide error guarantees for the approximation error. Thirdly, for the Full-Coverage problem (i.e., ๐ฝ-MIN-Seed where ๐ฝ = โˆฃ๐‘‰ โˆฃ), we observe some interesting properties and thus design some other ef๏ฌcient algorithms. Finally, we conducted extensive experiments which veri๏ฌed our algorithms. The rest of the paper is organized as follows. Section II covers the related work of our problem, while Section III provides the formal de๏ฌnition of the ๐ฝ-MIN-Seed problem and some relevant properties. We show how to calculate the in๏ฌ‚uence incurred by a seed set in Section IV, which is followed by Section V discussing our greedy algorithm framework. In Section VI, we discuss the Full-Coverage problem. We conduct our empirical studies in Section VII and conclude our paper in Section VIII. II. R ELATED W ORK In Section II-A, we discuss two widely used diffusion models in a social network, and in Section II-B, we give the related work about the in๏ฌ‚uence maximization problem. A. Diffusion Models Given a social network represented in a directed graph ๐บ, we denote ๐‘‰ to be the set containing all the nodes in ๐บ each of which corresponds to a user and ๐ธ to be the set containing all the directed edges in ๐บ. Each edge ๐‘’ โˆˆ ๐ธ in form of (๐‘ข, ๐‘ฃ) is associated with a weight ๐‘ค๐‘ข,๐‘ฃ โˆˆ [0, 1]. Different diffusion models have different meanings on weights. In the following, we discuss the meanings for two popular diffusion models, namely the Independent Cascade (IC) model and the Linear Threshold (LT) model. 1) Independent Cascade (IC) Model [7, 8]: The ๏ฌrst model is the Independent Cascade (IC) model. In this model, the in๏ฌ‚uence is based on how a single node in๏ฌ‚uences each of its single neighbor. The weight ๐‘ค๐‘ข,๐‘ฃ of an edge (๐‘ข, ๐‘ฃ) corresponds to the probability that node ๐‘ข in๏ฌ‚uences node ๐‘ฃ. Let ๐‘†0 be the initial set of in๏ฌ‚uenced nodes (seeds in our problem). The diffusion process involves a number of steps where each step corresponds to the in๏ฌ‚uence spread from some in๏ฌ‚uenced nodes to other non-in๏ฌ‚uenced nodes. At step ๐‘ก, all in๏ฌ‚uenced nodes at step ๐‘ก โˆ’ 1 remain in๏ฌ‚uenced, and each node that becomes in๏ฌ‚uenced at step ๐‘ก โˆ’ 1 for the ๏ฌrst time has one chance to in๏ฌ‚uence its non-in๏ฌ‚uenced neighbors.

428

Speci๏ฌcally, when an in๏ฌ‚uenced node ๐‘ข attempts to in๏ฌ‚uence its non-in๏ฌ‚uenced neighbor ๐‘ฃ, the probability that ๐‘ฃ becomes in๏ฌ‚uenced is equal to ๐‘ค๐‘ข,๐‘ฃ . The propagation process halts at step ๐‘ก if no nodes become in๏ฌ‚uenced at step ๐‘กโˆ’1. The running example in Figure 1 is based on the IC model. For a graph under the IC model, we say that the graph is deterministic if all its edges have the probabilities equal to 1. Otherwise, we say that it is probabilistic. 2) Linear Threshold (LT) Model [5, 6]: The second model is the Linear Threshold (LT) model. In this model, the in๏ฌ‚uence is based on how a single node is in๏ฌ‚uenced by its multiple neighbors together. The weight ๐‘ค๐‘ข,๐‘ฃ of an edge (๐‘ข, ๐‘ฃ) corresponds to the relative strength that node ๐‘ฃ is in๏ฌ‚uenced by its neighbor ๐‘ข (among โˆ‘ all of ๐‘ฃโ€™s neighbors). Besides, for each ๐‘ฃ โˆˆ ๐‘‰ , it holds that (๐‘ข,๐‘ฃ)โˆˆ๐ธ ๐‘ค๐‘ข,๐‘ฃ โ‰ค 1. The dynamics of the process proceeds as follows. Each node ๐‘ฃ selects a threshold value ๐œƒ๐‘ฃ from range [0, 1] randomly. Same as the IC model, let ๐‘†0 be the set of initial in๏ฌ‚uenced nodes. At step ๐‘ก, the non-in๏ฌ‚uenced node ๐‘ฃ, for which the total weight of the โˆ‘ edges from its in๏ฌ‚uenced neighbors exceeds its threshold ( (๐‘ข,๐‘ฃ)โˆˆ๐ธ and ๐‘ข is in๏ฌ‚uenced ๐‘ค๐‘ข,๐‘ฃ โ‰ฅ ๐œƒ๐‘ฃ ), becomes in๏ฌ‚uenced. The spread process terminates when no more in๏ฌ‚uence spread is possible. For a graph under the LT model, we say that the graph is deterministic if the thresholds of all its nodes have been set before the process of in๏ฌ‚uence spread. Otherwise, we say that it is probabilistic.

tic algorithm, which was veri๏ฌed to be quite scalable to largescale social networks [15]. The in๏ฌ‚uence maximization problem has been extended into the setting with multiple products instead of a single product. Bharathi et al. solved the in๏ฌ‚uence maximization problem for multiple competitive products using game-theoretical methods [18, 19], while Datta et al. proposed the in๏ฌ‚uence maximization problem for multiple non-competitive products [16]. Apart from these studies aiming at maximizing the in๏ฌ‚uence, considerable efforts have been devoted to the diffusion models in social networks [9, 10]. Clearly, most of the existing studies related to viral marketing aim at maximizing the in๏ฌ‚uence incurred by a limited number of seeds (i.e., ๐‘˜-MAX-In๏ฌ‚uence). While our problem, ๐ฝ-MIN-Seed, is targeted to minimize the number of seeds while satisfying the requirement of in๏ฌ‚uencing at least a certain number of users in the social network. As discussed in Section I, a naยจฤฑve adaption of any existing algorithm for ๐‘˜-MAX-In๏ฌ‚uence is time-consuming. III. P ROBLEM We ๏ฌrst formalize ๐ฝ-MIN-Seed in Section III-A. In Section III-B, we provide several properties related to ๐ฝ-MINSeed. A. Problem De๏ฌnition Given a set ๐‘† of seeds, we de๏ฌne the in๏ฌ‚uence incurred by the seed set ๐‘† (or simply the in๏ฌ‚uence of ๐‘†), denoted by ๐œŽ(๐‘†), to be the expected number of nodes in๏ฌ‚uenced during the diffusion process initiated by ๐‘†. How to calculate ๐œŽ(๐‘†) under different diffusion models given ๐‘† will be discussed in Section IV. Problem 1 (๐ฝ-MIN-Seed): Given a social network ๐บ(๐‘‰, ๐ธ) and an integer ๐ฝ, ๏ฌnd a set ๐‘† of seeds such that the size of the seed set is minimized and ๐œŽ(๐‘†) โ‰ฅ ๐ฝ. We say that node ๐‘ข is covered by seed set ๐‘† if ๐‘ข is in๏ฌ‚uenced during the in๏ฌ‚uence diffusion process initiated by ๐‘†. It is easy to see that ๐ฝ-MIN-Seed aims at minimizing the number of seeds while satisfying the requirement of covering at least ๐ฝ nodes. Given a node ๐‘ฅ in ๐‘‰ and a subset ๐‘† of ๐‘‰ , the marginal gain of inserting ๐‘ฅ into ๐‘†, denoted by ๐บ๐‘ฅ (๐‘†), is de๏ฌned to be ๐œŽ(๐‘† โˆช {๐‘ฅ}) โˆ’ ๐œŽ(๐‘†). We show the hardness of ๐ฝ-MIN-Seed with the following theorem. Theorem 1: The ๐ฝ-MIN-Seed problem is NP-hard for both the IC model and the LT model. Proof. The proof can be found in [20].

B. In๏ฌ‚uence Maximization Motivated by the fact that social network plays a fundamental role in spreading ideas, innovations and information, Domingoes and Richardson proposed to use social networks for marketing purpose, which is called viral marketing [11, 17]. By viral marketing, they aimed at selecting a limited number of seeds such that the in๏ฌ‚uence incurred by these seeds is maximized. We call this fundamental problem as the in๏ฌ‚uence maximization problem. In [12], Kempe et al. formalized the above in๏ฌ‚uence maximization problem as a discrete optimization problem called ๐‘˜-MAX-In๏ฌ‚uence for the ๏ฌrst time. Given a social network ๐บ(๐‘‰, ๐ธ) and an integer ๐‘˜, ๏ฌnd ๐‘˜ seeds such that the incurred in๏ฌ‚uence is maximized. Kempe et al. proved that ๐‘˜-MAXIn๏ฌ‚uence is NP-hard for both the IC model and the LT model. To achieve better ef๏ฌciency, they provided a (1 โˆ’ 1/๐‘’)approximation algorithm for ๐‘˜-MAX-In๏ฌ‚uence. Recently, several studies have been conducted to solve ๐‘˜MAX-In๏ฌ‚uence in a more ef๏ฌcient and/or scalable way than the aforementioned approximate algorithm in [12]. Specifically, in [13], Leskovec et al. employed a โ€œlazy-forwardโ€ strategy to select seeds, which has been shown to be effective for reducing the cost of the in๏ฌ‚uence propagation of nodes. In [14], Kimura et al. proposed a new shortestpath cascade model, based on which, they developed ef๏ฌcient algorithms for ๐‘˜-MAX-In๏ฌ‚uence. Motivated by the drawback of non-scalability of all aforementioned solutions for ๐‘˜-MAXIn๏ฌ‚uence, Chen et al. proposed an arborescence-based heuris-

B. Properties Since the analysis of the error bounds of our approximate algorithms to be discussed is based on the property that function ๐œŽ(โ‹…) is submodular, we ๏ฌrst brie๏ฌ‚y introduce the concept of submodular function, denoted by ๐‘“ (โ‹…). After that, we provide several properties related to the in๏ฌ‚uence diffusion process in a social network. De๏ฌnition 1 (Submodularity): Let ๐‘ˆ be a universe set of elements and ๐‘† be a subset of ๐‘ˆ . Function ๐‘“ (โ‹…) which maps

429

๐‘† to a non-negative value is said to be submodular if given any ๐‘† โŠ† ๐‘ˆ and any ๐‘‡ โŠ† ๐‘ˆ where ๐‘† โŠ† ๐‘‡ , it holds for any element ๐‘ฅ โˆˆ ๐‘ˆ โˆ’ ๐‘‡ that ๐‘“ (๐‘† โˆช {๐‘ฅ}) โˆ’ ๐‘“ (๐‘†) โ‰ฅ ๐‘“ (๐‘‡ โˆช {๐‘ฅ}) โˆ’ ๐‘“ (๐‘‡ ). In other words, we say ๐‘“ (โ‹…) is submodular if it satis๏ฌes the โ€œdiminishing marginal gainโ€ property: the marginal gain of inserting a new element into a set ๐‘‡ is at most the marginal gain of inserting the same element into a subset of ๐‘‡ . According to [12], function ๐œŽ(โ‹…) is submodular for both the IC model and the LT model. The main idea is as follows. When we add a new node ๐‘ฅ into a seed set ๐‘†, the in๏ฌ‚uence incurred by the node ๐‘ฅ (without considering the nodes in ๐‘†) might overlap with that incurred by ๐‘†. The larger ๐‘† is, the more overlap might happen. Hence, the marginal gain is smaller on a (larger) set compared to that on any of its subsets. We formalize this statement with the following Property 1. The proof can be found in [12]. Property 1: Function ๐œŽ(โ‹…) is submodular for both the IC model and the LT model. To illustrate the concept of submodular functions, consider Figure 1. Assume that a seed set ๐‘‡ is {๐ด๐‘‘๐‘Ž}. Let a subset ๐‘† of ๐‘‡ be โˆ…. We insert into seed sets ๐‘‡ and ๐‘† the same node ๐ต๐‘œ๐‘. In fact, it is easy to calculate ๐œŽ(โˆ…) = 0, ๐œŽ({๐ด๐‘‘๐‘Ž}) = 3.57, ๐œŽ({๐ต๐‘œ๐‘}) = 2.64 and ๐œŽ({๐ด๐‘‘๐‘Ž, ๐ต๐‘œ๐‘}) = 3.83. Consequently, we know the marginal gain of adding a new node ๐ต๐‘œ๐‘ into set ๐‘‡ , i.e., ๐œŽ({๐ด๐‘‘๐‘Ž, ๐ต๐‘œ๐‘}) โˆ’ ๐œŽ({๐ด๐‘‘๐‘Ž}) = 0.26, is smaller than that of adding ๐ต๐‘œ๐‘ into one of its subsets ๐‘†, i.e., ๐œŽ({๐ต๐‘œ๐‘}) โˆ’ ๐œŽ(โˆ…) = 2.64. In the ๐‘˜-MAX-In๏ฌ‚uence problem, we have a submodular function ๐œŽ(โ‹…) which takes a set of seeds as an input and returns the expected number of in๏ฌ‚uenced nodes incurred by the seed set as an output. Similarly, in the ๐ฝ-MIN-Seed problem, we de๏ฌne a function ๐›ผ(โ‹…) which takes a set of in๏ฌ‚uenced nodes as an input and returns the smallest number of seeds needed to in๏ฌ‚uence these nodes as an output. One may ask: Is function ๐›ผ(โ‹…) also submodular? Unfortunately, the answer is โ€œnoโ€ which is formalized with the following Property 2. Property 2: Function ๐›ผ(โ‹…) is not submodular for both the IC model and the LT model. Proof. We prove Property 2 by constructing a problem instance where ๐›ผ(โ‹…) does not satisfy the aforementioned conditions of a submodular function. We ๏ฌrst discuss the case for the IC model. Consider the example as shown in Figure 2. In this ๏ฌgure, there are four nodes, namely ๐‘›1 , ๐‘›2 , ๐‘›3 and ๐‘›4 . We assume that each edge is associated with its weight equal to 1, which indicates that an in๏ฌ‚uenced node ๐‘ข will in๏ฌ‚uence a non-in๏ฌ‚uenced node ๐‘ฃ de๏ฌnitely when there is an edge from ๐‘ข to ๐‘ฃ. Let set ๐‘‡ be {๐‘›1 , ๐‘›3 , ๐‘›4 } and a subset of ๐‘‡ , says ๐‘†, be {๐‘›3 , ๐‘›4 }. Obviously, when node ๐‘›1 is in๏ฌ‚uenced, it will further in๏ฌ‚uence node ๐‘›3 and node ๐‘›4 , i.e., all the nodes in ๐‘‡ will be in๏ฌ‚uenced when ๐‘›1 is selected as a seed. Thus, ๐›ผ(๐‘‡ ) = 1. Similarly, we know that ๐›ผ(๐‘†) = 1. Now, we add node ๐‘›2 into both ๐‘‡ and ๐‘† and then obtain ๐›ผ(๐‘‡ โˆช {๐‘›2 }) = 2 (by the seed set {๐‘›1 , ๐‘›2 }) and ๐›ผ(๐‘† โˆช {๐‘›2 }) = 1 (by the seed set {๐‘›2 }). As a result, we know that ๐›ผ(๐‘‡ โˆช {๐‘›2 }) โˆ’ ๐›ผ(๐‘‡ ) = 1 > ๐›ผ(๐‘† โˆช {๐‘›2 }) โˆ’ ๐›ผ(๐‘†) = 0, which, however, violates the conditions of a submodular function.

Next, we discuss the case for the LT model. Consider the special case where each nodeโ€™s threshold is equal to a value slightly greater than 0. Consequently, a node will be in๏ฌ‚uenced whenever one of its neighbors becomes in๏ฌ‚uenced. The resulting diffusion process is actually identical to the special case for the IC model where the weights of all edges are 1s. That is, the example in Figure 2 can also be applied for the LT model. Hence, Property 2 also holds for the LT model. Property 2 suggests that we cannot directly adapt existing techniques for the ๐‘˜-MAX-In๏ฌ‚uence problem (which involves a submodular function as an objective function) to our ๐ฝ-MINSeed problem (which involves a non-submodular function as an objective function). IV. I NFLUENCE C ALCULATION We describe how we compute the in๏ฌ‚uence of a given seed set (i.e., ๐œŽ(โ‹…)) under the IC model (Section IV-A) and the LT model (Section IV-B). A. IC model It has been proved in [15] that the process of calculating the in๏ฌ‚uence given a seed set for the IC model is #P-hard. That is, computing the exact in๏ฌ‚uence is hard. Thus, we have to resort to approximate algorithms for ef๏ฌciency. Intuitively, the hardness of calculating the in๏ฌ‚uence is due to the fact that the edges in the social network under the IC model are probabilistic in the sense that the propagation of in๏ฌ‚uence via an edge happens with probability. In contrast, when the social network is deterministic, i.e., the probability associated with each edge is exactly 1, we only need to traverse the graph from each seed in a breadth-๏ฌrst manner and return all visited nodes as the in๏ฌ‚uenced nodes incurred by the seed set, thus resulting in a linear-time algorithm for in๏ฌ‚uence calculation. In view of the above discussion, we use sampling to calculate the (approximate) in๏ฌ‚uence as follows. Let ๐บ(๐‘‰, ๐ธ) be the original probabilistic social network and ๐‘† be the seed set. Instead of calculating the in๏ฌ‚uence on ๐บ directly, we calculate the in๏ฌ‚uence on each sampled graph from ๐บ using the same seed set ๐‘† and ๏ฌnally average the incurred in๏ฌ‚uences on all sampled graphs to obtain the approximate one for the original probabilistic graph. To obtain the sampled graph of ๐บ(๐‘‰, ๐ธ) each time, we keep the node set ๐‘‰ unchanged, remove the edge (๐‘ข, ๐‘ฃ) with the probability of 1 โˆ’ ๐‘ค๐‘ข,๐‘ฃ for each edge (๐‘ข, ๐‘ฃ) โˆˆ ๐ธ and assign each remaining edge with the weight equal to 1. In this way, we can obtain that the probability that an edge (๐‘ข, ๐‘ฃ) remains in the resulting graph is ๐‘ค๐‘ข,๐‘ฃ . Note that the resulting sampled graph is deterministic. We call such a process as social network sampling. Conceptually, given a probabilistic graph ๐บ(๐‘‰, ๐ธ), each ๐บโ€™s sampled graph is generated with probability equal to a certain value. As a result, the in๏ฌ‚uence calculated based on each ๐บโ€™s sampled graph has one speci๏ฌc probability to be equal to the exact in๏ฌ‚uence on the original probabilistic graph ๐บ given the same seed set. That is, the exact in๏ฌ‚uence for ๐บ is the expected in๏ฌ‚uence for a sampled graph of ๐บ. Based on this,

430

we can use Hoeffdingโ€™s Inequality to analyze the error incurred by our sampling method. We state our result with the following Lemma 1. Lemma 1: Let ๐‘ be a real number between 0 and 1. Given a seed set ๐‘† and a social network ๐บ(๐‘‰, ๐ธ) under the IC model, the sampling method stated above achieves a (1 ยฑ ๐œ–)approximation of the in๏ฌ‚uence incurred by ๐‘† on ๐บ with the con๏ฌdence at least ๐‘ by performing the sampling process at 2 times. least (โˆฃ๐‘‰ โˆฃโˆ’1)2๐œ–2๐‘™๐‘›(2/(1โˆ’๐‘)) โˆฃ๐‘†โˆฃ2 Proof: The proof can be found in our technical report [20].

Algorithm 1 Greedy Algorithm Framework Input: ๐บ(๐‘‰, ๐ธ): a social network. ๐ฝ: the required number of nodes to be in๏ฌ‚uenced Output: ๐‘†: a seed set. 1: ๐‘† โ† โˆ… 2: while ๐œŽ(๐‘†) < ๐ฝ do 3: ๐‘ข โ† arg max๐‘ฅโˆˆ๐‘‰ โˆ’๐‘† (๐œŽ(๐‘† โˆช {๐‘ฅ}) โˆ’ ๐œŽ(๐‘†)) 4: ๐‘† โ† ๐‘† โˆช {๐‘ข} 5: return ๐‘†

a greedy algorithm which solves ๐ฝ-MIN-Seed ef๏ฌciently by executing an iteration based on the results from its previous iteration. Speci๏ฌcally, we ๏ฌrst initialize a seed set ๐‘† to be an empty set. Then, we select a non-seed node ๐‘ข such that the marginal gain of inserting ๐‘ข into ๐‘† is the greatest and then we insert ๐‘ข into ๐‘†. We repeat the above steps until at least ๐ฝ nodes are in๏ฌ‚uenced. Algorithm 1 presents this greedy algorithm framework. This greedy algorithm is similar to the algorithm from [12] for ๐‘˜-MAX-In๏ฌ‚uence except the stopping criterion, but they have different theoretical results. The stopping criterion in this greedy algorithm is ๐œŽ(๐‘†) โ‰ฅ ๐ฝ and the stopping criterion in the algorithm from [12] is โˆฃ๐‘†โˆฃ โ‰ฅ ๐‘˜ where ๐‘˜ is a user parameter of ๐‘˜-MAX-In๏ฌ‚uence. Note that our greedy algorithm for ๐ฝMIN-Seed has theoretical results which guarantee the number of seeds used while the algorithm for ๐‘˜-MAX-In๏ฌ‚uence has theoretical results which guarantee the number of in๏ฌ‚uenced nodes.

B. LT model Similar to the case under the IC model, the in๏ฌ‚uence calculation for the LT model is much easier when the graph is deterministic (i.e., the threshold of each node has been speci๏ฌed before the process of in๏ฌ‚uence spread). We illustrate the main idea as follows. For an in๏ฌ‚uenced node ๐‘ข, all we need to do is to add the corresponding in๏ฌ‚uence to each of its non-in๏ฌ‚uenced neighbors and check whether each of its nonin๏ฌ‚uenced neighbors, says ๐‘ฃ, has received enough in๏ฌ‚uence (๐œƒ๐‘ฃ ) to be in๏ฌ‚uenced. If so, we change node ๐‘ฃ to be in๏ฌ‚uenced. Otherwise, we leave node ๐‘ฃ non-in๏ฌ‚uenced. At the beginning, we initialize the set of in๏ฌ‚uenced nodes to be the seed set ๐‘†. Then, we perform the above process for each in๏ฌ‚uenced node until no new in๏ฌ‚uenced nodes are generated. With the view of the above discussion, we perform the in๏ฌ‚uence calculation on a probabilistic graph for the LT model in the same way as the IC model except for the sampling method. Speci๏ฌcally, to sample a probabilistic graph ๐บ under the LT model, we pick a real number from range [0, 1] uniformly as the threshold of each node in ๐บ to form a deterministic graph. We perform the sampling process multiple times, and for each resulting deterministic graph, we run the algorithm for a deterministic graph (just described above) to obtain the incurred in๏ฌ‚uence. Finally, we average the in๏ฌ‚uences on all sampled graphs to obtain the approximate in๏ฌ‚uence. Clearly, we can derive a similar lemma as Lemma 1 for the LT model.

B. Theoretical Analysis In this part, we show that the greedy algorithm framework in Algorithm 1 can return the seed set with both an additive error guarantee and a multiplicative error guarantee. The greedy algorithm gives the following additive error bound. Lemma 2 (Additive Error Guarantee): Let โ„Ž be the size of the seed set returned by the greedy algorithm framework in Algorithm 1 and ๐‘ก be the size of the optimal seed set for ๐ฝMIN-Seed. The greedy algorithm framework in Algorithm 1 gives an additive error bound equal to 1/๐‘’ โ‹… ๐ฝ + 1. That is, โ„Ž โˆ’ ๐‘ก โ‰ค 1/๐‘’ โ‹… ๐ฝ + 1. Here, ๐‘’ is the natural logarithmic base. Before we give the multiplicative error bound of the greedy algorithm, we ๏ฌrst give some notations. Suppose that the greedy algorithm terminates after โ„Ž iterations. We denote ๐‘†๐‘– to be the seed set maintained by the greedy algorithm at the end of iteration ๐‘– where ๐‘– = 1, 2, ...., โ„Ž. Let ๐‘†0 denote the seed set maintained before the greedy algorithm starts (i.e., an empty set). Note that ๐œŽ(๐‘†๐‘– ) < ๐ฝ for ๐‘– = 1, 2, ..., โ„Ž โˆ’ 1 and ๐œŽ(๐‘†โ„Ž ) โ‰ฅ ๐ฝ. In the following, we give the multiplicative error bound of the greedy algorithm framework in Algorithm 1. Lemma 3 (Multiplicative Error Guarantee): Let ๐œŽ โ€ฒ (๐‘†) = min{๐œŽ(๐‘†), ๐ฝ}. The greedy algorithm framework in Algorithm 1 is a (1 + min{๐‘˜1 , ๐‘˜2 , ๐‘˜3 })-approximation of ๐ฝ-MINโ€ฒ (๐‘†1 ) ๐ฝ Seed, where ๐‘˜1 = ln ๐ฝโˆ’๐œŽโ€ฒ (๐‘† , ๐‘˜2 = ln ๐œŽโ€ฒ (๐‘†โ„Ž๐œŽ)โˆ’๐œŽ , and โ€ฒ (๐‘† โ„Žโˆ’1 ) โ„Žโˆ’1 )

V. G REEDY A LGORITHM We present in Section V-A the framework of our greedy algorithm which ๏ฌnds a seed set by adding a seed into the seed set iteratively. Section V-B provides the analysis of this algorithm framework, while Section V-C discusses two different implementations of this algorithm framework. A. Algorithm Framework As proved in Section III, ๐ฝ-MIN-Seed is NP-hard. It is expected that there is no ef๏ฌcient exact algorithm for ๐ฝ-MINSeed. As discussed in Section I, if we want to solve ๐ฝ-MINSeed, a naยจฤฑve adaption of any existing algorithm originally designed for ๐‘˜-MAX-In๏ฌ‚uence is time-consuming. The major reason is that it executes an existing algorithm many times and the execution of this existing algorithm for an iteration is independent of the execution of the same algorithm for the next iteration. Motivated by this observation, we propose

431

โ€ฒ

๐œŽ ({๐‘ฅ}) โ€ฒ ๐‘˜3 = ln(max{ ๐œŽโ€ฒ (๐‘†๐‘– โˆช{๐‘ฅ})โˆ’๐œŽ โ€ฒ (๐‘† ) โˆฃ๐‘ฅ โˆˆ ๐‘‰, 0 โ‰ค ๐‘– โ‰ค โ„Ž, ๐œŽ (๐‘†๐‘– โˆช ๐‘– โ€ฒ {๐‘ฅ}) โˆ’ ๐œŽ (๐‘†๐‘– ) > 0}).

the SCC computation algorithm developed by Kosaraju et al [21], which runs in ๐‘‚(โˆฃ๐‘‰ โˆฃ + โˆฃ๐ธโˆฃ) time. For the second step, we construct a new graph ๐บ(๐‘‰ โ€ฒ , ๐ธ โ€ฒ ) based on ๐บ(๐‘‰, ๐ธ). Speci๏ฌcally, for constructing ๐‘‰ โ€ฒ , we create a new node ๐‘ฃ๐‘– for each SCC ๐‘ ๐‘๐‘๐‘– obtained in Step 1. We construct ๐ธ โ€ฒ as follows. Initially, ๐ธ โ€ฒ is set to an empty set. For each (๐‘ข, ๐‘ฃ) โˆˆ ๐ธ, we ๏ฌnd the SCC containing ๐‘ข (๐‘ฃ) in ๐บ(๐‘‰, ๐ธ), says ๐‘ ๐‘๐‘๐‘– (๐‘ ๐‘๐‘๐‘— ). Then, we ๏ฌnd the node ๐‘ฃ๐‘– (๐‘ฃ๐‘— ) representing ๐‘ ๐‘๐‘๐‘– (๐‘ ๐‘๐‘๐‘— ) in ๐บ(๐‘‰ โ€ฒ , ๐ธ โ€ฒ ). We check whether (๐‘ฃ๐‘– , ๐‘ฃ๐‘— ) โˆˆ ๐ธ โ€ฒ . If not, we insert (๐‘ฃ๐‘– , ๐‘ฃ๐‘— ) into ๐ธ โ€ฒ . Clearly, the cost for constructing ๐‘‰ โ€ฒ is ๐‘‚(โˆฃ๐‘‰ โˆฃ), while the cost for generating ๐ธ โ€ฒ is ๐‘‚(โˆฃ๐ธโˆฃ โ‹… ๐ถ๐‘โ„Ž๐‘’๐‘๐‘˜ ), where ๐ถ๐‘โ„Ž๐‘’๐‘๐‘˜ indicates the cost for checking whether a speci๏ฌc edge (๐‘ฃ๐‘– , ๐‘ฃ๐‘— ) has been constructed before in ๐ธ โ€ฒ . ๐ถ๐‘โ„Ž๐‘’๐‘๐‘˜ depends on the structure for storing ๐บ(๐‘‰ โ€ฒ , ๐ธ โ€ฒ ). Speci๏ฌcally, ๐ถ๐‘โ„Ž๐‘’๐‘๐‘˜ is ๐‘‚(1) when ๐บ(๐‘‰ โ€ฒ , ๐ธ โ€ฒ ) is stored in an adjacency matrix. With this data structure, the overall cost for Step 2 is ๐‘‚(โˆฃ๐‘‰ โˆฃ + โˆฃ๐ธโˆฃ). In case that ๐บ(๐‘‰ โ€ฒ , ๐ธ โ€ฒ ) is maintained in an adjacency list, ๐ถ๐‘โ„Ž๐‘’๐‘๐‘˜ becomes ๐‘‚(โˆฃ๐ธ โ€ฒ โˆฃ) (bounded by ๐‘‚(โˆฃ๐ธโˆฃ)), resulting in Step 2โ€™s complexity equal to ๐‘‚(โˆฃ๐‘‰ โˆฃ + โˆฃ๐ธโˆฃ2 ) in the worst case. To further reduce the complexity of Step 2 in this case, we do not check the existence of each newly formed edge in the new graph every time we create a new edge. Instead, we create all newly formed edges and perform sorting on all the newly formed edges to ๏ฌlter out any redundant edges in ๐ธ โ€ฒ , thus yielding the cost of Step 2 equal to ๐‘‚(โˆฃ๐‘‰ โˆฃ + โˆฃ๐ธโˆฃ โ‹… log โˆฃ๐ธโˆฃ). Note that there exist no SCCs in the constructed graph ๐บโ€ฒ (๐‘‰ โ€ฒ , ๐ธ โ€ฒ ). For the last step, we simply pick the nodes with in-degree equal to 0 in ๐บ(๐‘‰ โ€ฒ , ๐ธ โ€ฒ ) and for each such node ๐‘ฃ๐‘– , we insert into the seed set ๐‘† a node randomly from its corresponding ๐‘ ๐‘๐‘๐‘– in the original ๐บ(๐‘‰, ๐ธ). Since there exist no SCCs in ๐บโ€ฒ (๐‘‰ โ€ฒ , ๐ธ โ€ฒ ), it is possible to perform a topological sort on ๐บโ€ฒ (๐‘‰ โ€ฒ , ๐ธ โ€ฒ ). Hence, the seed set consisting of all the nodes with in-degree equal to 0 in ๐บโ€ฒ (๐‘‰ โ€ฒ , ๐ธ โ€ฒ ) would in๏ฌ‚uence all nodes in ๐บโ€ฒ (๐‘‰ โ€ฒ , ๐ธ โ€ฒ ). Since each node in ๐บโ€ฒ (๐‘‰ โ€ฒ , ๐ธ โ€ฒ ) corresponds to a SCC structure in ๐บ(๐‘‰, ๐ธ), according to Observation 2, we conclude that the seed set ๐‘† constructed at the last step would in๏ฌ‚uence โˆฃ๐‘‰ โˆฃ nodes in ๐บ(๐‘‰, ๐ธ) (deterministic). Clearly, the cost of Step 3 is ๐‘‚(โˆฃ๐‘‰ โ€ฒ โˆฃ + โˆฃ๐ธ โ€ฒ โˆฃ) (DFS/BFS), which is bounded by ๐‘‚(โˆฃ๐‘‰ โˆฃ + โˆฃ๐ธโˆฃ). In summary, the worst-case time complexity of Decomposeand-Pick is ๐‘‚(โˆฃ๐‘‰ โˆฃ + โˆฃ๐ธโˆฃ) and ๐‘‚(โˆฃ๐‘‰ โˆฃ + โˆฃ๐ธโˆฃ โ‹… log โˆฃ๐ธโˆฃ) when the new graph is maintained in an adjacency matrix and an adjacency list, respectively.

C. Implementations As can be seen, the ef๏ฌciency of Algorithm 1 relies on the calculation of the in๏ฌ‚uence of a given seed set (operator ๐œŽ(โ‹…)). However, the in๏ฌ‚uence calculation process for the IC model is #P-hard [15]. Under such a circumstance, we adopt the sampling method discussed in Section IV when using operator ๐œŽ(โ‹…). We denote this implementation by Greedy1. In fact, we have an alternative implementation of Algorithm 1 as follows. Instead of sampling the social network to be deterministic when calculating the in๏ฌ‚uence incurred by a given seed set, we can sample the social network to generate a certain number of deterministic graphs only at the beginning. Then, we solve the ๐ฝ-MIN-Seed problem on each such deterministic graph using Algorithm 1, where the cost of operator ๐œŽ(โ‹…) simply becomes the time to traverse the graph. At the end, we return the average of the sizes of the seed sets returned by the algorithm based on all samples (deterministic graphs). We call this alternative implementation as Greedy2. VI. F ULL -C OVERAGE In some applications, we are interested in in๏ฌ‚uencing all nodes in the social network. For example, a government wants to promote some campaigns like an election and an awareness of some infectious diseases. In these applications, ๐ฝ is set to โˆฃ๐‘‰ โˆฃ. We call this special instance of ๐ฝ-MIN-Seed as FullCoverage. In Section VI-A, we give some interesting observations and present an ef๏ฌcient algorithm on deterministic graphs for the IC model, while in Section VI-B, we develop our probabilistic algorithm which can provide an arbitrarily small error for the IC model. A. Full-Coverage on Deterministic Graph (IC Model) According to Theorem 1, in general, it is NP-hard to solve the ๐ฝ-MIN-Seed problem on a graph (either probabilistic or deterministic) for the IC model. However, on a deterministic graph for the IC model, Full-Coverage is not NP-hard yet easy to solve. In the following, we design an ef๏ฌcient algorithm to handle Full-Coverage on a deterministic graph ๐บ(๐‘‰, ๐ธ). Before illustrating our ef๏ฌcient method for Full-Coverage, we ๏ฌrst introduce the following two observations. Observation 1: On a deterministic graph, if a node within a strongly connected component (SCC) is in๏ฌ‚uenced, then it will in๏ฌ‚uence all nodes in this SCC. Observation 2: Any node with in-degree equal to 0 must be selected as a seed in order to be in๏ฌ‚uenced. This is because it cannot be in๏ฌ‚uenced by other nodes. Based on the above two observations, we design our method called Decompose-and-Pick as follows. At the ๏ฌrst step, we decompose the deterministic graph into a number of strongly connected components (SCCs), namely ๐‘ ๐‘๐‘1 , ๐‘ ๐‘๐‘2 , ..., ๐‘ ๐‘๐‘๐‘š . This step can be achieved by adopting some existing methods in the rich literature for ๏ฌnding all SCCs in a graph. In our implementation for this step, we adopt

B. Full-Coverage on Probabilistic Graph (IC Model) At this moment, it is quite straightforward to perform our probabilistic algorithm for Full-Coverage based on social network sampling and Decompose-and-Pick as follows. We ๏ฌrst use social network sampling to generate a certain number of deterministic graphs. Then, on each such deterministic graph, we run Decompose-and-Pick to obtain its corresponding seed set which covers all the nodes in the social network and the corresponding size of the seed set. At the end, we average the sizes of the seed sets obtained for all samples (deterministic graphs) to approximate the solution of Full-Coverage on a

432

TABLE I S TATISTICS OF REAL DATASETS

general (probabilistic) social network. Again, using Hoeffdingโ€™s Inequality, for a real number ๐‘ between 0 and 1, we can provide users with a (1 ยฑ ๐œ–)-approximation solution for any positive real number ๐œ– with con๏ฌdence at least ๐‘ by performing 2 ๐‘™๐‘›(2/(1โˆ’๐‘)) times. The the sampling process at least (โˆฃ๐‘‰ โˆฃโˆ’1) 2๐œ– 2 proof is similar to that of Lemma 1.

Dataset No. of Nodes No. of Edges

HEP-T 15233 58891

Epinions 75888 508837

Amazon 262111 1234877

DBLP 654628 1990259

parameter ๐ฝ as a relative real number between 0 and 1 denoting the fraction of the in๏ฌ‚uenced nodes among all nodes in the social network (instead of an absolute positive integer denoting the total number of in๏ฌ‚uenced nodes) because a relative measure is more meaningful than an absolute measure in the experiments. We set ๐ฝ to be 0.5 by default. Alternative con๏ฌgurations considered are {0.1, 0.25, 0.5, 0.75, 1}. 3) Algorithms: We compare our greedy algorithm with several other common heuristic algorithms. We list all the algorithms studied in our experiments as follows. (1) Greedy1: We denote our ๏ฌrst implementation of Algorithm 1 by Greedy1. As stated before, we only conduct the graph sampling process when performing the in๏ฌ‚uence calculation. (2) Greedy2: Greedy2 corresponds to the alternative implementation of Algorithm 1. (3) Degree-heuristic: We implemented this baseline algorithm using the heuristic of nodesโ€™ out-degree. Speci๏ฌcally, we repeatedly pick the node with the largest out-degree yet un-covered and add it into the seed set until the incurred in๏ฌ‚uence exceeds the threshold. We denote this heuristic algorithm as Degree-heuristic. (4) Centrality-heuristic: Centralityheuristic indicates another heuristic algorithm based on the nodesโ€™ distance centrality. In sociology, distance centrality is a common measurement of nodesโ€™ importance in a social network based on the assumption that a node with short distances to other nodes would probably have a higher chance to in๏ฌ‚uence them. In Centrality-heuristic, we select the seeds in a decreasing order of nodesโ€™ distance centralities until the requirement of in๏ฌ‚uencing at least ๐ฝ nodes is met. (5) Random: Finally, we consider the method of selecting seeds from the un-covered nodes at random as a baseline. Correspondingly, we denote it by Random. In the experiment, we do not compare our algorithms with the naยจฤฑve adaption of an existing algorithm for ๐‘˜-MAXIn๏ฌ‚uence described in Section I because this naยจฤฑve adaption is time-consuming as discussed in Section V.

VII. E MPIRICAL S TUDY We set up our experiments in Section VII-A and give the corresponding experimental results in Section VII-B. A. Experimental Setup We conducted our experiments on a 2.26GHz machine with 4GB memory under a Linux platform. All algorithms were implemented in C/C++. 1) Datasets: We used four real datasets for our empirical study, namely HEP-T, Epinions, Amazon and DBLP. HEP-T is a collaboration network generated from โ€œHigh Energy PhysicsTheoryโ€ section of the e-print arXiv (http://www.arXiv.org). In this collaboration network, each node represents one speci๏ฌc author and each edge indicates a co-author relationship between the two authors corresponding to the nodes incident to the edge. The second one, Epinions, is a who-trust-whom network at Epinions.com, where each node represents a member of the site and the link from member ๐‘ข to member ๐‘ฃ means that ๐‘ข trusts ๐‘ฃ (i.e., ๐‘ฃ has a certain in๏ฌ‚uence on ๐‘ข). The third real dataset, Amazon, is a product co-purchasing network extracted from Amazon.com with nodes and edges representing products and co-purchasing relationships, respectively. We believe that product ๐‘ข has an in๏ฌ‚uence on product ๐‘ฃ if ๐‘ฃ is purchased often with ๐‘ข. Both of Epinions and Amazon are maintained by Jure Leskovec. Our last real dataset, DBLP, is another collaboration network of computer science bibliography database maintained by Michael Ley. We summarize the features of the above real datasets in Table I. For ef๏ฌciency, we ran our algorithms on the samples of the aforementioned real datasets with the sampling ratio equal to one percent. The sampling process is done as follows. We randomly choose a node as the root and then perform a breadth-๏ฌrst traversal (BFT) from this root. If the BFT from one root cannot cover our targeted number of nodes, we continue to pick more new roots randomly and perform BFTs from them until we obtain our expected number of nodes. Next, we construct the edges by keeping the original edges between the nodes traversed. 2) Con๏ฌgurations: (1) Weight generation for the IC model: We use the QUADRIVALENCY model to generate the weights. Speci๏ฌcally, for each edge, we uniformly choose a value from set {0.1, 0.25, 0.5, 0.75}, each of which represents minor, low, medium and high in๏ฌ‚uence, respectively. (2) Weight generation for the LT model: For each node ๐‘ข, let ๐‘‘๐‘ข denote its in-degree, we assign the weight of each edge to ๐‘ข as 1/๐‘‘๐‘ข . In this case, each node obtains the equivalent in๏ฌ‚uence from each of its neighbors. (3) No. of Times for Sampling: For each in๏ฌ‚uence calculation under both the IC model and the LT model, we perform the graph sampling process 10000 times by default. (4) Parameter ๐ฝ: In the following, we denote

B. Experiment Results For the sake of space, we show the results for the IC model only. The results for the LT model can be found in [20]. 1) No. of Seeds: We measure the quality of the algorithm for ๐ฝ-MIN-Seed by using the number of seeds returned by the algorithm. Clearly, the fewer the seeds an algorithm returns, the better it is. We study the qualities of the ๏ฌve aforementioned algorithms by comparing the number of seeds returned by them. Specifically, we vary parameter ๐ฝ from 0.1 to 1. The experimental results are shown in Figure 3. Consider the results on HEP-T (Figure 3(a)) as an example. We ๏ฌnd algorithms Greedy1 and Greedy 2 are comparable in terms of quality. Both of them outperform other heuristic algorithms signi๏ฌcantly. Similar results can be found in other real datasets.

433

Error (Ratio of No. of Seeds)

Error (Number of Seeds)

60

Greedy1 Greedy2 Additive-error-bound

40

20

0 0.1

0.25

0.5

0.75

1

6

In this paper, we propose a new viral marketing problem called ๐ฝ-MIN-Seed, which has extensive applications in real world. We then prove that ๐ฝ-MIN-Seed is NP-hard under two popular diffusion models (i.e., the IC model and the LT model). To solve ๐ฝ-MIN-Seed effectively, we develop a greedy algorithm, which can provide approximation guarantees. Besides, for the special setting where ๐ฝ is equal to the number of all users in the social network (i.e., Full-Coverage), we design other ef๏ฌcient algorithms. Finally, we conducted extensive experiments on real datasets, which veri๏ฌed the effectiveness and ef๏ฌciency of our greedy algorithm. For future work, we plan to study the properties of our new problem under diffusion models other than the IC model and the LT model. Finding other solutions of Full-Coverage for the LT model is another interesting direction.

4

2

0 0.1

0.25

J

0.5

0.75

1

J

(a) Additive Error Fig. 5.

VIII. C ONCLUSION

Greedy Multiplicative-error-bound

(b) Multiplicative Error

Error Analysis (IC Model)

2) Running Time: We explore the ef๏ฌciency of different algorithms by comparing their running times. Again, we vary ๐ฝ, and for each setting of ๐ฝ, we record the corresponding running time of each algorithm. According to the results shown in Figure 4, we ๏ฌnd that Greedy1 is the slowest algorithm. The reason is that Greedy1 selects the seeds by calculating the marginal gain of each nonseed at each iteration and then picking the one with the largest marginal gain while other heuristic algorithms simply choose the non-seed with the best heuristic value (e.g., out-degree and centrality). However, the alternative implementation of our greedy algorithm, i.e., Greedy2, shows its advantage in terms of ef๏ฌciency. Greedy2 is faster than Greedy1 because the total cost of sampling in Greedy2 is much smaller than that in Greedy1. Besides, Random is slower than Greedy2, though the cost of choosing a seed in Random is ๐‘‚(1). This is because Random usually has to select more seeds than Greedy2 in order to incur the same amount of in๏ฌ‚uence and for each iteration, Random also needs to calculate the in๏ฌ‚uence incurred by the current seed set. 3) Error Analysis: To verify the error bounds derived in this paper, we also conducted the experiments which compare the number of seeds returned by our algorithms with the optimal one on small datasets (0.5% of the HEP-T dataset). We performed Brute-Force searching to obtain the optimal solution. According to the results in Figure 5(a), the additive errors incurred by our algorithms are generally much smaller than the theoretical error bounds on the real dataset. In Figure 5(b), we ๏ฌnd that the multiplicative error of our greedy algorithm grows slowly when ๐ฝ increases. Besides, we discover that ๐‘˜2 is the smallest among ๐‘˜1 , ๐‘˜2 and ๐‘˜3 in most cases of our experiments. That is, the multiplicative bound becomes โ€ฒ (๐‘†1 ) )) in these cases. Based (1 + ๐‘˜2 ) (i.e., (1 + ln ๐œŽโ€ฒ (๐‘†โ„Ž๐œŽ)โˆ’๐œŽ โ€ฒ (๐‘† โ„Žโˆ’1 ) on this, we can explain the phenomenon in Figure 5(b) that the theoretical multiplicative error bound does not change too much when we increase ๐ฝ from 0.75 to 1. 4) Full Coverage Experiments: We conducted experiments for Full-Coverage and the corresponding results can be found in [20].

Acknowledgements: The research is supported by HKRGC GRF 621309 and Direct Allocation Grant DAG11EG05G. R EFERENCES [1] J. Bryant and D. Miron, โ€œTheory and research in mass communication,โ€ Journal of communication, vol. 54, no. 4, pp. 662โ€“704, 2004. [2] J. Nail, โ€œThe consumer advertising backlash,โ€ Forrester Research, 2004. [3] I. R. Misner, The Worldโ€™s best known marketing secret: Building your business with word-of-mouth marketing. Bard Press, 2nd edition, 1999. [4] A. Johnson, โ€œnike-tops-list-of-most-viral-brands-on-facebook-twitter,โ€ 2010. [Online]. Available: http://www.kikabink.com/news/ [5] M. Granovetter, โ€œThreshold models of collective behavior,โ€ The American Journal of Sociology, vol. 83, no. 6, pp. 1420โ€“1443, 1978. [6] T. C. Schelling, Micromotives and macrobehavior. WW Norton and Company, 2006. [7] J. Goldenberg, B. Libai, and E. Muller, โ€œTalk of the network: A complex systems look at the underlying process of word-of-mouth,โ€ Marketing Letters, vol. 12, no. 3, pp. 211โ€“223, 2001. [8] โ€”โ€”, โ€œUsing complex systems analysis to advance marketing theory development: Modeling heterogeneity effects on new product growth through stochastic cellular automata,โ€ Academy of Marketing Science Review, vol. 9, no. 3, pp. 1โ€“18, 2001. [9] D. Gruhl, R. Guha, D. Liben-Nowell, and A. Tomkins, โ€œInformation diffusion through blogspace,โ€ in WWW, 2004. [10] H. Ma, H. Yang, M. R. Lyu, and I. King, โ€œMining social networks using heat diffusion processes for marketing candidates selection,โ€ in CIKM, 2008. [11] P. Domingos and M. Richardson, โ€œMining the network value of customers,โ€ in KDD, 2001. [12] D. Kempe, J. Kleinberg, and E. Tardos, โ€œMaximizing the spread of in๏ฌ‚uence through a social network,โ€ in SIGKDD, 2003. [13] J. Leskovec, A. Krause, C. Guestrin, C. Faloutsos, J. VanBriesen, and N. Glance, โ€œCost-effective outbreak detection in networks,โ€ in SIGKDD, 2007. [14] M. Kimura and K. Saito, โ€œTractable models for information diffusion in social networks,โ€ PKDD, 2006. [15] W. Chen, C. Wang, and Y. Wang, โ€œScalable in๏ฌ‚uence maximization for prevalent viral marketing in large-scale social networks,โ€ in SIGKDD, 2010. [16] S. Datta, A. Majumder, and N. Shrivastava, โ€œViral marketing for multiple products,โ€ in ICDM, 2010. [17] M. Richardson and P. Domingos, โ€œMining knowledge-sharing sites for viral marketing,โ€ in SIGKDD, 2002. [18] S. Bharathi, D. Kempe, and M. Salek, โ€œCompetitive in๏ฌ‚uence maximization in social networks,โ€ Internet and Network Economics, pp. 306โ€“311, 2007. [19] T. Carnes, C. Nagarajan, S. M. Wild, and A. van Zuylen, โ€œMaximizing in๏ฌ‚uence in a competitive social network: a followerโ€™s perspective,โ€ in Proceedings of the ninth international conference on Electronic commerce. ACM, 2007, pp. 351โ€“360.

Conclusion: Greedy1 and Greedy2 both give the smallest seed set compared with other algorithms Degree-Heuristic, Centrality-Heuristic and Random. In addition, the difference between the size of a seed set returned by Greedy1 or Greedy2 and the minimum (optimal) seed size is signi๏ฌcantly smaller than the theoretical bound. Besides, Greedy2 performs faster than Greedy1.

434

102 101

3

10

2

10

1

10

0

10

0.25

0.5

0.75

100 0.1

1

0.25

0.5

J

0.75

Runnig time (s)

Running time (s)

10

2

10

1

10

0.5

101

105

0.75 J

(a) HEP-T

1

104

0.1

0.5

0.75

1

0.5

0.1

0.25

0.5

0.75

1

(d) DBLP 109

Greedy1 Greedy2 Random Degree-heuristic Centrality-heuristic

107 106 105 4

10

103

1

0.1

Greedy1 Greedy2 Random Degree-heuristic Centrality-heuristic

108 7

10

106 5

10

104 3

10 0.25

J

0.5

0.75

1

0.1

0.25

J

(b) Epinions Fig. 4.

0.75 J

102 0.25

102 101

0.25

108

Greedy1 Greedy2 Random Degree-heuristic Centrality-heuristic

106

103

(c) Amazon

10 0.25

10

104

Number of Seeds (IC Model)

3

100 0.1

2

5

10

J

Fig. 3.

3

3

(b) Epinions

Greedy1 Greedy2 Random Degree-heuristic Centrality-heuristic

Greedy1 Greedy2 Random Degree-heuristic Centrality-heuristic

6

10

10

J

(a) HEP-T 104

104

100 0.1

1

Running time (s)

0.1

Greedy1 Greedy2 Random Degree-heuristic Centrality-heuristic

5

10

Running time (s)

10

6

10

Greedy1 Greedy2 Random Degree-heuristic Centrality-heuristic

4

10

Number of Seeds

3

Number of Seeds

Number of seeds

10

Greedy1 Greedy2 Random Degree-heuristic Centrality-heuristic

Number of Seeds

5

4

10

0.5

0.75

1

J

(c) Amazon

(d) DBLP

Running Time (IC Model)

๐‘— if the seed set ๐‘† contains ๐‘— seeds at the end of an iteration. The seed set ๐‘† at stage ๐‘— is denoted by ๐‘†๐‘— . Consequently, according to Lemma 4, at each stage ๐‘—, we conclude that

[20] C. Long and R. C.-W. Wong, โ€œMinimizing seed set for viral marketing,โ€ 2011. [Online]. Available: http://www.cse.ust.hk/โˆผraywong/paper/JMIN-Seed-technical.pdf [21] T. H. Cormen, C. E. Leiserson, R. L. Rivest, and C. Stein, Introduction to algorithms. The MIT press, 2009. [22] G. L. Nemhauser, L. A. Wolsey, and M. L. Fisher, โ€œAn analysis of approximations for maximizing submodular set functions-I,โ€ Mathematical Programming, vol. 14, no. 1, pp. 265โ€“294, 1978. [23] L. A. Wolsey, โ€œAn analysis of the greedy algorithm for the submodular set covering problem,โ€ COMBINATORIA, vol. 2, no. 4, pp. 385โ€“393, 1981.

๐œŽ(๐‘†๐‘— ) โ‰ฅ (1 โˆ’ 1/๐‘’) โ‹… ๐œŽ(๐‘†๐‘—โˆ— )

(1)

where ๐‘†๐‘—โˆ— is the set that provides the maximum value of ๐œŽ(โ‹…) over all possible seed sets of size ๐‘—. Note that the total number of stages for the greedy process is equal to โ„Ž (i.e., the size of the seed set returned by the algorithm). That is, the greedy process stops at stage โ„Ž. Thus, we know that ๐œŽ(๐‘†โ„Ž ) โ‰ฅ ๐ฝ and the greedy solution for ๐ฝ-MINSeed is ๐‘†โ„Ž . Consider the last two stages, namely stage โ„Ž โˆ’ 1 and stage โ„Ž. We know that ๐œŽ(๐‘†โ„Žโˆ’1 ) < ๐ฝ and ๐œŽ(๐‘†โ„Ž ) โ‰ฅ ๐ฝ. Since ๐œŽ(๐‘†โ„Žโˆ— ) โ‰ฅ ๐œŽ(๐‘†โ„Ž ), we have ๐œŽ(๐‘†โ„Žโˆ— ) โ‰ฅ ๐ฝ. Now, we want to explore the relationship between โ„Ž and ๐‘ก. Note that the following inequality holds.

A PPENDIX : P ROOF OF L EMMAS /T HEOREMS Proof of Lemma 2. Firstly, we give the theoretical bound on the in๏ฌ‚uence for ๐‘˜-MAX-In๏ฌ‚uence. The problem of determining the ๐‘˜-element set ๐‘† โŠ‚ ๐‘‰ that maximizes the value of ๐œŽ(โ‹…) is NP-hard. Fortunately, according to [22], a simple greedy algorithm can solve this maximization problem with the approximation factor of (1 โˆ’ 1/๐‘’) by initializing an empty set ๐‘† and iteratively adding the node such that the marginal gain of inserting this node into the current set ๐‘† is the greatest one until ๐‘˜ nodes have been added. We present this interesting tractability property of maximizing a submodular function in Lemma 4 as follows. Lemma 4 ([22]): For a non-negative, monotone submodular function ๐‘“ , we obtain a set ๐‘† of size ๐‘˜ by initializing set ๐‘† to be an empty set and then iteratively adding the node ๐‘ข one at a time such that the marginal gain of inserting ๐‘ข into the current set ๐‘† is the greatest. Assume that ๐‘† โˆ— is the set with ๐‘˜ elements that maximizes function ๐‘“ , i.e., the optimal ๐‘˜-element set. Then, ๐‘“ (๐‘†) โ‰ฅ (1 โˆ’ 1/๐‘’) โ‹… ๐‘“ (๐‘† โˆ— ), where ๐‘’ is the natural logarithmic base. Secondly, we derive the additive error bound on the seed set size for ๐ฝ-MIN-Seed based on the aforementioned bound. As discussed in Section III, ๐œŽ(โ‹…) is submodular. Clearly, ๐œŽ(โ‹…) is also non-negative and monotone. The framework in Algorithm 1 involves a number of iterations (lines 2-4) where the size of the seed set ๐‘† is incremented by one for each iteration. We say that the framework in Algorithm 1 is at stage

๐‘กโ‰คโ„Ž

(2)

Consider two stages, stage ๐‘– and stage ๐‘– + 1, such that ๐œŽ(๐‘†๐‘– ) < (1 โˆ’ 1/๐‘’) โ‹… ๐ฝ while ๐œŽ(๐‘†๐‘–+1 ) โ‰ฅ (1 โˆ’ 1/๐‘’) โ‹… ๐ฝ. According to Inequality (1), we know ๐œŽ(๐‘†๐‘–โˆ— ) < ๐ฝ. (This is because if ๐œŽ(๐‘†๐‘–โˆ— ) โ‰ฅ ๐ฝ, then we have ๐œŽ(๐‘†๐‘– ) โ‰ฅ (1 โˆ’ 1/๐‘’) โ‹… ๐ฝ with Inequality (1), which contradicts ๐œŽ(๐‘†๐‘– ) < (1 โˆ’ 1/๐‘’) โ‹… ๐ฝ). As a result, we have the following inequality ๐‘ก>๐‘–

(3)

due to the monotonicity property of ๐œŽ(โ‹…). According to Inequality (2) and Inequality (3), we obtain ๐‘ก โˆˆ [๐‘–+1, โ„Ž]. That is, the additive error of our greedy algorithm (i.e., โ„Ž โˆ’ ๐‘ก) is bounded by the number of stages between stage ๐‘–+1 and stage โ„Ž. Since ๐œŽ(๐‘†๐‘–+1 ) โ‰ฅ (1โˆ’1/๐‘’)โ‹…๐ฝ and ๐œŽ(๐‘†โ„Žโˆ’1 ) < ๐ฝ, the difference of the in๏ฌ‚uence incurred between stage ๐‘– + 1 and stage โ„Žโˆ’1 is bounded by ๐ฝ โˆ’(1โˆ’1/๐‘’)โ‹…๐ฝ = 1/๐‘’โ‹…๐ฝ. Since each stage increases at least 1 in๏ฌ‚uenced node (seed itself), it is easy to see that the number of stages between stage ๐‘– + 1 and stage โ„Ž โˆ’ 1 is at most 1/๐‘’ โ‹… ๐ฝ. Consequently, the number

435

of stages between stage ๐‘–+1 and stage โ„Ž is at most 1/๐‘’โ‹…๐ฝ +1. As a result, โ„Ž โˆ’ ๐‘ก โ‰ค 1/๐‘’ โ‹… ๐ฝ + 1.

โ€ฒ

๐œŽ ({๐‘ฅ}) โ€ฒ ln(max{ ๐œŽโ€ฒ (๐‘†๐‘– โˆช{๐‘ฅ})โˆ’๐œŽ โ€ฒ (๐‘† ) โˆฃ๐‘ฅ โˆˆ ๐‘‰, 0 โ‰ค ๐‘– โ‰ค โ„Ž, ๐œŽ (๐‘†๐‘– โˆช {๐‘ฅ}) โˆ’ ๐‘– โ€ฒ ๐œŽ (๐‘†๐‘– ) > 0}). Thirdly, we show that problem ๐‘ƒ โ€ฒ is equivalent to the ๐ฝMIN-Seed problem which can be formalized as follows (since โˆ‘ ๐‘”(๐‘ฅ) = โˆฃ๐‘†โˆฃ). ๐‘ฅโˆˆ๐‘† โˆ‘ arg min{ ๐‘”(๐‘ฅ) : ๐œŽ(๐‘†) โ‰ฅ ๐ฝ, ๐‘† โŠ† ๐‘‰ }. (6)

Proof of Lemma 3. This proof involves four parts. In the ๏ฌrst part, we construct a new problem ๐‘ƒ โ€ฒ based on the submodular function ๐œŽ โ€ฒ (โ‹…) (instead of ๐œŽ(โ‹…)). In the second part, we show the multiplicative error bound of the greedy algorithm in Algorithm 1 (using ๐œŽ โ€ฒ (โ‹…) instead of ๐œŽ(โ‹…)) for this new problem ๐‘ƒ โ€ฒ . We denote this adapted greedy algorithm by ๐ดโ€ฒ . For simplicity, we denote the original greedy algorithm in Algorithm 1 using ๐œŽ(โ‹…) by ๐ด. In the third part, we show that this new problem is equivalent to the ๐ฝ-MIN-Seed problem. In the fourth part, we show that the multiplicative error bound deduced in the second part can be used as the multiplicative error bound of algorithm ๐ด for ๐ฝ-MIN-Seed. Firstly, we construct a new problem ๐‘ƒ โ€ฒ as follows. Note that โ€ฒ ๐œŽ (๐‘†) = min{๐œŽ(๐‘†), ๐ฝ}. Problem ๐‘ƒ โ€ฒ is formalized as follows. arg min{โˆฃ๐‘†โˆฃ : ๐œŽ โ€ฒ (๐‘†) = ๐œŽ โ€ฒ (๐‘‰ ), ๐‘† โŠ† ๐‘‰ }.

๐‘ฅโˆˆ๐‘†

In the following, we show that the set of all possible solutions for the problem in form of (6) (i.e., the ๐ฝ-MIN-Seed problem) is equivalent to the set of all possible solutions for the problem in form of (5) (i.e., problem ๐‘ƒ โ€ฒ ). Note that the objective functions in both problems are equal. The remaining issue is to show that the constraints for one problem are the same as those for the other problem. Suppose that ๐‘† is a solution for the problem in form of (6). We know that ๐œŽ(๐‘†) โ‰ฅ ๐ฝ and ๐‘† โŠ† ๐‘‰ . We derive that ๐œŽ โ€ฒ (๐‘†) = ๐ฝ. Since ๐œŽ โ€ฒ (๐‘‰ ) = ๐ฝ, we have ๐œŽ โ€ฒ (๐‘†) = ๐œŽ โ€ฒ (๐‘‰ ) and ๐‘† โŠ† ๐‘‰ (which are the constraints for the problem in form of (5)). Suppose that ๐‘† is a solution for the problem in form of (5). We know that ๐œŽ โ€ฒ (๐‘†) = ๐œŽ โ€ฒ (๐‘‰ ) and ๐‘† โŠ† ๐‘‰ . Since ๐œŽ โ€ฒ (๐‘‰ ) = ๐ฝ, we have ๐œŽ โ€ฒ (๐‘†) = ๐ฝ. Considering ๐œŽ โ€ฒ (๐‘†) = min{๐œŽ(๐‘†), ๐ฝ}, we derive that ๐œŽ(๐‘†) โ‰ฅ ๐ฝ. So, we have ๐œŽ(๐‘†) โ‰ฅ ๐ฝ and ๐‘† โŠ† ๐‘‰ (which are the constraints for the problem in form of (6)). Fourthly, we show that the size of the solution (i.e., โˆฃ๐‘†โˆฃ) returned by algorithm ๐ดโ€ฒ for the new problem ๐‘ƒ โ€ฒ is equal to that returned by algorithm ๐ด for ๐ฝ-MIN-Seed. Since ๐œŽ(๐‘†๐‘– ) < ๐ฝ for 1 โ‰ค ๐‘– โ‰ค โ„Ž โˆ’ 1, we know that ๐œŽ โ€ฒ (๐‘†๐‘– ) = ๐œŽ(๐‘†๐‘– ) for 1 โ‰ค ๐‘– โ‰ค โ„Ž โˆ’ 1. We also know that the element ๐‘ฅ in ๐‘‰ โˆ’ ๐‘†๐‘–โˆ’1 that maximizes ๐œŽ(๐‘†๐‘–โˆ’1 โˆช {๐‘ฅ}) โˆ’ ๐œŽ(๐‘†๐‘–โˆ’1 ) (which is chosen at iteration ๐‘– by algorithm ๐ด) would also be the element that maximizes ๐œŽ โ€ฒ (๐‘†๐‘–โˆ’1 โˆช {๐‘ฅ}) โˆ’ ๐œŽ โ€ฒ (๐‘†๐‘–โˆ’1 ) (which is chosen at iteration ๐‘– by algorithm ๐ดโ€ฒ ) for ๐‘– = 1, 2, ..., โ„Ž โˆ’ 1. That is, algorithm ๐ดโ€ฒ would proceed in the same way as algorithm ๐ด at iteration ๐‘– = 1, 2, ..., โ„Žโˆ’1. Consider iteration โ„Ž of algorithm ๐ด. We denote the element selected by algorithm ๐ด by ๐‘ฅโ„Ž . Then, we know ๐œŽ(๐‘†โ„Žโˆ’1 โˆช {๐‘ฅโ„Ž }) โ‰ฅ ๐ฝ since algorithm ๐ด stops at iteration โ„Ž. Consider iteration โ„Ž of algorithm ๐ดโ€ฒ . This iteration is also the last iteration of ๐ดโ€ฒ . This is because there exists an element ๐‘ฅ in ๐‘‰ โˆ’ ๐‘†โ„Žโˆ’1 such that ๐œŽ โ€ฒ (๐‘†โ„Žโˆ’1 โˆช {๐‘ฅ}) = ๐œŽ โ€ฒ (๐‘‰ )(= ๐ฝ) (since ๐‘ฅ can be equal to ๐‘ฅโ„Ž where ๐œŽ โ€ฒ (๐‘†โ„Žโˆ’1 โˆช {๐‘ฅโ„Ž }) = ๐ฝ). Note that this element ๐‘ฅ maximizes ๐œŽ โ€ฒ (๐‘†โ„Žโˆ’1 โˆช{๐‘ฅ})โˆ’๐œŽ โ€ฒ (๐‘†โ„Žโˆ’1 ) and thus is selected by ๐ดโ€ฒ . We conclude that both algorithms ๐ด and ๐ดโ€ฒ terminates at iteration โ„Ž. Since the number of iterations for an algorithm (๐ด or ๐ดโ€ฒ ) corresponds to the size of the solution returned by the algorithm, we deduce that the size of the solution returned by algorithm ๐ดโ€ฒ is equal to that returned by algorithm ๐ด. In view of the above discussion, we know that problem ๐‘ƒ โ€ฒ is equivalent to ๐ฝ-MIN-Seed and algorithm ๐ดโ€ฒ for problem ๐‘ƒ โ€ฒ would proceed in the same way as algorithm ๐ด for ๐ฝ-MINSeed. As a result, the multiplicative bound of algorithm ๐ดโ€ฒ for problem ๐‘ƒ โ€ฒ in the second part also applies to algorithm ๐ด (i.e., the greedy algorithm in Algorithm 1) for ๐ฝ-MIN-Seed.

(4)

Secondly, we show the multiplicative error bound of algorithm ๐ดโ€ฒ for problem ๐‘ƒ โ€ฒ by using the following Lemma 5 [23]. โˆ‘ Lemma 5 ([23]): Given problem arg min{ ๐‘ฅโˆˆ๐‘† ๐‘”(๐‘ฅ) : ๐‘“ (๐‘†) = ๐‘“ (๐‘ˆ ), ๐‘† โŠ† ๐‘ˆ } where ๐‘“ is a nondecreasing and submodular function de๏ฌned on subsets of a ๏ฌnite set ๐‘ˆ , and ๐‘” is a function de๏ฌned on ๐‘ˆ . Consider the greedy algorithm that selects ๐‘ฅ in ๐‘ˆ โˆ’ ๐‘† such that (๐‘“ (๐‘† โˆช {๐‘ฅ}) โˆ’ ๐‘“ (๐‘†))/๐‘”(๐‘ฅ) is the greatest and adds it into ๐‘† at each iteration. The process stops when ๐‘“ (๐‘†) = ๐‘“ (๐‘ˆ ). Assume that the greedy algorithm terminates after โ„Ž iterations and let ๐‘†๐‘– denote the seed set at iteration ๐‘– (๐‘†0 = โˆ…). The greedy algorithm provides a (1 + min{๐‘˜1 , ๐‘˜2 , ๐‘˜3 })-approximation of the above problem, ๐‘“ (๐‘ˆ )โˆ’๐‘“ (โˆ…) ๐‘“ (๐‘†1 )โˆ’๐‘“ (โˆ…) where ๐‘˜1 = ln ๐‘“ (๐‘ˆ )โˆ’๐‘“ (๐‘†โ„Žโˆ’1 ) , ๐‘˜2 = ln ๐‘“ (๐‘†โ„Ž )โˆ’๐‘“ (๐‘†โ„Žโˆ’1 ) , and

(โˆ…) ๐‘˜3 = ln(max{ ๐‘“ (๐‘†๐‘“๐‘–({๐‘ฅ})โˆ’๐‘“ โˆช{๐‘ฅ})โˆ’๐‘“ (๐‘†๐‘– ) โˆฃ๐‘ฅ โˆˆ ๐‘ˆ, 0 โ‰ค ๐‘– โ‰ค โ„Ž, ๐‘“ (๐‘†๐‘– โˆช {๐‘ฅ}) โˆ’ ๐‘“ (๐‘†๐‘– ) > 0}). We apply the above lemma for problem ๐‘ƒ โ€ฒ as follows. It is easy to verify that ๐œŽ โ€ฒ (โ‹…) is a non-decreasing and submodular function de๏ฌned on subsets of a ๏ฌnite set ๐‘‰ . We set ๐‘ˆ to be ๐‘‰ and set ๐‘“ (โ‹…) to be ๐œŽ โ€ฒ (โ‹…). We โˆ‘also de๏ฌne ๐‘”(๐‘ฅ) to be 1 for each ๐‘ฅ โˆˆ ๐‘‰ (or ๐‘ˆ ). Note that ๐‘ฅโˆˆ๐‘† ๐‘”(๐‘ฅ) = โˆฃ๐‘†โˆฃ. We re-write Problem ๐‘ƒ โ€ฒ (4) as follows. โˆ‘ arg min{ ๐‘”(๐‘ฅ) : ๐œŽ โ€ฒ (๐‘†) = ๐œŽ โ€ฒ (๐‘‰ ), ๐‘† โŠ† ๐‘‰ }. (5) ๐‘ฅโˆˆ๐‘†

The above form of problem ๐‘ƒ โ€ฒ is exactly the form of the problem described in Lemma 5. Suppose that we adopt the greedy algorithm in Algorithm 1 for problem ๐‘ƒ โ€ฒ by using ๐œŽ โ€ฒ (โ‹…) instead of ๐œŽ(โ‹…), i.e., algorithm ๐ดโ€ฒ . It is easy to verify that algorithm ๐ดโ€ฒ follows the steps of the greedy algorithm described in Lemma 5 (i.e., selecting the node ๐‘ฅ such that (๐œŽ โ€ฒ (๐‘† โˆช {๐‘ฅ}) โˆ’ ๐œŽ โ€ฒ (๐‘†))/๐‘”(๐‘ฅ) is the greatest where ๐‘”(๐‘ฅ) is exactly equal to 1). By Lemma 5, the greedy algorithm ๐ดโ€ฒ for problem ๐‘ƒ โ€ฒ gives (1 + min{๐‘˜1 , ๐‘˜2 , ๐‘˜3 })-approximation of โ€ฒ )โˆ’๐œŽ โ€ฒ (โˆ…) ๐ฝ problem ๐‘ƒ โ€ฒ , where ๐‘˜1 = ln ๐œŽโ€ฒ๐œŽ(๐‘‰(๐‘‰ )โˆ’๐œŽ โ€ฒ (๐‘†โ„Žโˆ’1 ) = ln ๐ฝโˆ’๐œŽ โ€ฒ (๐‘†โ„Žโˆ’1 ) , โ€ฒ

โ€ฒ

โ€ฒ

(๐‘†1 ) 1 )โˆ’๐œŽ (โˆ…) = ln ๐œŽโ€ฒ (๐‘†โ„Ž๐œŽ)โˆ’๐œŽ , and ๐‘˜3 = ๐‘˜2 = ln ๐œŽโ€ฒ ๐œŽ(๐‘†(๐‘† โ€ฒ โ€ฒ (๐‘† โ„Ž )โˆ’๐œŽ (๐‘†โ„Žโˆ’1 ) โ„Žโˆ’1 )

436