Finding Global Icebergs over Distributed Data Sets

Finding Global Icebergs over Distributed Data Sets Qi Zhao Georgia Tech Mitsunori Ogihara Univ. Rochester Jun (Jim) Xu Georgia Tech PODS 2006, Chica...
Author: Eric Jackson
4 downloads 0 Views 311KB Size
Finding Global Icebergs over Distributed Data Sets Qi Zhao Georgia Tech

Mitsunori Ogihara Univ. Rochester Jun (Jim) Xu Georgia Tech

PODS 2006, Chicago

Haixun Wang IBM

The Talk in One Slide • An iceberg: the item whose frequency count is greater than a certain threshold. • A number of algorithms are proposed to find icebergs at a single node (i.e., local icebergs). • In many real-life applications, data sets are physically distributed over a large number of nodes. It is often useful to find the icebergs over aggregate data across all the nodes (i.e., global icebergs). • Global iceberg 6= Local iceberg • We study the problem of finding global icebergs over distributed nodes and propose two novel solutions. Finding Global Icebergs over Distributed Data Sets

PODS 2006

1

Outline • Motivations and Problem Statement • The Sampling-based Scheme • The Counting-sketch-based Scheme • Evaluation Results • Conclusions and Future Work

Finding Global Icebergs over Distributed Data Sets

PODS 2006

2

Motivations: Some Example Applications • Detection of distributed DoS attacks in a large-scale network – The IP address of the victim appears over many ingress points. It may not be a local iceberg at any ingress points since the attacking packets may come from a large number of hosts and Internet paths. • Finding globally frequently accessed objects/URLs in CDNs (e.g., Akamai) to keep tabs on current “hot spots” • Detection of system events which happen frequently across the network during a time interval – These events are often the indication of some anomalies. For example, finding DLLs which have been modified on a large number of hosts may help detect the spread of some unknown worms or spywares. Finding Global Icebergs over Distributed Data Sets

PODS 2006

3

Problem Statement • A system or network that consists of N distributed nodes • The data set Si at node i contains a set of hx, cx,ii pairs. – Assume each node has enough capacity to process incoming data stream. Hence each node generates a list of the arriving items and their exact frequency counts. • The flat communication infrastructure, in which each node only needs to communicate with a central server. PN

• Objective: Find {x| i=1 cx,i ≥ T }, where cx,i is the frequency count of the item x in the set Si, with the minimal communication cost. Finding Global Icebergs over Distributed Data Sets

PODS 2006

4

The Existing Approaches • Most previous work focuses on finding local icebergs at a single node. • Ship all the raw data to the central server and the central server uses single-node algorithms: prohibitive communication cost • Linear composing the data sketches (e.g., CM sketch) from the distributed nodes and then using the corresponding single-node algorithm: not accurate • Arrange nodes in a tree structure and each non-root node sends its synopsis to its parent [Manjhi et al. ICDE’05]: use different iceberg definition (i.e., > s% of the total count); assume a globally frequent item is also frequent locally somewhere. Finding Global Icebergs over Distributed Data Sets

PODS 2006

5

Outline • Motivations and Problem Statement • The Sampling-based Scheme • The Counting-sketch-based Scheme • Evaluation Results • Conclusions and Future Work

Finding Global Icebergs over Distributed Data Sets

PODS 2006

6

The Sampling-based Scheme Server Fx = s1/g(s1) + s2/g(s2) + ... + sk/g(sk)

x has pairs , , ..., .

sampled pairs sample the pair with prob. g(c1)

sampled pairs

sample the pair with prob. g(c2)

Node 1 Finding Global Icebergs over Distributed Data Sets

Node 2

sampled pairs sample the pair with prob. g(cN)

Node N PODS 2006

7

Configuring the Sampling Rate Function g(c) to Minimize the Worst-case MSE • Should be size-dependent: the pair with larger frequency count has bigger potential to push the total count over T • A split pattern {cx,1, cx,2, · · · , cx,N }: Fx =

PN

i=1 cx,i

• The adversary could manipulate the split patterns to enlarge the MSE, PN c2x,i(1−g(cx,i)) . i.e., i=1 g(cx,i ) • Our idea is to make the MSE independent of the split patterns. c for the pair hx, ci where d is a • The unique solution g(c) = d+c positive constant and can be tuned to satisfy the given requirement of communication cost. Finding Global Icebergs over Distributed Data Sets

PODS 2006

8

Outline • Motivations and Problem Statement • The Sampling-based Scheme • The Counting-sketch-based Scheme • Evaluation Results • Conclusions and Future Work

Finding Global Icebergs over Distributed Data Sets

PODS 2006

9

The Solution for 0-1 Case • 0-1 case: cx,i = 1, i = 1, 2, · · · , N . Now the total frequency count of an item is equal to the number of nodes it appears. • Each node samples its items with Prob. p, encodes the sampled items into a Bloom filter (BF) and ships the BF to the central server afterwards. – Given the communication cost per item there is a tradeoff between the sampling rate and the storage space consumed by each sampled item. We derive the optimal tradeoff to achieve the minimal MSE. • The server uses the ID of an item to query all the collect BFs and then the corresponding total frequency count is estimated based on the number of BFs asserting to contain the item. – Need a sampling module running at each node to collect some item IDs and ship them to the central server. Finding Global Icebergs over Distributed Data Sets

PODS 2006

10

The General Case: A Straightforward Thinking Node #1

Node #2

Node #N

1

1

1

2

2

2

3

3

3

4

4

4

k

k

k

M

M

M

Finding Global Icebergs over Distributed Data Sets

PODS 2006

11

The Solution for the General Case • The straightforward thinking may not be economical since more BFs may result in more false positives and hence larger MSE. • Only setup BFs at a short series of quantization points selected from the whole range of the frequency count. • How about hx, ci where c is not equal to any quantization points? B2

B3 b3−c b3−b2

c−b2 b3−b2

b0

b1

b2

Finding Global Icebergs over Distributed Data Sets



b3

b4

PODS 2006

12

The Solution for the General Case (Cont’d) • Assume we setup n quantization points the server collects at most N × n Bloom filters. And this can be viewed as the sum of n 0-1 cases. • The unbiased estimator of the total frequency count of item x is given by cx = b1 × fb1 + b2 × fb2 + · · · + bn × fc F n Node #1

Node #2

Node #N

b1

f1

b2

f2

bj

fj

bn

fN

Finding Global Icebergs over Distributed Data Sets

PODS 2006

13

Configuring Quantization Points • MSE = error led by sampling and false positives of the BFs + rounding error; there is a tradeoff between these two types of errors. cx] ≤ aFx + d which is independent with the split patterns • M SE[F • The slope a: rounding (quantization) error • The intercept d: error led by sampling and false positives of the BFs • A greedy algorithm finds a short series of quantization points {bj }nj=1, b1 < · · · < bn to minimize the bound (aFx + d).

Finding Global Icebergs over Distributed Data Sets

PODS 2006

14

Outline • Motivations and Problem Statement • The Sampling-based Scheme • The Counting-sketch-based Scheme • Evaluation Results • Conclusions and Future Work

Finding Global Icebergs over Distributed Data Sets

PODS 2006

15

Evaluation Results We use the NetFlow traces from Abilene network to construct 216 distributed data sets and each item is identified by the destination IP address in each flow record. After carefully configuring the parameters in our scheme, we make the incurred communication cost (1,576.4 KB) one order of magnitude smaller than the size of original data (22,203.4 KB). 0.15

Relative error

0.1 0.05 0 -0.05 -0.1 -0.15 2000 4000 6000 8000 10000 12000 14000 16000 18000 20000

Original count

Finding Global Icebergs over Distributed Data Sets

PODS 2006

16

Outline • Motivations and Problem Statement • The Sampling-based Scheme • The Counting-sketch-based Scheme • Evaluation Results • Conclusions and Future Work

Finding Global Icebergs over Distributed Data Sets

PODS 2006

17

Conclusions and Future Work • The two novel schemes for finding global icebergs over distributed data sets. – The sampling-based scheme: simple – The counting-sketch-based scheme: more involved but more accurate • The unbiasedness of the resulting estimators, their variances and accuracy tail bounds using a newly developed tail bound theorem • Future work – How about the data stream case where each node does not have enough memory and computation power to process the incoming item arrivals to obtain their exact frequency counts? Finding Global Icebergs over Distributed Data Sets

PODS 2006

18

Thank You!

Finding Global Icebergs over Distributed Data Sets

PODS 2006

19