Processing of Very Large Data IV

Processing of Very Large Data IV ´ Krzysztof Dembczynski Institute of Computing Science Laboratory of Intelligent Decision Support Systems ´ Politechn...

Author: Roberta Lucas

0 downloads 1 Views 242KB Size

Report

Download PDF

Recommend Documents

Traffic data processing using large scale graph processing systems

Very Large Searches 1

Very Large Databases How Large? How Different?

Hydroelastic Analysis of Very Large Floating Structures

Analysis of Large Data

Very Large Scale Integration (VLSI)

Very Large Scale Information Retrieval

Practical very large scale CRFs

Map-Reduce-Merge. Simplified Relational Data Processing on Large Clusters

Hadoop MapReduce: A Programming Model for Large Scale Data Processing

Apache Hadoop. Large scale data processing. Speaker: Isabel Drost

Performance Inconsistency in Large Scale Data Processing Clusters

Apache Hadoop. Large scale data processing. Speaker: Isabel Drost

PODi CASE STUDY SCREEN EUROPE AT LABELEXPO VERY LARGE MAILER GETS A VERY LARGE RESPONSE RATE

VLDB th International Conference on Very Large Data Bases, Hangzhou, China. Proceedings of the VLDB Endowment

O Management. A large amount of data can be processed very quickly

Rough Set Theory in Very Large Databases

Reliability Mechanisms for Very Large Storage Systems

Advantages Disadvantages Computer Data Processing Over Manual Data Processing

Capabilities of the Expanded Very Large Array for Astronomical Surveys

AUTOMATIC DATA PROCESSING PROGRAMS

Automatic Data Processing Laboratory

OPTICAL DATA PROCESSING STUDY

Data Processing Techniques

Processing of Very Large Data IV ´ Krzysztof Dembczynski Institute of Computing Science Laboratory of Intelligent Decision Support Systems ´ Politechnika Poznanska (Poznan´ University of Technology)

Intelligent Decision Support Systems Master studies, first semester Academic year 2011/12 (summer course)

Partitioning

Bloom Filter

Summary

Partitioning

Bloom Filter

Summary

Partitioning divides tables and indexes into a smaller pieces, enabling these database objects to be managed and accessed at a finer level of granularity. Partitioning can provide benefits by improving manageability, performance, and availability. Partitioning is transparent for database queries.

Partitioning I

Table or index is subdivided into smaller pieces.

I

Each piece of database object is called a partition.

I

Each partition has its own name, and may have its own storage characteristics (e.g. table compression),

I

From the perspective of a database administrator, a partitioned object has multiple pieces which can be managed either collectively or individually,

I

However, from the perspective of the application, a partitioned table is identical to a non-partitioned table.

Partitioning of Tables Tables are partitioned using a ’partitioning key’, a set of columns which determines in which partition a given row will reside. Different techniques for partitioning tables: I

Hash partitioning: Rows divided into partitions using a hash function

I

Range partitioning: Each partition holds a range of attribute values

Partitioning of Tables Range partitioning: I

Specified by a range of values the partitioning key: for a table with a date column as the partitioning key, the ’January-2001’ partition contains rows with the partitioning-key values from ’01-JAN-2001’ – ’31-JAN-2001’,

I

List of values: for a table with a region column as the partitioning key, the ’North America’ partition may contain values ’Canada’, ’USA’, and ’Mexico’),

Partitioning of Indexes Different techniques for partitioning indexes: I

Local index: Defined on a partitioned table which is partitioned in the exact same manner as the underlying partitioned table; each partition of a local index corresponds to one and only one partition of the underlying table,

I

Global partitioned index: Defined on a partitioned or non-partitioned table which is partitioned using a different partitioning-key from the table.

I

Global non-partitioned index: Essentially identical to an index on a non-partitioned table; the index structure is not partitioned.

Partitioning and Manageability I

Maintenance operations can be focused on particular portions of tables,

I

Partial compression,

I

Partial backups,

I

Data recovery can concern partitions,

I

”Divide and conquer” approach to data management.

Partitioning and Data Warehouses I

Partition fact table: I I I

I

Replicate dimension tables across cluster nodes: I I I

I

Fact tables are big, Process queries in parallel for each partition, Divide the work among the nodes in the cluster. Dimension tables are small, Storing multiple copies of them is cheap, No communication needed for parallel joins

One big dimension: I I I

Sometimes one dimension table is quite big (e.g. customer) Can partition the big dimension table Partition fact table on key of big dimension

Partitioning and Data Warehouses I

Reducing load time via partitioning I I I I

I

I

Often fact tables are partitioned on Date Also indexes, aggregate tables, etc. Newly loaded records go into the last partition Only indexes and aggregates for that partition need to be updated All other partitions remain unchanged

Expiring old data I I I

Often older data is less useful / relevant for data analysts To reduce data warehouse size, old data is often deleted If data is partitioned on date, simply delete or compress the oldest partition

Partitioning and Data Warehouses Example: I

New data are loaded into a database on weekly basis,

I

Database tables could be range-partitioned so that each partition contains one week of data,

I

The load process is simply the addition of a new partition,

I

Adding a single partition is much more efficient than modifying the entire table.

Partitioning and Performance I

Query can be performed on a particular partition: a query requesting orders for a single week would only access a single partition,

I

Multi-table joins can be improved by partitioning: join can be applied with two tables partitioned on the join key. This breaks a large join into smaller joins that occur between each of the partitions, completing the overall join in less time,

I

Parallel execution of queries.

Partitioning with Consistent Hashing Traditional hashing with n nodes: I

Use hashing of the type: h(k ) mod n

I

Problem: A change in the number of nodes causes nearly all keys to be remapped!

Consistent hashing: I

Requires only a relatively small amount of data to be moved

Consistent Hashing Hashing machines: I

Imagine wrapping the unit interval [0, 1) onto a circle

I

Suppose we number the machines 0, 1, . . . , n − 1.

I

If the hash function h(x) has range [0,R) then we rescale the hash function via: h0 (x) = h(x)/R, so that the hash function maps into the range [0, 1), i.e., effectively onto the circle.

I

Then we can hash machine number j to a point h0 (j) on the circle, for each machine in the range j = 0, 1, . . . , n − 1.

Consistent Hashing Hashing data: I

Now suppose we have a key-value pair we want to store.

I

We simply hash the key onto the circle, and then store the key-value pair on the first machine that appears clockwise of the key’s hash point

I

In the case of the uniformity of the hash function, a fraction roughly 1/n of the key-value pairs will get stored on any single machine.

Consistent Hashing Adding a new node: I

It goes to the point h0 (n)

I

Most of the key-value pairs are completely unaffected by this change, only those which key’s hash point is between h0 (n) and the previous hash value of a node in the circle.

Consistent Hashing Two main problems: I

The distribution of keys can be pretty irregular: If two of the machines get mapped to very nearby points on the circle, then one of those machines may end up with very few of the keys

I

When a machine is added to the cluster, all the keys redistributed to that machine come from just one other machine.

Solution: I

Instead of mapping machine number j to a single point on the circle, let us map it to multiple, for example, we pick out r points, by hashing (j, 0), (j, 1), . . . , (j, r − 1) onto the circle.

I

Otherwise, everything works exactly as before.

Partitioning

Bloom Filter

Summary

Bloom Filter I

A Bloom filter is a space-efficient probabilistic data structure for set membership.

I

To maximize space efficiency, correctness is sacrificed: if a given key is not in the set, then a Bloom filter may give the wrong answer (this is called a false positive), but the probability of such a wrong answer can be made small.

I

The more elements that are added to the set, the larger the probability of false positives.

Bloom Filter I

An empty Bloom filter is a bit array of m bits, all set to 0. 0

I

0

0

0

0

0

0

0

0

0

There must also be k different hash functions defined, each of which maps or hashes some set element to one of the m array positions with a uniform random distribution.

Bloom Filter I

To add an element, feed it to each of the k hash functions to get k array positions. Set the bits at all these positions to 1. 0

I

1

0

1

0

1

0

0

0

0

To query for an element (test whether it is in the set), feed it to each of the k hash functions to get k array positions. I

I

If any of the bits at these positions are 0, the element is definitely not in the set – if it were, then all the bits would have been set to 1 when it was inserted. If all are 1, then either the element is in the set, or the bits have by chance been set to 1 during the insertion of other elements, resulting in a false positive.

Probability of False Positives I

We compute the probability of a false positive. The probability that one hash fails to set a given a bit is 1−

I

1 m

Hence, after all n elements have been inserted into the Bloom filter, the probability that a specific bit is still 0 is

1−

1 m

kn

kn

' e− m ,

assuming that the hash functions are independent and perfectly random.

Probability of False Positives I

The probability of a false positive is the probability that a specific set of k bits are 1, which is 1 1− 1− m

I

kn !k

kn

' 1 − e− m

k

Suppose we are given the ratio m n and want to optimize the number k of hash functions to minimize the false positive rate. Note that more hash functions increase the precision but also the number of 1’s in the filter, thus making false positives both less and more likely at the same time.

Probability of False Positives I

The solution is

m n This can be shown to be a global minimum. k = (ln2)

I

For the optimal value of k , the false positive rate is k 1

2 I

m

= (0.6185) n

Already m = 8n reduces the chance of error to roughly 2%, and m = 10n to less than 1%.

Partitioning

Bloom Filter

Summary

Summary I

Processing of Very Large Data: partitioning, Bloom filters, ...

Bibliography C.J. Date, Wprowadzenie do systemów baz danych, Wydawnictwa Naukowo-Techniczne 1999. Z. Królikowski, Hurtownie danych: logiczne i fizyczne struktury danych, ´ Wydawnictwo Politechniki Poznanskiej 2007 M. Jarke, M. Lenzerini, Y. Vassiliou, P. Vassiliadis, Hurtownie danych. Podstawy organizacji i funkcjonowania, Wydawnictwa Szkolne i Pedagogiczne 2003 http://www.oracle.com http://michaelnielsen.org/blog/consistent-hashing/ M. Jordan, Notes 14 for CS 170: Efficient Algorithms and Intractable Problems, 2005 Topics in Data Warehousing. Slides found in Internet (the author is unknown).