Efficiently Indexing AND Querying Big Data in Hadoop MapReduce. MapReduce Intro. Hadoop++ HAIL. MapReduce Intro. Jens Dittrich

Efficiently Indexing AND Querying Big Data in Hadoop MapReduce Copyright of all slides: Jens Dittrich 2012 Jens Dittrich saarland university comput...
Author: Reynard Shelton
2 downloads 2 Views 23MB Size
Efficiently Indexing AND Querying Big Data in Hadoop MapReduce

Copyright of all slides: Jens Dittrich 2012

Jens Dittrich saarland university

computer science

This talk consists of three parts.

MapReduce Intro

Hadoop++

HAIL

First Part

MapReduce Intro

Big data is the new “very large“.

Big

Data

Big data is everywhere: CERN... http://cdsweb.cern.ch/record/1295244

...moving objects indexing... http://www.istockphoto.com/stock-video-4518244-latrafic-a-time-lapse.php

...astronomy... http://www.flickr.com/photos/ 14924974@N02/2992963984/

[Physics]

basically whenever you point a satellite dish up in the air, you collect tons of data but also in... \href{http://it.wikipedia.org/wiki/ File:KSC_radio_telescope.jpg}{http://it.wikipedia.org/ wiki/File:KSC\_radio\_telescope.jpg}

...genomics... http://www.istockphoto.com/stockillustration-16136234-dna-strands.php

...social networks...

...and search engines.

They proposed a system to effectively analyze big data.

[Dean et al, OSDI’04]

That system was coined “MapReduce“. The system is Google-proprietary.

MapReduce

Hadoop is the open source variant. It has a large community of developers and start-up companies.

Big Data Tutorial

We presented a tutorial on Big Data Processing in Hadoop MapReduce at VLDB 2012. It contains many details on data layouts, indexing, query processing and so forth. The tutorial slides are available online: http://infosys.uni-saarland.de/publications/ BigDataTutorialSlides.pdf

[VLDB 2012b]

Let‘s briefly revisit the MapReduce interface:

Semantics:

just two functions: map() and reduce()

map(key, value) -> set of (ikey, ivalue) reduce(ikey, set of ivalue) -> (fkey, fvalue)

Let‘s look at a concrete use-case:

Google-Use Case:

This is vital for Google‘s search service you use everyday.

Web-Index Creation

In this use-case the map function...

map(key, value) -> set of (ikey, ivalue)

...takes a docID and a document (the contents of the document) and returns a set of (term,docID)-pairs.

map(docID, document) -> set of (term, docID)

For instance...

map(44, ´´This is text on a website!´´ ) -> {

(``This´´, 44),

(``is´´, 44),

(``text´´, 44),

(``on´´, 44),

(``a´´, 44),

(``website´´, 44) }

...map() will be called for document 44 with its contents “This is text on a website!“. The map()-function breaks this into pairs, one pair for each term occurring on website 44.

map(42, ´´This is just another website!´´ ) -> {

(``This´´, 42),

(``is´´, 42),

(``just´´, 42),

(``another´´, 42),

(``website´´, 42) }

the same happens for document 42

map(43, ´´One more boring website!´´ ) -> {

(``One´´, 43),

(``more´´, 43),

(``boring´´, 43),

(``website´´, 43) }

and so forth

What about reduce()?

reduce(ikey, set of ivalue) -> (fkey, fvalue)

reduce(term, set of docID) -> (term, (posting list of docID, count))

For Web-index creation reduce() receives a term and the set of docIDs containing that term. reduce() then returns a pair of the input term and an ordered posting loist of docIDs plus a count, i.e. the number of web pages having that term. Note: there are many variants how to do web-indexing with MapReduce. The actual semantics used by Google may differ; the core idea however is the same.

reduce(``This´´, {42, 43} ) -> (``This´´, ([42, 43], 2))

For instance: documents 42 and 43 contain “This“. reduce() simply returns an ordered posting list plus the count.

reduce(``is´´, {42, 43} ) -> (``is´´, ([42, 43], 2))

Documents 42 and 43 contain “is“. reduce() simply returns an ordered posting list plus count for this as well.

and so forth

reduce(``boring´´, {43} ) -> (``boring´´, ([43], 1)) etc.

Other Applications:

Search rec.a==42, rec.contains(``bla´´), rec.contains(0011001)

Machine Learning k-means, mahout library

Web-Analysis sum of all accesses to page Y from user X etc.

map() and reduce() with

Big

Data ?

Many things can be mapped to the map()/reduce()interface, but not all. Think about twice before blindly using MapReduce. It is useful for many things, but not all. Many important extensions have been done in the past to support more application classes, i.e. iterative problems.

Let‘s assume a user Bob who wants to analyze a large file.

Bob

Notice that this is a simplified explanation. For details on how Hadoop works see our paper: Hadoop++: Making a Yellow Elephant Run Like a Cheetah (Without It Even Noticing), VLDB 2010 http://infosys.cs.uni-saarland.de/publications/DQJ +10CRv2.pdf

HDFS

...

DN1

DN2

DN3

DN4

DN5

http://www.istockphoto.com/file\_closeup.php? id=591134

DN6

DNn

Bob first needs to upload his file to Hadoop‘s Distributed File System (HDFS). HDFS partitions his data into large horizontal partitions. HDFS block42

A

block block block block block 42 42 block 4242 block 42 42 block 42 42

...

...

N2DN4 DN5 DN6 DN2DN4 DN5 DN6 DN7

DN4 DN5 DN6 DN7 DN2

DN4 DNn DN5 DN6 DN7

block42 block42 block42

...

AB AB AB horizontal partitions

...

A B

B

A

...

...

B

DNn DN5 DN6 DN7DN4 DNn DN6 DN7 DN5 DNnDN7

DN6 DNn

block42

B

...

DNn DN7

DNn

...

DFS

HDFS

HDFS

DN1

HDFS

DN2

HDFS

DN3

HDFS

DN4

DN5

DN6

Those horizontal partitions are termed HDFS blocks. They are relatively large: at least 64MB up to 1GB. Do not confuse these large HDFS blocks with the typically small database pages (which are only a few KB in size).

HDFS block42

A

block block

block block

block block

42 42 ...4242 ...42 42 AB AB AB horizontal partitions

DN6 DN2DN4 DN6 DN7 N2DN4 DN5 DN5

DNn

DN4 DN6 DN7 DN5 DN2

DN4 DN6 DN7 DNn DN5

block block 42 42

block HDFS blocks

block42 block42 block42

...

...

A B

B

A

42 ... ... 64MB (default)B B

DN6 DN7DN4 DNn DN6 DN7 DN5 DNn DN5 DNnDN7

DN6 DNn

...

DN7 DNn

DNn

...

DFS

HDFS

HDFS

DN1

HDFS

DN2

HDFS

DN3

HDFS

DN4

DN5

DN6

DNn

Each HDFS block receives a unique ID.

HDFS block42

1 2 block block block block block 42 42 block 4242 block 42 42 block 42 42

A

AB

DN6 DN2DN4 DN6 DN7 N2DN4 DN5 DN5

DFS

...

...

AB

...

AB

DN4 DN6 DN7 DN5 DN2

DN4 DN6 DN7 DNn DN5

A 4B

3 42 block block42 block42

...

5

6B

A

...

B

DN6 DN7DN4 DNn DN6 DN7 DN5 DNn DN5 DNnDN7

7

8

9

... DN6 DNn

block42

B DN7 DNn

... HDFS

DN1

HDFS

DN2

HDFS

DN3

HDFS

DN4

HDFS

DN5

DN6

DNn

... DNn

The HDFS blocks get distributed and replicated over the cluster. Each HDFS block gets replicated to at least three different data nodes (DN1, ...DNn in this example). HDFS

1

2

2

4

5

6

7

8

9

1

3

2

3

DN1

1

...

3

DN2

DN3

DN4

DN5

DN6

DNn

HDFS does this for every HDFS block of the input file.

HDFS block42

block block block block block 42 42 block 4242 block 42 42 block 42 42

A

...

AB

N2DN4 DN5 DN6 DN2DN4 DN5 DN6 DN7

1 5

DFS

HDFS

...

AB DN4 DN5 DN6 DN7 DN2

2 6

DN1

DN4 DNn DN5 DN6 DN7

2

3

HDFS

5

4

DN2

A

B

...

7

5

8

6

2

4

DN4

3

HDFS

DN5

B

DN6 DNn

9

3

block42

...

B

DNn DN5 DN6 DN7DN4 DNn DN6 DN7 DN5 DNnDN7

HDFS

DN3

...

A B

1

HDFS

block42 block42 block42

...

AB

...

DNn DN7

DNn

1

... 6

4

DN6

DNn

Eventually all HDFS blocks have been sent to the datanodes.

HDFS block42

block block block block block 42 42 block 4242 block 42 42 block 42 42

A

...

AB

DN6 DN2DN4 DN6 DN7 N2DN4 DN5 DN5

DFS

1

7

5

6

HDFS

DN1

...

AB DN4 DN6 DN7 DN5 DN2

2

9

3

5

HDFS

DN4 DN6 DN7 DNn DN5

2

9

4

8

HDFS

DN2

DN3

block42 block42 block42

...

AB

...

A B

A

B

...

DN6 DN7DN4 DNn DN6 DN7 DN5 DNn DN5 DNnDN7

1

7

3

8

2

8

5

6

4

9

3

6

HDFS

DN4

HDFS

DN5

block42

...

B

B

DN6 DNn

DN7 DNn

4

DN6

7

DNn

We gain nice failover properties: even if two datanodes go offline, we still have one copy of the block.

HDFS block block block block block 42 42 block 4242 block 42 42 block 42 42

A

...

AB

DFS

1

7

5

6

HDFS

DN1

...

AB

DN6 DN2DN4 DN6 DN7 N2DN4 DN5 DN5

9

3

5

HDFS

DN2

DN4 DN6 DN7 DNn DN5

2

9

4

8

HDFS

DN3

block42 block42 block42

...

AB

DN4 DN6 DN7 DN5 DN2

2

DNn

1

...

Failover

block42

...

...

A B

A

B

...

B

DN6 DN7DN4 DNn DN6 DN7 DN5 DNn DN5 DNnDN7

1

7

3

8

2

8

5

6

4

9

3

6

HDFS

DN4

DN5

HDFS

DN6

block42

...

B

DN6 DNn

...

DN7 DNn

1 4

7

DNn

Assume that we want to retrieve HDFS block 3. We lost the copies on DN2 and DN5. However,we can still ... retrieve a copy of block 3 from DN6. Notice that once datanodes go offline HDFS (should) DNn copy blocks to other nodes to get back to having three copies of each block again.

Load Balancing

Another advantage of having three copies for each block is load balancing. Whenever a user or an application asks for a particular block, we have three options for retrieving that block. The decision which datanode to use may be made based on network locality, network congestion, and the current load of the datanodes.

I would like to have block 4!

HDFS

1

7

2

9

2

9

1

7

3

8

2

8

5

6

3

5

4

8

5

6

4

9

3

6

DN1

DN2

DN3

DN4

DN5

1

...

4

DN6

7

DNn

So now we have our data stored in HDFS. What about MapReduce? And by “MapReduce“ I mean “Hadoop MapReduce“ in the following. HDFS block42

block block block block block 42 42 block 4242 block 42 42 block 42 42

A

...

AB

N2DN4 DN5 DN6 DN2DN4 DN5 DN6 DN7

DFS

1

7

5

6

HDFS

DN1

...

AB DN4 DN5 DN6 DN7 DN2

2

9

3

5

HDFS

DN4 DNn DN5 DN6 DN7

2

9

4

8

HDFS

DN2

DN3

block42 block42 block42

...

AB

...

A B

A

B

...

DNn DN5 DN6 DN7DN4 DNn DN6 DN7 DN5 DNnDN7

1

7

3

8

2

8

5

6

4

9

3

6

HDFS

DN4

HDFS

DN5

block42

...

B

B

DN6 DNn

...

DNn DN7

DNn

1

...

4

DN6

7

DNn

MapReduce is another software layer on top of HDFS. MapReduce consists of three phases.

map(docID, document) -> set of (term, docID)

MapReduce HDFS block42

block block block block block 42 42 block 4242 block 42 42 block 42 42

A

...

AB

DN6 DN2DN4 DN6 DN7 N2DN4 DN5 DN5

DFS

1

7

5

6

HDFS

DN1

...

AB DN4 DN6 DN7 DN5 DN2

2

9

3

5

HDFS

DN4 DN6 DN7 DNn DN5

2

9

4

8

HDFS

DN2

DN3

block42 block42 block42

...

AB

...

A B

A

B

...

DN6 DN7DN4 DNn DN6 DN7 DN5 DNn DN5 DNnDN7

1

7

3

8

2

8

5

6

4

9

3

6

HDFS

DN4

HDFS

DN5

block42

...

B

B

DN6 DNn

...

DN7 DNn

DNn

1

...

4

DN6

7

DNn

In the first phase (the Map Phase) only the mapfunction is considered.

Map Phase MapReduce

map(docID, document) -> set of (term, docID)

HDFS block42

block block block block block 42 42 block 4242 block 42 42 block 42 42

A

...

AB

AB

DN6 DN2DN4 DN6 DN7 N2DN4 DN5 DN5

DFS

1

7

5

6

HDFS

DN1

...

AB

DN4 DN6 DN7 DN5 DN2

2

9

3

5

HDFS

DN2

DN4 DN6 DN7 DNn DN5

2

9

4

8

HDFS

DN3

block42 block42 block42

...

...

A B

A

B

...

B

DN6 DN7DN4 DNn DN6 DN7 DN5 DNn DN5 DNnDN7

1

7

3

8

2

8

5

6

4

9

3

6

HDFS

DN4

DN5

HDFS

DN6

block42

...

B

DN6 DNn

...

DN7 DNn

1 4

7

DNn

... DNn

MapReduce assigns a thread (AKA Mapper) at every datanode having data to be processed for this job.

Map Phase MapReduce

map(docID, document) -> set of (term, docID)

HDFS

M2

M1

M3

M4

M5

M6

M7

1

7

2

9

2

9

1

7

3

8

2

8

5

6

3

5

4

8

5

6

4

9

3

6

DN1

DN2

DN3

DN4

DN5

1

...

4

DN6

7

DNn

Each mapper reads one of the HDFS blocks...

Map Phase MapReduce

map(docID, document) -> set of (term, docID)

HDFS block42

block block block block block 42 42 block 4242 block 42 42 block 42 42

A

...

AB

...

AB

N2DN4 DN5 DN6 DN2DN4 DN5 DN6 DN7

DN4 DN5 DN6 DN7 DN2

M2

M1

block42 block42 block42

...

AB

...

A B

A

B

...

DN4 DNn DN5 DN6 DN7

DNn DN5 DN6 DN7DN4 DNn DN6 DN7 DN5 DNnDN7

M3

M4

M5

block42

...

B

B

DN6 DNn

...

DNn DN7

M6

DNn

M7

Bob's Perspective

DFS

1

7

5

6

HDFS

2

9

3

5

HDFS

2

9

4

8

HDFS

1

7

3

8

2

8

5

6

4

9

3

6

HDFS

HDFS

1

...

4

7

Bob

DN1

DN2

DN3

DN4

DN5

DN6

DNn

Map Phase block42

MapReduce

DN2

block42

A

map(docID, document) -> set of (term, docID) DN4

DN5

DN6

HDFS block42

block block block block block 42 42 block 4242 block 42 42 block 42 42

A

...

AB

...

AB

DN6 DN2DN4 DN6 DN7 N2DN4 DN5 DN5

DN4 DN6 DN7 DN5 DN2

M2

M1

block42 block42 block42

...

AB

...

A B

A

B

...

DN4 DN6 DN7 DNn DN5

DN6 DN7DN4 DNn DN6 DN7 DN5 DNn DN5 DNnDN7

M3

M4

M5

block42

...

B

B

HDFS DN6 DNn

M6

..and breaks that HDFS blocks into records. This can be customized with the RecordReader. For...each record map() is called. The output to that file B (also called intermediate results) is collected on the local disks of the datanodes. ... For instance, for block 6 the output is collected in file 6‘. DN7

DN7 DNn

DNn

DNn

M7

Bob's Perspective

DFS

1

7

5

6

HDFS Bob

6‘

2

9

3

5

HDFS

9

4

8

HDFS

9‘

DN1

2

8‘

DN2

1

7

3

8

2

8

5

6

4

9

3

6

HDFS

3‘

1‘

DN3

HDFS

DN4

1

...

4

2‘

DN5

7

7‘

DN6

DNn

This is done for every block of the input file. Obviously we do not have to do this with every copy of am HDFS block. ... Processing one copy is enough. B With this the Map Phase is finished.

Map Phase block42

MapReduce

map(docID, document) -> set of (term, docID) DN2

block42

A

DN4

DN5

DN6

DN7

DNn

HDFS block42

block block block block block 42 42 block 4242 block 42 42 block 42 42

A

...

AB

AB

DN6 DN2DN4 DN6 DN7 N2DN4 DN5 DN5

DFS

...

AB

DN4 DN6 DN7 DN5 DN2

DN4 DN6 DN7 DNn DN5

7

5

6

HDFS

6‘ DN1

2

9

3

5

9‘

5‘

HDFS

DN2

...

A B

A

B

...

block42

...

B

DN6 DN7DN4 DNn DN6 DN7 DN5 DNn DN5 DNnDN7

M8 1

block42 block42 block42

...

B

HDFS DN6 DNn

DN7 DNn

M9 2

9

4

8

HDFS

8‘ DN3

1

7

3

8

2

8

5

6

4

9

3

6

3‘

4‘

2‘

HDFS

1‘ DN4

HDFS

DN5

DN6

...

1 4

7

7‘ DNn

... DNn

Now, the Shuffle Phase starts.

Shuffle Phase MapReduce

group by term

HDFS

... 6‘

9‘

DN1

5‘

8‘

DN2

3‘

1‘

DN3

DN4

4‘

2‘

DN5

7‘

DN6

DNn

In the Shuffle Phase all intermediate results are redistributed over the different datanodes. In this example we want to redistribute the intermediate results based on the term, i.e. we want to group all intermediate results by term.

Shuffle Phase MapReduce

group by term

HDFS block42

block block block block block 42 42 block 4242 block 42 42 block 42 42

A

N2DN4 DN5 DN6 DN2DN4 DN5 DN6 DN7

DFS

...

AB

9‘

HDFS

HDFS

5‘

...

A B

DN4 DNn DN5 DN6 DN7

A

B

...

DNn DN5 DN6 DN7DN4 DNn DN6 DN7 DN5 DNnDN7

1‘

HDFS

HDFS

DN3

3‘

4‘

2‘

block42

...

B

B

DN6 DNn

network

8‘

DN2

block42 block42 block42

...

AB

DN4 DN5 DN6 DN7 DN2

6‘

DN1

...

AB

...

DNn DN7

DNn

7‘

...

HDFS

DN4

DN5

DN6

DNn

This means, after shuffling, we obtain a range partitioning on terms....

Shuffle Phase MapReduce

group by term

HDFS block42

block block block block block 42 42 block 4242 block 42 42 block 42 42

A

DN6 DN2DN4 DN6 DN7 N2DN4 DN5 DN5

DFS

...

AB

9‘

HDFS

HDFS

5‘

...

A B

DN4 DN6 DN7 DNn DN5

A

B

...

DN6 DN7DN4 DNn DN6 DN7 DN5 DNn DN5 DNnDN7

1‘

HDFS

HDFS

DN3

3‘

4‘

2‘

block42

...

B

B

DN6 DNn

network

8‘

DN2

block42 block42 block42

...

AB

DN4 DN6 DN7 DN5 DN2

6‘

DN1

...

AB

...

DN7 DNn

DNn

7‘

...

HDFS

DN4

DN5

DN6

DNn

For instance, DN1 contains all intermediate results having terms starting with A or B. In turn, DN2 has only terms starting with C or D and so forth. Once all data has been redistributed, the Shuffle Phase is finished.

Shuffle Phase MapReduce

group by term

HDFS block42

block block block block block 42 42 block 4242 block 42 42 block 42 42

A

AB

DN6 DN2DN4 DN6 DN7 N2DN4 DN5 DN5

DFS

...

...

AB

AB

DN4 DN6 DN7 DN5 DN2

DN4 DN6 DN7 DNn DN5

G-H

HDFS

HDFS

HDFS

HDFS

DN4

I-J

K-L

block42

...

B

DN6 DN7DN4 DNn DN6 DN7 DN5 DNn DN5 DNnDN7

E-F

DN3

A

B

...

network

C-D

DN2

...

A B

A-B

DN1

block42 block42 block42

...

B

DN6 DNn

...

DN7 DNn

W-Z

HDFS

DN5

DN6

DNn

... DNn

Reduce Phase

In the final Reduce Phase, MapReduce assigns threads to the different datanodes having intermediate results. These threads are termed reducers. The reducers read the intermediate results and for each distinct key they call reduce(). Notice that reduce() may only be called once for each distinct key on the entire cluster. Otherwise the semantics of the map()/reduce()paradigm would be broken.

reduce(term, set of docID) -> set of (term, (posting list of docID, count))

MapReduce HDFS

R1

R2

A-B

C-D

DN1

R3 E-F

DN2

R4

R6

I-J

G-H

DN3

R5

DN4

Rn

K-L

DN5

W-Z

...

DN6

DNn

The output of the reduce()-calls is stored on disk again. In this example this is visualized to store the output on local disk. However, Hadoop stores the output on HDFS by default, i.e. the output of the Reduce Phase gets replicated by HDFS again.

Reduce Phase reduce(term, set of docID) -> set of (term, (posting list of docID, count))

MapReduce HDFS block42

block block block block block 42 42 block 4242 block 42 42 block 42 42

A

N2DN4 DN5 DN6 DN2DN4 DN5 DN6 DN7

DFS

DN4 DN5 DN6 DN7 DN2

...

R3

R4 G-H

HDFS

HDFS

HDFS

HDFS

A-B‘

C-D‘

E-F‘

AB

DN4 DN6 DN7 DN5 DN2

HDFS

DN4 DN6 DN7 DNn DN5

HDFS

When to replicate which data for fault tolerance in DNn MapReduce is an interesting discussion. See our paper RAFT for more details: http://infosys.uni-saarland.de/publications/QPSD11.pdf

HDFS

DN4

AB

DNn DN7

W-Z

...

K-L‘

DN5

W-Z‘

DN6

DNn

Hadoop ... ... ... ... MapReduce Advantages

AB

...

Rn

K-L

I-J‘

block block block block block 42 42 block 4242 block 42 42 block 42 42

...

B

DN6 DNn

R6

I-J

G-H‘

DN3

R5

block42

...

B

DNn DN5 DN6 DN7DN4 DNn DN6 DN7 DN5 DNnDN7

E-F

HDFS

A

B

DN4 DNn DN5 DN6 DN7

C-D

A

...

A B

A-B

DN2

block42 block42 block42

...

AB

R2

DN6 DN2DN4 DN6 DN7 N2DN4 DN5 DN5

DFS

...

AB

R1

DN1

block42

...

AB

A B

block42 block42 block42

B

A

B

DN6 DN7DN4 DNn DN6 DN7 DN5 DNn DN5 DNnDN7

HDFS

... DN6 DNn

block42

B DN7 DNn

HDFS

failover scalability schema-later ease of use

... DNn

Hadoop MapReduce Disadvantages

And this and how to fix it is what the following material is about.

Performance

MapReduce Intro

Hadoop++

HAIL

Second Part

Hadoop++

Jens Dittrich, Jorge-Arnulfo Quiane-Ruiz, Alekh Jindal, Yagiz Kargin, Vinay Setty, Jörg Schad Hadoop++: Making a Yellow Elephant Run Like a Cheetah (Without It Even Noticing) VLDB 2010/PVLDB, Singapore http://infosys.cs.uni-saarland.de/publications/DQJ +10CRv2.pdf slides: http://infosys.cs.uni-saarland.de/publications/DQJ +10talk.pdf

[VLDB 2010a]

The “Map Reduce Plan“

partition

load

map

reduce

T

Data Load Phase

Data — a large horip with appropriLoad anything in the Phase

g contributions:

L Replicate H1

Replicate

Fetch

Fetch

Fetch

Store

Store

Store

Scan

Scan

M1

Map Phase

T6

PPartblock Replicate H2

Replicate H3

...

Replicate

Replicate

Fetch

Fetch Fetch

Store

Store Store

H4 ...

Scan

PPartsplit PPartsplit T5 T1

Map Phase

Hadoop is nothy execution plan itemize, mem, injected at pred query processphysical query nce, we are then

T1

PPartsplit T2

T3

T4 M2

Union

M3

RecReaditemize

RecReaditemize

MMapmap

MMapmap

T6 M4

PPartmem LPartsh Sortcmp

LPartsh Sortcmp

LPartsh

SortGrpgrp

SortGrpgrp

MMapcombine

MMapcombine

Store

Store

Store

Sortcmp SortGrpgrp ...

MMapcombine

...

Mergecmp asive, DBMSSortGrpgrp dex. A Trojan MMapcombine read-optimized Store Store me and thus have PPartsh PPartsh HadoopDB we t all to integrate Fetch Fetch Fetch Fetch of the Trojan InBuffer Buffer Buffer Buffer Fs. (Section 3) Store Store Merge MS-independent Store llows us to coMergecmp ... Trojan Indexes, SortGrpgrp to do so. TroMMapreduce reate arbitrarily R1 Store R2 it. (Section 4) fair experimenT T 1 2 iques on top ofVLDB 2010, Sep 15, Singapore Figure 1: The Hadoop Plan: Hadoop’s processing pipeline exmark Hadoop++ pressed as a physical query execution plan osed at VLDB

Reduce Phase

Shu⇥e Phase

Shuffle Phase

Reduce Phase

from SIGMOD n’s EC2 Cloud. adoop and even n 5)

ategies systems hat MapReduce MapReduce task s C to E contain udy.

UERY EX-

es a MapReduce 0.19. Note that operator-model egy may be exedge, this paper e Hadoop Plan.

ned parameters ucers, and data with four mapa nodes (P = 4) op Plan consists hich correspond ributed file sys-

has 10 user-defined functions (UDFs):

LPartsh

Sortcmp

SortGrpgrp MMapcombine

§ The “MapReduce Plan“

figure shows example with 4 mappers and 2 reducers Hadoop MapReduce uses a hard-coded pipeline. This pipeline cannot be changed. This is in sharp contrast to database systems which may use different pipelines for different queries.

! ! ! ! ! ! ! ! ! !

block split itemize mem map sh cmp grp combine reduce

tem. M determines the number of mapper subplans ( ), whereas R determines the number of reducer subplans ( ). Let’s analyze The Hadoop Plan in more detail: Data Load Phase. To be able to run a MapReduce job, we first load the data into the distributed file system. This is done by partitioning the input T horizontally into disjoint subsets T 1 , . . . , T b . See the physical partitioning operator PPart in subplan L. In the example b = 6, i.e. we obtain subsets T 1 , . . . , T 6 . These subsets are called blocks. The partitioning function block partitions the input T based on the block size. Each block is then replicated (Replicate). The default number of replicas used by Hadoop is 3, but this may be configured. For presentation reasons, in the example we replicate each block only once. The figure shows 4 different data nodes with subplans H1–H4. Replicas are stored on different nodes in the network (Fetch and Store). Hadoop tries to store replicas of the same block on different nodes. Map Phase. In the map phase each map subplan M1–M4 reads a subset of the data called a split2 from HDFS. A split is a logical concept typically comprising one or more blocks. This assignment is defined by UDF split. In the example, the split assigned to M1 consists of two blocks which may both be retrieved from subplan H1. Subplan M1 unions the input blocks T 1 and T 5 and breaks them into records (RecRead). The latter operator uses a UDF itemize that defines how a split is divided into items. Then subplan M1 calls map on each item and passes the output to a PPart operator. This operator divides the output into so-called spills based

However, Hadoop Map Reduce uses 10 user-defined functions (UDFs). Theses UDFs can be used to inject arbitrary code into Hadoop...

...including code that was not intended to be injected into Hadoop. Our idea is somewhat similar to a trojan horse or a trojan, i.e. a virus that is injected into a computer system to harm or destory the system. However, we inject trojans to improve or heal the system. Therefore we are calling them...

2 Not to be confused with spills. See below. We use the terminology introduced in [10].

http://www.istockphoto.com/stock-photo-1824642trojan-horse.php?st=0bab152

Good Trojans!

results taken from our VLDB 2010-paper Selection Task Hadoop HadoopDB HadoopDB Chunks

140

Hadoop++ is in the same ballpark or even faster than HadoopDB (now spun-off as Hadapt)

Hadoop++(256MB) Hadoop++(1GB)

runtime [seconds]

120 100 80 60 40 20 0 10 nodes

50 nodes

100 nodes

Hadoop++ is up to a factor 18 faster than Hadoop Join Task 2500

Hadoop HadoopDB

...even though we do not modify the underlying HDFS and Hadoop MapReduce source code at all! Our improvements are all done through UDFs only.

Hadoop++(256MB) Hadoop++(1GB)

runtime [seconds]

2000

1500

1000

500

0

10 nodes

50 nodes

100 nodes

But wait: UDFs are also available in traditional database systems. What happens if we exploit those UDFs to inject “better technology“ into an existing database system? Say we inject column store technology into a commercial, closed-source row store. How would that look like?

1Good ROW Q1 Q6 Q12 Q14

Trojans in a DBMS?... TROJAN

SQL

VECTORWISE

76.730296 19.293983 22.0 31.276143 77.589034 8.6532381 16.0 25.845965 92.486038 37.331905 33.0 29.785149 81.207649 30.788114 28.0 22.291128 Good Trojans Versus Closed

VECTORWISE2

VERTICA (other machine)

6.0 4.6 4.3 Source5.0

Column Store DBMS

Query Time (sec)

100

Standard Row DBMS-Y DBMS-Z (b)

75

Trojan Columns DBMS-Z (a)

50

25

0

Q1

Q6

Q12

Q14

[CIDR 2013a]

41.7 11.4 It15.3 looks 13.9

Factor

1.1 1.8 this.0.9 For 0.9

3.2 1.9 8.7 see our 6.2

like details paper: Alekh Jindal, Felix Martin Schuhknecht, Jens Dittrich, Karen Khachatryan, Alexander Bunte. How Achaeans Would Construct Columns in Troy. CIDR 2013, Asilomar, USA. http://infosys.uni-saarland.de/publications/How %20Achaeans%20Would%20Construct%20Columns %20in%20Troy.pdf You can get much faster than a row store. We are not as fast as a from scratch implementation of a column store. This has to do with the different QP technology used. However for some customers the performance of a native column store might not even be required - especially for mediumsized datasets. ...

Q19 Q2 Q4 Q8 Q15

0

0

2.935810086 1.9092920003

0

1.816512607

15.286783871

Good Trojans in a Closed Source Row 0 0 0 0 0 0 Store DBMS

100

Query Time (sec)

8.807430258

15.13420846 5.1889663903

Standard Row Materialized View

5.449537821

5.727876001

part

... Why buy a car with 1000hp if 200hp are just part enough? lineitem

45.860351612 15.566899171 45.402625379 order

Trojan Columns

An interesting result from our work is also that we can beat materialized views for some queries.

75

50

25

0

Q1

Q6

Q12

Q14

[CIDR 2013a]

With this let‘s close this footnote and go back to Hadoop MapReduce performance. UDFs in Hadoop allow us to boost query performance without changing the underlying system. However, what are the...

1...Good

Trojans in a DBMS!

Problems:

Upload-Times Hadoop HadoopDB

50000

Hadoop++(256MB) Hadoop++(1GB)

(I)ndex Creation (C)o-Partitioning Data (L)oading

runtime [seconds]

40000

30000

I I

20000

I

10000

C

0

I I

I C

C

C C

C

L L

L L

L L

10 nodes

50 nodes

100 nodes

from Hadoop++ paper: The problem is that in order to have fast queries we first have to “massage“ the data before, i.e. create indexes, co-partition and so forth. This takes time. Hadoop does not have to spend this time and therefore uplloading the data to HDFS is fast. In contrast, for Hadoop++ (but als HadoopDB) we also have to do a lot of extra work. This extra work is very costly. So costly that only after running many queries these investments are amortized. In other words, if we only want to run a few queries exploiting our indexes and co-partitioning, we shouldn‘t use Hadoop++ in the first place but rather run the queries directly on Hadoop! How could we fix this problem?

=> back to scanning? => index selection algos? => coarse-granular indexes

We could drop the idea of using indexes: just scan everything. Well we are not gonna follow this approach. We could invest into better index selection algorithms. If we pick the wrong index, index creation is unlikely to be amortized. Therfore making the right choice is important. Therefore... Well we are not gonna follow this approach. Or: as it is expensive to create all these indexes, we better investigate coarse-granular indexes, i.e. indexes that are cheaper to construct and yet give some benefit at query time. Well we are not gonna follow this approach.

We do something different. Which brings me to...

MapReduce Intro

Hadoop++

HAIL

... the third part of my talk. The approach I would like to present is coined HAIL.

HAIL

[SOCC 2011] [VLDB 2012a] [int‘ patent]

Hadoop Aggressive Indexing Library

HAIL means Hadoop Aggressive Indexing Library. For details see our paper: Jens Dittrich, Jorge-Arnulfo Quiane-Ruiz, Stefan Richter, Stefan Schuh, Alekh Jindal, Jörg Schad. Only Aggressive Elephants are Fast Elephants. VLDB 2012/ PVLDB, Istanbul, Turkey. http://infosys.uni-saarland.de/ publications/HAIL.pdf A predecessor of this work focussing on data layouts in HDFS is our paper: Alekh Jindal, Jorge-Arnulfo Quiane-Ruiz, Jens Dittrich. Trojan Data Layouts: Right Shoes for a Running Elephant. ACM SOCC 2011, Cascais, Portugal. http:// infosys.uni-saarland.de/publications/JQD11.pdf

So back to Bob again. Recall tath Bob wants to analyze a large file with Hadoop MapReduce. So he first has to upload his file to HDFS. In our approach we replace HDFS with HAIL. HAIL is an extension of HDFS.

Bob

HAIL

...

DN1

DN2

DN3

DN4

DN5

DN6

DNn

As before Bob‘s file gets partiitoned into HDFS blocks...

HAIL block42

A

block block block block block 42 42 block 4242 block 42 42 block 42 42

...

...

N2DN4 DN5 DN6 DN2DN4 DN5 DN6 DN7

DN4 DN5 DN6 DN7 DN2

DN4 DNn DN5 DN6 DN7

block42 block42 block42

...

AB AB AB horizontal partitions

...

A B

B

A

...

...

B

DNn DN5 DN6 DN7DN4 DNn DN6 DN7 DN5 DNnDN7

DN6 DNn

block42

B

...

DNn DN7

DNn

...

DFS

HDFS

HDFS

DN1

HDFS

DN2

HDFS

DN3

HDFS

DN4

DN5

DN6

DNn

those blocks are relatively large, at least 64MB

HAIL block42

A

block block

block block

block block

42 42 ...4242 ...42 42 AB AB AB horizontal partitions

DN6 DN2DN4 DN6 DN7 N2DN4 DN5 DN5

DN4 DN6 DN7 DN5 DN2

DN4 DN6 DN7 DNn DN5

block block 42 42

block HDFS blocks

block42 block42 block42

...

...

A B

B

A

42 ... ... 64MB (default)B B

DN6 DN7DN4 DNn DN6 DN7 DN5 DNn DN5 DNnDN7

DN6 DNn

...

DN7 DNn

DNn

...

DFS

HDFS

HDFS

DN1

HDFS

DN2

HDFS

DN3

HDFS

DN4

DN5

DN6

DNn

then those blocks get partitioned to the different datanodes (just as above)

HAIL block42

1 2 block block block block block 42 42 block 4242 block 42 42 block 42 42

A

AB

DN6 DN2DN4 DN6 DN7 N2DN4 DN5 DN5

DFS

...

...

AB

...

AB

DN4 DN6 DN7 DN5 DN2

DN4 DN6 DN7 DNn DN5

A 4B

3 42 block block42 block42

...

5

6B

A

...

B

DN6 DN7DN4 DNn DN6 DN7 DN5 DNn DN5 DNnDN7

7

8

9

... DN6 DNn

block42

B DN7 DNn

... HDFS

DN1

HDFS

DN2

HDFS

DN3

HDFS

DN4

HDFS

DN5

DN6

DNn

... DNn

the HDFS blocks also get replicated (just as above) but then, before writing the data to the local disks on the different datanodes, we do something in addition: HAIL

1

2

2

4

5

6

7

8

9

1

3

2

3

DN1

1

...

3

DN2

DN3

DN4

DN5

DN6

DNn

HAIL block42

block block block block block 42 42 block 4242 block 42 42 block 42 42

A

...

AB

N2DN4 DN5 DN6 DN2DN4 DN5 DN6 DN7

1 1

DFS

HDFS

DN4 DNn DN5 DN6 DN7

A 4B 7

2 2

11

33

HDFS

HDFS

DN3

...

A

6B

5

...

8

22

B

DN6 DNn

9

3 3

block42

...

B

DNn DN5 DN6 DN7DN4 DNn DN6 DN7 DN5 DNnDN7

22

DN2

block42 block42 block42

...

AB

DN4 DN5 DN6 DN7 DN2

HDFS

DN1

...

AB

DNn DN7

1 1

...

we sort the data on each HDFS block in main memory. Each replica is sorted using a different sort cirteria. This means after sorting each HDFS block is available in three different sort orders - roughly corresponding to three different clustered indexes. ... Notice that we do not redistribute data across HDFS blocks! Data that was on one particular block in DNn standard HDFS will sit on the same HDFS block in HAIL. In other words: the different copies of the block contain the same data - yet in different sort orders.

33

HDFS

DN4

DN5

DN6

DNn

Again, we do this for each and every copy of a block.

HAIL block42

block block block block block 42 42 block 4242 block 42 42 block 42 42

A

...

AB

DN6 DN2DN4 DN6 DN7 N2DN4 DN5 DN5

1 5

DFS

HDFS

...

AB DN4 DN6 DN7 DN5 DN2

2 6

DN1

3

HDFS

5

DN4 DN6 DN7 DNn DN5

DN2

A

B

...

7

1 5

HDFS

8

2

4

DN4

3

HDFS

DN5

B

DN6 DNn

9

3 6

block42

...

B

DN6 DN7DN4 DNn DN6 DN7 DN5 DNn DN5 DNnDN7

2

DN3

...

A B

4

HDFS

block42 block42 block42

...

AB

...

DN7 DNn

DNn

1

... 6

4

DN6

DNn

Notice that this is done without introducing additional I/ O. We fully piggy-back on the existing HDFS upload pipeline. HAIL block42

block block block block block 42 42 block 4242 block 42 42 block 42 42

A

...

AB

AB

DN6 DN2DN4 DN6 DN7 N2DN4 DN5 DN5

DFS

1 5 6 5 6 HDFS

DN1

...

AB

DN4 DN6 DN7 DN5 DN2

2 3 5 5 HDFS

DN2

DN4 DN6 DN7 DNn DN5

2 4 4 HDFS

DN3

block42 block42 block42

...

...

A B

B

A

...

B

DN6 DN7DN4 DNn DN6 DN7 DN5 DNn DN5 DNnDN7

7

1 5 6 6 5 HDFS

DN4

8

9

3

2

4 4

3 HDFS

DN5

DN6

B

DN6 DNn

... 6 6

block42

...

DN7 DNn

1 4 4

DNn

... DNn

HAIL

7

1 5

2 6

DN1

3

5

DN2

2

1

4

5

DN3

6

8

9

3

2

4

3

DN4

DN5

1

...

4

6

DN6

DNn

HAIL block42

block block block block block 42 42 block 4242 block 42 42 block 42 42

A

...

AB

N2DN4 DN5 DN6 DN2DN4 DN5 DN6 DN7

DFS

1

7

5

6

HDFS

DN1

...

AB DN4 DN5 DN6 DN7 DN2

DN4 DNn DN5 DN6 DN7

2

9

2

9

3

5

HDFS

4

8

HDFS

DN2

DN3

block42 block42 block42

...

AB

...

A B

A

B

...

DNn DN5 DN6 DN7DN4 DNn DN6 DN7 DN5 DNnDN7

1

7

3

8

2

8

5

6

4

9

3

6

HDFS

DN4

HDFS

DN5

block42

...

B

B

DN6 DNn

...

DNn DN7

DNn

1

...

4

DN6

7

DNn

Eventually uploading (and indexing) is finished. What does this mean for HDFS failover? HAIL block42

block block block block block 42 42 block 4242 block 42 42 block 42 42

A

...

AB

DN6 DN2DN4 DN6 DN7 N2DN4 DN5 DN5

7 7

1 5

DFS

HDFS

6

DN1

...

AB DN4 DN6 DN7 DN5 DN2

9 9

2 3

HDFS

5

DN4 DN6 DN7 DNn DN5

2

99

4

8 8

HDFS

DN2

DN3

block42 block42 block42

...

AB

...

A B

A

B

...

DN6 DN7DN4 DNn DN6 DN7 DN5 DNn DN5 DNnDN7

1 5

HDFS

7 7 6

3

8 8

2

8 8

4

9 9

3

6

DN4

HDFS

DN5

block42

...

B

B

DN6 DNn

DN7 DNn

4

DN6

77

DNn

HAIL block block block block block 42 42 block 4242 block 42 42 block 42 42

A

...

AB

AB

DN6 DN2DN4 DN6 DN7 N2DN4 DN5 DN5

DFS

1

7

5 6 HDFS

DN1

...

AB

DN4 DN6 DN7 DN5 DN2

2

9

3 5 HDFS

DN2

DN4 DN6 DN7 DNn DN5

2

9

4 8 HDFS

DN3

block42 block42 block42

...

...

A B

B

A

...

7

3

8

2

8

5 6 HDFS

4

9HDFS3

6

DN4

DN5

DN6

block42

...

B

DN6 DN7DN4 DNn DN6 DN7 DN5 DNn DN5 DNnDN7

1

DNn

1

...

Failover

block42

...

B

DN6 DNn

...

DN7 DNn

1 4

7

DNn

Well actually, nothing changes. All data sits on the same HDFS blocks as before. For instance, if we lose DN2 and DN6, we can still retrieve block 3 from DN6. That block might not be sorted along the desired sort criteria, but it contains all the data. And we can use the ... remaining block to recreate additional copies in other sort orders. DNn

Well it is all somewhat more complex than explained in the previous example.

little

Details

HAIL Upload Pipeline Network

Network

Bob Datanode DN1

HAIL Client upload

preprocess

1 A

OK

B

C

convert

HAIL Block

PAX Block

Block Metadata

Block Metadata

0101001010111 A 0110010111010

build

0000100100011 A 0011110111111

0101001010111 A 0110010111010

7

B 1000101110101

0101010101010

B 0011010001100

2

1011000110110

6

5

0101001010111

A 0110010111010

PCK 2

PCK 1

ACK 1 3 2 1

ACK 2 3 2 1

0101010101010

15

1001110111011

B 1010101101010 0001001100011

0110100010111

C 0111100111111

C 0101000110110

Index Metadata Index C

reassemble

8

PCK 1

PCK 2

forward

1

ACK 1 3 2 1

forward

ACK 1 3

10 register

register

14

3

9

ACK 2 3

acknowledge

12

get location

check

PCK 1

PCK 2

ACK 2 3 2

append

13 Network

...

C 0111101100110

reassemble

B 0011010001100 0110100010111

0010

0010011101101

A 1101001101101

Index Metadata Index A

4

Block Metadata

C 0101000110110

HAIL Block Block1110 Metadata

build

0101010101010

B 0011010001100

1000101010001

0110100010111

C 0101000110110

notify PAX Block

Datanode DN3

PAX Block Block Metadata

11

HDFS Namenode Block directory

HAIL Replica directory

HAIL Query Pipeline System's Perspective

Bob's Perspective

Hadoop MapReduce Pipeline MapReduce Pipeline Split Phase

Scheduler

Map Phase

2 Bob

run Job

1

write Job

MapReduce Job

Job Client

Task Tracker

Job Tracker

for each block i in input { locations = blocki.getHostWithIndex(@3); splitBuilder.add(locations, block i); } splits[] = splitBuilder.result;

3 send splits[]

for each split i in splits { allocate split i to closest DataNode storing block i }

Record Reader: - Perform Index Scan - Perform Post-Filtering - for each Record invoke map(HailRecord)

5 allocate Map Task

Main Class map(...) reduce(...)

4

... @HailQuery( filter="@3 between(1999-01-01, 2000-01-01)", projection={@1}) void map(Text k, HailRecord v) { output(v.getInt(1), null); } ...

DN3

DN4

B

DN5

HDFS

read block42

block42

A

C DN2

6

block42

block42

DN1 HAIL Annotation

chose computing Node

DN6

DN7

7

store output

... DNn

HDFS

Experiments

We play some other tricks. For instance, while reading the input file, we immediately parse the data into a binary PAX (column like) layout. The PAX data is then sent to the different datanodes. We also had to make sure to not break the involved data consistency-checks used by HDFS (packet acknowledge). In addition, we extended the namenode to record additional information on the sort orders and layouts used for the different copies of an HDFS block. The latter is needed at query time. None of these changes affects principle properties of HDFS. We just extend and piggyback.

At query time HAIL needs to know how to filter input records and which attributes to project to in intermediate result tuples. This can be solved in many ways. In our current implementation we allow users to annotate their map-function with the filter and projection conditions. However this could also be done fully transparently by using static code analysis as shown in: Michael J. Cafarella, Christopher Ré: Manimal: Relational Optimization for Data-Intensive Programs. WebDB 2010. That code analysis could be used directly with HAIL. Another option, if the map()/ reduce()-functions are not produced by a user, is to adjust the application. ...

...Possible “applications“ might be Pig, Hive, Impala or any other system producing map()/reduce() programs as its output. In addition, any other application not relying on MapReduce but just relying on HDFS might use our system, i.e. applications diurectly working with HDFS files.

What happens if we upload a file to HDFS?

Upload Times

Hadoop 1132

Hadoop++ 3472 5766

HAIL 671 704 712 717

Upload Time

Upload time [sec]

0 1 2 3

Hadoop

6800

Hadoop++

Let‘s start with the case that no index is created by any of the systems. Hadoop takes about 1132 seconds HAIL

5766

5100 3472

3400 1700

1132

0

0

1

717

712

704

671

2

3

Number of created indexes Hadoop 1132

Hadoop++ 3472 5766

HAIL

Hadoop

6800

all with 3 replicas

671 704 712 717

Upload Time

Upload time [sec]

0 1 2 3

Hadoop++

HAIL

5766

Hadoop++ is considerably slower. Even though we switch off index creation here, Hadoop++ runs an extra job to convert the input data to binary. This takes a while. What about HAIL?

5100 3472

3400 1700

1132

0

0

1

717

712

704

671

2

3

Number of created indexes Hadoop++ 3472 5766

HAIL

Hadoop

6800

all with 3 replicas

671 704 712 717

Upload Time

Upload time [sec]

0 1 2 3

Hadoop 1132

Hadoop++

HAIL

5766

5100 3472

3400 1700 0

1132

0

1

717

712

704

671

2

3

Number of created indexes all with 3 replicas

HAIL is faster than Hadoop HDFS. How can this be? We are doing more work than Hadoop HDFS, right? For instance, we convert the input file to binary PAX during upload directly. This can only be slower than Hadoop HDFS but NOT faster. Well, when converting to binary layout it turns out that the binary represenation of this dataset is smaller than the textual representation. Therefore we have to write less data and SAVE I/O. Therefore HAIL is faster for this dataset. This does not necessarily hold for all datasets. Notice that for this and the following experiments we are not using compression yet. We expect compression to be even more beneficial for HAIL.

712 717

Upload Time

Upload time [sec]

2 3

Hadoop

6800

Hadoop++

So what happens if we start creating indexes? We should then feel the additional index creation effort in HAIL. HAIL

For Hadoop++ we observe long runtimes. For HAIL, in contrast to what we expected, we observe only a small increase in the upload time.

5766

5100 3472

3400 1700

1132

0

0

717

712

704

671

1

2

3

Number of created indexes Hadoop 1132

Hadoop++ 3472 5766

HAIL

Hadoop

6800

all with 3 replicas

671 704 712 717

Upload Time

Upload time [sec]

0 1 2 3

Hadoop++

The same observation holds when creating two clustered indexes with HAIL... HAIL

5766

5100 3472

3400 1700

1132

0

717

712

704

671

0

1

2

3

Number of created indexes Hadoop 1132

Hadoop++ 3472 5766

HAIL

Upload time [sec]

Hadoop

6800

all with 3 replicas

671 704 712 717

Upload Time Hadoop++

HAIL

5766

5100

1700

...or three. This is because, standard file upload in HDFS is I/Obound. The CPUs are mostly idle. HAIL simply exploits the unused CPU ticks that would be idling otherwise. Therefore the additional effort for indexing is hardly noticeable.

3472

3400 1132

0

0

717

712

704

671

1

2

3

Disk space is cheap. For some situations It could be affordable to store more than three copies of an HDFS block. What would be the impact on the upload times? The next experiment shows the results...

Number of created indexes all with 3 replicas

Here we create up to 10 replicas - corresponding to 10 different clustered indexes.

Replica Scalability

Upload time [sec]

0 1 2 3

Hadoop

4300

HAIL

Hadoop upload time with 3 replicas

3225

2712 2256

2150 1075 0

3710

1773 1132 717

3 (default)

1700 956

5

1089

6

1254

7

Number of created replicas

10

We observe that in the same time HDFS uploads the data without creating any index, HAIL uploads the data, converts to binary PAX, and creates six different clustered indexes.

Upload Time [sec]

Scale-Out Hadoop

2200

HAIL 1836

1742

1650

1530

1476 1486

1284

1100

918

827

684

600

550 0

1026

Syn

UV

633

Syn

10 nodes

UV

Syn

50 nodes

UV

100 nodes

Number of Nodes

modulo variance, see [VLDB 2010b]

We also evaluated upload times on the cloud using EC2 nodes. Notice that experiments on the cloud are somewhat problematic due to the high runtime variance in those environments. For details see our paper: Jörg Schad, Jens Dittrich, Jorge-Arnulfo Quiane-Ruiz Runtime Measurements in the Cloud: Observing, Analyzing, and Reducing Variance VLDB 2010/PVLDB, Singapore. http://infosys.cs.uni-saarland.de/publications/ SDQ10.pdf slides: http://infosys.cs.uni-saarland.de/publications/ SDQ10talk.pdf

What about query times?

min hail min hadoop 18 7 18 8 7.333333333 64 15.666666667 124 9.666666667 37.333333333 65.333333333 35.666666667 224 36 186 28 89.66666667 18 201.33333333 12 157 65.666666667 106 33.333333333

Query Times

Individual Jobs: Weblog, RecordReader RR Runtime [ms]

4000 3000

Hadoop

Hadoop ++

HAIL

3358 2917 2470

2776 2156

2112

2864 2442

2000 1000 0

73

83 75

683

333

53 52

Bob-Q1 Bob-Q2 Bob-Q3 Bob-Q4 Bob-Q5

MapReduce Jobs

Here we display the RecordReader times. They correspond roughly to the data access time, i.e. the time for further processing, which is equal in all systems - is factored out. We observe that HAIL improves query runtimes dramatically. Hadoop resorts to full scan in all cases. Hadoop++ can benefit from its index if the query happens to hit the right filter condition. In contrast, HAIL supports many more filter conditions. What does this mean for the overall job runtimes?

Job Runtime [sec]

Individual Jobs: Weblog, Job 1500 1125 750

Hadoop 1160 1094

601

1006 705 598

Hadoop++

HAIL

1143 1099

1145 1099

942 651598

598

602

375 0

Bob-Q1 Bob-Q2 Bob-Q3 Bob-Q4 Bob-Q5

MapReduce Jobs

Well, those results are not so great. The benefits of HAIL over the other approaches are marginal? How come? It has to do with...

Job Runtime [sec]

Scheduling Overhead Hadoop 1500

Hadoop++

HAIL

Overhead

1125 750 375 0

Bob-Q1 Bob-Q2 Bob-Q3 Bob-Q4 Bob-Q5

MapReduce Jobs

...the Hadoop scheduling overhead. Hadoop was designed having long running tasks in mind. Accessing indexes in HAIL, however, is in the order of milliseconds. These milliseconds of index access are overshadowed by scheduling latencies. You can try this out with a simple experiment. Write a MapReduce job that does not read any input, does not do anything, and does not produce any output. This takes about 7 seconds - for doing nothing. How could we fix this problem? By...

...introducing “HAIL scheduling“. But let‘s first look back at standard Hadoop scheduling:

HAIL Scheduling

Hadoop Scheduling MapReduce

map(row) -> set of (ikey, value)

HAIL

23

7

5

4

12

1

11

2

22

19

21

12

15

14

6

In Hadoop, each HDFS block will be processed by a different map task. This leads to waves of map tasks each having a certain overhead. However the assignment of HDFS blocks to map tasks is not fixed. Therefore, ...

sort order 7 map tasks (aka waves) 7 times scheduling overhead

DN1

HAIL Scheduling MapReduce

map(row) -> set of (ikey, value)

HAIL

23

7

5

4

12

1

11

2

22

19

block 21 12 42

6

15

B

14

sort order 1 map task (aka wave) 1 times scheduling overhead

... HAIL Split

DN1 DN6

DN7

DNn

...in HAIL Scheduling we assign all HDFS blocks that need to be processed on a datanode to a single map task. This is achieved by defining appropriate “splits“. see our paper for details. The overall effect is that we only have to pay the scheduling overhead once rather than 7 times (in this example). Notice that in case of failover we can simply reschedule index access tasks - they are fast anyways. Additionally, we could combine this with the recovery techniques from RAFT, ICDE 2011.

What are the end-to-end query runtimes with HAIL Scheduling?

Query Times with HAIL Scheduling

Now the good RecordReader times seen above translate to (very) good query times.

Job Runtime [sec]

Individual Jobs: Weblog Hadoop

1500 1125

1160 1094

Hadoop++

1145 1099

1143 1099

1006

942

705

750

HAIL

651

375 0

16

15

65

22

15

Bob-Q1 Bob-Q2 Bob-Q3 Bob-Q4 Bob-Q5

MapReduce Jobs

What about failover?

Failover 113

Failover HAIL-1Idx Hadoop

63 598

33 598

1212

661

631

1099

Job Runtime [sec]

Failover

1500 1125

Hadoop

HAIL

Slowdown

10.3 % slowdown 1099

750

10.5 % slowdown

5.5 % slowdown

375

598

598

HAIL

HAIL-1Idx

0

Hadoop

Systems

10 nodes, one killed

We use two configurations for HAIL. First, we configure HAIL to create indexes on three different attributes, one for each replica. Second, we use a variant of HAIL, coined HAIL-1Idx, where we create an index on the same attribute for all three replicas. We do so to measure the performance impact of HAIL falling back to full scan for some blocks after the node failure. This happens for any map task reading its input from the killed node. Notice that, in the case of HAIL-1Idx, all map tasks will still perform an index scan as all blocks have the same index. Overall result: HAIL inherits Hadoop MapReduce‘s failover properties.

of this talk

Summary(Summary(...))

What I tried to explain to you in my talk is:

BigData MapReduce => Intro

Hadoop++

HAIL

Hadoop MapReduce is THE engine for big data analytics.

BigData =>

Hadoop++

HAIL

BigData =>

Hadoop++

fast indexing HAIL

AND

fast querying

By using good trojans you can improve system performance dramatically AND after the fact --- even for closed-source systems.

BigData =>

HAIL allows you to have fast index creation AND fast query processing at the same time.

Hadoop++

project page: http://infosys.uni-saarland.de/hadoop++.php Copyright of alls Slides Jens Dittrich 2012

fast indexing

AND

fast querying annotated slides are available on that page

infosys.uni-saarland.de

Suggest Documents