Efficiently Indexing AND Querying Big Data in Hadoop MapReduce
Copyright of all slides: Jens Dittrich 2012
Jens Dittrich saarland university
computer science
This talk consists of three parts.
MapReduce Intro
Hadoop++
HAIL
First Part
MapReduce Intro
Big data is the new “very large“.
Big
Data
Big data is everywhere: CERN... http://cdsweb.cern.ch/record/1295244
...moving objects indexing... http://www.istockphoto.com/stock-video-4518244-latrafic-a-time-lapse.php
...astronomy... http://www.flickr.com/photos/ 14924974@N02/2992963984/
[Physics]
basically whenever you point a satellite dish up in the air, you collect tons of data but also in... \href{http://it.wikipedia.org/wiki/ File:KSC_radio_telescope.jpg}{http://it.wikipedia.org/ wiki/File:KSC\_radio\_telescope.jpg}
...genomics... http://www.istockphoto.com/stockillustration-16136234-dna-strands.php
...social networks...
...and search engines.
They proposed a system to effectively analyze big data.
[Dean et al, OSDI’04]
That system was coined “MapReduce“. The system is Google-proprietary.
MapReduce
Hadoop is the open source variant. It has a large community of developers and start-up companies.
Big Data Tutorial
We presented a tutorial on Big Data Processing in Hadoop MapReduce at VLDB 2012. It contains many details on data layouts, indexing, query processing and so forth. The tutorial slides are available online: http://infosys.uni-saarland.de/publications/ BigDataTutorialSlides.pdf
[VLDB 2012b]
Let‘s briefly revisit the MapReduce interface:
Semantics:
just two functions: map() and reduce()
map(key, value) -> set of (ikey, ivalue) reduce(ikey, set of ivalue) -> (fkey, fvalue)
Let‘s look at a concrete use-case:
Google-Use Case:
This is vital for Google‘s search service you use everyday.
Web-Index Creation
In this use-case the map function...
map(key, value) -> set of (ikey, ivalue)
...takes a docID and a document (the contents of the document) and returns a set of (term,docID)-pairs.
map(docID, document) -> set of (term, docID)
For instance...
map(44, ´´This is text on a website!´´ ) -> {
(``This´´, 44),
(``is´´, 44),
(``text´´, 44),
(``on´´, 44),
(``a´´, 44),
(``website´´, 44) }
...map() will be called for document 44 with its contents “This is text on a website!“. The map()-function breaks this into pairs, one pair for each term occurring on website 44.
map(42, ´´This is just another website!´´ ) -> {
(``This´´, 42),
(``is´´, 42),
(``just´´, 42),
(``another´´, 42),
(``website´´, 42) }
the same happens for document 42
map(43, ´´One more boring website!´´ ) -> {
(``One´´, 43),
(``more´´, 43),
(``boring´´, 43),
(``website´´, 43) }
and so forth
What about reduce()?
reduce(ikey, set of ivalue) -> (fkey, fvalue)
reduce(term, set of docID) -> (term, (posting list of docID, count))
For Web-index creation reduce() receives a term and the set of docIDs containing that term. reduce() then returns a pair of the input term and an ordered posting loist of docIDs plus a count, i.e. the number of web pages having that term. Note: there are many variants how to do web-indexing with MapReduce. The actual semantics used by Google may differ; the core idea however is the same.
reduce(``This´´, {42, 43} ) -> (``This´´, ([42, 43], 2))
For instance: documents 42 and 43 contain “This“. reduce() simply returns an ordered posting list plus the count.
reduce(``is´´, {42, 43} ) -> (``is´´, ([42, 43], 2))
Documents 42 and 43 contain “is“. reduce() simply returns an ordered posting list plus count for this as well.
and so forth
reduce(``boring´´, {43} ) -> (``boring´´, ([43], 1)) etc.
Other Applications:
Search rec.a==42, rec.contains(``bla´´), rec.contains(0011001)
Machine Learning k-means, mahout library
Web-Analysis sum of all accesses to page Y from user X etc.
map() and reduce() with
Big
Data ?
Many things can be mapped to the map()/reduce()interface, but not all. Think about twice before blindly using MapReduce. It is useful for many things, but not all. Many important extensions have been done in the past to support more application classes, i.e. iterative problems.
Let‘s assume a user Bob who wants to analyze a large file.
Bob
Notice that this is a simplified explanation. For details on how Hadoop works see our paper: Hadoop++: Making a Yellow Elephant Run Like a Cheetah (Without It Even Noticing), VLDB 2010 http://infosys.cs.uni-saarland.de/publications/DQJ +10CRv2.pdf
HDFS
...
DN1
DN2
DN3
DN4
DN5
http://www.istockphoto.com/file\_closeup.php? id=591134
DN6
DNn
Bob first needs to upload his file to Hadoop‘s Distributed File System (HDFS). HDFS partitions his data into large horizontal partitions. HDFS block42
A
block block block block block 42 42 block 4242 block 42 42 block 42 42
...
...
N2DN4 DN5 DN6 DN2DN4 DN5 DN6 DN7
DN4 DN5 DN6 DN7 DN2
DN4 DNn DN5 DN6 DN7
block42 block42 block42
...
AB AB AB horizontal partitions
...
A B
B
A
...
...
B
DNn DN5 DN6 DN7DN4 DNn DN6 DN7 DN5 DNnDN7
DN6 DNn
block42
B
...
DNn DN7
DNn
...
DFS
HDFS
HDFS
DN1
HDFS
DN2
HDFS
DN3
HDFS
DN4
DN5
DN6
Those horizontal partitions are termed HDFS blocks. They are relatively large: at least 64MB up to 1GB. Do not confuse these large HDFS blocks with the typically small database pages (which are only a few KB in size).
HDFS block42
A
block block
block block
block block
42 42 ...4242 ...42 42 AB AB AB horizontal partitions
DN6 DN2DN4 DN6 DN7 N2DN4 DN5 DN5
DNn
DN4 DN6 DN7 DN5 DN2
DN4 DN6 DN7 DNn DN5
block block 42 42
block HDFS blocks
block42 block42 block42
...
...
A B
B
A
42 ... ... 64MB (default)B B
DN6 DN7DN4 DNn DN6 DN7 DN5 DNn DN5 DNnDN7
DN6 DNn
...
DN7 DNn
DNn
...
DFS
HDFS
HDFS
DN1
HDFS
DN2
HDFS
DN3
HDFS
DN4
DN5
DN6
DNn
Each HDFS block receives a unique ID.
HDFS block42
1 2 block block block block block 42 42 block 4242 block 42 42 block 42 42
A
AB
DN6 DN2DN4 DN6 DN7 N2DN4 DN5 DN5
DFS
...
...
AB
...
AB
DN4 DN6 DN7 DN5 DN2
DN4 DN6 DN7 DNn DN5
A 4B
3 42 block block42 block42
...
5
6B
A
...
B
DN6 DN7DN4 DNn DN6 DN7 DN5 DNn DN5 DNnDN7
7
8
9
... DN6 DNn
block42
B DN7 DNn
... HDFS
DN1
HDFS
DN2
HDFS
DN3
HDFS
DN4
HDFS
DN5
DN6
DNn
... DNn
The HDFS blocks get distributed and replicated over the cluster. Each HDFS block gets replicated to at least three different data nodes (DN1, ...DNn in this example). HDFS
1
2
2
4
5
6
7
8
9
1
3
2
3
DN1
1
...
3
DN2
DN3
DN4
DN5
DN6
DNn
HDFS does this for every HDFS block of the input file.
HDFS block42
block block block block block 42 42 block 4242 block 42 42 block 42 42
A
...
AB
N2DN4 DN5 DN6 DN2DN4 DN5 DN6 DN7
1 5
DFS
HDFS
...
AB DN4 DN5 DN6 DN7 DN2
2 6
DN1
DN4 DNn DN5 DN6 DN7
2
3
HDFS
5
4
DN2
A
B
...
7
5
8
6
2
4
DN4
3
HDFS
DN5
B
DN6 DNn
9
3
block42
...
B
DNn DN5 DN6 DN7DN4 DNn DN6 DN7 DN5 DNnDN7
HDFS
DN3
...
A B
1
HDFS
block42 block42 block42
...
AB
...
DNn DN7
DNn
1
... 6
4
DN6
DNn
Eventually all HDFS blocks have been sent to the datanodes.
HDFS block42
block block block block block 42 42 block 4242 block 42 42 block 42 42
A
...
AB
DN6 DN2DN4 DN6 DN7 N2DN4 DN5 DN5
DFS
1
7
5
6
HDFS
DN1
...
AB DN4 DN6 DN7 DN5 DN2
2
9
3
5
HDFS
DN4 DN6 DN7 DNn DN5
2
9
4
8
HDFS
DN2
DN3
block42 block42 block42
...
AB
...
A B
A
B
...
DN6 DN7DN4 DNn DN6 DN7 DN5 DNn DN5 DNnDN7
1
7
3
8
2
8
5
6
4
9
3
6
HDFS
DN4
HDFS
DN5
block42
...
B
B
DN6 DNn
DN7 DNn
4
DN6
7
DNn
We gain nice failover properties: even if two datanodes go offline, we still have one copy of the block.
HDFS block block block block block 42 42 block 4242 block 42 42 block 42 42
A
...
AB
DFS
1
7
5
6
HDFS
DN1
...
AB
DN6 DN2DN4 DN6 DN7 N2DN4 DN5 DN5
9
3
5
HDFS
DN2
DN4 DN6 DN7 DNn DN5
2
9
4
8
HDFS
DN3
block42 block42 block42
...
AB
DN4 DN6 DN7 DN5 DN2
2
DNn
1
...
Failover
block42
...
...
A B
A
B
...
B
DN6 DN7DN4 DNn DN6 DN7 DN5 DNn DN5 DNnDN7
1
7
3
8
2
8
5
6
4
9
3
6
HDFS
DN4
DN5
HDFS
DN6
block42
...
B
DN6 DNn
...
DN7 DNn
1 4
7
DNn
Assume that we want to retrieve HDFS block 3. We lost the copies on DN2 and DN5. However,we can still ... retrieve a copy of block 3 from DN6. Notice that once datanodes go offline HDFS (should) DNn copy blocks to other nodes to get back to having three copies of each block again.
Load Balancing
Another advantage of having three copies for each block is load balancing. Whenever a user or an application asks for a particular block, we have three options for retrieving that block. The decision which datanode to use may be made based on network locality, network congestion, and the current load of the datanodes.
I would like to have block 4!
HDFS
1
7
2
9
2
9
1
7
3
8
2
8
5
6
3
5
4
8
5
6
4
9
3
6
DN1
DN2
DN3
DN4
DN5
1
...
4
DN6
7
DNn
So now we have our data stored in HDFS. What about MapReduce? And by “MapReduce“ I mean “Hadoop MapReduce“ in the following. HDFS block42
block block block block block 42 42 block 4242 block 42 42 block 42 42
A
...
AB
N2DN4 DN5 DN6 DN2DN4 DN5 DN6 DN7
DFS
1
7
5
6
HDFS
DN1
...
AB DN4 DN5 DN6 DN7 DN2
2
9
3
5
HDFS
DN4 DNn DN5 DN6 DN7
2
9
4
8
HDFS
DN2
DN3
block42 block42 block42
...
AB
...
A B
A
B
...
DNn DN5 DN6 DN7DN4 DNn DN6 DN7 DN5 DNnDN7
1
7
3
8
2
8
5
6
4
9
3
6
HDFS
DN4
HDFS
DN5
block42
...
B
B
DN6 DNn
...
DNn DN7
DNn
1
...
4
DN6
7
DNn
MapReduce is another software layer on top of HDFS. MapReduce consists of three phases.
map(docID, document) -> set of (term, docID)
MapReduce HDFS block42
block block block block block 42 42 block 4242 block 42 42 block 42 42
A
...
AB
DN6 DN2DN4 DN6 DN7 N2DN4 DN5 DN5
DFS
1
7
5
6
HDFS
DN1
...
AB DN4 DN6 DN7 DN5 DN2
2
9
3
5
HDFS
DN4 DN6 DN7 DNn DN5
2
9
4
8
HDFS
DN2
DN3
block42 block42 block42
...
AB
...
A B
A
B
...
DN6 DN7DN4 DNn DN6 DN7 DN5 DNn DN5 DNnDN7
1
7
3
8
2
8
5
6
4
9
3
6
HDFS
DN4
HDFS
DN5
block42
...
B
B
DN6 DNn
...
DN7 DNn
DNn
1
...
4
DN6
7
DNn
In the first phase (the Map Phase) only the mapfunction is considered.
Map Phase MapReduce
map(docID, document) -> set of (term, docID)
HDFS block42
block block block block block 42 42 block 4242 block 42 42 block 42 42
A
...
AB
AB
DN6 DN2DN4 DN6 DN7 N2DN4 DN5 DN5
DFS
1
7
5
6
HDFS
DN1
...
AB
DN4 DN6 DN7 DN5 DN2
2
9
3
5
HDFS
DN2
DN4 DN6 DN7 DNn DN5
2
9
4
8
HDFS
DN3
block42 block42 block42
...
...
A B
A
B
...
B
DN6 DN7DN4 DNn DN6 DN7 DN5 DNn DN5 DNnDN7
1
7
3
8
2
8
5
6
4
9
3
6
HDFS
DN4
DN5
HDFS
DN6
block42
...
B
DN6 DNn
...
DN7 DNn
1 4
7
DNn
... DNn
MapReduce assigns a thread (AKA Mapper) at every datanode having data to be processed for this job.
Map Phase MapReduce
map(docID, document) -> set of (term, docID)
HDFS
M2
M1
M3
M4
M5
M6
M7
1
7
2
9
2
9
1
7
3
8
2
8
5
6
3
5
4
8
5
6
4
9
3
6
DN1
DN2
DN3
DN4
DN5
1
...
4
DN6
7
DNn
Each mapper reads one of the HDFS blocks...
Map Phase MapReduce
map(docID, document) -> set of (term, docID)
HDFS block42
block block block block block 42 42 block 4242 block 42 42 block 42 42
A
...
AB
...
AB
N2DN4 DN5 DN6 DN2DN4 DN5 DN6 DN7
DN4 DN5 DN6 DN7 DN2
M2
M1
block42 block42 block42
...
AB
...
A B
A
B
...
DN4 DNn DN5 DN6 DN7
DNn DN5 DN6 DN7DN4 DNn DN6 DN7 DN5 DNnDN7
M3
M4
M5
block42
...
B
B
DN6 DNn
...
DNn DN7
M6
DNn
M7
Bob's Perspective
DFS
1
7
5
6
HDFS
2
9
3
5
HDFS
2
9
4
8
HDFS
1
7
3
8
2
8
5
6
4
9
3
6
HDFS
HDFS
1
...
4
7
Bob
DN1
DN2
DN3
DN4
DN5
DN6
DNn
Map Phase block42
MapReduce
DN2
block42
A
map(docID, document) -> set of (term, docID) DN4
DN5
DN6
HDFS block42
block block block block block 42 42 block 4242 block 42 42 block 42 42
A
...
AB
...
AB
DN6 DN2DN4 DN6 DN7 N2DN4 DN5 DN5
DN4 DN6 DN7 DN5 DN2
M2
M1
block42 block42 block42
...
AB
...
A B
A
B
...
DN4 DN6 DN7 DNn DN5
DN6 DN7DN4 DNn DN6 DN7 DN5 DNn DN5 DNnDN7
M3
M4
M5
block42
...
B
B
HDFS DN6 DNn
M6
..and breaks that HDFS blocks into records. This can be customized with the RecordReader. For...each record map() is called. The output to that file B (also called intermediate results) is collected on the local disks of the datanodes. ... For instance, for block 6 the output is collected in file 6‘. DN7
DN7 DNn
DNn
DNn
M7
Bob's Perspective
DFS
1
7
5
6
HDFS Bob
6‘
2
9
3
5
HDFS
9
4
8
HDFS
9‘
DN1
2
8‘
DN2
1
7
3
8
2
8
5
6
4
9
3
6
HDFS
3‘
1‘
DN3
HDFS
DN4
1
...
4
2‘
DN5
7
7‘
DN6
DNn
This is done for every block of the input file. Obviously we do not have to do this with every copy of am HDFS block. ... Processing one copy is enough. B With this the Map Phase is finished.
Map Phase block42
MapReduce
map(docID, document) -> set of (term, docID) DN2
block42
A
DN4
DN5
DN6
DN7
DNn
HDFS block42
block block block block block 42 42 block 4242 block 42 42 block 42 42
A
...
AB
AB
DN6 DN2DN4 DN6 DN7 N2DN4 DN5 DN5
DFS
...
AB
DN4 DN6 DN7 DN5 DN2
DN4 DN6 DN7 DNn DN5
7
5
6
HDFS
6‘ DN1
2
9
3
5
9‘
5‘
HDFS
DN2
...
A B
A
B
...
block42
...
B
DN6 DN7DN4 DNn DN6 DN7 DN5 DNn DN5 DNnDN7
M8 1
block42 block42 block42
...
B
HDFS DN6 DNn
DN7 DNn
M9 2
9
4
8
HDFS
8‘ DN3
1
7
3
8
2
8
5
6
4
9
3
6
3‘
4‘
2‘
HDFS
1‘ DN4
HDFS
DN5
DN6
...
1 4
7
7‘ DNn
... DNn
Now, the Shuffle Phase starts.
Shuffle Phase MapReduce
group by term
HDFS
... 6‘
9‘
DN1
5‘
8‘
DN2
3‘
1‘
DN3
DN4
4‘
2‘
DN5
7‘
DN6
DNn
In the Shuffle Phase all intermediate results are redistributed over the different datanodes. In this example we want to redistribute the intermediate results based on the term, i.e. we want to group all intermediate results by term.
Shuffle Phase MapReduce
group by term
HDFS block42
block block block block block 42 42 block 4242 block 42 42 block 42 42
A
N2DN4 DN5 DN6 DN2DN4 DN5 DN6 DN7
DFS
...
AB
9‘
HDFS
HDFS
5‘
...
A B
DN4 DNn DN5 DN6 DN7
A
B
...
DNn DN5 DN6 DN7DN4 DNn DN6 DN7 DN5 DNnDN7
1‘
HDFS
HDFS
DN3
3‘
4‘
2‘
block42
...
B
B
DN6 DNn
network
8‘
DN2
block42 block42 block42
...
AB
DN4 DN5 DN6 DN7 DN2
6‘
DN1
...
AB
...
DNn DN7
DNn
7‘
...
HDFS
DN4
DN5
DN6
DNn
This means, after shuffling, we obtain a range partitioning on terms....
Shuffle Phase MapReduce
group by term
HDFS block42
block block block block block 42 42 block 4242 block 42 42 block 42 42
A
DN6 DN2DN4 DN6 DN7 N2DN4 DN5 DN5
DFS
...
AB
9‘
HDFS
HDFS
5‘
...
A B
DN4 DN6 DN7 DNn DN5
A
B
...
DN6 DN7DN4 DNn DN6 DN7 DN5 DNn DN5 DNnDN7
1‘
HDFS
HDFS
DN3
3‘
4‘
2‘
block42
...
B
B
DN6 DNn
network
8‘
DN2
block42 block42 block42
...
AB
DN4 DN6 DN7 DN5 DN2
6‘
DN1
...
AB
...
DN7 DNn
DNn
7‘
...
HDFS
DN4
DN5
DN6
DNn
For instance, DN1 contains all intermediate results having terms starting with A or B. In turn, DN2 has only terms starting with C or D and so forth. Once all data has been redistributed, the Shuffle Phase is finished.
Shuffle Phase MapReduce
group by term
HDFS block42
block block block block block 42 42 block 4242 block 42 42 block 42 42
A
AB
DN6 DN2DN4 DN6 DN7 N2DN4 DN5 DN5
DFS
...
...
AB
AB
DN4 DN6 DN7 DN5 DN2
DN4 DN6 DN7 DNn DN5
G-H
HDFS
HDFS
HDFS
HDFS
DN4
I-J
K-L
block42
...
B
DN6 DN7DN4 DNn DN6 DN7 DN5 DNn DN5 DNnDN7
E-F
DN3
A
B
...
network
C-D
DN2
...
A B
A-B
DN1
block42 block42 block42
...
B
DN6 DNn
...
DN7 DNn
W-Z
HDFS
DN5
DN6
DNn
... DNn
Reduce Phase
In the final Reduce Phase, MapReduce assigns threads to the different datanodes having intermediate results. These threads are termed reducers. The reducers read the intermediate results and for each distinct key they call reduce(). Notice that reduce() may only be called once for each distinct key on the entire cluster. Otherwise the semantics of the map()/reduce()paradigm would be broken.
reduce(term, set of docID) -> set of (term, (posting list of docID, count))
MapReduce HDFS
R1
R2
A-B
C-D
DN1
R3 E-F
DN2
R4
R6
I-J
G-H
DN3
R5
DN4
Rn
K-L
DN5
W-Z
...
DN6
DNn
The output of the reduce()-calls is stored on disk again. In this example this is visualized to store the output on local disk. However, Hadoop stores the output on HDFS by default, i.e. the output of the Reduce Phase gets replicated by HDFS again.
Reduce Phase reduce(term, set of docID) -> set of (term, (posting list of docID, count))
MapReduce HDFS block42
block block block block block 42 42 block 4242 block 42 42 block 42 42
A
N2DN4 DN5 DN6 DN2DN4 DN5 DN6 DN7
DFS
DN4 DN5 DN6 DN7 DN2
...
R3
R4 G-H
HDFS
HDFS
HDFS
HDFS
A-B‘
C-D‘
E-F‘
AB
DN4 DN6 DN7 DN5 DN2
HDFS
DN4 DN6 DN7 DNn DN5
HDFS
When to replicate which data for fault tolerance in DNn MapReduce is an interesting discussion. See our paper RAFT for more details: http://infosys.uni-saarland.de/publications/QPSD11.pdf
HDFS
DN4
AB
DNn DN7
W-Z
...
K-L‘
DN5
W-Z‘
DN6
DNn
Hadoop ... ... ... ... MapReduce Advantages
AB
...
Rn
K-L
I-J‘
block block block block block 42 42 block 4242 block 42 42 block 42 42
...
B
DN6 DNn
R6
I-J
G-H‘
DN3
R5
block42
...
B
DNn DN5 DN6 DN7DN4 DNn DN6 DN7 DN5 DNnDN7
E-F
HDFS
A
B
DN4 DNn DN5 DN6 DN7
C-D
A
...
A B
A-B
DN2
block42 block42 block42
...
AB
R2
DN6 DN2DN4 DN6 DN7 N2DN4 DN5 DN5
DFS
...
AB
R1
DN1
block42
...
AB
A B
block42 block42 block42
B
A
B
DN6 DN7DN4 DNn DN6 DN7 DN5 DNn DN5 DNnDN7
HDFS
... DN6 DNn
block42
B DN7 DNn
HDFS
failover scalability schema-later ease of use
... DNn
Hadoop MapReduce Disadvantages
And this and how to fix it is what the following material is about.
Performance
MapReduce Intro
Hadoop++
HAIL
Second Part
Hadoop++
Jens Dittrich, Jorge-Arnulfo Quiane-Ruiz, Alekh Jindal, Yagiz Kargin, Vinay Setty, Jörg Schad Hadoop++: Making a Yellow Elephant Run Like a Cheetah (Without It Even Noticing) VLDB 2010/PVLDB, Singapore http://infosys.cs.uni-saarland.de/publications/DQJ +10CRv2.pdf slides: http://infosys.cs.uni-saarland.de/publications/DQJ +10talk.pdf
[VLDB 2010a]
The “Map Reduce Plan“
partition
load
map
reduce
T
Data Load Phase
Data — a large horip with appropriLoad anything in the Phase
g contributions:
L Replicate H1
Replicate
Fetch
Fetch
Fetch
Store
Store
Store
Scan
Scan
M1
Map Phase
T6
PPartblock Replicate H2
Replicate H3
...
Replicate
Replicate
Fetch
Fetch Fetch
Store
Store Store
H4 ...
Scan
PPartsplit PPartsplit T5 T1
Map Phase
Hadoop is nothy execution plan itemize, mem, injected at pred query processphysical query nce, we are then
T1
PPartsplit T2
T3
T4 M2
Union
M3
RecReaditemize
RecReaditemize
MMapmap
MMapmap
T6 M4
PPartmem LPartsh Sortcmp
LPartsh Sortcmp
LPartsh
SortGrpgrp
SortGrpgrp
MMapcombine
MMapcombine
Store
Store
Store
Sortcmp SortGrpgrp ...
MMapcombine
...
Mergecmp asive, DBMSSortGrpgrp dex. A Trojan MMapcombine read-optimized Store Store me and thus have PPartsh PPartsh HadoopDB we t all to integrate Fetch Fetch Fetch Fetch of the Trojan InBuffer Buffer Buffer Buffer Fs. (Section 3) Store Store Merge MS-independent Store llows us to coMergecmp ... Trojan Indexes, SortGrpgrp to do so. TroMMapreduce reate arbitrarily R1 Store R2 it. (Section 4) fair experimenT T 1 2 iques on top ofVLDB 2010, Sep 15, Singapore Figure 1: The Hadoop Plan: Hadoop’s processing pipeline exmark Hadoop++ pressed as a physical query execution plan osed at VLDB
Reduce Phase
Shu⇥e Phase
Shuffle Phase
Reduce Phase
from SIGMOD n’s EC2 Cloud. adoop and even n 5)
ategies systems hat MapReduce MapReduce task s C to E contain udy.
UERY EX-
es a MapReduce 0.19. Note that operator-model egy may be exedge, this paper e Hadoop Plan.
ned parameters ucers, and data with four mapa nodes (P = 4) op Plan consists hich correspond ributed file sys-
has 10 user-defined functions (UDFs):
LPartsh
Sortcmp
SortGrpgrp MMapcombine
§ The “MapReduce Plan“
figure shows example with 4 mappers and 2 reducers Hadoop MapReduce uses a hard-coded pipeline. This pipeline cannot be changed. This is in sharp contrast to database systems which may use different pipelines for different queries.
! ! ! ! ! ! ! ! ! !
block split itemize mem map sh cmp grp combine reduce
tem. M determines the number of mapper subplans ( ), whereas R determines the number of reducer subplans ( ). Let’s analyze The Hadoop Plan in more detail: Data Load Phase. To be able to run a MapReduce job, we first load the data into the distributed file system. This is done by partitioning the input T horizontally into disjoint subsets T 1 , . . . , T b . See the physical partitioning operator PPart in subplan L. In the example b = 6, i.e. we obtain subsets T 1 , . . . , T 6 . These subsets are called blocks. The partitioning function block partitions the input T based on the block size. Each block is then replicated (Replicate). The default number of replicas used by Hadoop is 3, but this may be configured. For presentation reasons, in the example we replicate each block only once. The figure shows 4 different data nodes with subplans H1–H4. Replicas are stored on different nodes in the network (Fetch and Store). Hadoop tries to store replicas of the same block on different nodes. Map Phase. In the map phase each map subplan M1–M4 reads a subset of the data called a split2 from HDFS. A split is a logical concept typically comprising one or more blocks. This assignment is defined by UDF split. In the example, the split assigned to M1 consists of two blocks which may both be retrieved from subplan H1. Subplan M1 unions the input blocks T 1 and T 5 and breaks them into records (RecRead). The latter operator uses a UDF itemize that defines how a split is divided into items. Then subplan M1 calls map on each item and passes the output to a PPart operator. This operator divides the output into so-called spills based
However, Hadoop Map Reduce uses 10 user-defined functions (UDFs). Theses UDFs can be used to inject arbitrary code into Hadoop...
...including code that was not intended to be injected into Hadoop. Our idea is somewhat similar to a trojan horse or a trojan, i.e. a virus that is injected into a computer system to harm or destory the system. However, we inject trojans to improve or heal the system. Therefore we are calling them...
2 Not to be confused with spills. See below. We use the terminology introduced in [10].
http://www.istockphoto.com/stock-photo-1824642trojan-horse.php?st=0bab152
Good Trojans!
results taken from our VLDB 2010-paper Selection Task Hadoop HadoopDB HadoopDB Chunks
140
Hadoop++ is in the same ballpark or even faster than HadoopDB (now spun-off as Hadapt)
Hadoop++(256MB) Hadoop++(1GB)
runtime [seconds]
120 100 80 60 40 20 0 10 nodes
50 nodes
100 nodes
Hadoop++ is up to a factor 18 faster than Hadoop Join Task 2500
Hadoop HadoopDB
...even though we do not modify the underlying HDFS and Hadoop MapReduce source code at all! Our improvements are all done through UDFs only.
Hadoop++(256MB) Hadoop++(1GB)
runtime [seconds]
2000
1500
1000
500
0
10 nodes
50 nodes
100 nodes
But wait: UDFs are also available in traditional database systems. What happens if we exploit those UDFs to inject “better technology“ into an existing database system? Say we inject column store technology into a commercial, closed-source row store. How would that look like?
1Good ROW Q1 Q6 Q12 Q14
Trojans in a DBMS?... TROJAN
SQL
VECTORWISE
76.730296 19.293983 22.0 31.276143 77.589034 8.6532381 16.0 25.845965 92.486038 37.331905 33.0 29.785149 81.207649 30.788114 28.0 22.291128 Good Trojans Versus Closed
VECTORWISE2
VERTICA (other machine)
6.0 4.6 4.3 Source5.0
Column Store DBMS
Query Time (sec)
100
Standard Row DBMS-Y DBMS-Z (b)
75
Trojan Columns DBMS-Z (a)
50
25
0
Q1
Q6
Q12
Q14
[CIDR 2013a]
41.7 11.4 It15.3 looks 13.9
Factor
1.1 1.8 this.0.9 For 0.9
3.2 1.9 8.7 see our 6.2
like details paper: Alekh Jindal, Felix Martin Schuhknecht, Jens Dittrich, Karen Khachatryan, Alexander Bunte. How Achaeans Would Construct Columns in Troy. CIDR 2013, Asilomar, USA. http://infosys.uni-saarland.de/publications/How %20Achaeans%20Would%20Construct%20Columns %20in%20Troy.pdf You can get much faster than a row store. We are not as fast as a from scratch implementation of a column store. This has to do with the different QP technology used. However for some customers the performance of a native column store might not even be required - especially for mediumsized datasets. ...
Q19 Q2 Q4 Q8 Q15
0
0
2.935810086 1.9092920003
0
1.816512607
15.286783871
Good Trojans in a Closed Source Row 0 0 0 0 0 0 Store DBMS
100
Query Time (sec)
8.807430258
15.13420846 5.1889663903
Standard Row Materialized View
5.449537821
5.727876001
part
... Why buy a car with 1000hp if 200hp are just part enough? lineitem
45.860351612 15.566899171 45.402625379 order
Trojan Columns
An interesting result from our work is also that we can beat materialized views for some queries.
75
50
25
0
Q1
Q6
Q12
Q14
[CIDR 2013a]
With this let‘s close this footnote and go back to Hadoop MapReduce performance. UDFs in Hadoop allow us to boost query performance without changing the underlying system. However, what are the...
1...Good
Trojans in a DBMS!
Problems:
Upload-Times Hadoop HadoopDB
50000
Hadoop++(256MB) Hadoop++(1GB)
(I)ndex Creation (C)o-Partitioning Data (L)oading
runtime [seconds]
40000
30000
I I
20000
I
10000
C
0
I I
I C
C
C C
C
L L
L L
L L
10 nodes
50 nodes
100 nodes
from Hadoop++ paper: The problem is that in order to have fast queries we first have to “massage“ the data before, i.e. create indexes, co-partition and so forth. This takes time. Hadoop does not have to spend this time and therefore uplloading the data to HDFS is fast. In contrast, for Hadoop++ (but als HadoopDB) we also have to do a lot of extra work. This extra work is very costly. So costly that only after running many queries these investments are amortized. In other words, if we only want to run a few queries exploiting our indexes and co-partitioning, we shouldn‘t use Hadoop++ in the first place but rather run the queries directly on Hadoop! How could we fix this problem?
=> back to scanning? => index selection algos? => coarse-granular indexes
We could drop the idea of using indexes: just scan everything. Well we are not gonna follow this approach. We could invest into better index selection algorithms. If we pick the wrong index, index creation is unlikely to be amortized. Therfore making the right choice is important. Therefore... Well we are not gonna follow this approach. Or: as it is expensive to create all these indexes, we better investigate coarse-granular indexes, i.e. indexes that are cheaper to construct and yet give some benefit at query time. Well we are not gonna follow this approach.
We do something different. Which brings me to...
MapReduce Intro
Hadoop++
HAIL
... the third part of my talk. The approach I would like to present is coined HAIL.
HAIL
[SOCC 2011] [VLDB 2012a] [int‘ patent]
Hadoop Aggressive Indexing Library
HAIL means Hadoop Aggressive Indexing Library. For details see our paper: Jens Dittrich, Jorge-Arnulfo Quiane-Ruiz, Stefan Richter, Stefan Schuh, Alekh Jindal, Jörg Schad. Only Aggressive Elephants are Fast Elephants. VLDB 2012/ PVLDB, Istanbul, Turkey. http://infosys.uni-saarland.de/ publications/HAIL.pdf A predecessor of this work focussing on data layouts in HDFS is our paper: Alekh Jindal, Jorge-Arnulfo Quiane-Ruiz, Jens Dittrich. Trojan Data Layouts: Right Shoes for a Running Elephant. ACM SOCC 2011, Cascais, Portugal. http:// infosys.uni-saarland.de/publications/JQD11.pdf
So back to Bob again. Recall tath Bob wants to analyze a large file with Hadoop MapReduce. So he first has to upload his file to HDFS. In our approach we replace HDFS with HAIL. HAIL is an extension of HDFS.
Bob
HAIL
...
DN1
DN2
DN3
DN4
DN5
DN6
DNn
As before Bob‘s file gets partiitoned into HDFS blocks...
HAIL block42
A
block block block block block 42 42 block 4242 block 42 42 block 42 42
...
...
N2DN4 DN5 DN6 DN2DN4 DN5 DN6 DN7
DN4 DN5 DN6 DN7 DN2
DN4 DNn DN5 DN6 DN7
block42 block42 block42
...
AB AB AB horizontal partitions
...
A B
B
A
...
...
B
DNn DN5 DN6 DN7DN4 DNn DN6 DN7 DN5 DNnDN7
DN6 DNn
block42
B
...
DNn DN7
DNn
...
DFS
HDFS
HDFS
DN1
HDFS
DN2
HDFS
DN3
HDFS
DN4
DN5
DN6
DNn
those blocks are relatively large, at least 64MB
HAIL block42
A
block block
block block
block block
42 42 ...4242 ...42 42 AB AB AB horizontal partitions
DN6 DN2DN4 DN6 DN7 N2DN4 DN5 DN5
DN4 DN6 DN7 DN5 DN2
DN4 DN6 DN7 DNn DN5
block block 42 42
block HDFS blocks
block42 block42 block42
...
...
A B
B
A
42 ... ... 64MB (default)B B
DN6 DN7DN4 DNn DN6 DN7 DN5 DNn DN5 DNnDN7
DN6 DNn
...
DN7 DNn
DNn
...
DFS
HDFS
HDFS
DN1
HDFS
DN2
HDFS
DN3
HDFS
DN4
DN5
DN6
DNn
then those blocks get partitioned to the different datanodes (just as above)
HAIL block42
1 2 block block block block block 42 42 block 4242 block 42 42 block 42 42
A
AB
DN6 DN2DN4 DN6 DN7 N2DN4 DN5 DN5
DFS
...
...
AB
...
AB
DN4 DN6 DN7 DN5 DN2
DN4 DN6 DN7 DNn DN5
A 4B
3 42 block block42 block42
...
5
6B
A
...
B
DN6 DN7DN4 DNn DN6 DN7 DN5 DNn DN5 DNnDN7
7
8
9
... DN6 DNn
block42
B DN7 DNn
... HDFS
DN1
HDFS
DN2
HDFS
DN3
HDFS
DN4
HDFS
DN5
DN6
DNn
... DNn
the HDFS blocks also get replicated (just as above) but then, before writing the data to the local disks on the different datanodes, we do something in addition: HAIL
1
2
2
4
5
6
7
8
9
1
3
2
3
DN1
1
...
3
DN2
DN3
DN4
DN5
DN6
DNn
HAIL block42
block block block block block 42 42 block 4242 block 42 42 block 42 42
A
...
AB
N2DN4 DN5 DN6 DN2DN4 DN5 DN6 DN7
1 1
DFS
HDFS
DN4 DNn DN5 DN6 DN7
A 4B 7
2 2
11
33
HDFS
HDFS
DN3
...
A
6B
5
...
8
22
B
DN6 DNn
9
3 3
block42
...
B
DNn DN5 DN6 DN7DN4 DNn DN6 DN7 DN5 DNnDN7
22
DN2
block42 block42 block42
...
AB
DN4 DN5 DN6 DN7 DN2
HDFS
DN1
...
AB
DNn DN7
1 1
...
we sort the data on each HDFS block in main memory. Each replica is sorted using a different sort cirteria. This means after sorting each HDFS block is available in three different sort orders - roughly corresponding to three different clustered indexes. ... Notice that we do not redistribute data across HDFS blocks! Data that was on one particular block in DNn standard HDFS will sit on the same HDFS block in HAIL. In other words: the different copies of the block contain the same data - yet in different sort orders.
33
HDFS
DN4
DN5
DN6
DNn
Again, we do this for each and every copy of a block.
HAIL block42
block block block block block 42 42 block 4242 block 42 42 block 42 42
A
...
AB
DN6 DN2DN4 DN6 DN7 N2DN4 DN5 DN5
1 5
DFS
HDFS
...
AB DN4 DN6 DN7 DN5 DN2
2 6
DN1
3
HDFS
5
DN4 DN6 DN7 DNn DN5
DN2
A
B
...
7
1 5
HDFS
8
2
4
DN4
3
HDFS
DN5
B
DN6 DNn
9
3 6
block42
...
B
DN6 DN7DN4 DNn DN6 DN7 DN5 DNn DN5 DNnDN7
2
DN3
...
A B
4
HDFS
block42 block42 block42
...
AB
...
DN7 DNn
DNn
1
... 6
4
DN6
DNn
Notice that this is done without introducing additional I/ O. We fully piggy-back on the existing HDFS upload pipeline. HAIL block42
block block block block block 42 42 block 4242 block 42 42 block 42 42
A
...
AB
AB
DN6 DN2DN4 DN6 DN7 N2DN4 DN5 DN5
DFS
1 5 6 5 6 HDFS
DN1
...
AB
DN4 DN6 DN7 DN5 DN2
2 3 5 5 HDFS
DN2
DN4 DN6 DN7 DNn DN5
2 4 4 HDFS
DN3
block42 block42 block42
...
...
A B
B
A
...
B
DN6 DN7DN4 DNn DN6 DN7 DN5 DNn DN5 DNnDN7
7
1 5 6 6 5 HDFS
DN4
8
9
3
2
4 4
3 HDFS
DN5
DN6
B
DN6 DNn
... 6 6
block42
...
DN7 DNn
1 4 4
DNn
... DNn
HAIL
7
1 5
2 6
DN1
3
5
DN2
2
1
4
5
DN3
6
8
9
3
2
4
3
DN4
DN5
1
...
4
6
DN6
DNn
HAIL block42
block block block block block 42 42 block 4242 block 42 42 block 42 42
A
...
AB
N2DN4 DN5 DN6 DN2DN4 DN5 DN6 DN7
DFS
1
7
5
6
HDFS
DN1
...
AB DN4 DN5 DN6 DN7 DN2
DN4 DNn DN5 DN6 DN7
2
9
2
9
3
5
HDFS
4
8
HDFS
DN2
DN3
block42 block42 block42
...
AB
...
A B
A
B
...
DNn DN5 DN6 DN7DN4 DNn DN6 DN7 DN5 DNnDN7
1
7
3
8
2
8
5
6
4
9
3
6
HDFS
DN4
HDFS
DN5
block42
...
B
B
DN6 DNn
...
DNn DN7
DNn
1
...
4
DN6
7
DNn
Eventually uploading (and indexing) is finished. What does this mean for HDFS failover? HAIL block42
block block block block block 42 42 block 4242 block 42 42 block 42 42
A
...
AB
DN6 DN2DN4 DN6 DN7 N2DN4 DN5 DN5
7 7
1 5
DFS
HDFS
6
DN1
...
AB DN4 DN6 DN7 DN5 DN2
9 9
2 3
HDFS
5
DN4 DN6 DN7 DNn DN5
2
99
4
8 8
HDFS
DN2
DN3
block42 block42 block42
...
AB
...
A B
A
B
...
DN6 DN7DN4 DNn DN6 DN7 DN5 DNn DN5 DNnDN7
1 5
HDFS
7 7 6
3
8 8
2
8 8
4
9 9
3
6
DN4
HDFS
DN5
block42
...
B
B
DN6 DNn
DN7 DNn
4
DN6
77
DNn
HAIL block block block block block 42 42 block 4242 block 42 42 block 42 42
A
...
AB
AB
DN6 DN2DN4 DN6 DN7 N2DN4 DN5 DN5
DFS
1
7
5 6 HDFS
DN1
...
AB
DN4 DN6 DN7 DN5 DN2
2
9
3 5 HDFS
DN2
DN4 DN6 DN7 DNn DN5
2
9
4 8 HDFS
DN3
block42 block42 block42
...
...
A B
B
A
...
7
3
8
2
8
5 6 HDFS
4
9HDFS3
6
DN4
DN5
DN6
block42
...
B
DN6 DN7DN4 DNn DN6 DN7 DN5 DNn DN5 DNnDN7
1
DNn
1
...
Failover
block42
...
B
DN6 DNn
...
DN7 DNn
1 4
7
DNn
Well actually, nothing changes. All data sits on the same HDFS blocks as before. For instance, if we lose DN2 and DN6, we can still retrieve block 3 from DN6. That block might not be sorted along the desired sort criteria, but it contains all the data. And we can use the ... remaining block to recreate additional copies in other sort orders. DNn
Well it is all somewhat more complex than explained in the previous example.
little
Details
HAIL Upload Pipeline Network
Network
Bob Datanode DN1
HAIL Client upload
preprocess
1 A
OK
B
C
convert
HAIL Block
PAX Block
Block Metadata
Block Metadata
0101001010111 A 0110010111010
build
0000100100011 A 0011110111111
0101001010111 A 0110010111010
7
B 1000101110101
0101010101010
B 0011010001100
2
1011000110110
6
5
0101001010111
A 0110010111010
PCK 2
PCK 1
ACK 1 3 2 1
ACK 2 3 2 1
0101010101010
15
1001110111011
B 1010101101010 0001001100011
0110100010111
C 0111100111111
C 0101000110110
Index Metadata Index C
reassemble
8
PCK 1
PCK 2
forward
1
ACK 1 3 2 1
forward
ACK 1 3
10 register
register
14
3
9
ACK 2 3
acknowledge
12
get location
check
PCK 1
PCK 2
ACK 2 3 2
append
13 Network
...
C 0111101100110
reassemble
B 0011010001100 0110100010111
0010
0010011101101
A 1101001101101
Index Metadata Index A
4
Block Metadata
C 0101000110110
HAIL Block Block1110 Metadata
build
0101010101010
B 0011010001100
1000101010001
0110100010111
C 0101000110110
notify PAX Block
Datanode DN3
PAX Block Block Metadata
11
HDFS Namenode Block directory
HAIL Replica directory
HAIL Query Pipeline System's Perspective
Bob's Perspective
Hadoop MapReduce Pipeline MapReduce Pipeline Split Phase
Scheduler
Map Phase
2 Bob
run Job
1
write Job
MapReduce Job
Job Client
Task Tracker
Job Tracker
for each block i in input { locations = blocki.getHostWithIndex(@3); splitBuilder.add(locations, block i); } splits[] = splitBuilder.result;
3 send splits[]
for each split i in splits { allocate split i to closest DataNode storing block i }
Record Reader: - Perform Index Scan - Perform Post-Filtering - for each Record invoke map(HailRecord)
5 allocate Map Task
Main Class map(...) reduce(...)
4
... @HailQuery( filter="@3 between(1999-01-01, 2000-01-01)", projection={@1}) void map(Text k, HailRecord v) { output(v.getInt(1), null); } ...
DN3
DN4
B
DN5
HDFS
read block42
block42
A
C DN2
6
block42
block42
DN1 HAIL Annotation
chose computing Node
DN6
DN7
7
store output
... DNn
HDFS
Experiments
We play some other tricks. For instance, while reading the input file, we immediately parse the data into a binary PAX (column like) layout. The PAX data is then sent to the different datanodes. We also had to make sure to not break the involved data consistency-checks used by HDFS (packet acknowledge). In addition, we extended the namenode to record additional information on the sort orders and layouts used for the different copies of an HDFS block. The latter is needed at query time. None of these changes affects principle properties of HDFS. We just extend and piggyback.
At query time HAIL needs to know how to filter input records and which attributes to project to in intermediate result tuples. This can be solved in many ways. In our current implementation we allow users to annotate their map-function with the filter and projection conditions. However this could also be done fully transparently by using static code analysis as shown in: Michael J. Cafarella, Christopher Ré: Manimal: Relational Optimization for Data-Intensive Programs. WebDB 2010. That code analysis could be used directly with HAIL. Another option, if the map()/ reduce()-functions are not produced by a user, is to adjust the application. ...
...Possible “applications“ might be Pig, Hive, Impala or any other system producing map()/reduce() programs as its output. In addition, any other application not relying on MapReduce but just relying on HDFS might use our system, i.e. applications diurectly working with HDFS files.
What happens if we upload a file to HDFS?
Upload Times
Hadoop 1132
Hadoop++ 3472 5766
HAIL 671 704 712 717
Upload Time
Upload time [sec]
0 1 2 3
Hadoop
6800
Hadoop++
Let‘s start with the case that no index is created by any of the systems. Hadoop takes about 1132 seconds HAIL
5766
5100 3472
3400 1700
1132
0
0
1
717
712
704
671
2
3
Number of created indexes Hadoop 1132
Hadoop++ 3472 5766
HAIL
Hadoop
6800
all with 3 replicas
671 704 712 717
Upload Time
Upload time [sec]
0 1 2 3
Hadoop++
HAIL
5766
Hadoop++ is considerably slower. Even though we switch off index creation here, Hadoop++ runs an extra job to convert the input data to binary. This takes a while. What about HAIL?
5100 3472
3400 1700
1132
0
0
1
717
712
704
671
2
3
Number of created indexes Hadoop++ 3472 5766
HAIL
Hadoop
6800
all with 3 replicas
671 704 712 717
Upload Time
Upload time [sec]
0 1 2 3
Hadoop 1132
Hadoop++
HAIL
5766
5100 3472
3400 1700 0
1132
0
1
717
712
704
671
2
3
Number of created indexes all with 3 replicas
HAIL is faster than Hadoop HDFS. How can this be? We are doing more work than Hadoop HDFS, right? For instance, we convert the input file to binary PAX during upload directly. This can only be slower than Hadoop HDFS but NOT faster. Well, when converting to binary layout it turns out that the binary represenation of this dataset is smaller than the textual representation. Therefore we have to write less data and SAVE I/O. Therefore HAIL is faster for this dataset. This does not necessarily hold for all datasets. Notice that for this and the following experiments we are not using compression yet. We expect compression to be even more beneficial for HAIL.
712 717
Upload Time
Upload time [sec]
2 3
Hadoop
6800
Hadoop++
So what happens if we start creating indexes? We should then feel the additional index creation effort in HAIL. HAIL
For Hadoop++ we observe long runtimes. For HAIL, in contrast to what we expected, we observe only a small increase in the upload time.
5766
5100 3472
3400 1700
1132
0
0
717
712
704
671
1
2
3
Number of created indexes Hadoop 1132
Hadoop++ 3472 5766
HAIL
Hadoop
6800
all with 3 replicas
671 704 712 717
Upload Time
Upload time [sec]
0 1 2 3
Hadoop++
The same observation holds when creating two clustered indexes with HAIL... HAIL
5766
5100 3472
3400 1700
1132
0
717
712
704
671
0
1
2
3
Number of created indexes Hadoop 1132
Hadoop++ 3472 5766
HAIL
Upload time [sec]
Hadoop
6800
all with 3 replicas
671 704 712 717
Upload Time Hadoop++
HAIL
5766
5100
1700
...or three. This is because, standard file upload in HDFS is I/Obound. The CPUs are mostly idle. HAIL simply exploits the unused CPU ticks that would be idling otherwise. Therefore the additional effort for indexing is hardly noticeable.
3472
3400 1132
0
0
717
712
704
671
1
2
3
Disk space is cheap. For some situations It could be affordable to store more than three copies of an HDFS block. What would be the impact on the upload times? The next experiment shows the results...
Number of created indexes all with 3 replicas
Here we create up to 10 replicas - corresponding to 10 different clustered indexes.
Replica Scalability
Upload time [sec]
0 1 2 3
Hadoop
4300
HAIL
Hadoop upload time with 3 replicas
3225
2712 2256
2150 1075 0
3710
1773 1132 717
3 (default)
1700 956
5
1089
6
1254
7
Number of created replicas
10
We observe that in the same time HDFS uploads the data without creating any index, HAIL uploads the data, converts to binary PAX, and creates six different clustered indexes.
Upload Time [sec]
Scale-Out Hadoop
2200
HAIL 1836
1742
1650
1530
1476 1486
1284
1100
918
827
684
600
550 0
1026
Syn
UV
633
Syn
10 nodes
UV
Syn
50 nodes
UV
100 nodes
Number of Nodes
modulo variance, see [VLDB 2010b]
We also evaluated upload times on the cloud using EC2 nodes. Notice that experiments on the cloud are somewhat problematic due to the high runtime variance in those environments. For details see our paper: Jörg Schad, Jens Dittrich, Jorge-Arnulfo Quiane-Ruiz Runtime Measurements in the Cloud: Observing, Analyzing, and Reducing Variance VLDB 2010/PVLDB, Singapore. http://infosys.cs.uni-saarland.de/publications/ SDQ10.pdf slides: http://infosys.cs.uni-saarland.de/publications/ SDQ10talk.pdf
What about query times?
min hail min hadoop 18 7 18 8 7.333333333 64 15.666666667 124 9.666666667 37.333333333 65.333333333 35.666666667 224 36 186 28 89.66666667 18 201.33333333 12 157 65.666666667 106 33.333333333
Query Times
Individual Jobs: Weblog, RecordReader RR Runtime [ms]
4000 3000
Hadoop
Hadoop ++
HAIL
3358 2917 2470
2776 2156
2112
2864 2442
2000 1000 0
73
83 75
683
333
53 52
Bob-Q1 Bob-Q2 Bob-Q3 Bob-Q4 Bob-Q5
MapReduce Jobs
Here we display the RecordReader times. They correspond roughly to the data access time, i.e. the time for further processing, which is equal in all systems - is factored out. We observe that HAIL improves query runtimes dramatically. Hadoop resorts to full scan in all cases. Hadoop++ can benefit from its index if the query happens to hit the right filter condition. In contrast, HAIL supports many more filter conditions. What does this mean for the overall job runtimes?
Job Runtime [sec]
Individual Jobs: Weblog, Job 1500 1125 750
Hadoop 1160 1094
601
1006 705 598
Hadoop++
HAIL
1143 1099
1145 1099
942 651598
598
602
375 0
Bob-Q1 Bob-Q2 Bob-Q3 Bob-Q4 Bob-Q5
MapReduce Jobs
Well, those results are not so great. The benefits of HAIL over the other approaches are marginal? How come? It has to do with...
Job Runtime [sec]
Scheduling Overhead Hadoop 1500
Hadoop++
HAIL
Overhead
1125 750 375 0
Bob-Q1 Bob-Q2 Bob-Q3 Bob-Q4 Bob-Q5
MapReduce Jobs
...the Hadoop scheduling overhead. Hadoop was designed having long running tasks in mind. Accessing indexes in HAIL, however, is in the order of milliseconds. These milliseconds of index access are overshadowed by scheduling latencies. You can try this out with a simple experiment. Write a MapReduce job that does not read any input, does not do anything, and does not produce any output. This takes about 7 seconds - for doing nothing. How could we fix this problem? By...
...introducing “HAIL scheduling“. But let‘s first look back at standard Hadoop scheduling:
HAIL Scheduling
Hadoop Scheduling MapReduce
map(row) -> set of (ikey, value)
HAIL
23
7
5
4
12
1
11
2
22
19
21
12
15
14
6
In Hadoop, each HDFS block will be processed by a different map task. This leads to waves of map tasks each having a certain overhead. However the assignment of HDFS blocks to map tasks is not fixed. Therefore, ...
sort order 7 map tasks (aka waves) 7 times scheduling overhead
DN1
HAIL Scheduling MapReduce
map(row) -> set of (ikey, value)
HAIL
23
7
5
4
12
1
11
2
22
19
block 21 12 42
6
15
B
14
sort order 1 map task (aka wave) 1 times scheduling overhead
... HAIL Split
DN1 DN6
DN7
DNn
...in HAIL Scheduling we assign all HDFS blocks that need to be processed on a datanode to a single map task. This is achieved by defining appropriate “splits“. see our paper for details. The overall effect is that we only have to pay the scheduling overhead once rather than 7 times (in this example). Notice that in case of failover we can simply reschedule index access tasks - they are fast anyways. Additionally, we could combine this with the recovery techniques from RAFT, ICDE 2011.
What are the end-to-end query runtimes with HAIL Scheduling?
Query Times with HAIL Scheduling
Now the good RecordReader times seen above translate to (very) good query times.
Job Runtime [sec]
Individual Jobs: Weblog Hadoop
1500 1125
1160 1094
Hadoop++
1145 1099
1143 1099
1006
942
705
750
HAIL
651
375 0
16
15
65
22
15
Bob-Q1 Bob-Q2 Bob-Q3 Bob-Q4 Bob-Q5
MapReduce Jobs
What about failover?
Failover 113
Failover HAIL-1Idx Hadoop
63 598
33 598
1212
661
631
1099
Job Runtime [sec]
Failover
1500 1125
Hadoop
HAIL
Slowdown
10.3 % slowdown 1099
750
10.5 % slowdown
5.5 % slowdown
375
598
598
HAIL
HAIL-1Idx
0
Hadoop
Systems
10 nodes, one killed
We use two configurations for HAIL. First, we configure HAIL to create indexes on three different attributes, one for each replica. Second, we use a variant of HAIL, coined HAIL-1Idx, where we create an index on the same attribute for all three replicas. We do so to measure the performance impact of HAIL falling back to full scan for some blocks after the node failure. This happens for any map task reading its input from the killed node. Notice that, in the case of HAIL-1Idx, all map tasks will still perform an index scan as all blocks have the same index. Overall result: HAIL inherits Hadoop MapReduce‘s failover properties.
of this talk
Summary(Summary(...))
What I tried to explain to you in my talk is:
BigData MapReduce => Intro
Hadoop++
HAIL
Hadoop MapReduce is THE engine for big data analytics.
BigData =>
Hadoop++
HAIL
BigData =>
Hadoop++
fast indexing HAIL
AND
fast querying
By using good trojans you can improve system performance dramatically AND after the fact --- even for closed-source systems.
BigData =>
HAIL allows you to have fast index creation AND fast query processing at the same time.
Hadoop++
project page: http://infosys.uni-saarland.de/hadoop++.php Copyright of alls Slides Jens Dittrich 2012
fast indexing
AND
fast querying annotated slides are available on that page
infosys.uni-saarland.de