Hadoop: The Definitive Guide

THIRD EDITION •• ••• ••• ••• ••• ••• •• Hadoop: The Definitive Guide Tom White out O'REILLY® Beijing • Cambridge • Farnham • Koln • Sebastopol • ...
Author: Tyler Garrison
7 downloads 0 Views 2MB Size
THIRD EDITION

••

••• ••• ••• ••• ••• ••

Hadoop: The Definitive Guide

Tom White out

O'REILLY® Beijing • Cambridge • Farnham • Koln • Sebastopol • Tokyo

Table of Contents

Foreword .. .......... . ....... . . . .... .... . .... . ... . .. .. .. . . . . ........ . ....... xv Preface ..... . . .. ....... . ....... . .. . ........ . . . . .. .. . . . .. ....... ... .. .. . .. . . xvii 1. Meet Hadoop ....................................... .. . .......... . .. . ... 1 Datal Data Storage and Analysis Comparison with Other Systems Rational Database Management System Grid Computing Volunteer Computing A Brief History of Hadoop Apache Hadoop and the Hadoop Ecosystem Hadoop Releases What's Covered in This Book Compatibility

1 3 4 4 6 8 9 12 13 15 15

2. MapReduce ........... ..... . . .. . .. . ....... . .............. . ... . ....... . 17 A Weather Dataset Data Format Analyzing the Data with Unix Tools Analyzing the Data with Hadoop Map and Reduce Java MapReduce Scaling Out Data Flow Combiner Functions Running a Distributed MapReduce Job Hadoop Streaming Ruby Python

17 17 19 20 20 22 30 30 33 36 36 36 39

v

Hadoop Pipes Compiling and Running

40 41

3. The Hadoop Distributed Filesystem .............................. .. .... . .. 43 The Design of HDFS 43 HDFS Concepts 45 Blocks 45 Namenodes and Datanodes 46 HDFS Federation 47 HDFS High-Availability 48 The Command-Line Interface 49 Basic Filesystem Operations 50 Hadoop Filesystems 52 Interfaces 53 The Java Interface 55 Reading Data from a Hadoop URL 55 Reading Data Using the FileSystem API 57 Writing Data 60 Directories 62 Querying the Filesystem 62 Deleting Data 67 Data Flow 67 Anatomy of a File Read 67 Anatomy of a File Write 70 Coherency Model Data Ingest with Flume and Sqoop Parallel Copying with distcp Keeping an HDFS Cluster Balanced Hadoop Archives Using Hadoop Archives Limitations

72

74 75 76 77 77

79

4. Hadoop 1/0 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 81 Data Integrity 81 Data Integrity in HDFS 81 LocalFileSystem 82 ChecksumFileSystem 83 Compression 83 Codecs 85 Compression and Input Splits 89 Using Compression in MapReduce 90 Serialization 93 The Writable Interface 94 vi

I

Table of Contents

Writablt lmplemt Serializa Avro Avro Da In-Mem Avro Da lnteropt Schema Sort On AvroM Sorting AvroM File-Based Sequenc Map File

5. Developing

The Confi Combi1 Variabl Setting Ur Manag' Generic Writing a Mappe Reduct Running 1 Runnir Testin1 Runningc Packag La unci TheM Retrie' Debug Hadoc Remot Tuning a, Profili MapRedt Decon JobCc

40 41

. ........ . .. 43 43 45 45 46 47 48 49 50 52 53 55 55 57 60 62 62 67 67 67 70 72 74 75 76 77 77 79 ••

ttt

t

••

••



81 81 81 82 83 83 85 89 90 93 94

96 103 108

Writable Classes Implementing a Custom Writable Serialization Frameworks

110 111

Avro Avro Data Types and Schemas In-Memory Serialization and Deserialization Avro Datafiles Interoperability Schema Resolution Sort Order Avro MapReduce Sorting Using Avro MapReduce Avro MapReduce in Other Languages File-Based Data Structures SequenceFile MapFile

114 117

118 121 123 124 128

130 130 130 137

5. Developing a MapReduce Application .................................... 143 The Configuration API Combining Resources Variable Expansion Setting Up the Development Environment Managing Configuration GenericOptionsParser, Tool, and ToolRunner Writing a Unit Test with MRUnit Mapper Reducer Running Locally on Test Data Running a Job in a Local Job Runner Testing the Driver Running on a Cluster Packaging a Job Launching a Job The MapReduce Web Ul Retrieving the Results Debugging a Job Hadoop Logs Remote Debugging Tuning a Job Profiling Tasks MapReduce Workflows Decomposing a Problem into MapReduce Jobs JobControl

144 145 146 146 148 150 154 154 156 157 157 160 161 162 163 165 168

170 175 177 178 179 181

181 183

Table otcontents

I

vii

Apache Oozie

183

6. How MapReduce Works ............... ...... .. .. .. . . ................... 189 Anatomy of a Map Reduce Job Run Classic MapReduce (MapReduce 1) YARN (MapReduce 2) Failures Failures in Classic MapReduce Failures in YARN Job Scheduling The Fair Scheduler The Capacity Scheduler Shuffle and Sort The Map Side The Reduce Side Configuration Tuning Task Execution The Task Execution Environment Speculative Execution Output Committers Task JVM Reuse Skipping Bad Records

189 190 196 202 202 204 206 207 207 208 208 210 211 214 215 215 217 219 220

7. MapReduce Types and Formats .. .................... . .......... . ........ 223 MapReduce Types 223 The Default MapReduce Job 227 Input Formats 234 Input Splits and Records 234 Text Input 245 Binary Input 249 Multiple Inputs 250 Database Input (and Output) 251 Output Formats 251 Text Output 252 Binary Output 253 Multiple Outputs 253 Lazy Output 257 Database Output 258

8. MapReduce Features ......... . .... .. .................................. 259 Counters Built-in Counters User- Defined Java Counters

viii

I

Table ofContents

259 259 264

User Sorting Prep Parti Tota Seco Joins Map Redt SideD< Usin Dist MapRe

9. Setting Cluste1 Net' Cluste1 Inst: Cre: Inst Test SSHC Hadoc Cor Env Imr Hac Oth Use YARN Imr YAi Securi Ker Del Otl Bench Ha, Us( Hado, Ap

183 . .. .. ... .. .. 189 189 190 196 202 202 204 206 207 207 208 208 210 211 214

215 215 217 219 220 ............ 223 223 227 234 234 245 249 250 251 251 252 253 253 257 258 . . .... . .... 259 259 259 264

268 268 269 270 274

User-Defined Streaming Counters Sorting Preparation Partial Sort Total Sort Secondary Sort

277

283 284 285 288 288 289 295

Joins . Map-Side Joms Reduce-Side Joins Side Data Distribution Using the Job Configuration Distributed Cache MapReduce Library Classes

9. Setting Up a Hadoop Cluster .............. ..... . ..... .... .... . .......... 297 297 Cluster Specification 299 Network Topology 301 Cluster Setup and Installation 302 Installing Java 302 Creating a Hadoop User 302 Installing Hadoop 303 Testing the Installation 303 SSH Configuration 304 Hadoop Configuration 305 Configuration Management 307 Environment Settings 311

Important Hadoop Daemon Properties Hadoop Daemon Addresses and Ports Other Hadoop Properties User Account Creation YARN Configuration Important YARN Daemon Properties YARN Daemon Addresses and Ports Security Kerberos and Hadoop Delegation Tokens Other Security Enhancements Benchmarking a Hadoop Cluster Hadoop Benchmarks User Jobs Hadoop in the Cloud Apache Whirr

316 317 320 320 321

324 325 326 328 329 331 331 333 334 334

Table ofContents

I

ix

10. Administering Hadoop ...................................... .. . ........ 339 HDFS Persistent Data Structures Safe Mode Audit Logging Tools Monitoring Logging Metrics Java Management Extensions Maintenance Routine Administration Procedures Commissioning and Decommissioning Nodes Upgrades

339 339 344 346 347 351 352 352 355 358 358 359 362

11. Pig .......... . ... . .... . .. . ......................................... . 367 Installing and Running Pig Execution Types Running Pig Programs Grunt Pig Latin Editors An Example Generating Examples Comparison with Databases Pig Latin Structure Statements Expressions Types Schemas Functions Macros User-Defined Functions A Filter UDF An Eva! UDF A Load UDF Data Processing Operators Loading and Storing Data Filtering Data Grouping and Joining Data Sorting Data Combining and Splitting Data Pig in Practice

x

I

Table of Contents

368 368 370 370 371 371 373 374 375 376 377 381 382 384 388 390 391 391 394 396 399 399 400 402 407 408 409

Parallel Para me

12. Hive ..... Installing : The Hi· An Examr Running l Config1 HiveS( TheM• Com paris Schem Updat( HiveQL Data T Opera1 Tables Manaf Partiti• Storag Impoli Alterii Dropr: Queryin! Sortin MapR Joins Subqu Views User-Del Writi1 Writi!

13. HBase .. HBasics Back< Concepl Whir Imp It lnstallat Test Clients

.. ..... .. ... 339 339 339 344 346 347 351 352 352 355 358 358 359 362

........... 367 368 368 370 370 371 371 373 374 375 376 377 381 382 384 388 390 391 391 394 396 399 399

400 402 407 408

409

Parallelism Parameter Substitution

12. HI·ve

.........

409 410

....... · · · · · · · · · · · · · · · ... · ....... ............ .... ...... 413

Installing Hive The Hive Shell AnExample Running Hive Configuring Hive Hive Services The Metastore Comparison with Traditional Databases Schema on Read Versus Schem~ on Write Updates, Transactions, and Indexes HiveQL Data Types Operators and Functions Tables Managed Tables and External Tables Partitions and Buckets Storage Formats Importing Data Altering Tables Dropping Tables Querying Data Sorting and Aggregating MapReduce Scripts Joins Subqueries Views User-Defined Functions Writing a UDF Writing a UDAF

414 415 416 417 417 419 421 423 423 424 425 426 428 429 429 431 435 441 443 443 444 444 445 446 449 450 451 452 454

13. HBase ... ... ................................... ...... ................ 459 HBasics Backdrop Concepts Whirlwind Tour of the Data Model Implementation Installation Test Drive Clients

459

460 460

460 461

464 465 467

Table of Contents I xi

Java Avro, REST, and Thrift Example Schemas Loading Data Web Queries HBase Versus RDBMS Successful Service HBase Use Case: HBase at Streamy.com Praxis Versions HDFS Ul Metrics Schema Design Counters Bulk Load

467 470 472 472

473 476 479 480 481 481 483 483 484 485 485 486 486 487

14. ZooKeeper ............................... . . . . ... . ...... . .......... . .. 489 Installing and Running ZooKeeper 490 An Example Group Membership in ZooKeeper Creating the Group Joining a Group Listing Members in a Group Deleting a Group The ZooKeeper Service Data Model Operations Implementation Consistency Sessions States Building Applications with ZooKeeper A Configuration Service The Resilient ZooKeeper Application A Lock Service More Distributed Data Structures and Protocols ZooKeeper in Production Resilience and Performance Configuration

xii

I

Table of Contents

492

492 493 495 496 498 499 499 501 506 507 509 511 512 512 515 519 521 522 523 524

15. Sqoop. Gettin~

Sqoop ASamJ Tex1 Genera Add Import Con Imp Dire Worki1 Imp Import Perforr Export Exp Exp

16. CaseSt1 Hadoo Last Had Gen The Sum Hadoo Had HyJ1 Hiv1 Pro! Nutch Dat: Sele Surr Log Pr· Req Brie Chc Col MaJ Cascac FieJ.

467 470 472 472

473 476 479 480 481 481 483 483 484 485 485 486 486 487

. ...... . ... .. 489 490 492 492 493 495 496 498 499 499 501 506 507 509 511 512 512 515 519 521 522 523 524

15. Sqoop " " " " " ·

.......................... ...... . ...... ............. 527

Getting Sqoop Connectors Sqoop A Sample Import . Text and Binary F1le Formats

Jencrared ocl Additional erialization Systems

Im.porr : ADeeper Look ntrolling the 1m port Imports and Consistency Direct-mode Imports Working with Imported Data Imported Data and Hive Importing Large Objects Performing an Export Exports: A Deeper Look Exports and Transactionality Exports and SequenceFiles

527 529 9 52 532 532 533 533 535 536 536 536 537 540 542 543 545 545

16. Case Studies ....... .. ... . . . . . ........... . .................... . . . ... . . 547 Hadoop Usage at Last.fm Last.fm: The Social Music Revolution Hadoop at Last.fm Generating Charts with Hadoop The Track Statistics Program Summary Hadoop and Hive at Facebook Hadoop at Facebook Hypothetical Use Case Studies Hive Problems and Future Work Nutch Search Engine Data Structures Selected Examples of Hadoop Data Processing in Nutch Summary Log Processing at Rackspace Requirements/The Problem Brief History Choosing Hadoop Collection and Storage MapReduce for Logs Cascading Fields, Tuples, and Pipes

547 547 547 548 549 556 556 556 559 562 566 567 568 571

580 581 581 582 582 582 583 589 590 Table otcontents I xiii

Operations Taps, Schemes, and Flows Cascading in Practice Flexibility Hadoop and Cascading at Share This Summaty TeraByte Sort on Apache Hadoop Using Pig and Wukong to Explore Billion-edge Network Graphs Measuring Community Everybody's Talkin' at Me: The Twitter Reply Graph Symmetric Links Community Extraction

A.

593 594 595 598 599 603 603 607 609 609 612 613

. Installing Apache Hadoop .............................................. 617

B. Cloudera's Distribution Including Apache Hadoop ............ .. ............ 623

C. Preparing the NCDC Weather Data ......... . .......... . ...... . ........... 625 Index ..... . . . . . ................................... . ................ . ...... 629

[-J,1doop got its s web search engi handful of comr route became cit having with Nut< ,1s a part of Nutc

We managed to ~ to handle the Wt moreover, that t1 Around that timt We split off the d of Yahoo!, Hado In 2006, Tom Wi excellent article l in clear prose. I ~ to read as his pre From the beginn: for the project. U in tweaking the s anyone to use. Initially, Tom sg ices. Then he mfj MapReduce API work. In all case role of Hadoop c· Management Co Tom is now are~ he's an expert in easier to use and

xiv

J

Table of Contents