Foreword .. .......... . ....... . . . .... .... . .... . ... . .. .. .. . . . . ........ . ....... xv Preface ..... . . .. ....... . ....... . .. . ........ . . . . .. .. . . . .. ....... ... .. .. . .. . . xvii 1. Meet Hadoop ....................................... .. . .......... . .. . ... 1 Datal Data Storage and Analysis Comparison with Other Systems Rational Database Management System Grid Computing Volunteer Computing A Brief History of Hadoop Apache Hadoop and the Hadoop Ecosystem Hadoop Releases What's Covered in This Book Compatibility
1 3 4 4 6 8 9 12 13 15 15
2. MapReduce ........... ..... . . .. . .. . ....... . .............. . ... . ....... . 17 A Weather Dataset Data Format Analyzing the Data with Unix Tools Analyzing the Data with Hadoop Map and Reduce Java MapReduce Scaling Out Data Flow Combiner Functions Running a Distributed MapReduce Job Hadoop Streaming Ruby Python
17 17 19 20 20 22 30 30 33 36 36 36 39
v
Hadoop Pipes Compiling and Running
40 41
3. The Hadoop Distributed Filesystem .............................. .. .... . .. 43 The Design of HDFS 43 HDFS Concepts 45 Blocks 45 Namenodes and Datanodes 46 HDFS Federation 47 HDFS High-Availability 48 The Command-Line Interface 49 Basic Filesystem Operations 50 Hadoop Filesystems 52 Interfaces 53 The Java Interface 55 Reading Data from a Hadoop URL 55 Reading Data Using the FileSystem API 57 Writing Data 60 Directories 62 Querying the Filesystem 62 Deleting Data 67 Data Flow 67 Anatomy of a File Read 67 Anatomy of a File Write 70 Coherency Model Data Ingest with Flume and Sqoop Parallel Copying with distcp Keeping an HDFS Cluster Balanced Hadoop Archives Using Hadoop Archives Limitations
Writable Classes Implementing a Custom Writable Serialization Frameworks
110 111
Avro Avro Data Types and Schemas In-Memory Serialization and Deserialization Avro Datafiles Interoperability Schema Resolution Sort Order Avro MapReduce Sorting Using Avro MapReduce Avro MapReduce in Other Languages File-Based Data Structures SequenceFile MapFile
114 117
118 121 123 124 128
130 130 130 137
5. Developing a MapReduce Application .................................... 143 The Configuration API Combining Resources Variable Expansion Setting Up the Development Environment Managing Configuration GenericOptionsParser, Tool, and ToolRunner Writing a Unit Test with MRUnit Mapper Reducer Running Locally on Test Data Running a Job in a Local Job Runner Testing the Driver Running on a Cluster Packaging a Job Launching a Job The MapReduce Web Ul Retrieving the Results Debugging a Job Hadoop Logs Remote Debugging Tuning a Job Profiling Tasks MapReduce Workflows Decomposing a Problem into MapReduce Jobs JobControl
6. How MapReduce Works ............... ...... .. .. .. . . ................... 189 Anatomy of a Map Reduce Job Run Classic MapReduce (MapReduce 1) YARN (MapReduce 2) Failures Failures in Classic MapReduce Failures in YARN Job Scheduling The Fair Scheduler The Capacity Scheduler Shuffle and Sort The Map Side The Reduce Side Configuration Tuning Task Execution The Task Execution Environment Speculative Execution Output Committers Task JVM Reuse Skipping Bad Records
Joins . Map-Side Joms Reduce-Side Joins Side Data Distribution Using the Job Configuration Distributed Cache MapReduce Library Classes
9. Setting Up a Hadoop Cluster .............. ..... . ..... .... .... . .......... 297 297 Cluster Specification 299 Network Topology 301 Cluster Setup and Installation 302 Installing Java 302 Creating a Hadoop User 302 Installing Hadoop 303 Testing the Installation 303 SSH Configuration 304 Hadoop Configuration 305 Configuration Management 307 Environment Settings 311
Important Hadoop Daemon Properties Hadoop Daemon Addresses and Ports Other Hadoop Properties User Account Creation YARN Configuration Important YARN Daemon Properties YARN Daemon Addresses and Ports Security Kerberos and Hadoop Delegation Tokens Other Security Enhancements Benchmarking a Hadoop Cluster Hadoop Benchmarks User Jobs Hadoop in the Cloud Apache Whirr
11. Pig .......... . ... . .... . .. . ......................................... . 367 Installing and Running Pig Execution Types Running Pig Programs Grunt Pig Latin Editors An Example Generating Examples Comparison with Databases Pig Latin Structure Statements Expressions Types Schemas Functions Macros User-Defined Functions A Filter UDF An Eva! UDF A Load UDF Data Processing Operators Loading and Storing Data Filtering Data Grouping and Joining Data Sorting Data Combining and Splitting Data Pig in Practice
Installing Hive The Hive Shell AnExample Running Hive Configuring Hive Hive Services The Metastore Comparison with Traditional Databases Schema on Read Versus Schem~ on Write Updates, Transactions, and Indexes HiveQL Data Types Operators and Functions Tables Managed Tables and External Tables Partitions and Buckets Storage Formats Importing Data Altering Tables Dropping Tables Querying Data Sorting and Aggregating MapReduce Scripts Joins Subqueries Views User-Defined Functions Writing a UDF Writing a UDAF
13. HBase ... ... ................................... ...... ................ 459 HBasics Backdrop Concepts Whirlwind Tour of the Data Model Implementation Installation Test Drive Clients
459
460 460
460 461
464 465 467
Table of Contents I xi
Java Avro, REST, and Thrift Example Schemas Loading Data Web Queries HBase Versus RDBMS Successful Service HBase Use Case: HBase at Streamy.com Praxis Versions HDFS Ul Metrics Schema Design Counters Bulk Load
14. ZooKeeper ............................... . . . . ... . ...... . .......... . .. 489 Installing and Running ZooKeeper 490 An Example Group Membership in ZooKeeper Creating the Group Joining a Group Listing Members in a Group Deleting a Group The ZooKeeper Service Data Model Operations Implementation Consistency Sessions States Building Applications with ZooKeeper A Configuration Service The Resilient ZooKeeper Application A Lock Service More Distributed Data Structures and Protocols ZooKeeper in Production Resilience and Performance Configuration
Getting Sqoop Connectors Sqoop A Sample Import . Text and Binary F1le Formats
Jencrared ocl Additional erialization Systems
Im.porr : ADeeper Look ntrolling the 1m port Imports and Consistency Direct-mode Imports Working with Imported Data Imported Data and Hive Importing Large Objects Performing an Export Exports: A Deeper Look Exports and Transactionality Exports and SequenceFiles
16. Case Studies ....... .. ... . . . . . ........... . .................... . . . ... . . 547 Hadoop Usage at Last.fm Last.fm: The Social Music Revolution Hadoop at Last.fm Generating Charts with Hadoop The Track Statistics Program Summary Hadoop and Hive at Facebook Hadoop at Facebook Hypothetical Use Case Studies Hive Problems and Future Work Nutch Search Engine Data Structures Selected Examples of Hadoop Data Processing in Nutch Summary Log Processing at Rackspace Requirements/The Problem Brief History Choosing Hadoop Collection and Storage MapReduce for Logs Cascading Fields, Tuples, and Pipes
580 581 581 582 582 582 583 589 590 Table otcontents I xiii
Operations Taps, Schemes, and Flows Cascading in Practice Flexibility Hadoop and Cascading at Share This Summaty TeraByte Sort on Apache Hadoop Using Pig and Wukong to Explore Billion-edge Network Graphs Measuring Community Everybody's Talkin' at Me: The Twitter Reply Graph Symmetric Links Community Extraction
B. Cloudera's Distribution Including Apache Hadoop ............ .. ............ 623
C. Preparing the NCDC Weather Data ......... . .......... . ...... . ........... 625 Index ..... . . . . . ................................... . ................ . ...... 629
[-J,1doop got its s web search engi handful of comr route became cit having with Nut< ,1s a part of Nutc
We managed to ~ to handle the Wt moreover, that t1 Around that timt We split off the d of Yahoo!, Hado In 2006, Tom Wi excellent article l in clear prose. I ~ to read as his pre From the beginn: for the project. U in tweaking the s anyone to use. Initially, Tom sg ices. Then he mfj MapReduce API work. In all case role of Hadoop c· Management Co Tom is now are~ he's an expert in easier to use and