Data Engineering with Apache Hadoop Highlights from the Cloudera Engineering Blog TECHNICAL BRIEFING BOOK

TECHNICAL BRIEFING BOOK Data Engineering with Apache Hadoop Highlights from the Cloudera Engineering Blog Table  of  Contents   What  is  Data  En...
Author: Eric Cooper
20 downloads 0 Views 3MB Size

Data Engineering with Apache Hadoop Highlights from the Cloudera Engineering Blog

Table  of  Contents  

What  is  Data  Engineering?  .........................................................  3   Section  I:  Real-­‐Time  Ingest  and  Stream  Processing   Architectural  Patterns  for  Near  Real-­‐Time  Data  Processing  with  Apache   Hadoop  ....................................................................................................  4   Designing  Fraud-­‐Detection  Architecture  That  Works  Like  Your  Brain   Does  ......................................................................................................  10   How-­‐to:  Do  Near-­‐Real  Time  Sessionization  with  Spark  Streaming  and   Apache  Hadoop  .....................................................................................  14   Apache  Kafka  for  Beginners  ..................................................................  22

Section  II:  Data  Processing  with  Apache  Spark   How-­‐to:  Translate  from  MapReduce  to  Apache  Spark  ..........................  27   How-­‐to:  Translate  from  MapReduce  to  Apache  Spark  (Part  2)  .............  35   How-­‐to:  Tune  Your  Apache  Spark  Jobs  (Part  1)  .....................................  44   How-­‐to:  Tune  Your  Apache  Spark  Jobs  (Part  2)  .....................................  53


What is Data Engineering? Data engineering is the process of building analytic data infrastructure, or internal data products, that supports the collection, cleansing, storage, and processing (in batch or real time) of data for answering business questions (usually, by a data scientist, a statistician, or someone in similar role, but in some cases these functions overlap). Examples can include: •

The construction of data pipelines that aggregate data from multiple sources;

The productionization, at scale, of machine-learning models designed by data scientists;

The creation of pre-built tools that assist data scientists in the query process (eg, UDFs or entire applications)

Data engineers rely on Apache Hadoop ecosystem components such as Apache Spark, Apache Kafka, and Apache Flume as a foundation for this infrastructure. Regardless of use case or components involved, this infrastructure should be compliance-ready with respect to security, data lineage, and metadata management. This Technical Briefing Book contains selected posts from the Cloudera Engineering Blog about some key concepts pertaining to building and maintaining analytic data infrastructure on a Hadoop-powered enterprise data hub. For more technical information, see: •

Cloudera's Spark portal at

The Spark Guide and related chapters in the Cloudera documentation

Cloudera Developer Training for Apache Spark and Apache Hadoop from Cloudera University

Hadoop Application Architectures, an O’Reilly Media book

Advanced Analytics with Spark, an O’Reilly Media book

Copyright 2016 Cloudera Inc. All Rights Reserved. About Cloudera Cloudera provides the world’s fastest, easiest, and most secure data platform built on Apache Hadoop. We help solve your most demanding business challenges with data.


Architectural Patterns for Near Real-Time Data Processing with Apache Hadoop By  Ted  Malaska  (June  2015)  

Evaluating which streaming architectural pattern is the best match to your use case is a precondition for a successful production deployment.

The Apache Hadoop ecosystem has become a preferred platform for enterprises seeking to process and understand large-scale data in real time. Technologies like Apache Kafka, Apache Flume, Apache Spark, Apache Storm, and Apache Samza are increasingly pushing the envelope on what is possible. It is often tempting to bucket large-scale streaming use cases together but in reality they tend to break down into a few different architectural patterns, with different components of the ecosystem better suited for different problems. In this post, I will outline the four major streaming patterns that we have encountered with customers running enterprise data hubs in production, and explain how to implement those patterns architecturally on Hadoop.

Streaming Patterns The four basic streaming patterns (often used in tandem) are: Stream ingestion: Involves low-latency persisting of events to HDFS, Apache HBase, and Apache Solr. •

Near Real-Time (NRT) Event Processing with External Context: Takes actions like alerting, flagging, transforming, and filtering of events as they arrive. Actions might be taken based on sophisticated criteria, such as anomaly detection models. Common use cases, such as NRT fraud detection and recommendation, often demand low latencies under 100 milliseconds.

NRT Event Partitioned Processing: Similar to NRT event processing, but deriving benefits from partitioning the data—like storing more relevant external information in memory. This pattern also requires processing latencies under 100 milliseconds.

Complex Topology for Aggregations or ML: The holy grail of stream processing: gets real-time answers from data with a complex and flexible set of operations. Here, because results often depend on windowed computations and require more active data, the focus shifts from ultralow latency to functionality and accuracy.

In the following sections, we’ll get into recommended ways for implementing such patterns in a tested, proven, and maintainable way.


Streaming Ingestion Traditionally, Flume has been the recommended system for streaming ingestion. Its large library of sources and sinks cover all the bases of what to consume and where to write. (For details about how to configure and manage Flume, Using Flume, the O’Reilly Media book by Cloudera Software Engineer/Flume PMC member Hari Shreedharan, is a great resource.) Within the last year, Kafka has also become popular because of powerful features such as playback and replication. Because of the overlap between Flume’s and Kafka’s goals, their relationship is often confusing. How do they fit together? The answer is simple: Kafka is a pipe similar to Flume’s Channel abstraction, albeit a better pipe because of its support for the features mentioned above. One common approach is to use Flume for the source and sink, and Kafka for the pipe between them. The diagram below illustrates how Kafka can serve as the UpStream Source of Data to Flume, the DownStream destination of Flume, or the Flume Channel.


The design illustrated below is massively scalable, battle hardened, centrally monitored through Cloudera Manager, fault tolerant, and supports replay.

One thing to note before we go to the next streaming architecture is how this design gracefully handles failure. The Flume Sinks pull from a Kafka Consumer Group. The Consumer group track the Topic’s offset with help from Apache ZooKeeper. If a Flume Sink is lost, the Kafka Consumer will redistribute the load to the remaining sinks. When the Flume Sink comes back up, the Consumer group will redistribute again.

NRT Event Processing with External Context To reiterate, a common use case for this pattern is to look at events streaming in and make immediate decisions, either to transform the data or to take some sort of external action. The decision logic often depends on external profiles or metadata. An easy and scalable way to implement this approach is to add a Source or Sink Flume interceptor to your Kafka/Flume architecture. With modest tuning, it’s not difficult to achieve latencies in the low milliseconds. Flume Interceptors take events or batches of events and allow user code to modify or take actions based on them. The user code can interact with local memory or an external storage system like HBase to get profile information needed for decisions. HBase usually can give us our information in around 425 milliseconds depending on network, schema design, and configuration. You can also set up HBase in a way that it is never down or interrupted, even in the case of failure.


Implementation requires nearly no coding beyond the application-specific logic in the interceptor. Cloudera Manager offers an intuitive UI for deploying this logic through parcels as well as hooking up, configuring, and monitoring the services.

NRT Partitioned Event Processing with External Context In the architecture illustrated below (unpartitioned solution), you would need to call out frequently to HBase because external context relevant to particular events does not fit in local memory on the Flume interceptors.

However, if you define a key to partition your data, you can match incoming data to the subset of the context data that is relevant to it. If you partition the data 10 times, then you only need to hold 1/10th of


the profiles in memory. HBase is fast, but local memory is faster. Kafka enables you to define a custom partitioner that it uses to split up your data. Note that Flume is not strictly necessary here; the root solution here just a Kafka consumer. So, you could use just a consumer in YARN or a Map-only MapReduce application.

Complex Topology for Aggregations or ML Up to this point, we have been exploring event-level operations. However, sometimes you need more complex operations like counts, averages, sessionization, or machine-learning model building that operate on batches of data. In this case, Spark Streaming is the ideal tool for several reasons: •

It’s easy to develop compared to other tools. Spark’s rich and concise APIs make building out complex topologies easy.

Similar code for streaming and batch processing. With a few changes, the code for small batches in real time can be used for enormous batches offline. In addition to reducing code size, this approach reduces the time needed for testing and integration.

There’s one engine to know. There is a cost that goes into training staff on the quirks and internals of distributed processing engines. Standardizing on Spark consolidates this cost for both streaming and batch.

Micro-batching helps you scale reliably. Acknowledging at a batch level allows for more throughput and allows for solutions without the fear of a double-send. Micro-batching also helps with sending changes to HDFS or HBase in terms of performance at scale.

Hadoop ecosystem integration is baked in. Spark has deep integration with HDFS, HBase, and Kafka.

No risk of data loss. Thanks to the WAL and Kafka, Spark Streaming avoids data loss in case of failure.

It’s easy to debug and run. You can debug and step through your code Spark Streaming in a local IDE without a cluster. Plus, the code looks like normal functional programing code so it doesn’t take much time for a Java or Scala developer to make the jump. (Python is also supported.)

Streaming is natively stateful. In Spark Streaming, state is a first-class citizen, meaning that it’s easy to write stateful streaming applications that are resilient to node failures.


As the de facto standard, Spark is getting long-term investment from across the ecosystem. At the time of this writing, there were approximately 700 commits to Spark as a whole in the last 30 days—compared to other streaming frameworks such as Storm, with 15 commits during the same time.

You have access to ML libraries. Spark’s MLlib is becoming hugely popular and its functionality will only increase.

You can use SQL where needed. With Spark SQL, you can add SQL logic to your streaming application to reduce code complexity.

Conclusion There is a lot of power in streaming and several possible patterns, but as you have learned in this post, you can do really powerful things with minimal coding if you know which pattern matches up with your use case best. Ted Malaska is a Solutions Architect at Cloudera, a contributor to Spark, Flume, and HBase, and a co-author of the O’Reilly book, Hadoop Application Architectures.


Designing Fraud-Detection Architecture That Works Like Your Brain Does By  Gwen  Shapira  and  Ted  Malaska  (July  2015)  

To design effective fraud-detection architecture, look no further than the human brain (with some help from Spark Streaming and Apache Kafka).

At its core, fraud detection is about detection whether people are behaving “as they should,” otherwise known as catching anomalies in a stream of events. This goal is reflected in diverse applications such as detecting credit-card fraud, flagging patients who are doctor shopping to obtain a supply of prescription drugs, or identifying bullies in online gaming communities. To understand how to design an effective fraud-detection architecture, one need to examine how the human brain learns to detect anomalies and react to them. As it turns out, our brains have multiple systems for analyzing information. (If you are interested in how humans process information, we recommend the book Thinking Fast and Slow, by Daniel Kahneman). Consider a game of tennis, where the players have to detect the arriving ball and react to it appropriately. During the game, players have to perform this detection-reaction loop incredibly fast; there is no time to think, and reactions are based on instinct and very fast pattern detection. Between sets, a player may have a moment or two to reflect upon the game’s progress, identify tactics or strategies the other player is using, and make adjustments accordingly. Between games, the players have much time for reflection: they may notice that a particular aspect of their game is consistently weak and work on improving it, so during the next game they can instinctively perform better. (Note that this reflection can be conscious or unconscious. We have all occasionally spent hours trying to solve a challenging problem, only for the solution to materialize during the morning shower while we are thinking of nothing in particular.)

Combination of Systems In a similar fashion, effective fraud-detection architecture emulates the human brain by having three subsystems work together to detect anomalies in streams of events:


Near real-time system: the job of this system is to receive events and reply as fast as possible, usually in less than 100ms. This system typically does very little processing and depends mostly on pattern matching and applying predefined rules. Architectural design should focus on achieving very high throughput with very low latency, and to this end, use patterns such as caching of user profiles in local memory. Stream-processing systems: this system can take a little longer to process the incoming data, but should still process each even within few seconds to few minutes of its arrival. The goal of this system is to adjust parameters of the fraud-detection models in near real-time, using data aggregated across all user activity (for example, flagging vendors or regions that are currently more suspicious). Offline-processing system: this system can run in anything from hours to months latency and focusing on improving the models themselves. This process includes training the models on new data, exploring new features in the data, and developing new models. It also requires that human data analysts explore the data using BI tools.

In a previous blog post (see above), we explored different patterns of how the first type of system, near real-time, can be implemented and covered some reasons that Cloudera recommends Spark Streaming for the stream-processing system. To recap, the recommended architecture for the real-time reaction system is a service that subscribes as a consumer to an Apache Kafka topic with the events to which reactions are required. The service uses cached state and rules to react quickly to these events and uses Apache HBase as an external context. Kafka partitions can be used to distribute the service and ensure that each instance only needs to cache information for a subset of the users.


The beauty of this plan is that this distributed application is self-contained and can be managed in many ways – as an Apache Flume interceptor, as a YARN container, or with Mesos, Docker, Kubernetes, and many other distributed system container frameworks. You can pick whatever you prefer, since Kafka will do the data persistence and partitioning work. Now, let’s see how to integrate the real-time part of the system with the stream- and offline-processing portions.

Integrating Real-Time Detection and Processing The key to the integration is the use of Kafka as a scalable, ordered event storage. When registering consumers with Kafka, you can subscribe consumers either as part of the same consumer group or in separate consumer groups. If two consumers are subscribed to read events from a topic as part of the same group, they will each “own” a subset of the partitions in the group, and each one will only get events from the partitions it owns. If a consumer in the group crashes, its partitions will be distributed across other consumers in the group. This approach provides a mechanism for both load balancing and high availability of consumers, and it makes each data processing application scale (by adding more consumers to the same group as load increases). But we also want multiple applications reading the same data; both the real-time app and the streaming app need to read the same data from Kafka. In this case, each application will be its own consumer group and will be able to consume messages from Kafka independently at its own pace.

To enable offline processing by batch jobs and analysts, the data needs to be stored in HDFS. You can easily do this using Flume; just define Kafka as the channel and HDFS as the sink. In this setup, Flume will read events from Kafka and write them to HDFS, HBase, or Apache Solr where they can be accessed by Apache Spark, Impala, Apache Hive, and other BI tools.


Note that since each system is subscribed with its own consumer group to Kafka, they can each read events independently at their own rate. Thus, if the stream-processing system is taking longer to process events, it has no impact on the real-time system. Part of the beauty of Kafka is that it stores events for a set amount of time, regardless of how many consumers and what they do, so Kafka will keep performing the same even as you add more processing systems. The last part of the integration is the ability to send rule and model updates from the stream- and offline-processing systems back to the real-time system. This process is the equivalent of improving human instincts based on practice (changing the threshold of approved transactions sized for a particular vendor, for example). One approach is to have these systems update the models in HBase, and have the real-time system occasionally (and asynchronously check HBase for updates). A better option is to send the model updates to another Kafka topic: Our real-time app will subscribe to that topic and when updates show up, it will apply them to its own rule cache and modify its behavior accordingly. An interesting design option here can be to store the models completely in Kafka, because its compaction feature can ensure that regardless of how long you choose to store “raw” data in Kafka, the latest numbers for each model will be stored forever and can always be retrieved and cached by the real-time application.

Conclusion We hope that this post shed some light on a very challenging topic. Gwen Shapira is a Software Engineer at Cloudera, and a committer on Apache Sqoop and Apache Kafka. She has 15 years of experience working with customers to design scalable data architectures, and is a co-author of the O’Reilly book, Hadoop Application Architectures. Ted Malaska is a Solutions Architect at Cloudera, a contributor to Spark, Flume, and HBase, and also a co-author of Hadoop Application Architectures.


How-to: Do Near-Real Time Sessionization with Spark Streaming and Apache Hadoop By  Ted  Malaska  (Nov.  2014)  

This Spark Streaming use case is a great example of how near-real-time processing can be brought to Hadoop.

Spark Streaming is one of the most interesting components within the Apache Spark stack. With Spark Streaming, you can create data pipelines that process streamed data using the same API that you use for processing batch-loaded data. Furthermore, Spark Steaming’s “micro-batching” approach provides decent resiliency should a job fail for some reason. In this post, I will demonstrate and walk you through some common and advanced Spark Streaming functionality via the use case of doing near-real time sessionization of Website events, then load stats about that activity into Apache HBase, and then populate graphs in your preferred BI tool for analysis. (Sessionization refers to the capture of all clickstream activity within the timeframe of a single visitor’s Website session.) You can find the code for this demo here. A system like this one can be super-useful for understanding visitor behavior (whether human or machine). With some additional work, it can also be designed to contain windowing patterns for detecting possible fraud in an asynchronous manner.

Spark Streaming Code The main class to look at in our example is:

Let’s look at this code in sections (ignoring lines 1-59, which contains imports and other uninteresting stuff). Lines 60 to 112: Setting up Spark Streaming

These lines are our pretty basic start for Spark Streaming with an option to receive data from HDFS or a socket. I’ve added some verbose comments to help you understand the code if you are new to Spark Streaming. (I’m not going to go into great detail here because we’re still in the boilerplate-code zone.) //This is just creating a Spark Config object. I don't do much here but //add the app name. There are tons of options to put into the Spark config, //but none are needed for this simple example. val sparkConf = new SparkConf(). setAppName("SessionizeData " + args(0)). set("spark.cleaner.ttl", "120000")


//These two lines will get us out SparkContext and our StreamingContext. //These objects have all the root functionality we need to get started. val sc = new SparkContext(sparkConf) val ssc = new StreamingContext(sc, Seconds(10)) //Here are are loading our HBase Configuration object. This will have //all the information needed to connect to our HBase cluster. //There is nothing different here from when you normally interact with HBase. val conf = HBaseConfiguration.create(); conf.addResource(new Path("/etc/hbase/conf/core-site.xml")); conf.addResource(new Path("/etc/hbase/conf/hbase-site.xml")); //This is a HBaseContext object.

This is a nice abstraction that will

hide //any complex HBase stuff from us so we can focus on our business case //HBaseContext is from the SparkOnHBase project which can be found at // val hbaseContext = new HBaseContext(sc, conf); //This is create a reference to our root DStream.

DStreams are like RDDs

but //with the context of being in micro batch world. I set this to null now //because I later give the option of populating this data from HDFS or from //a socket. There is no reason this could not also be populated by Kafka, //Flume, MQ system, or anything else. I just focused on these because //there are the easiest to set up. var lines: DStream[String] = null //Options for data load. Will be adding Kafka and Flume at some point if (args(0).equals("socket")) { val host = args(FIXED_ARGS); val port = args(FIXED_ARGS + 1); println("host:" + host) println("port:" + Integer.parseInt(port)) //Simple example of how you set up a receiver from a Socket Stream lines = ssc.socketTextStream(host, port.toInt) } else if (args(0).equals("newFile")) { val directory = args(FIXED_ARGS) println("directory:" + directory) //Simple example of how you set up a receiver from a HDFS folder lines = ssc.fileStream[LongWritable, Text, TextInputFormat](directory, (t: Path) => true, true).map(_._2.toString) } else { throw new RuntimeException("bad input type") }


Lines 114 to 124: String Parsing

Here’s where the Streaming with Spark begins. Look at the following four lines: val ipKeyLines =[(String, (Long, Long, String))](eventRecord => { //Get the time and ip address out of the original event val time = dateFormat.parse( eventRecord.substring(eventRecord.indexOf('[') + 1, eventRecord.indexOf(']'))). getTime() val ipAddress = eventRecord.substring(0, eventRecord.indexOf(' ')) //We are return the time twice because we will use the first at the start time //and the second as the end time (ipAddress, (time, time, eventRecord)) })

The first command above is doing a map function on the “lines” DStream object and parsing the original events to separate out the IP address, timestamp, and event body. For those new to Spark Streaming, a DStream holds a batch of records to be processed. These records are populated by the receiver object, which was defined previously, and this map function produces another DStream within this micro-batch to store the transformed records for additional processing.

There are a couple things to note when looking at a Spark Streaming diagram like the one above:


Each micro-batch is fired at the number of seconds defined when constructing your streamingContext

The Receiver is always populated the future RDDs for the next micro-batch

Older RDDs of past micro batch will be cleaned up and discarded

Lines  126  to  135:  Making  Sessions  

Now that we have IP address and times broken out from the web log, it ‘s time to build sessions. The following code does the session building by first clumping events within the micro-batch, and then reducing those clumps with sessions in the stateful DStream . val latestSessionInfo = ipKeyLines. map[(String, (Long, Long, Long))](a => { //transform to (ipAddress, (time, time, counter)) (a._1, (a._2._1, a._2._2, 1)) }). reduceByKey((a, b) => { //transform to (ipAddress, (lowestStartTime, MaxFinishTime, sumOfCounter)) (Math.min(a._1, b._1), Math.max(a._2, b._2), a._3 + b._3) }). updateStateByKey(updateStatbyOfSessions)

Here’s an example of how the records will be reduced within the micro-batch:

With the session ranges joined within the micro-batch, we can use the supercool updateStateByKey functionality, which will do a join/reduce-like operation with a DStream from the micro-batch before the active one. The diagram below illustrates how this process looks in terms of DStream s over time.


Now let’s dig into the updateStatbyOfSessions function, which is defined at the bottom of the file. This code (note the verbose comments) contains a lot of the magic that makes sessionization happen in a micro-batch continuous mode. /** * This function will be called for to union of keys in the Reduce DStream * with the active sessions from the last micro batch with the ipAddress * being the key * * To goal is that this produces a stateful RDD that has all the active * sessions. So we add new sessions and remove sessions that have timed * out and extend sessions that are still going */ def updateStatbyOfSessions( //(sessionStartTime, sessionFinishTime, countOfEvents) a: Seq[(Long, Long, Long)], //(sessionStartTime, sessionFinishTime, countOfEvents, isNewSession) b: Option[(Long, Long, Long, Boolean)] ): Option[(Long, Long, Long, Boolean)] = { //This function will return a Optional value. //If we want to delete the value we can return a optional "None". //This value contains four parts //(startTime, endTime, countOfEvents, isNewSession) var result: Option[(Long, Long, Long, Boolean)] = null // These if statements are saying if we didn't get a new event for //this session's ip address for longer then the session //timeout + the batch time then it is safe to remove this key value //from the future Stateful DStream if (a.size == 0) { if (System.currentTimeMillis() - b.get._2 > SESSION_TIMEOUT + 11000) {


result = None } else { if (b.get._4 == false) { result = b } else { result = Some((b.get._1, b.get._2, b.get._3, false)) } } } //Now because we used the reduce function before this function we are //only ever going to get at most one event in the Sequence. a.foreach(c => { if (b.isEmpty) { //If there was no value in the Stateful DStream then just add it //new, with a true for being a new session result = Some((c._1, c._2, c._3, true)) } else { if (c._1 - b.get._2 < SESSION_TIMEOUT) { //If the session from the stateful DStream has not timed out //then extend the session result = Some(( Math.min(c._1, b.get._1), //newStartTime Math.max(c._2, b.get._2), //newFinishTime b.get._3 + c._3, //newSumOfEvents false //This is not a new session )) } else { //Otherwise remove the old session with a new one result = Some(( c._1, //newStartTime c._2, //newFinishTime b.get._3, //newSumOfEvents true //new session )) } } }) result } }

There’s a lot going on in this code, and in many ways, it’s the most complex part of the whole job. To summarize, it tracks active sessions so you know if you are continuing an existing session or starting a new one. Line  126  to  207:  Counting  and  HBase  

This section is where most of the counting happens. There is a lot of repetition here, so let’s walk through just one count example and then the steps that will allow us to put the generated counts in the same record for storage in HBase.


val onlyActiveSessions = latestSessionInfo.filter(t => System.currentTimeMillis() - t._2._2 < SESSION_TIMEOUT) … val newSessionCount = onlyActiveSessions.filter(t => { //is the session newer then that last micro batch //and is the boolean saying this is a new session true (System.currentTimeMillis() - t._2._2 > 11000 && t._2._4) }). count. map[HashMap[String, Long]](t => HashMap((NEW_SESSION_COUNTS, t)))

In short, the code above is filtering all but the active sessions, counting them, and putting that final count record into a single entity HashMap . It uses the HashMap as a container, so we can call the following reduce function after all the counts are done to put them all into a single record. (I’m sure there are better ways to do that, but this approach works just fine.) Next, the following code takes all those HashMap s and puts all their values in one HashMap . val allCounts = newSessionCount. union(totalSessionCount). union(totals). union(totalEventsCount). union(deadSessionsCount). union(totalSessionEventCount). reduce((a, b) => b ++ a)

Interacting with HBase through Spark Streaming is super simple with HBaseContext . All you have to do is supply the DStream with the HashMap and a function to convert it to a put object. hbaseContext.streamBulkPut[HashMap[String, Long]]( allCounts, //The input RDD hTableName, //The name of the table we want to put too (t) => { //Here we are converting our input record into a put //The rowKey is C for Count and a backward counting time so the newest //count show up first in HBase's sorted order val put = new Put(Bytes.toBytes("C." + (Long.MaxValue System.currentTimeMillis()))) //We are iterating through the HashMap to make all the columns with their counts t.foreach(kv => put.add(Bytes.toBytes(hFamily), Bytes.toBytes(kv._1), Bytes.toBytes(kv._2.toString))) put }, false)


Now with this information in HBase, can wrap it up with an Apache Hive table, and then execute a query through your favorite BI tool to get graphs like the following that will refresh on every micro-batch.

Lines  209  to  215:  Writing  to  HDFS  

The final task is to join the active session information with the event data and then persist the events to HDFS with the starting time of the session. //Persist to HDFS ipKeyLines.join(onlyActiveSessions). map(t => { //Session root start time | Event message dateFormat.format(new Date(t._2._2._1)) + "\t" + t._2._1._3 }). saveAsTextFiles(outputDir + "/session", "txt")

Conclusion I hope you come away from this example feeling like a lot of work was done with just a little bit of code, because it was. Imagine about what else you can do with this pattern and the ability to interact with HBase and HDFS so easily within Spark Streaming. Ted Malaska is a Solutions Architect at Cloudera, a contributor to Spark, Flume, and HBase, and a co-author of the O’Reilly book, Hadoop Application Architectures.


Apache Kafka for Beginners By  Gwen  Shapira  &  Jeff  Holoman  (Sept.  2014)  

When used in the right way and for the right use case, Kafka has unique attributes that make it a highly attractive option for data integration.

Apache Kafka is creating a lot of buzz these days. While LinkedIn, where Kafka was founded, is the most well known user, there are many companies successfully using this technology. So now that the word is out, it seems the world wants to know: What does it do? Why does everyone want to use it? How is it better than existing solutions? Do the benefits justify replacing existing systems and infrastructure? In this post, we’ll try to answers those questions. We’ll begin by briefly introducing Kafka, and then demonstrate some of Kafka’s unique features by walking through an example scenario. We’ll also cover some additional use cases and also compare Kafka to existing solutions.

What is Kafka? Kafka is one of those systems that is very simple to describe at a high level, but has an incredible depth of technical detail when you dig deeper. The Kafka documentation does an excellent job of explaining the many design and implementation subtleties in the system, so we will not attempt to explain them all here. In summary, Kafka is a distributed publish-subscribe messaging system that is designed to be fast, scalable, and durable . Like many publish-subscribe messaging systems, Kafka maintains feeds of messages in topics. Producers write data to topics and consumers read from topics. Since Kafka is a distributed system, topics are partitioned and replicated across multiple nodes. Messages are simply byte arrays and the developers can use them to store any object in any format – with String, JSON, and Avro the most common. It is possible to attach a key to each message, in which case the producer guarantees that all messages with the same key will arrive to the same partition. When consuming from a topic, it is possible to configure a consumer group with multiple consumers. Each consumer in a consumer group will read messages from a unique subset of partitions in each topic they subscribe to, so each message is delivered to one consumer in the group, and all messages with the same key arrive at the same consumer. What makes Kafka unique is that Kafka treats each topic partition as a log (an ordered set of messages). Each message in a partition is assigned a unique offset. Kafka does not attempt to track which messages were read by each consumer and only retain unread messages; rather, Kafka retains all messages for a set amount of time, and consumers are responsible to track their location in each log.


Consequently, Kafka can support a large number of consumers and retain large amounts of data with very little overhead. Next, let’s look at how Kafka’s unique properties are applied in a specific use case.

Kafka at Work Suppose we are developing a massive multiplayer online game. In these games, players cooperate and compete with each other in a virtual world. Often players trade with each other, exchanging game items and money, so as game developers it is important to make sure players don’t cheat: Trades will be flagged if the trade amount is significantly larger than normal for the player and if the IP the player is logged in with is different than the IP used for the last 20 games. In addition to flagging trades in realtime, we also want to load the data to Apache Hadoop, where our data scientists can use it to train and test new algorithms. For the real-time event flagging, it will be best if we can reach the decision quickly based on data that is cached on the game server memory, at least for our most active players. Our system has multiple game servers and the data set that includes the last 20 logins and last 20 trades for each player can fit in the memory we have, if we partition it between our game servers. Our game servers have to perform two distinct roles: The first is to accept and propagate user actions and the second to process trade information in real time and flag suspicious events. To perform the second role effectively, we want the whole history of trade events for each user to reside in memory of a single server. This means we have to pass messages between the servers, since the server that accepts the user action may not have his trade history. To keep the roles loosely coupled, we use Kafka to pass messages between the servers, as you’ll see below. Kafka has several features that make it a good fit for our requirements: scalability, data partitioning, low latency, and the ability to handle large number of diverse consumers. We have configured Kafka with a single topic for logins and trades. The reason we need a single topic is to make sure that trades arrive to our system after we already have information about the login (so we can make sure the gamer logged in from his usual IP). Kafka maintains order within a topic, but not between topics. When a user logs in or makes a trade, the accepting server immediately sends the event into Kafka. We send messages with the user id as the key, and the event as the value. This guarantees that all trades and logins from the same user arrive to the same Kafka partition. Each event processing server runs a Kafka consumer, each of which is configured to be part of the same group—this way, each server reads data from few Kafka partitions, and all the data about a particular user arrives to the same event processing server (which can be different from the accepting server). When the event-processing server reads a user trade from Kafka, it adds the event to the user’s event history it caches in local memory. Then it can access the user’s event history from the local cache and flag suspicious events without additional network or disk overhead.


It’s important to note that we create a partition per event-processing server, or per core on the eventprocessing servers for a multi-threaded approach. (Keep in mind that Kafka was mostly tested with fewer than 10,000 partitions for all the topics in the cluster in total, and therefore we do not attempt to create a partition per user.) This may sound like a circuitous way to handle an event: Send it from the game server to Kafka, read it from another game server and only then process it. However, this design decouples the two roles and allows us to manage capacity for each role as required. In addition, the approach does not add significantly to the timeline as Kafka is designed for high throughput and low latency; even a small three-node cluster can process close to a million events per second with an average latency of 3ms. When the server flags an event as suspicious, it sends the flagged event into a new Kafka topic—for example, Alerts—where alert servers and dashboards pick it up. Meanwhile, a separate process reads data from the Events and Alerts topics and writes them to Hadoop for further analysis. Because Kafka does not track acknowledgements and messages per consumer it can handle many thousands of consumers with very little performance impact. Kafka even handles batch consumers— processes that wake up once an hour to consume all new messages from a queue—without affecting system throughput or latency.


Additional Use Cases As this simple example demonstrates, Kafka works well as a traditional message broker as well as a method of ingesting events into Hadoop. Here are some other common uses for Kafka: •

Website activity tracking: The web application sends events such as page views and searches Kafka, where they become available for real-time processing, dashboards and offline analytics in Hadoop

Operational metrics: Alerting and reporting on operational metrics. One particularly fun example is having Kafka producers and consumers occasionally publish their message counts to a special Kafka topic; a service can be used to compare counts and alert if data loss occurs.

Log aggregation: Kafka can be used across an organization to collect logs from multiple services and make them available in standard format to multiple consumers, including Hadoop and Apache Solr.

Stream processing: A framework such as Spark Streaming reads data from a topic, processes it and writes processed data to a new topic where it becomes available for users and applications. Kafka’s strong durability is also very useful in the context of stream processing.

Other systems serve many of those use cases, but none of them do them all. ActiveMQ and RabbitMQ are very popular message broker systems, and Apache Flume is traditionally used to ingest events, logs, and metrics into Hadoop.

Kafka and Its Alternatives We can’t speak much about message brokers, but data ingest for Hadoop is a problem we understand very well. First, it is interesting to note that Kafka started out as a way to make data ingest to Hadoop easier. When there are multiple data sources and destinations involved, writing a separate data pipeline for each source and destination pairing quickly evolves to an unmaintainable mess. Kafka helped LinkedIn standardize the data pipelines and allowed getting data out of each system once and into each system once, significantly reducing the pipeline complexity and cost of operation. Jay Kreps, Kafka’s architect at LinkedIn, describes this familiar problem well in a blog post:

My own involvement in this started around 2008 after we had shipped our key-value store. My next project was to try to get a working Hadoop setup going, and move some of our recommendation processes there. Having little experience in this area, we naturally budgeted a few weeks for getting data in and out, and the rest of our time for implementing fancy prediction algorithms. So began a long slog.


Diffs versus Flume There is significant overlap in the functions of Flume and Kafka. Here are some considerations when evaluating the two systems. •

Kafka is very much a general-purpose system. You can have many producers and many consumers sharing multiple topics. In contrast, Flume is a special-purpose tool designed to send data to HDFS and HBase. It has specific optimizations for HDFS and it integrates with Hadoop’s security. As a result, Cloudera recommends using Kafka if the data will be consumed by multiple applications, and Flume if the data is designated for Hadoop.

Those of you familiar with Flume know that Flume has many built-in sources and sinks. Kafka, however, has a significantly smaller producer and consumer ecosystem, and it is not well supported by the Kafka community. Hopefully this situation will improve in the future, but for now: Use Kafka if you are prepared to code your own producers and consumers. Use Flume if the existing Flume sources and sinks match your requirements and you prefer a system that can be set up without any development.

Flume can process data in-flight using interceptors. These can be very useful for data masking or filtering. Kafka requires an external stream processing system for that.

Both Kafka and Flume are reliable systems that with proper configuration can guarantee zero data loss. However, Flume does not replicate events. As a result, even when using the reliable file channel, if a node with Flume agent crashes, you will lose access to the events in the channel until you recover the disks. Use Kafka if you need an ingest pipeline with very high availability.

Flume and Kafka can work quite well together. If your design requires streaming data from Kafka to Hadoop, using a Flume agent with Kafka source to read the data makes sense: You don’t have to implement your own consumer, you get all the benefits of Flume’s integration with HDFS and HBase, you have Cloudera Manager monitoring the consumer and you can even add an interceptor and do some stream processing on the way.

Conclusion As you can see, Kafka has a unique design that makes it very useful for solving a wide range of architectural challenges. It is important to make sure you use the right approach for your use case and use it correctly to ensure high throughput, low latency, high availability, and no loss of data. Gwen Shapira is a Software Engineer at Cloudera, and a Kafka contributor. Jeff Holoman is a Systems Engineer at Cloudera.


How-to: Translate from MapReduce to Apache Spark By  Sean  Owen  (Sept.  2014)  

The key to getting the most out of Spark is to understand the differences between its RDD API and the original Mapper and Reducer API.

Venerable MapReduce has been Apache Hadoop‘s work-horse computation paradigm since its inception. It is ideal for the kinds of work for which Hadoop was originally designed: large-scale log processing, and batch-oriented ETL (extract-transform-load) operations. As Hadoop’s usage has broadened, it has become clear that MapReduce is not the best framework for all computations. Hadoop has made room for alternative architectures by extracting resource management into its own first-class component, YARN. And so, projects like Impala have been able to use new, specialized non-MapReduce architectures to add interactive SQL capability to the platform, for example. Today, Apache Spark is another such alternative, and is said by many to succeed MapReduce as Hadoop’s general-purpose computation paradigm. But if MapReduce has been so useful, how can it suddenly be replaced? After all, there is still plenty of ETL-like work to be done on Hadoop, even if the platform now has other real-time capabilities as well. Thankfully, it’s entirely possible to re-implement MapReduce-like computations in Spark. They can be simpler to maintain, and in some cases faster, thanks to Spark’s ability to optimize away spilling to disk. For MapReduce, re-implementation on Spark is a homecoming. Spark, after all, mimics Scala's functional programming style and APIs. And the very idea of MapReduce comes from the functional programming language LISP. Although Spark’s primary abstraction, the RDD (Resilient Distributed Dataset), plainly exposes map() and reduce() operations, these are not the direct analog of Hadoop’s Mapper or Reducer APIs. This is often a stumbling block for developers looking to move Mapper and Reducer classes to Spark equivalents. Viewed in comparison with classic functional language implementations of map() and reduce() in Scala or Spark, the Mapper and Reducer APIs in Hadoop are actually both more flexible and more complex as a result. These differences may not even be apparent to developers accustomed to MapReduce, but, the following behaviors are specific to Hadoop’s implementation rather than the idea of MapReduce in the abstract: •

Mappers and Reducers always use key-value pairs as input and output.


A Reducer reduces values per key only.

A Mapper or Reducer may emit 0, 1 or more key-value pairs for every input.

Mappers and Reducers may emit any arbitrary keys or values, not just subsets or transformations of those in the input.

Mapper and Reducer objects have a lifecycle that spans many map() and reduce() calls. They support a setup() and cleanup() method, which can be used to take actions before or after a batch of records is processed.

This post will briefly demonstrate how to recreate each of these within Spark — and also show that it’s not necessarily desirable to literally translate a Mapper and Reducer!

Key-Value Pairs as Tuples Let’s say we need to compute the length of each line in a large text input, and report the count of lines by line length. In Hadoop MapReduce, this begins with a Mapper that produces key-value pairs in which the line length is the key, and count of 1 is the value: public class LineLengthMapper extends Mapper { @Override protected void map(LongWritable lineNumber, Text line, Context context) throws IOException, InterruptedException { context.write(new IntWritable(line.getLength()), new IntWritable(1)); } }

It’s worth noting that Mappers and Reducers only operate on key-value pairs. So the input to LineLengthMapper , provided by a TextInputFormat , is actually a pair containing the line as value, with position within the file thrown in as a key, for fun. (It’s rarely used, but, something has to be the key.) The Spark equivalent is: => (line.length, 1))

In Spark, the input is an RDD of Strings only, not of key-value pairs. Spark’s representation of a keyvalue pair is a Scala tuple, created with the (a,b) syntax shown above. The result of the map() operation above is an RDD of (Int,Int) tuples. When an RDD contains tuples, it gains more methods, such as reduceByKey() , which will be essential to reproducing MapReduce behavior.


Reducer and reduce() versus reduceByKey() To produce a count of line lengths, it’s necessary to sum the counts per length in a Reducer: public class LineLengthReducer extends Reducer { @Override protected void reduce(IntWritable length, Iterable counts, Context context) throws IOException, InterruptedException { int sum = 0; for (IntWritable count : counts) { sum += count.get(); } context.write(length, new IntWritable(sum)); } }

The equivalent of the Mapper and Reducer above together is a one-liner in Spark: val lengthCounts = => (line.length, 1)).reduceByKey(_ + _)

Spark’s RDD API has a reduce() method, but it will reduce the entire set of key-value pairs to one single value. This is not what Hadoop MapReduce does. Instead, Reducers reduce all values for a key and emit a key along with the reduced value. reduceByKey() is the closer analog. But, that is not even the most direct equivalent in Spark; see groupByKey() below. It is worth pointing out here that a Reducer’s reduce() method receives a stream of many values, and produces 0, 1 or more results. reduceByKey() , in contrast, accepts a function that turns exactly two values into exactly one — here, a simple addition function that maps two numbers to their sum. This associative function can be used to reduce many values to one for the caller. It is a simpler, narrower API for reducing values by key than what a Reducer exposes.

Mapper and map() versus flatMap() Now, instead consider counting the occurrences of only words beginning with an uppercase character. For each line of text in the input, a Mapper might emit 0, 1 or many key-value pairs: public class CountUppercaseMapper extends Mapper { @Override protected void map(LongWritable lineNumber, Text line, Context context) throws IOException, InterruptedException { for (String word : line.toString().split(" ")) { if (Character.isUpperCase(word.charAt(0))) { context.write(new Text(word), new IntWritable(1)); }


} } }

The equivalent in Spark is: lines.flatMap( _.split(" ").filter(word => Character.isUpperCase(word(0))).map(word => (word,1)) )

map( ) will not suffice here, because map() must produce exactly one output per input, but unlike

before, one line needs to yield potentially many outputs. Again, the map() function in Spark is simpler and narrower compared to what the Mapper API supports. The solution in Spark is to first map each line to an array of output values. The array may be empty, or have many values. Merely map() -ing lines to arrays would produce an RDD of arrays as the result, when the result should be the contents of those arrays. The result needs to be “flattened” afterward, and flatMap() does exactly this. Here, the array of words in the line is filtered and converted into tuples inside the function. In a case like this, it’s flatMap() that’s required to emulate such a Mapper, not map() .

groupByKey() It’s simple to write a Reducer that then adds up the counts for each word, as before. And in Spark, again, reduceByKey() could be used to sum counts per word. But what if for some reason the output has to contain the word in all uppercase, along with a count? In MapReduce, that’s: public class CountUppercaseReducer extends Reducer { @Override protected void reduce(Text word, Iterable counts, Context context) throws IOException, InterruptedException { int sum = 0; for (IntWritable count : counts) { sum += count.get(); } context.write(new Text(word.toString().toUpperCase()), new IntWritable(sum)); } }


But reduceByKey() by itself doesn’t quite work in Spark, since it preserves the original key. To emulate this in Spark, something even more like the Reducer API is needed. Recall that Reducer’s reduce() method receives a key and Iterable of values, and then emits some transformation of those. groupByKey() and a subsequent map() can achieve this: ... .groupByKey().map { case (word,ones) => (word.toUpperCase, ones.sum) }

groupByKey() merely collects all values for a key together, and does not apply a reduce function. From there, any transformation can be applied to the key and Iterable of values. Here, the key is transformed to uppercase, and the values are directly summed.

Be careful! groupByKey() works, but also collects all values for a key into memory. If a key is associated to many values, a worker could run out of memory. Although this is the most direct analog of a Reducer, it’s not necessarily the best choice in all cases. For example, Spark could have simply transformed the keys after a call to reduceByKey : ... .reduceByKey(_ + _).map { case (word,total) => (word.toUpperCase,total) }

It’s better to let Spark manage the reduction rather than ask it to collect all values just for us to manually sum them.

setup() and cleanup() In MapReduce, a Mapper and Reducer can declare a setup() method, called before any input is processed, to perhaps allocate an expensive resource like a database connection, and a cleanup() method to release the resource: public class SetupCleanupMapper extends Mapper { private Connection dbConnection; @Override protected void setup(Context context) { dbConnection = ...; } ... @Override protected void cleanup(Context context) { dbConnection.close(); } }


The Spark map() and flatMap() methods only operate on one input at a time though, and provide no means to execute code before or after transforming a batch of values. It looks possible to simply put the setup and cleanup code before and after a call to map() in Spark: val dbConnection = ... dbConnection.createStatement(...) ...) dbConnection.close() // Wrong!

However, this fails for several reasons: •

It puts the object dbConnection into the map function’s closure, which requires that it be serializable (for example, by implementing ). An object like a database connection is generally not serializable.

map() is a transformation, rather than an operation, and is lazily evaluated. The connection

can’t be closed immediately here. •

Even so, it would only close the connection on the driver, not necessarily freeing resources allocated by serialized copies.

In fact, neither map() nor flatMap() is the closest counterpart to a Mapper in Spark — it’s the important mapPartitions() method. This method does not map just one value to one other value, but rather maps an Iterator of values to an Iterator of other values. It’s like a “bulk map” method. This means that the mapPartitions() function can allocate resources locally at its start, and release them when done mapping many values. Adding setup code is simple; adding cleanup code is harder because it remains difficult to detect when the transformed iterator has been fully evaluated. For example, this does not work: lines.mapPartitions { valueIterator => val dbConnection = ... // OK val transformedIterator = dbConnection ...) dbConnection.close() // Still wrong! May not have evaluated iterator transformedIterator }

A more complete formulation (HT Tobias Pfeiffer) is roughly: lines.mapPartitions { valueIterator => if (valueIterator.isEmpty) { Iterator[...]() } else { val dbConnection = ...

32 { item => val transformedItem = ... if (!valueIterator.hasNext) { dbConnection.close() } transformedItem } } }

Although decidedly less elegant than previous translations, it can be done. There is no flatMapPartitions() method. However, the same effect can be achieved by calling mapPartitions() , followed by a call to flatMap(a => a) to flatten. The equivalent of a Reducer with setup() and cleanup() is just a groupByKey() followed by a mapPartitions() call like the one above. Take note of the caveat about using groupByKey() above, though.

But Wait, There’s More MapReduce developers will point out that there is yet more to the API that hasn’t been mentioned yet: •

MapReduce supports a special type of Reducer, called a Combiner, that can reduce shuffled data size from a Mapper.

It also supports custom partitioning via a Partitioner, and custom grouping for purposes of the Reducer via grouping Comparator.

The Context objects give access to a Counter API for accumulating statistics.

A Reducer always sees keys in sorted order within its lifecycle.

MapReduce has its own Writable serialization scheme.

Mappers and Reducers can emit multiple outputs at once.

MapReduce alone has tens of tuning parameters.

There are ways to implement or port these concepts into Spark, using APIs like the Accumulator, methods like groupBy() and the partitioner argument in various of these methods, Java or Kryo serialization, caching, and more. To keep this post brief, the remainder will be left to a follow-up post.


The concepts in MapReduce haven’t stopped being useful. It just now has a different and potentially more powerful implementation on Hadoop, and in a functional language that better matches its functional roots. Understanding the differences between Spark’s RDD API, and the original Mapper and Reducer APIs, helps developers better understand how all of them truly work and how to use Spark’s counterparts to best advantage. In Part 2 of this post (below), Juliet Hougland covers aggregation functionality, counters, partitioning, and serialization. Sean Owen is Director of Data Science at Cloudera, an Apache Mahout committer/PMC member, and a Spark committer.


How-to: Translate from MapReduce to Apache Spark (Part 2) By  Juliet  Hougland  

The conclusion to this series covers Combiner-like aggregation functionality, counters, partitioning, and serialization.

Apache Spark is rising in popularity as an alternative to MapReduce, in a large part due to its expressive API for complex data processing. To briefly reiterate, MapReduce was originally designed for batch Extract Transform Load (ETL) operations and massive log processing. MapReduce relies on processing key-value pairs in map and reduce phases. Each phase has the following actions: •

Map: Emits 0, 1, or more key-values pairs as output for every input.

Shuffle: Groups key-value pairs with the same keys by shuffling data across the cluster’s network.

Reduce: Operates on an iterable of values associated with each key, often performing some kind of aggregation.

To perform complex operations, many Map and Reduce phases must be strung together. As MapReduce became more popular, its limitations with respect to complex and iterative operations became clear. Spark provides a processing API based around Resilient Distributed Datasets (RDDs.) You can create an RDD by reading in a file and then specifying the sequence of operations you want to perform on it, like parsing records, grouping by a key, and averaging an associated value. Spark allows you to specify two different types of operations on RDDs: transformations and actions. Transformations describe how to transform one data collection into another. Examples of transformations include map , flatMap , and groupByKey . Actions require that the computation be performed, like writing output to a file or printing a variable to screen. Spark uses a lazy computation model, which means that computation does not get triggered until an action is called. Calling an action on an RDD triggers all necessary transformations to be performed. This lazy evaluation allows Spark to smartly combine operations and optimize performance.


As an aid to the successful production deployment of a Spark cluster, in the rest of the blog post, we’ll explore how to reproduce functionality with which you may already be familiar from MapReduce in Spark. Specifically, we will cover combiner-like aggregation functionality, partitioning data, counter-like functionality, and the pluggable serialization frameworks involved.

reduceByKey vs Combiner This simple Mapper featured in Part 1: public class LineLengthMapper extends Mapper { @Override protected void map(LongWritable lineNumber, Text line, Context context) throws IOException, InterruptedException { context.write(new IntWritable(line.getLength()), new IntWritable(1)); } }

…is part of a job that counts lines of text by their length. It’s simple, but inefficient: The Mapper writes a length and count of 1 for every line, which is then written to disk and shuffled across the network, just to be added up on the Reducer. If there are a million empty lines, then a million records representing “length 0, count 1” will be copied across the network, just to be collapsed into “length 0, count 1000000” by a Reducer like the one also presented last time: public class LineLengthReducer extends Reducer { @Override protected void reduce(IntWritable length, Iterable counts, Context context) throws IOException, InterruptedException { int sum = 0; for (IntWritable count : counts) { sum += count.get(); } context.write(length, new IntWritable(sum)); } }

For this reason, MapReduce has the notion of a Combiner. A Combiner is an optimization that works like a Reducer—in fact, it must be an implementation of Reducer—that can combine multiple records on the Mapper side, before those records are written anywhere. It functions like a miniature Reducer preceding the shuffle. A Combiner must be commutative and associative, which means that its result must be the same no matter the order in which it combines records. In fact, LineLengthReducer itself could be applied as the Combiner in this MapReduce job, as well as being the Reducer.


Back to Spark. A terse and literal—but certainly not optimal—translation of LineLengthMapper and LineLengthReducer is: linesRDD.mapPartitions { lines => => (line.length, 1)) }.groupByKey().mapValues(_.sum)

The Mapper corresponds to mapPartitions , the shuffle to groupByKey , and the Reducer to the mapValues call. A likewise literal translation of the Combiner would inject its logic at the end of the Mapper’s analog, mapPartitions : linesRDD.mapPartitions { lines => val mapResult = => (line.length, 1)) mapResult.toSeq.groupBy(_._1).mapValues( }.groupByKey().mapValues(_.sum)

The new code uses Scala’s Collections API; these are not Spark operations. As mentioned previously, the new code actually implements exactly the same logic. It’s easy to see the resemblance in the expression of the two, since Spark mimics many of Scala’s APIs. Still, it’s clunky. The essence of the operation is summing counts, and to know how to sum many counts it’s only necessary to know how to sum two counts, and apply that over and over until just one value is left. This is what a true reduce operation does: from a function that makes two values into one, it makes many values into one. In fact, if just given the reduce function, Spark can intelligently apply it so as to get the effect of the Combiner and Reducer above all at once: linesRDD.mapPartitions { lines => val mapResult = => (line.length, 1)) }.reduceByKey(_ + _)

_ + _ is shorthand for a function of two arguments that returns their sum. This is a far more common

way of expressing this operation in Spark, and under the hood, Spark will be able to apply the reduce function before and after a shuffle automatically. In fact, without the need to express the Combiner’s counterpart directly in the code, it’s also no longer necessary to express how to map an entire partition with mapPartitions, since it’s implied by expressing how to map an element at a time:


37 => (line.length, 1)).reduceByKey(_ + _)

The upshot is that, when using Spark, you’re often automatically using the equivalent of a Combiner. For the interested, a few further notes: •

reduceByKey is built on a more general operation in Spark, called combineByKey , which

allows values to be transformed at the same time. •

For those who really are counting values, there is an even more direct Spark method for this:

And if speed is more important than accuracy, there is a much faster approximate version that relies on the HyperLogLog algorithm:

Partitioning and Grouping Data Both Spark and MapReduce support partitioning of key-value data by key. How data is split into chunks and in turn tasks by the processing framework has a large effect on the performance of common data operations like joining disparate data sets or doing per-key aggregations. In MapReduce, you can specify a partitioner that determines how key-value pairs are split up and organized amongst the reducers. A well-designed partitioner will approximately evenly distribute the records between the reducers. Both MapReduce and Spark use hash partitioning as their default partitioning strategy, though there are separate implementations for MapReduce and Spark. Hash partitioning works by assigning pairs to partitions based on the hash value of the key. In MapReduce and Spark the partition a key-value pair is assigned to the hashCode() method modulo the number of partitions you are creating. The hope is that the hashing function will evenly distribute your keys in the hash-space and you should end up with approximately evenly distributed data between partitions. A common issue in distributed programs with per-key aggregation is seeing a long tail in the distribution of the number of records assigned to reducers, and having “straggler” reducers that take much more time to complete than the rest. You can often resolve this problem by specifying a different, potentially custom partitioner. To do this in MapReduce, you can define your own customer partitioner by extending Partitioner and specifying your custom Partitioning class in the job configuration. This can be done in the configuration file, or programmatically with conf.setPartitionerClass(MyPartitioner.class) . In Spark, there are operations that benefit from partitioning as well as operations can modify partitioning. The following table explains what types of transformations can affect partitioning and how.


Counters MapReduce allows you to count things that happen in your job, and then query that count later. To define customer counters in MapReduce, you first need to define an Enum that describes the counters you will track. Imagine you are using Jackson ( org.codehaus.jackson ) to parse JSON into a Plain Old Java Object (POJO) using a jackson ObjectMapper. In doing so, you may encounter a JsonParseException or JsonMappingException , and you would like to track how many of each you see. So, you will create an enum that contains an element for both of these possible exceptions:


public static enum JsonErr { PARSE_ERROR, MAPPING_ERROR }

Then, in the map method of your map class you would have public void map(LongWritable key, Text value, Context context) throws IOException, InterruptedException { try { /* Parse a string into a POJO */ } catch (JsonParseException parseException) { context.getCounter(JsonErr.PARSE_ERROR).increment(1L) } catch (JsonMappingException mappingException) { context.getCounter(JsonErr.MAPPING_ERROR).increment(1L); } }

All counters that get incremented during the course of a job will be reported to the JobTracker and displayed in the JobTracker Web UI, along with the default I/O counters. You can also access the counters from the MapReduce driver program, from the Job you create using your Configuration. Spark exposes Accumulators, which can be used as counters, but more generally support any associative operation. Thus, you can go beyond incrementing and decrementing by integers toward summing arbitrary floating-point numbers—or even better, actually collecting samples of parsing errors you encounter. If you were to do a literal translation of this parsing-error count to Spark it would look like: /** Define accumulators. */ val parseErrorAccum = sc.accumulator(0, "JsonParseException") val mappingErrorAccum = sc.accumulator(0, "JsonMappingException") /** Define a function for parsing records and incrementing accumulators when exceptions are thrown. */ def parse(line: String) = { try { /* Parse a String into a POJO */ } catch { case e: JsonParseException => mapErr += 1 case e: JsonMappingException => parseError += 1 } } /** Parse records in a transformation.*/ val parsedRdd =

While there does not currently exist a good way to count while performing a transformation, Spark’s Accumulators do provide useful functionality for creating samples of parsing errors. This is very notably


something that is more difficult to do in MapReduce. An alternative and useful strategy to take is to instead use reservoir sampling to create a sample of error messages associated with parsing errors.

Important Caveat About Accumulators Now, you should be careful about how and when you use Accumulators. In MapReduce, increment actions on a counter executed during a task that later fails will not be counted toward the final value. MapReduce is careful to count correctly even when tasks fail or speculative execution occurs. In Spark, the behavior of accumulators requires careful attention. It is strongly recommended that accumulators only be used in an action. Accumulators incremented in an action are guaranteed to only be incremented once. Accumulators incremented in a transformation can have their values incremented multiple times if a task or job stage is ever rerun, which is unexpected behavior for most users. In the example below, an RDD is created and then mapped over while an accumulator is incremented. Since Spark uses a lazy evaluation model, these RDDs are only computed once an action is invoked and a value is required to be returned. Calling another action on myRdd2 requires that the preceding steps in the workflow are recomputed, incrementing the accumulator again. scala> val myrdd = sc.parallelize(List(1,2,3)) myrdd: org.apache.spark.rdd.RDD[Int] = ParallelCollectionRDD[0] at parallelize at :12 scala> val myaccum = sc.accumulator(0, "accum") myaccum: org.apache.spark.Accumulator[Int] = 0 scala> val myRdd2 = => myaccum += 1) myRdd2: org.apache.spark.rdd.RDD[Unit] = MappedRDD[1] at map at :16 scala> myRdd2.collect res0: Array[Unit] = Array((), (), ()) scala> myaccum res1: org.apache.spark.Accumulator[Int] = 3 scala> myRdd2.collect res2: Array[Unit] = Array((), (), ()) scala> myaccum res3: org.apache.spark.Accumulator[Int] = 6

Beyond situations where Spark’s lazy evaluation model causes transformations to be reapplied and accumulators incremented, it is possible that tasks getting rerun because of a partial failure will cause accumulators to be incremented again. The semantics of accumulators are distinctly not a once-andonly-once (aka counting) model.


The problem with the example of counting parsing errors above is that it is possible for the job to complete successfully with no explicit errors, but the numeric results may not be valid—and it would be difficult to know either way. One of the most common uses of counters in MapReduce is parsing records and simultaneously counting errors, and unfortunately there is no way to reliably count using accumulators in Spark.

Serialization Frameworks MapReduce and Spark both need to be able to take objects in the JVM and serialize them into a binary representation to be sent across the network when shuffling data. MapReduce uses a pluggable serialization framework, which allows users to specify their own implementation(s) of by setting io.serialization in the Hadoop configuration if they wish to use a custom serializer. HadoopWritable and Avro specific and reflection-based serializers are configured as the default supported serializations. Similarly, Spark has a pluggable serialization system that can be configured by setting the spark.serializer variable in the Spark configuration to a class that extends org.apache.spark.serializer.Serialize r. By default, Spark uses Java Serialization that works out of the box but is not as fast as other serialization methods. Spark can be configured to use the much faster Kryo Serialization protocol by setting spark.serializer to org.apache.spark.serializer.KryoSerializer and setting spark.kryo.registrator to the class of your own custom registrator, if you have one. In order to get the best performance out of Kryo, you should register the classes with a KryoRegistrator ahead of time, and configure Spark to use your particular Kryo registrator. If you wanted to use Kryo for serialization and register a User class for speed, you would define your registrator like this. import com.mycompany.model.User import org.apache.spark.serializer.KryoRegistrator class MyKryoRegistrator extends KryoRegistrator { override def registerClasses(kryo: Kryo) { kryo.register(classOf[User]) } }

You would then set spark.serializer to spark.KryoSerializer and spark.kryo.registrator to com.mycompany.myproje ct.MyKryoRegistrator . It is worth noting that if you are working with Avro objects, you will also


need to specify the AvroSerializer class to serialize and deserialize. You would modify our Registrator code like so: import com.mycompany.model.UserAvroRecord import org.apache.spark.serializer.KryoRegistrator class MyKryoRegistrator extends KryoRegistrator { override def registerClasses(kryo: Kryo) { kryo.register(classOf[UserAvroRecord], new AvroSerializer[UserAvroRecord]()) }

Note: while the data sent across the network using Spark will be serialized with the serializer you specify in the configuration, the closures of tasks will be serialized with Java serialization. This means anything in the closures of your tasks must be serializable, or you will get a TaskNotSerializableException . For Spark to operate on the data in your RDD it must be able to serialize the function you specify in map , flatMap , combineByKey on the driver node, ship that serialized function to the worker nodes, deserialize it on the worker nodes, and execute it on the data. This is always done with Java Serialization, which means you can’t easily have Avro objects in the closure of function in Spark because Avro objects have not been Java serializable up until version 1.8.

Conclusion As you hopefully observed, there are similarities but also important differences between MapReduce and Spark with respect to combiner-like aggregation functionality, partitioning, counters, and pluggable serialization frameworks. Understanding these nuances can help ensure that your Spark deployment is a long-term success. Juliet Hougland is a Data Scientist at Cloudera.


How-to: Tune Your Apache Spark Jobs (Part 1) By  Sandy  Ryza  (March  2015)  

Learn techniques for tuning your Apache Spark jobs for optimal efficiency.

When you write Apache Spark code and page through the public APIs, you come across words like

transformation, action, and RDD. Understanding Spark at this level is vital for writing Spark programs. Similarly, when things start to fail, or when you venture into the web UI to try to understand why your application is taking so long, you’re confronted with a new vocabulary of words like job, stage, and task. Understanding Spark at this level is vital for writing good Spark programs, and of course by good, I mean fast. To write a Spark program that will execute efficiently, it is very, very helpful to understand Spark’s underlying execution model. In this post, you’ll learn the basics of how Spark programs are actually executed on a cluster. Then, you’ll get some practical recommendations about what Spark’s execution model means for writing efficient programs.

How Spark Executes Your Program A Spark application consists of a single driver process and a set of executor processes scattered across nodes on the cluster. The driver is the process that is in charge of the high-level control flow of work that needs to be done. The executor processes are responsible for executing this work, in the form of tasks, as well as for storing any data that the user chooses to cache. Both the driver and the executors typically stick around for the entire time the application is running, although dynamic resource allocation changes that for the latter. A single executor has a number of slots for running tasks, and will run many concurrently throughout its lifetime. Deploying these processes on the cluster is up to the cluster manager in use (YARN, Mesos, or Spark Standalone), but the driver and executor themselves exist in every Spark application.


At the top of the execution hierarchy are jobs. Invoking an action inside a Spark application triggers the launch of a Spark job to fulfill it. To decide what this job looks like, Spark examines the graph of RDDs on which that action depends and formulates an execution plan. This plan starts with the farthest-back RDDs—that is, those that depend on no other RDDs or reference already-cached data–and culminates in the final RDD required to produce the action’s results. The execution plan consists of assembling the job’s transformations into stages. A stage corresponds to a collection of tasks that all execute the same code, each on a different subset of the data. Each stage contains a sequence of transformations that can be completed without shuffling the full data. What determines whether data needs to be shuffled? Recall that an RDD comprises a fixed number of partitions, each of which comprises a number of records. For the RDDs returned by so-called narrow transformations like map and filter, the records required to compute the records in a single partition reside in a single partition in the parent RDD. Each object is only dependent on a single object in the parent. Operations like coalesce can result in a task processing multiple input partitions, but the transformation is still considered narrow because the input records used to compute any single output record can still only reside in a limited subset of the partitions. However, Spark also supports transformations with wide dependencies such as groupByKey and reduceByKey. In these dependencies, the data required to compute the records in a single partition may reside in many partitions of the parent RDD. All of the tuples with the same key must end up in the same partition, processed by the same task. To satisfy these operations, Spark must execute a shuffle, which transfers data around the cluster and results in a new stage with a new set of partitions. For example, consider the following code: sc.textFile("someFile.txt"). map(mapFunc).


flatMap(flatMapFunc). filter(filterFunc). count()

It executes a single action, which depends on a sequence of transformations on an RDD derived from a text file. This code would execute in a single stage, because none of the outputs of these three operations depend on data that can come from different partitions than their inputs. In contrast, this code finds how many times each character appears in all the words that appear more than 1,000 times in a text file. val tokenized = sc.textFile(args(0)).flatMap(_.split(' ')) val wordCounts =, 1)).reduceByKey(_ + _) val filtered = wordCounts.filter(_._2 >= 1000) val charCounts = filtered.flatMap(_._1.toCharArray).map((_, 1)). reduceByKey(_ + _) charCounts.collect()

This process would break down into three stages. The reduceByKey operations result in stage boundaries, because computing their outputs requires repartitioning the data by keys. Here is a more complicated transformation graph including a join transformation with multiple dependencies.

The pink boxes show the resulting stage graph used to execute it.


At each stage boundary, data is written to disk by tasks in the parent stages and then fetched over the network by tasks in the child stage. Because they incur heavy disk and network I/O, stage boundaries can be expensive and should be avoided when possible. The number of data partitions in the parent stage may be different than the number of partitions in the child stage. Transformations that may trigger a stage boundary typically accept a numPartitions argument that determines how many partitions to split the data into in the child stage. Just as the number of reducers is an important parameter in tuning MapReduce jobs, tuning the number of partitions at stage boundaries can often make or break an application’s performance. We’ll delve deeper into how to tune this number in a later section.

Picking the Right Operators When trying to accomplish something with Spark, a developer can usually choose from many arrangements of actions and transformations that will produce the same results. However, not all these arrangements will result in the same performance: avoiding common pitfalls and picking the right arrangement can make a world of difference in an application’s performance. A few rules and insights will help you orient yourself when these choices come up. Recent work in SPARK-5097 began stabilizing SchemaRDD, which will open up Spark’s Catalyst optimizer to programmers using Spark’s core APIs, allowing Spark to make some higher-level choices about which operators to use. When SchemaRDD becomes a stable component, users will be shielded from needing to make some of these decisions. The primary goal when choosing an arrangement of operators is to reduce the number of shuffles and the amount of data shuffled. This is because shuffles are fairly expensive operations; all shuffle data must be written to disk and then transferred over the network. repartition , join , cogroup ,


and any of the *By or *ByKey transformations can result in shuffles. Not all these operations are equal, however, and a few of the most common performance pitfalls for novice Spark developers arise from picking the wrong one: •   Avoid groupByKey when performing an associative reductive operation. For example, rdd.groupByKey().mapValues(_.sum) will produce the same results as rdd.reduceByKey(_ + _) . However, the former will transfer the entire dataset across the network, while the latter will compute local sums for each key in each partition and combine those local sums into larger sums after shuffling. •   Avoid reduceByKey When the input and output value types are different. For example, consider writing a transformation that finds all the unique strings corresponding to each key. One way would be to use map to transform each element into a Set and then combine the Set s with reduceByKey : => (kv._1, new Set[String]() + kv._2)) .reduceByKey(_ ++ _)

This code results in tons of unnecessary object creation because a new set must be allocated for each record. It’s better to use aggregateByKey , which performs the map-side aggregation more efficiently: val zero = new collection.mutable.Set[String]() rdd.aggregateByKey(zero)( (set, v) => set += v, (set1, set2) => set1 ++= set2)

•   Avoid the flatMap-join-groupBy pattern. When two datasets are already grouped by key and you want to join them and keep them grouped, you can just use cogroup. That avoids all the overhead associated with unpacking and repacking the groups.

When Shuffles Don’t Happen It’s also useful to be aware of the cases in which the above transformations will not result in shuffles. Spark knows to avoid a shuffle when a previous transformation has already partitioned the data according to the same partitioner. Consider the following flow: rdd1 = someRdd.reduceByKey(...) rdd2 = someOtherRdd.reduceByKey(...)


rdd3 = rdd1.join(rdd2)

Because no partitioner is passed to reduceByKey , the default partitioner will be used, resulting in rdd1 and rdd2 both hash-partitioned. These two reduceByKey s will result in two shuffles. If the RDDs have the same number of partitions, the join will require no additional shuffling. Because the RDDs are partitioned identically, the set of keys in any single partition of rdd1 can only show up in a single partition of rdd2. Therefore, the contents of any single output partition of rdd3 will depend only on the contents of a single partition in rdd1 and single partition in rdd2, and a third shuffle is not required. For example, if someRdd has four partitions, someOtherRdd has two partitions, and both the reduceByKey s use three partitions, the set of tasks that execute would look like:


What if rdd1 and rdd2 use different partitioners or use the default (hash) partitioner with different numbers partitions? In that case, only one of the rdds (the one with the fewer number of partitions) will need to be reshuffled for the join. Same transformations, same inputs, different number of partitions:

One way to avoid shuffles when joining two datasets is to take advantage of broadcast variables. When one of the datasets is small enough to fit in memory in a single executor, it can be loaded into a hash table on the driver and then broadcast to every executor. A map transformation can then reference the hash table to do lookups.

When More Shuffles are Better There is an occasional exception to the rule of minimizing the number of shuffles. An extra shuffle can be advantageous to performance when it increases parallelism. For example, if your data arrives in a few large unsplittable files, the partitioning dictated by the InputFormat might place large numbers of records in each partition, while not generating enough partitions to take advantage of all the available


cores. In this case, invoking repartition with a high number of partitions (which will trigger a shuffle) after loading the data will allow the operations that come after it to leverage more of the cluster’s CPU. Another instance of this exception can arise when using the reduce or aggregate action to aggregate data into the driver. When aggregating over a high number of partitions, the computation can quickly become bottlenecked on a single thread in the driver merging all the results together. To loosen the load on the driver, one can first use reduceByKey or aggregateByKey to carry out a round of distributed aggregation that divides the dataset into a smaller number of partitions. The values within each partition are merged with each other in parallel, before sending their results to the driver for a final round of aggregation. Take a look at treeReduce and treeAggregate for examples of how to do that. (Note that in 1.2, the most recent version at the time of this writing, these are marked as developer APIs, but SPARK-5430 seeks to add stable versions of them in core.) This trick is especially useful when the aggregation is already grouped by a key. For example, consider an app that wants to count the occurrences of each word in a corpus and pull the results into the driver as a map. One approach, which can be accomplished with the aggregate action, is to compute a local map at each partition and then merge the maps at the driver. The alternative approach, which can be accomplished with aggregateByKey , is to perform the count in a fully distributed way, and then simply collectAsMap the results to the driver.

Secondary Sort Another important capability to be aware of is the repartitionAndSortWithinPartitions transformation. It’s a transformation that sounds arcane, but seems to come up in all sorts of strange situations. This transformation pushes sorting down into the shuffle machinery, where large amounts of data can be spilled efficiently and sorting can be combined with other operations. For example, Apache Hive on Spark uses this transformation inside its join implementation. It also acts as a vital building block in the secondary sort pattern, in which you want to both group records by key and then, when iterating over the values that correspond to a key, have them show up in a particular order. This issue comes up in algorithms that need to group events by user and then analyze the events for each user based on the order they occurred in time. Taking advantage of repartitionAndSortWithinPartitions to do secondary sort currently requires a bit of legwork on the part of the user, but SPARK-3655 will simplify things vastly.

Conclusion You should now have a good understanding of the basic factors in involved in creating a performanceefficient Spark program! In Part 2, we’ll cover tuning resource requests, parallelism, and data structures.


Sandy Ryza is a Data Scientist at Cloudera, an Apache Spark committer, and an Apache Hadoop PMC member. He is a co-author of the O’Reilly Media book, Advanced Analytics with Spark.


How-to: Tune Your Apache Spark Jobs (Part 2) By  Sandy  Ryza  (March  2015)    

In the conclusion to this series, learn how resource tuning, parallelism, and data representation affect Spark job performance.

In this post, we’ll finish what we started in Part 1, above. I’ll try to cover pretty much everything you could care to know about making a Spark program run fast. In particular, you’ll learn about resource tuning, or configuring Spark to take advantage of everything the cluster has to offer. Then we’ll move to tuning parallelism, the most difficult as well as most important parameter in job performance. Finally, you’ll learn about representing the data itself, in the on-disk form which Spark will read (spoiler alert: use Apache Avro or Apache Parquet) as well as the in-memory format it takes as it’s cached or moves through the system.

Tuning Resource Allocation The Spark user list is a litany of questions to the effect of “I have a 500-node cluster, but when I run my application, I see only two tasks executing at a time. HALP.” Given the number of parameters that control Spark’s resource utilization, these questions aren’t unfair, but in this section you’ll learn how to squeeze every last bit of juice out of your cluster. The recommendations and configurations here differ a little bit between Spark’s cluster managers (YARN, Mesos, and Spark Standalone), but we’re going to focus only on YARN, which Cloudera recommends to all users. For some background on what it looks like to run Spark on YARN, check out my post on this topic. The two main resources that Spark (and YARN) think about are CPU and memory. Disk and network I/O, of course, play a part in Spark performance as well, but neither Spark nor YARN currently do anything to actively manage them. Every Spark executor in an application has the same fixed number of cores and same fixed heap size. The number of cores can be specified with the --executor-cores flag when invoking sparksubmit, spark-shell, and pyspark from the command line, or by setting the spark.executor.cores property in the spark-defaults.conf file or on a SparkConf object. Similarly, the heap size can be controlled with the --executor-memory flag or the spark.executor.memory property. The cores property controls the number of concurrent tasks an executor can run. --executor-cores 5 means that each executor can run a maximum of five tasks at the same time. The memory property impacts the amount of data Spark can cache, as well as the maximum sizes of the shuffle data structures used for grouping, aggregations, and joins.


The --num-executors command-line flag or spark.executor.instances configuration property control the number of executors requested. Starting in CDH 5.4/Spark 1.3, you will be able to avoid setting this property by turning on dynamic allocation with the spark.dynamicAllocation.enabled property. Dynamic allocation enables a Spark application to request executors when there is a backlog of pending tasks and free up executors when idle. It’s also important to think about how the resources requested by Spark will fit into what YARN has available. The relevant YARN properties are: •   yarn.nodemanager.resource.memory-mb controls the maximum sum of memory used by the containers on each node. •   yarn.nodemanager.resource.cpu-vcores controls the maximum sum of cores used by the containers on each node. Asking for five executor cores will result in a request to YARN for five virtual cores. The memory requested from YARN is a little more complex for a couple reasons: •   --executor-memory/spark.executor.memory controls the executor heap size, but JVMs can also use some memory off heap, for example for interned Strings and direct byte buffers. The value of the spark.yarn.executor.memoryOverhead property is added to the executor memory to determine the full memory request to YARN for each executor. It defaults to max(384, .07 * spark.executor.memory). •   YARN may round the requested memory up a little. YARN’s yarn.scheduler.minimumallocation-mb and yarn.scheduler.increment-allocation-mb properties

control the minimum and increment request values respectively. The following (not to scale with defaults) shows the hierarchy of memory properties in Spark and YARN:


And if that weren’t enough to think about, a few final concerns when sizing Spark executors: •   The application master, which is a non-executor container with the special capability of requesting containers from YARN, takes up resources of its own that must be budgeted in. In yarn-client mode, it defaults to a 1024MB and one vcore. In yarn-cluster mode, the application master runs the driver, so it’s often useful to bolster its resources with the -driver-memory and --driver-cores properties.

•   Running executors with too much memory often results in excessive garbage collection delays. 64GB is a rough guess at a good upper limit for a single executor. •   I’ve noticed that the HDFS client has trouble with tons of concurrent threads. A rough guess is that at most five tasks per executor can achieve full write throughput, so it’s good to keep the number of cores per executor below that number. •   Running tiny executors (with a single core and just enough memory needed to run a single task, for example) throws away the benefits that come from running multiple tasks in a single JVM. For example, broadcast variables need to be replicated once on each executor, so many small executors will result in many more copies of the data. To hopefully make all of this a little more concrete, here’s a worked example of configuring a Spark app to use as much of the cluster as possible: Imagine a cluster with six nodes running NodeManagers, each equipped with 16 cores and 64GB of memory. The NodeManager capacities, yarn.nodemanager.resource.memorymb and yarn.nodemanager.resource.cpu-vcores , should probably be set to 63 * 1024 =

64512 (megabytes) and 15 respectively. We avoid allocating 100% of the resources to YARN containers because the node needs some resources to run the OS and Hadoop daemons. In this case, we leave a gigabyte and a core for these system processes. Cloudera Manager helps by accounting for these and configuring these YARN properties automatically. The likely first impulse would be to use --num-executors 6 --executor-cores 15 -executor-memory 63G . However, this is the wrong approach because:

•   63GB + the executor memory overhead won’t fit within the 63GB capacity of the NodeManagers. •   The application master will take up a core on one of the nodes, meaning that there won’t be room for a 15-core executor on that node. •   15 cores per executor can lead to bad HDFS I/O throughput.


•   A better option would be to use --num-executors 17 --executor-cores 5 -executor-memory 19G . Why?

•   This config results in three executors on all nodes except for the one with the AM, which will have two executors. •   --executor-memory was derived as (63/3 executors per node) = 21. 21 * 0.07 = 1.47. 21 – 1.47 ~ 19.

Tuning Parallelism Spark, as you have likely figured out by this point, is a parallel processing engine. What is maybe less obvious is that Spark is not a “magic” parallel processing engine, and is limited in its ability to figure out the optimal amount of parallelism. Every Spark stage has a number of tasks, each of which processes data sequentially. In tuning Spark jobs, this number is probably the single most important parameter in determining performance. How is this number determined? The way Spark groups RDDs into stages is described in the previous post. (As a quick reminder, transformations like repartition and reduceByKey induce stage boundaries.) The number of tasks in a stage is the same as the number of partitions in the last RDD in the stage. The number of partitions in an RDD is the same as the number of partitions in the RDD on which it depends, with a couple exceptions: the coalesce transformation allows creating an RDD with fewer partitions than its parent RDD, the union transformation creates an RDD with the sum of its parents’ number of partitions, and cartesian creates an RDD with their product. What about RDDs with no parents? RDDs produced by textFile or hadoopFile have their partitions determined by the underlying MapReduce InputFormat that’s used. Typically there will be a partition for each HDFS block being read. Partitions for RDDs produced by parallelize come from the parameter given by the user, or spark.default.parallelism if none is given. To determine the number of partitions in an RDD, you can always call rdd.partitions().size() . The primary concern is that the number of tasks will be too small. If there are fewer tasks than slots available to run them in, the stage won’t be taking advantage of all the CPU available. A small number of tasks also mean that more memory pressure is placed on any aggregation operations that occur in each task. Any join , cogroup , or *ByKey operation involves holding objects in hashmaps or in-memory buffers to group or sort. join , cogroup , and groupByKey use these data structures in the tasks for the stages that are on the fetching side of the shuffles they


trigger. reduceByKey and aggregateByKey use data structures in the tasks for the stages on both sides of the shuffles they trigger. When the records destined for these aggregation operations do not easily fit in memory, some mayhem can ensue. First, holding many records in these data structures puts pressure on garbage collection, which can lead to pauses down the line. Second, when the records do not fit in memory, Spark will spill them to disk, which causes disk I/O and sorting. This overhead during large shuffles is probably the number one cause of job stalls I have seen at Cloudera customers. So how do you increase the number of partitions? If the stage in question is reading from Hadoop, your options are: •   Use the repartition transformation, which will trigger a shuffle. •   Configure your InputFormat to create more splits. •   Write the input data out to HDFS with a smaller block size. If the stage is getting its input from another stage, the transformation that triggered the stage boundary will accept a numPartitions argument, such as val rdd2 = rdd1.reduceByKey(_ + _, numPartitions = X)

What should “X” be? The most straightforward way to tune the number of partitions is experimentation: Look at the number of partitions in the parent RDD and then keep multiplying that by 1.5 until performance stops improving. There is also a more principled way of calculating X, but it’s difficult to apply a priori because some of the quantities are difficult to calculate. I’m including it here not because it’s recommended for daily use, but because it helps with understanding what’s going on. The main goal is to run enough tasks so that the data destined for each task fits in the memory available to that task. The memory available to each task is ( spark.executor.memory * spark.shuffle.memoryFraction * spark.shuffle.safet yFraction )/ spark.executor.cores . Memory fraction and safety fraction default to 0.2 and 0.8

respectively. The in-memory size of the total shuffle data is harder to determine. The closest heuristic is to find the ratio between Shuffle Spill (Memory) metric and the Shuffle Spill (Disk) for a stage that ran. Then multiply the total shuffle write by this number. However, this can be somewhat compounded if the stage is doing a reduction:


Then round up a bit because too many partitions is usually better than too few partitions. In fact, when in doubt, it’s almost always better to err on the side of a larger number of tasks (and thus partitions). This advice is in contrast to recommendations for MapReduce, which requires you to be more conservative with the number of tasks. The difference stems from the fact that MapReduce has a high startup overhead for tasks, while Spark does not.

Slimming Down Your Data Structures Data flows through Spark in the form of records. A record has two representations: a deserialized Java object representation and a serialized binary representation. In general, Spark uses the deserialized representation for records in memory and the serialized representation for records stored on disk or being transferred over the network. There is work planned to store some in-memory shuffle data in serialized form. The spark.serializer property controls the serializer that’s used to convert between these two representations. The Kryo serializer, org.apache.spark.serializer.KryoSerializer , is the preferred option. It is unfortunately not the default, because of some instabilities in Kryo during earlier versions of Spark and a desire not to break compatibility, but the Kryo serializer should used

always be

The footprint of your records in these two representations has a massive impact on Spark performance. It’s worthwhile to review the data types that get passed around and look for places to trim some fat. Bloated deserialized objects will result in Spark spilling data to disk more often and reduce the number of deserialized records Spark can cache (e.g. at the MEMORY storage level). The Spark tuning guide has a great section on slimming these down. Bloated serialized objects will result in greater disk and network I/O, as well as reduce the number of serialized records Spark can cache (e.g. at the MEMORY_SER storage level.) The main action item here is to make sure to register any custom classes you define and pass around using the SparkConf#registerKryoClasses API.

Data Formats Whenever you have the power to make the decision about how data is stored on disk, use an extensible binary format like Avro, Parquet, Thrift, or Protobuf. Pick one of these formats and stick to it. To be clear, when one talks about using Avro, Thrift, or Protobuf on Hadoop, they mean that each record is a Avro/Thrift/Protobuf struct stored in a sequence file. JSON is just not worth it. Every time you consider storing lots of data in JSON, think about the conflicts that will be started in the Middle East, the beautiful rivers that will be dammed in Canada, or the radioactive fallout from the


nuclear plants that will be built in the American heartland to power the CPU cycles spent parsing your files over and over and over again. Also, try to learn people skills so that you can convince your peers and superiors to do this, too. Sandy Ryza is a Data Scientist at Cloudera, an Apache Spark committer, and an Apache Hadoop PMC member. He is a co-author of the O’Reilly Media book, Advanced Analytics with Spark.
