1

Clojure in Action By Amit Rathore Google popularized the map/reduce approach to distributed computing where large volumes of data can be processed using a large number of computers. This article from the author of Clojure in Action explores the combination of map and reduce to see how it can be useful in processing data. You may also be interested in…

Understanding Data Processing with Map/Reduce Google popularized the map/reduce approach to distributed computing where large volumes of data can be processed using a large number of computers. The data processing problem is broken into pieces, and each piece runs on an individual machine. The software then combines the output from each computer to produce a final answer. The breaking up of the problem into smaller problems and assigning them to computers happens in the map stage, whereas the output from individual computers is taken and combined into a single entity in the reduce stage. Google’s map/reduce is based on the functional concepts of map and reduce, functions that you’ve seen repeatedly in this book so far. In this article, we’ll explore this combination of map and reduce to see how it can be useful in processing data. We’ll use the basic ideas of mapping and reducing and, over the course of this article, we’ll process data that we read from files. We’ll build abstractions on top of simple file input so that we eventually end up processing Ruby on Rails server log files.

Getting started with map/reduce—counting words We’re going to use a traditional example in order to understand the idea of map/reduce. The problem is to count the number of times each word appears in a corpus of text. The total volume of text is usually large, but we’ll use a small amount in order to illustrate the idea. The following is the first stanza of a poem by Lewis Carroll, called “The Jabberwocky”: Twas brillig and the slithy toves Did gyre and gimble in the wabe All mimsy were the borogoves And the mome raths outgrabe It’s easy to find this poem on the internet, because it’s from the famous book Through the Looking Glass. Note that for convenience, we’ve removed all punctuation from the text. We put it in a file called jabberwocky.txt in a convenient folder. Let’s write some code to count the number of times each word appears in the poem. Consider the following function that operates on only a single line of the poem: (defn parse-line [line] (let [tokens (.split (.toLowerCase line) " ")] (map #(vector % 1) tokens)))

For source code, sample chapters, the Online Author Forum, and other resources, go to http://www.manning.com/rathore/

2

This will convert a given line of text into a sequence of vectors, where each entry contains a single word and the number 1 (which can be thought of as a tally mark that the word appeared once), for instance: user> (parse-line "Twas brillig and the slithy toves") (["twas" 1] ["brillig" 1] ["and" 1] ["the" 1] ["slithy" 1] ["toves" 1]) Next, we’ll combine the tally marks, so to speak, so you’ll have an idea of how many times each word appears. Consider this: (defn combine [mapped] (->> (apply concat mapped) (group-by first) (map (fn [[k v]] {k (map second v)})) (apply merge-with conj))) This works by creating a map, where the keys are the words found by parse-line, and the values are the sequences of tally marks. The only thing of curiosity here should be the group-by function. As you can see, it takes two arguments: a function and a sequence. The return value is a map where the keys are the results of applying the function on each element of the sequence, and the values are a vector of corresponding elements. The elements in each vector are also, conveniently, in the same order in which they appear in the original sequence. Here’s the combine operation in action: user> (use 'clojure.contrib.io) nil user> (combine (map parse-line (read-lines "/Users/amit/tmp/jabberwocky.txt"))) {"were" (1), "all" (1), "in" (1), "gyre" (1), "toves" (1), "outgrabe" (1), "wabe" (1), "gimble" (1), "raths" (1), "the" (1 1 1 1), "borogoves" (1), "slithy" (1), "twas" (1), "brillig" (1), "mimsy" (1), "and" (1 1 1), "mome" (1), "did" (1)} The read-lines function reads in the content of a file into a sequence of lines. Consider the output. For example, notice the word the. It appears multiple times, and the associated value is a list of 1s, each representing a single occurrence. The final step is to sum the tally marks. This is the reduce step, and it’s quite straightforward. Consider the following code: (defn sum [[k v]] {k (apply + v)}) (defn reduce-parsed-lines [collected-values] (apply merge (map sum collected-values))) And that’s all there is to it. Let’s create a nice wrapper function that you can call with a filename: (defn word-frequency [filename] (->> (read-lines filename) (map parse-line) (combine) (reduce-parsed-lines))) Let’s try it at the REPL: user> (word-frequency "/Users/amit/tmp/jabberwocky.txt") {"were" 1, "all" 1, "in" 1, "gyre" 1, "toves" 1, "outgrabe" 1, "wabe" 1, "gimble" 1, "raths" 1, "the" 4, "borogoves" 1, "slithy" 1, "twas" 1, "brillig" 1, "mimsy" 1, "and" 3, "mome" 1, "did" 1} So there you have it. It might seem a somewhat convoluted way to count the number of times words appear in text, but you’ll see why this is a good approach for generalized computations of this sort.

Generalizing the map/reduce In the previous section, we wrote a fair bit of code to compute the frequency of words in a given piece of text. The following listing shows the complete code.

Listing 1 Computing the frequency of words in given text (ns chapter-data.word-count-1 (:use clojure.contrib.io clojure.contrib.seq-utils)) (defn parse-line [line]

For source code, sample chapters, the Online Author Forum, and other resources, go to http://www.manning.com/rathore/

3

(let [tokens (.split (.toLowerCase line) " ")] (map #(vector % 1) tokens))) (defn combine [mapped] (->> (apply concat mapped) (group-by first) (map (fn [[k v]] {k (map second v)})) (apply merge-with conj))) (defn sum [[k v]] {k (apply + v)}) (defn reduce-parsed-lines [collected-values] (apply merge (map sum collected-values))) (defn word-frequency [filename] (->> (read-lines filename) (map parse-line) (combine) (reduce-parsed-lines))) As pointed out earlier, there are probably more direct ways to do the job. We said that we did this so we could generalize the code to compute other kinds of things. We’ll do that in this section. Consider the word-frequency function in listing 1. Clearly, the first thing to pull out is how the input lines of text are provided. By decoupling the rest of the code from the call to read-lines, you can pass in any other lines of text you might have to process. So your new top-level function will accept the input as a parameter. Next, you’ll decouple the code from the parse-line function. That way, the user of your map/reduce code can decide how to map each piece of input into the intermediate form. Your new top-level function will accept the mapper function. Figure 1 shows the conceptual phase of the mapping part of the map/reduce approach. Finally, you’ll also decouple our map/reduce code from the way in which the reduce happens, so that the user of your code can decide how to do this part of the computation. In other words, you’ll also accept the reducer function as a parameter.

For source code, sample chapters, the Online Author Forum, and other resources, go to http://www.manning.com/rathore/

4

Figure 1 The mapping phase of the map/reduce approach applies a function to each input value, producing a list of key/value pairs for each input. All these lists (each containing several key/value pairs) are gathered into another list to constitute the final output of the mapping phase.

Given these considerations, your top-level map-reduce function may look like this: (defn map-reduce [mapper reducer args-seq] (->> (map mapper args-seq) (combine) (reducer))) The first line of this function is simple, and the combine function from our previous word count example is sufficient. Finally, reducer will accept the combined set of processed input to produce the result. So with this map-reduce function and the combine function from the previous example, you have enough to try the word count example again. Recall that the idea of the combine phase is to group together common keys in order to prepare for the final reduce phase. Figure 2 shows the conceptual view, and listing 2 shows the extracted bits, followed by the word count example.

For source code, sample chapters, the Online Author Forum, and other resources, go to http://www.manning.com/rathore/

5

Figure 2 The combine phase takes the output of the mapping phase and collects each key and associated values from the collection of lists of key/value pairs. The combined output is then essentially a map with unique keys created during the mapping process, with each associated value being a list of values from the mapping phase.

Listing 2 General map/reduce extracted out of the word-count example (ns chapter-data.map-reduce (:use clojure.contrib.seq-utils)) (defn combine [mapped] (->> (apply concat mapped) (group-by first) (map (fn [[k v]] {k (map second v)})) (apply merge-with conj))) (defn map-reduce [mapper reducer args-seq] (->> (map mapper args-seq) (combine) (reducer))) It’s time to see it in action. Consider the rewritten word-frequency function: (defn word-frequency [filename] (map-reduce parse-line reduce-parsed-lines (read-lines filename))) And here it is on the REPL: user> (word-frequency "/Users/amit/tmp/jabberwocky.txt") {"were" 1, "all" 1, "in" 1, "gyre" 1, "toves" 1, "outgrabe" 1, "wabe" 1, "gimble" 1, "raths" 1, "the" 4, "borogoves" 1, "slithy" 1, "twas" 1, "brillig" 1, "mimsy" 1, "and" 3, "mome" 1, "did" 1} Note that, in this case, the final output is a map of words to total counts. The map/reduce algorithm is general in the sense that the reduce phase can result in any arbitrary value; for instance, it can be a constant, or a list, or a map, as in the previous example, or any other value. The generic process is conceptualized in figure 3.

Figure 3 The input to the reduce phase is the output of the combiner, which is a map, with keys being all the unique keys found in the mapping operation and the values being the collected values for each key from the mapping process. The output of the reduce phase can be any arbitrary value.

For source code, sample chapters, the Online Author Forum, and other resources, go to http://www.manning.com/rathore/

6

The obvious question is, just how general is this map/reduce code? Let’s find the average number of words per line in the text. The code to do that is shown in the following listing.

Listing 3 Using map/reduce to calculate average number of words in each line (ns chapter-data.average-line-length (:use chapter-data.map-reduce clojure.contrib.io)) (def IGNORE "_") (defn parse-line [line] (let [tokens (.split (.toLowerCase line) " ")] [[IGNORE (count tokens)]])) (defn average [numbers] (/ (apply + numbers) (count numbers))) (defn reducer [combined] (average (val (first combined)))) (defn average-line-length [filename] (map-reduce parse-line reducer (read-lines filename))) Let’s look at it in action: user> (average-line-length "/Users/amit/tmp/jabberwocky.txt") 23/4 user> (float (average-line-length "/Users/amit/tmp/jabberwocky.txt")) 5.75 In this version of parse-line, you don’t actually care about what line has what length, and you just use a placeholder string "_" (named IGNORE because you don’t use it later).

Summary Each application that you’ll end up writing will need a different model based on the specifics of the domain. Clojure is flexible enough to solve the most demanding problems, and the functional style helps by reducing the amount of code needed while also increasing the readability of the code.

For source code, sample chapters, the Online Author Forum, and other resources, go to http://www.manning.com/rathore/

7

Here are some other Manning titles you might be interested in: The Joy of Clojure Michael Fogus and Chris Houser

Real-World Functional Programming Tomas Petricek with Jon Skeet

Scala in Action Nilanjan Raychaudhuri

Last updated: August 26, 2011

For source code, sample chapters, the Online Author Forum, and other resources, go to http://www.manning.com/rathore/