Hadoop? Example MapReduce Problem Exercise: Write your own queries in Hadoop!

NoSQL Outline ● ● ● ● What is NoSQL? What is MapReduce/Hadoop? Example MapReduce Problem Exercise: Write your own queries in Hadoop! What is NoSQ...
Author: Imogen Holt
2 downloads 0 Views 1MB Size
NoSQL

Outline ● ● ● ●

What is NoSQL? What is MapReduce/Hadoop? Example MapReduce Problem Exercise: Write your own queries in Hadoop!

What is NoSQL? ● “No SQL” ○ No relational database

● Umbrella term for many different types of datastores ○ Key-Value Stores, Document Stores, Graph Database systems, etc. ● (Really, it’s more like “Not Only SQL” – we don’t want to abandon the relational DBMS entirely)

Why NoSQL? ● In general, we want our databases to be: ○ ○ ○ ○ ○

Convenient Reliable Safe Scalable Efficient

Why NoSQL? ● In general, we want our databases to be: ○ ○ ○ ○ ○

Convenient Reliable Safe Scalable Efficient

● Nowadays, we care a lot more about scalability and efficiency

Gaining in popularity…..

…..but still got a long ways to go

Terrastore RethinkDB RavenDB ThruDB LevelDB

Cloudata

RaptorDB Amazon DynamoDB

BerkeleyDB Voldemort FoundationDB Amazon SimpleDB

Terrastore RethinkDB RavenDB ThruDB LevelDB

Cloudata

RaptorDB Amazon DynamoDB

BerkeleyDB Voldemort FoundationDB Amazon SimpleDB

What is MapReduce? ● Created in 2004 at Google ● Problem: 100’s of data files distributed across 1,000’s of machines ○ how do we get that information, quickly?

● Solution: Extract the data from the files in parallel ○ take advantage of the fact that the data is distributed over 1,000’s of machines

What is MapReduce? ● No data model, data stored in files ● Primarily used on distributed filesystems ● Users provide two functions ○ map function (data transformation) ○ reduce function (data aggregation)

● System takes care of parallelizing the process

Mapping and Reducing ● map: divide problem into subproblems ○ input: single line from data file ○ output: 0 or more (key, value) pairs

● reduce: work on each subproblem, combine results ○ input: (key, list of values) ○ output: 0 or more output records

Mapping and Reducing ● map: divide problem into subproblems ○ input: single line from data file ○ output: 0 or more (key, value) pairs

● reduce: work on each subproblem, combine results ○ input: (key, list of values) ○ output: 0 or more output records

What is Hadoop? ● An open source implementation of MapReduce ○ Google didn’t want to share :-(

● Also used over distributed filesystems ● Same mechanics as Google’s version of MapReduce

Example Problem: Word Counts How now brown cow Brown cow is blue

Example Problem: Word Counts How now brown cow Brown cow is blue

Input How now brown cow

Brown cow is blue

Example Problem: Word Counts How now brown cow Brown cow is blue

Input How now brown cow

Brown cow is blue

Map

Example Problem: Word Counts How now brown cow Brown cow is blue

Input How now brown cow

Map

(how, 1) (now, 1) (brown, 1) (cow, 1)

(brown, 1) Brown cow is blue

(cow, 1) (is, 1) (blue, 1)

Example Problem: Word Counts How now brown cow Brown cow is blue

Input How now brown cow

Map

(how, 1) (now, 1) (brown, 1) (cow, 1)

(brown, 1) Brown cow is blue

(cow, 1) (is, 1) (blue, 1)

Reduce

Example Problem: Word Counts How now brown cow Brown cow is blue

Input How now brown cow

Map

(how, 1) (now, 1) (brown, 1) (cow, 1)

(brown, 1) Brown cow is blue

(cow, 1) (is, 1) (blue, 1)

Reduce

how, 1 now, 1 brown, 2 cow, 2 is, 1 blue, 1

Example Problem: Word Counts How now brown cow Brown cow is blue

Input How now brown cow

Map

(how, 1) (now, 1) (brown, 1) (cow, 1)

(brown, 1) Brown cow is blue

(cow, 1) (is, 1) (blue, 1)

Reduce

how, 1 now, 1 brown, 2 cow, 2 is, 1 blue, 1

Example Problem: Word Counts How now brown cow Brown cow is blue

Input How now brown cow

Map

(how, 1) (now, 1) (brown, 1) (cow, 1)

(brown, 1) Brown cow is blue

(cow, 1) (is, 1) (blue, 1)

Reduce

how, 1 now, 1 brown, 2 cow, 2 is, 1 blue, 1

Now it’s your turn! ● ssh into corn, and copy the Hadoop starter code ○ cp -r /usr/class/cs145/NoSQL-activity . ○ cd NoSQL-activity/ ● Run the initialization script ○ local-hadoop/start-local-hadoop.py ○ Don’t forget to run local-hadoop/stoplocal-hadoop.py before you log out!

Query #1: Word Counts (again!) ● ● ● ● ●

(We’ll do this one together.) Starter code can be found in src/query1 Dataset can be found in /usr/class/cs145/NoSQL-data Compile and run your code using query1-wordcount.sh Results will show up in output1/ directory ○ check results by running diff output1/part-00000 /usr/class/cs145/NoSQLanswers/output1/part-00000

Query #2: Hashtag Counts ● Count the number of times each Hashtag appears in the Twitter dataset ○ a hashtag is a term that starts with ‘#’

● Answer should be of the form: ○

● How many times does #goStanford appear in the dataset?

Query #3: Inverted Index on Mentions ● Create a mapping from a Twitter username to a list of Tweets that the username appears in ○ A username always starts with ‘@’

● Answer should be of the form: ○

● What Tweet IDs include mentions of @BillCosby? @AndrewYNg?