Hadoop? Example MapReduce Problem Exercise: Write your own queries in Hadoop!

NoSQL Outline ● ● ● ● What is NoSQL? What is MapReduce/Hadoop? Example MapReduce Problem Exercise: Write your own queries in Hadoop! What is NoSQ...

Author: Imogen Holt

2 downloads 0 Views 1MB Size

Report

Download PDF

Recommend Documents

Hadoop: Understanding MapReduce

Apache Avro# Hadoop MapReduce guide

Hadoop MapReduce for Tactical Clouds

Hadoop MapReduce Felipe Meneses Besson

MapReduce Overview Hadoop Overview Installing Hadoop Demos Advanced Topics. Outline

Lab Exercise 11: Hadoop, HDFS, MapReduce and Pig

Efficiently Indexing AND Querying Big Data in Hadoop MapReduce. MapReduce Intro. Hadoop++ HAIL. MapReduce Intro. Jens Dittrich

Cloud Computing using MapReduce, Hadoop, Spark

Image Kernel Sharpening using Hadoop Mapreduce Framework

MapReduce Frameworks: Comparing Hadoop and HPCC

Big Data Processing with Hadoop-MapReduce in Cloud Systems

Hadoop MapReduce: A Programming Model for Large Scale Data Processing

ScienceDirect. Hadoop, MapReduce and HDFS: A Developers Perspective

FP-Hadoop: Efficient Processing of Skewed MapReduce Jobs

Data streaming in Hadoop

A Comparative Analysis of MapReduce Scheduling Algorithms for Hadoop

Introduction to MapReduce, Hadoop, Spark, Shark and Mahout. Christan Grant

Hadoop MapReduce Types. Spring 2015, X. Zhang Fordham Univ

NetApp Solutions for Hadoop

Big Data Hadoop Developer

Hadoop Data Integration Benchmark

HADOOP - ENVIORNMENT SETUP

HADOOP Course Content

Hadoop FS Shell Guide

NoSQL

Outline ● ● ● ●

What is NoSQL? What is MapReduce/Hadoop? Example MapReduce Problem Exercise: Write your own queries in Hadoop!

What is NoSQL? ● “No SQL” ○ No relational database

● Umbrella term for many different types of datastores ○ Key-Value Stores, Document Stores, Graph Database systems, etc. ● (Really, it’s more like “Not Only SQL” – we don’t want to abandon the relational DBMS entirely)

Why NoSQL? ● In general, we want our databases to be: ○ ○ ○ ○ ○

Convenient Reliable Safe Scalable Efficient

Why NoSQL? ● In general, we want our databases to be: ○ ○ ○ ○ ○

Convenient Reliable Safe Scalable Efficient

● Nowadays, we care a lot more about scalability and efficiency

Gaining in popularity…..

…..but still got a long ways to go

Terrastore RethinkDB RavenDB ThruDB LevelDB

Cloudata

RaptorDB Amazon DynamoDB

BerkeleyDB Voldemort FoundationDB Amazon SimpleDB

Terrastore RethinkDB RavenDB ThruDB LevelDB

Cloudata

RaptorDB Amazon DynamoDB

BerkeleyDB Voldemort FoundationDB Amazon SimpleDB

What is MapReduce? ● Created in 2004 at Google ● Problem: 100’s of data files distributed across 1,000’s of machines ○ how do we get that information, quickly?

● Solution: Extract the data from the files in parallel ○ take advantage of the fact that the data is distributed over 1,000’s of machines

What is MapReduce? ● No data model, data stored in files ● Primarily used on distributed filesystems ● Users provide two functions ○ map function (data transformation) ○ reduce function (data aggregation)

● System takes care of parallelizing the process

Mapping and Reducing ● map: divide problem into subproblems ○ input: single line from data file ○ output: 0 or more (key, value) pairs

● reduce: work on each subproblem, combine results ○ input: (key, list of values) ○ output: 0 or more output records

Mapping and Reducing ● map: divide problem into subproblems ○ input: single line from data file ○ output: 0 or more (key, value) pairs

● reduce: work on each subproblem, combine results ○ input: (key, list of values) ○ output: 0 or more output records

What is Hadoop? ● An open source implementation of MapReduce ○ Google didn’t want to share :-(

● Also used over distributed filesystems ● Same mechanics as Google’s version of MapReduce

Example Problem: Word Counts How now brown cow Brown cow is blue

Example Problem: Word Counts How now brown cow Brown cow is blue

Input How now brown cow

Brown cow is blue

Example Problem: Word Counts How now brown cow Brown cow is blue

Input How now brown cow

Brown cow is blue

Map

Example Problem: Word Counts How now brown cow Brown cow is blue

Input How now brown cow

Map

(how, 1) (now, 1) (brown, 1) (cow, 1)

(brown, 1) Brown cow is blue

(cow, 1) (is, 1) (blue, 1)

Example Problem: Word Counts How now brown cow Brown cow is blue

Input How now brown cow

Map

(how, 1) (now, 1) (brown, 1) (cow, 1)

(brown, 1) Brown cow is blue

(cow, 1) (is, 1) (blue, 1)

Reduce

Example Problem: Word Counts How now brown cow Brown cow is blue

Input How now brown cow

Map

(how, 1) (now, 1) (brown, 1) (cow, 1)

(brown, 1) Brown cow is blue

(cow, 1) (is, 1) (blue, 1)

Reduce

how, 1 now, 1 brown, 2 cow, 2 is, 1 blue, 1

Example Problem: Word Counts How now brown cow Brown cow is blue

Input How now brown cow

Map

(how, 1) (now, 1) (brown, 1) (cow, 1)

(brown, 1) Brown cow is blue

(cow, 1) (is, 1) (blue, 1)

Reduce

how, 1 now, 1 brown, 2 cow, 2 is, 1 blue, 1

Example Problem: Word Counts How now brown cow Brown cow is blue

Input How now brown cow

Map

(how, 1) (now, 1) (brown, 1) (cow, 1)

(brown, 1) Brown cow is blue

(cow, 1) (is, 1) (blue, 1)

Reduce

how, 1 now, 1 brown, 2 cow, 2 is, 1 blue, 1

Now it’s your turn! ● ssh into corn, and copy the Hadoop starter code ○ cp -r /usr/class/cs145/NoSQL-activity . ○ cd NoSQL-activity/ ● Run the initialization script ○ local-hadoop/start-local-hadoop.py ○ Don’t forget to run local-hadoop/stoplocal-hadoop.py before you log out!

Query #1: Word Counts (again!) ● ● ● ● ●

(We’ll do this one together.) Starter code can be found in src/query1 Dataset can be found in /usr/class/cs145/NoSQL-data Compile and run your code using query1-wordcount.sh Results will show up in output1/ directory ○ check results by running diff output1/part-00000 /usr/class/cs145/NoSQLanswers/output1/part-00000

Query #2: Hashtag Counts ● Count the number of times each Hashtag appears in the Twitter dataset ○ a hashtag is a term that starts with ‘#’

● Answer should be of the form: ○

● How many times does #goStanford appear in the dataset?

Query #3: Inverted Index on Mentions ● Create a mapping from a Twitter username to a list of Tweets that the username appears in ○ A username always starts with ‘@’

● Answer should be of the form: ○

● What Tweet IDs include mentions of @BillCosby? @AndrewYNg?