What is Cloud Computing? 1. Web-Scale Problems. How much data? What is cloud computing? Why is this different? Characteristics: Examples:

What is cloud computing? Why is this different? Jimmy Lin The iSchool University of Maryland Monday, March 30, 2009 Some material adapted from slide...
Author: Derrick Adams
7 downloads 0 Views 532KB Size
What is cloud computing? Why is this different?

Jimmy Lin The iSchool University of Maryland Monday, March 30, 2009

Some material adapted from slides by Christophe Bisciglia, Aaron Kimball, & Sierra Michels-Slettvet, Google Distributed Computing Seminar, 2007 (licensed under Creation Commons Attribution 3.0 License) This work is licensed under a Creative Commons Attribution-Noncommercial-Share Alike 3.0 United States See http://creativecommons.org/licenses/by-nc-sa/3.0/us/ for details

What is Cloud Computing? 1.

Web-scale problems

2.

Large data centers

Source: http://www.free-pictures-photos.com/

1. “Web-Scale” Problems |

Characteristics: z z

3.

Different models of computing

4.

Highly-interactive Web applications

|

Definitely data-intensive May also be processing intensive

Examples: z z z z z z z

Crawling, indexing, searching, mining the Web Data warehouses Sensor networks “Post-genomics” life sciences research Other scientific data (physics, astronomy, etc.) Web 2.0 applications …

How much data? |

Internet archive has 2 PB of data + 20 TB/month

|

Google processes 20 PB a day (2008)

|

“all words ever spoken by human beings” ~ 5 EB

|

CERN’s LHC will generate 10-15 PB a year

|

S Sanger anticipates ti i t 6 PB off d data t iin 2009 640K ought to be enough for anybody.

Maximilien Brice, © CERN

1

Maximilien Brice, © CERN

Maximilien Brice, © CERN

There’s nothing like more data! s/inspiration/data/g;

Maximilien Brice, © CERN

What to do with more data? |

Answering factoid questions z z

(Banko and Brill, ACL 2001) (Brants et al., EMNLP 2007)

How do I make money? |

Pattern matching on the Web Works amazingly well

z z

Who shot Abraham Lincoln? → X shot Abraham Lincoln

|

Learning relations z z z

z

|

Start with seed instances Search for patterns on the Web Using patterns to find more instances

PERSON (DATE – PERSON was born in DATE

Sitting idle in existing data warehouses Overflowing out of existing data warehouses Simply being thrown away

Source of data: z z z z

Wolfgang Amadeus Mozart (1756 - 1791) Einstein was born in 1879 Birthday-of(Mozart, 1756) Birthday-of(Einstein, 1879)

Petabytes of valuable customer data…

z

OLTP User behavior logs Call-center logs Web crawls, public datasets …

|

Structured data (today) vs. unstructured data (tomorrow)

|

How can an organization derive value from all this data?

(Brill et al., TREC 2001; Lin, ACM TOIS 2007) (Agichtein and Gravano, DL 2000; Ravichandran and Hovy, ACL 2002; … )

2

2. Large Data Centers |

Web-scale problems? Throw more machines at it!

|

Centralization of resources in large data centers z z

|

Necessary ingredients: fiber, juice, and land What do Oregon, Iceland, and abandoned mines have in common?

Important Issues: z z z z z

Efficiency Redundancy Utilization Security Management overhead

Source: Harper’s (Feb, 2008)

Key Technology: Virtualization

App

App

App

Operating p g System y

App

App

App

OS

OS

OS

Hypervisor yp

Hardware

Hardware

Traditional Stack

Virtualized Stack

Maximilien Brice, © CERN

3. Different Computing Models “Why do it yourself if you can pay someone to do it for you?” |

z

z

Give me nice API and take care of the implementation Example: Google App Engine

What is the nature of future software applications? z z

Why buy machines when you can rent cycles? Examples: Amazon’s EC2, GoGrid, AppNexus

Platform as a Service (PaaS) z

|

|

Utility computing z

|

4. Web Applications

z

|

How do we deliver highly-interactive Web-based pp applications? z z

Software as a Service (SaaS) z z

Just run it for me! Example: Gmail

From the desktop to the browser SaaS == Web-based applications Examples: Google Maps, Facebook

z

AJAX (asynchronous JavaScript and XML) A hack on top of a mistake built on sand, all held together by duct tape and chewing gum? For better, or for worse…

3

What is the course about? 1.

Web-scale problems

2.

Large data centers

Web-Scale Problems? |

Don’t hold your breath: z z

3.

Different models of computing

4.

Highly-interactive Web applications

z z

|

Biocomputing Nanocomputing Quantum computing …

It all boils down to… z z

Divide-and-conquer Throwing more hardware at the problem

Simple to understand… a lifetime to master…

Divide and Conquer

Different Workers

“Work”

Partition

w1

w2

w3

“worker”

“worker”

“worker”

r1

r2

r3

“Result”

|

Different threads in the same core

|

Different cores in the same CPU

|

Different CPUs in a multi-processor system

|

Different machines in a distributed system

Combine

Flynn’s Taxonomy Instructions

Data

(Quick tour through parallel and distributed computing)

Multiple (MD)

Haven’t we been here before?

Sing gle (SD)

Single (SI)

Multiple (MI)

SISD

MISD

single-threaded process p

pipeline architecture

SIMD

MIMD

vector processing

multi-threaded processes

4

SISD

SIMD Processor

D0

D0

D0

D0

D0

D0

D0

D1

D1

D1

D1

D1

D1

D1

D2

D2

D2

D2

D2

D2

D2

D3

D3

D3

D3

D3

D3

D3

D4

D4

D4

D4

D4

D4

D4















Dn

Dn

Dn

Dn

Dn

Dn

Dn

Processor

D

D

D

D

D

D

D

Instructions

Instructions

MIMD Processor

D

D

D

D

D

D

D

D

D

D

Instructions Processor

D

D

D

D

Instructions

Source: MIT Open Courseware

Interface to external world

Interface to external world

Processor

Memory

Instructions

Processor Data

Data

Instructions

(Instructions and Data) Instructions

Memory y

Data

(Instructions and Data)

Processor Interface to external world

Instructions

Data

Data

Instructions

Processor

Processor

Interface to external world

Interface to external world

5

Memory

Memory

(Instructions and Data)

(Instructions and Data)

Memory

Memory

(Instructions and Data)

Instructions

Data

Data

Instructions

Processor

Processor

Interface to external world

(Instructions and Data)

Instructions Data

Processor

Interface to external world

Data

Instructions

Processor

Interface to external world

Data

Instructions

Processor

Interface to external world

Network

Interface to external world

Interface to external world

Processor

Processor Data

Data

Processor

Network

Instructions

Instructions

Data

Instructions

Memory

Memory

(Instructions and Data)

(Instructions and Data)

Choices, Choices, Choices

Interface to external world

Processor Instructions

Data

Interface to external world

Processor Data

Processor Instructions

Instructions

Data

Processor Data

Memory

Memory

(Instructions and Data)

(Instructions and Data)

Instructions

Parallelization Problems

|

Commodity vs. “exotic” hardware

|

How do we assign work units to workers?

|

Scale “up” or scale “out”

|

What if we have more work units than workers?

|

Number of machines vs. processor vs. cores

|

What if workers need to share partial results?

|

Bandwidth of memory vs. disk vs. network

|

How do we aggregate partial results?

|

Diff Different t programming i models d l

|

H How d do we kknow allll th the workers k h have fifinished? i h d?

|

What if workers die?

What is the common theme of all of these problems?

General Theme? |

Parallelization problems arise from: z z

| |

Managing Multiple Workers |

Communication between workers Access to shared resources

z z z

Thus, we need a synchronization system! This is tricky: z z

Difficult because

|

Thus, we need: z

Finding bugs is hard Solving bugs is even harder

(Often) don’t know the order in which workers run (Often) don’t know where the workers are running (Often) don’t know when workers interrupt each other

z z

Semaphores (lock, (lock unlock) Conditional variables (wait, notify, broadcast) Barriers

|

Still, lots of problems:

|

Moral of the story: be careful!

z

Deadlock, livelock, race conditions, ...

6

“Design g Patterns”

Source: Ricardo Guimarães Herrmann

master P

C

P

C

P

C

P

C

P

C

P

C

slaves

P shared queue P P

W W W W W

C

Rubber,, meet road…

C C

7

Rubber, Meet Road |

Existing tools: z z z

|

pthreads, OpenMP for multi-threaded programming MPI for clustering computing Condor, PBS, SGE, etc. for higher-level job management

The reality: z z z

Lots of one-off solutions, solutions custom code Write you own dedicated library, then program with it Burden on the programmer to explicitly manage everything

Source: Wikipedia

What’s different now?

Source: MIT Open Courseware

Questions?

8