Big Data Hype or Reality?

Big Data – Hype or Reality? by Rick F. van der Lans R20/Consultancy BV Copyright © 1991 - 2012 R20/Consultancy B.V., The Hague, The Netherlands. All ...
Author: Scott Little
25 downloads 0 Views 5MB Size
Big Data – Hype or Reality? by

Rick F. van der Lans R20/Consultancy BV Copyright © 1991 - 2012 R20/Consultancy B.V., The Hague, The Netherlands. All rights reserved. No part of this material may be reproduced, stored in a retrieval system, or transmitted in any form or by any means, electronic, mechanical, photographic, or otherwise, without the explicit written permission of the copyright owners.

Twitter: rick_vanderlans www.r20.nl

Rick F. van der Lans Rick F. van der Lans is an independent consultant, lecturer, and author. He specializes in data warehousing, business intelligence, service oriented architectures, and database technology. He is managing director of R20/Consultancy B.V.. Rick has been involved in various projects in which SOA, data warehousing, and integration technology was applied. Rick van der Lans is an internationally acclaimed lecturer. He has lectured professionally for the last twenty years in many of the European and Middle East countries, the USA, South America, and in Australia. He has been invited by several major software vendors to present keynote speeches. He is the author of several books on computing, including Myths on Computing. Some of these books are available in different languages. Books such as the popular Introduction to SQL and SQL for MySQL Developers, are available in English, Dutch, Italian, Chinese, and German and are sold world wide. This year he released The SQL Guide to Ingres. As author for BeyeNetwork.com, writer of whitepapers, as chairman for the annual European Data Warehouse and Business Intelligence Conference, and as columnist for a few IT magazines, he has close contacts with many vendors. R20/Consultancy B.V. is located in The Hague, The Netherlands, www.r20.nl. You can get in touch with Rick via: Email: [email protected] Twitter: http://twitter.com/Rick_vanderlans LinkedIn: http://www.linkedin.com/pub/rick-van-der-lans/9/207/223

Copyright © 1991 - 2012 R20/Consultancy B.V., The Hague, The Netherlands

2

1

Big Data

Size Does Matter Copyright © 1991 - 2012 R20/Consultancy B.V., The Hague, The Netherlands

3

Social Media and Big Data?

Copyright © 1991 - 2012 R20/Consultancy B.V., The Hague, The Netherlands

4

2

Copyright © 1991 - 2012 R20/Consultancy B.V., The Hague, The Netherlands

5

Rotterdam Harbor

Copyright © 1991 - 2012 R20/Consultancy B.V., The Hague, The Netherlands

6

3

Many, Many Examples Utility companies • Sensors Retail companies • Customer loyalty • RFID Car manufacturers • Sensor linked to satellite Factories Transport • tracking Many more … Copyright © 1991 - 2012 R20/Consultancy B.V., The Hague, The Netherlands

7

My Record Collection

Copyright © 1991 - 2012 R20/Consultancy B.V., The Hague, The Netherlands

8

4

But What If an Organization has Millions of Customers?

Copyright © 1991 - 2012 R20/Consultancy B.V., The Hague, The Netherlands

9

Las Vegas Gambling

Copyright © 1991 - 2012 R20/Consultancy B.V., The Hague, The Netherlands

10

5

BigBank

Copyright © 1991 - 2012 R20/Consultancy B.V., The Hague, The Netherlands

11

The Four V’s of Big Data Volume – the amount of data Variety – structured, semistructured, poly-structured, and unstructured data

Velocity – how fast is the data coming in

Variability – variance in meaning

Copyright © 1991 - 2012 R20/Consultancy B.V., The Hague, The Netherlands

12

6

Classic SQL Database Servers to the Rescue?

Copyright © 1991 - 2012 R20/Consultancy B.V., The Hague, The Netherlands

13

SQL is Intergalactic DataSpeak

Copyright © 1991 - 2012 R20/Consultancy B.V., The Hague, The Netherlands

14

7

Different Database Workloads sql database

OLXP

xml database sql database

OLAP OLAP database

sql database

OLCP OO database

sql database

OLTP pre-relational database

time

Copyright © 1991 - 2012 R20/Consultancy B.V., The Hague, The Netherlands

15

The Prehistory of Database Servers

application

application

file

Copyright © 1991 - 2012 R20/Consultancy B.V., The Hague, The Netherlands

application

file

16

8

Pre-Relational Database Servers

application

application

application

database server

Copyright © 1991 - 2012 R20/Consultancy B.V., The Hague, The Netherlands

17

Relational/SQL Database Servers

application

application

application

Relational/ SQL database server

Copyright © 1991 - 2012 R20/Consultancy B.V., The Hague, The Netherlands

18

9

Declarativeness and Storage Independency Declarativeness: The developer has only to program what has to be done, and not how it should be done.

Storage independency: The language should hide how data is physically stored and how it is accessed. Copyright © 1991 - 2012 R20/Consultancy B.V., The Hague, The Netherlands

19

The 8 Stages of Query Processing Users and reports

Query development Query acquisition Query optimization by DBS Data access

1

Application

2 3

8 7

Database Server

4

6 5

Application processing Result Transmission Database processing Data retrieval

Database Copyright © 1991 - 2012 R20/Consultancy B.V., The Hague, The Netherlands

20

10

Where Does Query Processing Take Place? In-Application Analytics

In-Database Analytics Application

Query

Query

Database server

Copyright © 1991 - 2012 R20/Consultancy B.V., The Hague, The Netherlands

21

Parallellization through Partitioning of Tables SELECT * FROM CUSTOMERS WHERE LOCATION = 'New York'

Database server

Processor

Master

Processor

Copyright © 1991 - 2012 R20/Consultancy B.V., The Hague, The Netherlands

Processor

22

11

Where Does Query Processing Take Place? Classic DBMS Application

Master

Query

Workers

Copyright © 1991 - 2012 R20/Consultancy B.V., The Hague, The Netherlands

23

total throughput

Effect of Partitions on Query Response

bottleneck

number of partitions/processors

Copyright © 1991 - 2012 R20/Consultancy B.V., The Hague, The Netherlands

24

12

response time

Impact Query Complexity on Performance

bottleneck

complexity of query

Copyright © 1991 - 2012 R20/Consultancy B.V., The Hague, The Netherlands

25

MapReduce to the Rescue?

Copyright © 1991 - 2012 R20/Consultancy B.V., The Hague, The Netherlands

26

13

Google’s MapReduce MapReduce is a programming model introduced by Google • Aimed at processing requests on large data sets where the processing can be distributed over a high number of nodes using parallel capabilities

MapReduce is not a programming language MapReduce ≠ MapReduce • Each implementation can be different

Copyright © 1991 - 2012 R20/Consultancy B.V., The Hague, The Netherlands

27

Where Does Query Processing Take Place? Classic DBMS

MapReduce Application

Master

Query

Workers

Copyright © 1991 - 2012 R20/Consultancy B.V., The Hague, The Netherlands

28

14

total throughput

Effect of Partitions on Query Response MapReduce

Classic DBMS bottleneck

number of partitions/processors

Copyright © 1991 - 2012 R20/Consultancy B.V., The Hague, The Netherlands

29

response time

Impact Query Complexity on Performance Classic DBMS

MapReduce

bottleneck

complexity of query

Copyright © 1991 - 2012 R20/Consultancy B.V., The Hague, The Netherlands

30

15

When to Unravel Unstructured Data? Incoming Unstructured data

Loading Application

Structured data

Reporting & Analytics

Unstructured data

Reporting & Analytics

unravel unstructured data

Incoming Unstructured data

Loading Application

unravel unstructured data

Copyright © 1991 - 2012 R20/Consultancy B.V., The Hague, The Netherlands

31

NoSQL Database Servers to the Rescue?

Copyright © 1991 - 2012 R20/Consultancy B.V., The Hague, The Netherlands

32

16

Is NoSQL the Answer?

What was the question again? Copyright © 1991 - 2012 R20/Consultancy B.V., The Hague, The Netherlands

33

Categories of NoSQL Database Servers Category

Examples

Document stores

Apache’s CouchDB, and Jackrabbit; MongoDB; Terrastore; ThruDB; and RavenDB

Wide column stores

Apache’s Hadoop, Apache’s Cassandra, Cloudera’s Hadoop, Hypertable (Google’s Bigtable), and Amazon’s SimpleDB

Key/value stores

Microsoft’s Azure Table Storage, Oracle’s Berkeley DB, hamsterdb, and illuminate’s iStore

Multi value stores

OpenQM, Rocket U2, and IBM UniVerse and IBM UniData

Graph data stores

InfiniteGraph, Open Query Graph engine, AllegroGraph, and Neo4J





Copyright © 1991 - 2012 R20/Consultancy B.V., The Hague, The Netherlands

34

17

Data as Documents: A Blog Post Primary key

{ _id:“A4304” author: “nosh”, date: 22/6/2010, title: “Intro to MongoDB” text: “MongoDB is an open source..”, tags: [“webinar”, “opensource”] comments: [{ author: “mike”, date: 11/18/2010, txt: “Did you see the…”, votes: 7 },….] }

Simple values Arrays

Embedded documents

Copyright © 1991 - 2012 R20/Consultancy B.V., The Hague, The Netherlands

35

SQL versus NoSQL

application

application

SQL database server

NoSQL solution

Copyright © 1991 - 2012 R20/Consultancy B.V., The Hague, The Netherlands

36

18

The Price to be Paid for NoSQL No real SQL interface

• HiveQL = Hadoop MapReduce job generator • Not supported by most BI tools Limited use of classic reporting and analytical tools

• Many support SQL Low-level programming environment

• Bad for productivity and maintenance The developer is the optimizer No support for built-in analytics

• Program-it-yourself analytics Non-ACID transaction management Copyright © 1991 - 2012 R20/Consultancy B.V., The Hague, The Netherlands

37

Marriage of SQL and MapReduce to the Rescue?

Copyright © 1991 - 2012 R20/Consultancy B.V., The Hague, The Netherlands

38

19

Marriage of SQL and MapReduce (1) SQL-MR is a set of built-in and user-defined external table functions Example: SELECT FROM

* GET_NEXT_FLIGHT_1HR (ON DEPARTURES PARTITION BY DESTINATION) WHERE DESTINATION = 'London' ORDER BY DEPARTURE_TIME

All the SQL-MR function processing is parallelized • Including complex group-by operations, time-series analytics, and so on

Copyright © 1991 - 2012 R20/Consultancy B.V., The Hague, The Netherlands

39

Marriage of SQL and MapReduce (2) An SQL-MR function can contain the most complex analytical logic Programmers of SQL don’t need to learn a new language, Java, C++, Python, and many more can be used The SQL statements invoking SQL-MR functions are still declarative and storage-independent • The functions themselves are not Usable by any BI tool supporting SQL Copyright © 1991 - 2012 R20/Consultancy B.V., The Hague, The Netherlands

40

20

Different Database Workloads sql ?

OLBDP

nosql database sql database

OLXP

xml database sql database

OLAP

OLAP database sql database

OLCP

OLTP

OO database sql database pre-relational database time

Copyright © 1991 - 2012 R20/Consultancy B.V., The Hague, The Netherlands

41

Why Do We Need NoSQL Database Servers? performance

maximum needs

Performance Gap

average needs

minimum needs

performance offered by SQL database servers time

Copyright © 1991 - 2012 R20/Consultancy B.V., The Hague, The Netherlands

42

21

Big Data: Hype or Reality? MapReduce is not a hype

• MapReduce ≠ MapReduce • MapReduce doesn’t exclude SQL Big data is not a hype • Applications already operational The buzz around Hadoop and other NoSQL database servers may turn out to be a hype for most companies

• Can you handle the workload with a SQL-based product? If not … Copyright © 1991 - 2012 R20/Consultancy B.V., The Hague, The Netherlands

43

22