Big Data – Hype or Reality? by
Rick F. van der Lans R20/Consultancy BV Copyright © 1991 - 2012 R20/Consultancy B.V., The Hague, The Netherlands. All rights reserved. No part of this material may be reproduced, stored in a retrieval system, or transmitted in any form or by any means, electronic, mechanical, photographic, or otherwise, without the explicit written permission of the copyright owners.
Twitter: rick_vanderlans www.r20.nl
Rick F. van der Lans Rick F. van der Lans is an independent consultant, lecturer, and author. He specializes in data warehousing, business intelligence, service oriented architectures, and database technology. He is managing director of R20/Consultancy B.V.. Rick has been involved in various projects in which SOA, data warehousing, and integration technology was applied. Rick van der Lans is an internationally acclaimed lecturer. He has lectured professionally for the last twenty years in many of the European and Middle East countries, the USA, South America, and in Australia. He has been invited by several major software vendors to present keynote speeches. He is the author of several books on computing, including Myths on Computing. Some of these books are available in different languages. Books such as the popular Introduction to SQL and SQL for MySQL Developers, are available in English, Dutch, Italian, Chinese, and German and are sold world wide. This year he released The SQL Guide to Ingres. As author for BeyeNetwork.com, writer of whitepapers, as chairman for the annual European Data Warehouse and Business Intelligence Conference, and as columnist for a few IT magazines, he has close contacts with many vendors. R20/Consultancy B.V. is located in The Hague, The Netherlands, www.r20.nl. You can get in touch with Rick via: Email:
[email protected] Twitter: http://twitter.com/Rick_vanderlans LinkedIn: http://www.linkedin.com/pub/rick-van-der-lans/9/207/223
Copyright © 1991 - 2012 R20/Consultancy B.V., The Hague, The Netherlands
2
1
Big Data
Size Does Matter Copyright © 1991 - 2012 R20/Consultancy B.V., The Hague, The Netherlands
3
Social Media and Big Data?
Copyright © 1991 - 2012 R20/Consultancy B.V., The Hague, The Netherlands
4
2
Copyright © 1991 - 2012 R20/Consultancy B.V., The Hague, The Netherlands
5
Rotterdam Harbor
Copyright © 1991 - 2012 R20/Consultancy B.V., The Hague, The Netherlands
6
3
Many, Many Examples Utility companies • Sensors Retail companies • Customer loyalty • RFID Car manufacturers • Sensor linked to satellite Factories Transport • tracking Many more … Copyright © 1991 - 2012 R20/Consultancy B.V., The Hague, The Netherlands
7
My Record Collection
Copyright © 1991 - 2012 R20/Consultancy B.V., The Hague, The Netherlands
8
4
But What If an Organization has Millions of Customers?
Copyright © 1991 - 2012 R20/Consultancy B.V., The Hague, The Netherlands
9
Las Vegas Gambling
Copyright © 1991 - 2012 R20/Consultancy B.V., The Hague, The Netherlands
10
5
BigBank
Copyright © 1991 - 2012 R20/Consultancy B.V., The Hague, The Netherlands
11
The Four V’s of Big Data Volume – the amount of data Variety – structured, semistructured, poly-structured, and unstructured data
Velocity – how fast is the data coming in
Variability – variance in meaning
Copyright © 1991 - 2012 R20/Consultancy B.V., The Hague, The Netherlands
12
6
Classic SQL Database Servers to the Rescue?
Copyright © 1991 - 2012 R20/Consultancy B.V., The Hague, The Netherlands
13
SQL is Intergalactic DataSpeak
Copyright © 1991 - 2012 R20/Consultancy B.V., The Hague, The Netherlands
14
7
Different Database Workloads sql database
OLXP
xml database sql database
OLAP OLAP database
sql database
OLCP OO database
sql database
OLTP pre-relational database
time
Copyright © 1991 - 2012 R20/Consultancy B.V., The Hague, The Netherlands
15
The Prehistory of Database Servers
application
application
file
Copyright © 1991 - 2012 R20/Consultancy B.V., The Hague, The Netherlands
application
file
16
8
Pre-Relational Database Servers
application
application
application
database server
Copyright © 1991 - 2012 R20/Consultancy B.V., The Hague, The Netherlands
17
Relational/SQL Database Servers
application
application
application
Relational/ SQL database server
Copyright © 1991 - 2012 R20/Consultancy B.V., The Hague, The Netherlands
18
9
Declarativeness and Storage Independency Declarativeness: The developer has only to program what has to be done, and not how it should be done.
Storage independency: The language should hide how data is physically stored and how it is accessed. Copyright © 1991 - 2012 R20/Consultancy B.V., The Hague, The Netherlands
19
The 8 Stages of Query Processing Users and reports
Query development Query acquisition Query optimization by DBS Data access
1
Application
2 3
8 7
Database Server
4
6 5
Application processing Result Transmission Database processing Data retrieval
Database Copyright © 1991 - 2012 R20/Consultancy B.V., The Hague, The Netherlands
20
10
Where Does Query Processing Take Place? In-Application Analytics
In-Database Analytics Application
Query
Query
Database server
Copyright © 1991 - 2012 R20/Consultancy B.V., The Hague, The Netherlands
21
Parallellization through Partitioning of Tables SELECT * FROM CUSTOMERS WHERE LOCATION = 'New York'
Database server
Processor
Master
Processor
Copyright © 1991 - 2012 R20/Consultancy B.V., The Hague, The Netherlands
Processor
22
11
Where Does Query Processing Take Place? Classic DBMS Application
Master
Query
Workers
Copyright © 1991 - 2012 R20/Consultancy B.V., The Hague, The Netherlands
23
total throughput
Effect of Partitions on Query Response
bottleneck
number of partitions/processors
Copyright © 1991 - 2012 R20/Consultancy B.V., The Hague, The Netherlands
24
12
response time
Impact Query Complexity on Performance
bottleneck
complexity of query
Copyright © 1991 - 2012 R20/Consultancy B.V., The Hague, The Netherlands
25
MapReduce to the Rescue?
Copyright © 1991 - 2012 R20/Consultancy B.V., The Hague, The Netherlands
26
13
Google’s MapReduce MapReduce is a programming model introduced by Google • Aimed at processing requests on large data sets where the processing can be distributed over a high number of nodes using parallel capabilities
MapReduce is not a programming language MapReduce ≠ MapReduce • Each implementation can be different
Copyright © 1991 - 2012 R20/Consultancy B.V., The Hague, The Netherlands
27
Where Does Query Processing Take Place? Classic DBMS
MapReduce Application
Master
Query
Workers
Copyright © 1991 - 2012 R20/Consultancy B.V., The Hague, The Netherlands
28
14
total throughput
Effect of Partitions on Query Response MapReduce
Classic DBMS bottleneck
number of partitions/processors
Copyright © 1991 - 2012 R20/Consultancy B.V., The Hague, The Netherlands
29
response time
Impact Query Complexity on Performance Classic DBMS
MapReduce
bottleneck
complexity of query
Copyright © 1991 - 2012 R20/Consultancy B.V., The Hague, The Netherlands
30
15
When to Unravel Unstructured Data? Incoming Unstructured data
Loading Application
Structured data
Reporting & Analytics
Unstructured data
Reporting & Analytics
unravel unstructured data
Incoming Unstructured data
Loading Application
unravel unstructured data
Copyright © 1991 - 2012 R20/Consultancy B.V., The Hague, The Netherlands
31
NoSQL Database Servers to the Rescue?
Copyright © 1991 - 2012 R20/Consultancy B.V., The Hague, The Netherlands
32
16
Is NoSQL the Answer?
What was the question again? Copyright © 1991 - 2012 R20/Consultancy B.V., The Hague, The Netherlands
33
Categories of NoSQL Database Servers Category
Examples
Document stores
Apache’s CouchDB, and Jackrabbit; MongoDB; Terrastore; ThruDB; and RavenDB
Wide column stores
Apache’s Hadoop, Apache’s Cassandra, Cloudera’s Hadoop, Hypertable (Google’s Bigtable), and Amazon’s SimpleDB
Key/value stores
Microsoft’s Azure Table Storage, Oracle’s Berkeley DB, hamsterdb, and illuminate’s iStore
Multi value stores
OpenQM, Rocket U2, and IBM UniVerse and IBM UniData
Graph data stores
InfiniteGraph, Open Query Graph engine, AllegroGraph, and Neo4J
…
…
Copyright © 1991 - 2012 R20/Consultancy B.V., The Hague, The Netherlands
34
17
Data as Documents: A Blog Post Primary key
{ _id:“A4304” author: “nosh”, date: 22/6/2010, title: “Intro to MongoDB” text: “MongoDB is an open source..”, tags: [“webinar”, “opensource”] comments: [{ author: “mike”, date: 11/18/2010, txt: “Did you see the…”, votes: 7 },….] }
Simple values Arrays
Embedded documents
Copyright © 1991 - 2012 R20/Consultancy B.V., The Hague, The Netherlands
35
SQL versus NoSQL
application
application
SQL database server
NoSQL solution
Copyright © 1991 - 2012 R20/Consultancy B.V., The Hague, The Netherlands
36
18
The Price to be Paid for NoSQL No real SQL interface
• HiveQL = Hadoop MapReduce job generator • Not supported by most BI tools Limited use of classic reporting and analytical tools
• Many support SQL Low-level programming environment
• Bad for productivity and maintenance The developer is the optimizer No support for built-in analytics
• Program-it-yourself analytics Non-ACID transaction management Copyright © 1991 - 2012 R20/Consultancy B.V., The Hague, The Netherlands
37
Marriage of SQL and MapReduce to the Rescue?
Copyright © 1991 - 2012 R20/Consultancy B.V., The Hague, The Netherlands
38
19
Marriage of SQL and MapReduce (1) SQL-MR is a set of built-in and user-defined external table functions Example: SELECT FROM
* GET_NEXT_FLIGHT_1HR (ON DEPARTURES PARTITION BY DESTINATION) WHERE DESTINATION = 'London' ORDER BY DEPARTURE_TIME
All the SQL-MR function processing is parallelized • Including complex group-by operations, time-series analytics, and so on
Copyright © 1991 - 2012 R20/Consultancy B.V., The Hague, The Netherlands
39
Marriage of SQL and MapReduce (2) An SQL-MR function can contain the most complex analytical logic Programmers of SQL don’t need to learn a new language, Java, C++, Python, and many more can be used The SQL statements invoking SQL-MR functions are still declarative and storage-independent • The functions themselves are not Usable by any BI tool supporting SQL Copyright © 1991 - 2012 R20/Consultancy B.V., The Hague, The Netherlands
40
20
Different Database Workloads sql ?
OLBDP
nosql database sql database
OLXP
xml database sql database
OLAP
OLAP database sql database
OLCP
OLTP
OO database sql database pre-relational database time
Copyright © 1991 - 2012 R20/Consultancy B.V., The Hague, The Netherlands
41
Why Do We Need NoSQL Database Servers? performance
maximum needs
Performance Gap
average needs
minimum needs
performance offered by SQL database servers time
Copyright © 1991 - 2012 R20/Consultancy B.V., The Hague, The Netherlands
42
21
Big Data: Hype or Reality? MapReduce is not a hype
• MapReduce ≠ MapReduce • MapReduce doesn’t exclude SQL Big data is not a hype • Applications already operational The buzz around Hadoop and other NoSQL database servers may turn out to be a hype for most companies
• Can you handle the workload with a SQL-based product? If not … Copyright © 1991 - 2012 R20/Consultancy B.V., The Hague, The Netherlands
43
22