© Copyright 2012 EMC Corporation. All rights reserved.
1
THE ROAD TO BIG DATA ANALYTICS Introduction to Greenplum Database and HD (Hadoop)
© Copyright 2012 EMC Corporation. All rights reserved.
2
First There Was The Data Warehouse • A new architecture to host data from multiple sources to support decision-making • Why the Data warehouse exists: – Centralization of high value data – Tools to process data into information – Highly regulated environment
© Copyright 2012 EMC Corporation. All rights reserved.
Legacy EDW
3
Then The MPP Database Was Introduced A new approach to database was required to handle new analytics environment Why the MPP Database exists: – – – –
Data got larger Queries got uglier Performance became critical R/SAS/Statistical languages need to run in-database
© Copyright 2012 EMC Corporation. All rights reserved.
4
Now There Is Hadoop Traditional systems weren‟t built to handle the storage/processing needs of Web 2.0 Why Hadoop exists: – Data volumes moved to the PB range – Raw (unstructured) forms of data needed to be processed – Cost needed to be low – Processing must scale with storage © Copyright 2012 EMC Corporation. All rights reserved.
5
Value Of Data Co-Processing With Hadoop
© Copyright 2012 EMC Corporation. All rights reserved.
6
Hadoop And MPP Represent A Paradigm Shift • Requires a different approach to how you leverage data • Removes limitations around what data is worth storing or analyzing • Augments analysis capabilities to create competitive advantages
© Copyright 2012 EMC Corporation. All rights reserved.
7
Initially Used For Web Logs But Now… • Healthcare – EMR/Claims data
• Financials – Ticker/Social media data • Retail – Transaction/Customer sentiment data • Insurance/Automobile – Telemetry data
© Copyright 2012 EMC Corporation. All rights reserved.
8
Different Tools Have Different Strengths Partitioning SQL Indexing RDBMS BI Tools Tables and Schemas STRUCTURED
© Copyright 2012 EMC Corporation. All rights reserved.
GP MapReduce
UNSTRUCTURED
9
Different Tools Have Different Strengths
Hive Pig STRUCTURED
© Copyright 2012 EMC Corporation. All rights reserved.
Schema on load MapReduce SequenceFile Java Directories XML, JSON, … Flat files No ETL
UNSTRUCTURED
10
Big Data Analytics Requires Both Partitioning Indexing SQL Hive RDBMS BI Tools Tables and SchemasPig STRUCTURED
© Copyright 2012 EMC Corporation. All rights reserved.
Schema on load Flat files MapReduce SequenceFile Java Directories XML, JSON, … No ETL GP MapReduce
UNSTRUCTURED
11
Delivered in a Unified Platform • One system for Multistructured analysis • MPP Performance for data load and query • Massive Scale
• Unified Collaboration, Management & Monitoring
© Copyright 2012 EMC Corporation. All rights reserved.
12
GREENPLUM DATABASE Industry-Leading Massively Parallel Processing (MPP) Performance
© Copyright 2012 EMC Corporation. All rights reserved.
13
Extreme Performance for Analytics Greenplum Database Optimized for BI and analytics –
Deep integration with statistical packages
–
High performance parallel implementations
• Simple and automatic –
Just load and query like any database
–
Tables are automatically distributed across nodes
• Extremely scalable
– MPP shared-nothing architecture
© Copyright 2012 EMC Corporation. All rights reserved.
–
All nodes can scan and process in parallel
–
Linear scalability by adding nodes
14
A Mature Enterprise Platform CLIENT ACCESS & TOOLS
CLIENT ACCESS
3rd PARTY TOOLS
ADMIN TOOLS
ODBC, JDBC, OLEDB,
BI Tools, ETL Tools
Greenplum Command Center
MapReduce, etc.
Data Mining, etc
Greenplum Package Manager
LOADING & EXT. ACCESS
STORAGE & DATA ACCESS
LANGUAGE SUPPORT
Petabyte-Scale Loading
Hybrid Storage & Execution (Row- & Column-Oriented)
Comprehensive SQL
Trickle Micro-Batching
PRODUCT FEATURES
Native MapReduce
In-Database Compression
Anywhere Data Access
SQL 2003 OLAP Extensions
Multi-Level Partitioning Indexes – Btree, Bitmap, etc. External Table Support
GREENPLUM DATABASE ADAPTIVE SERVICES
CORE MPP ARCHITECTURE
© Copyright 2012 EMC Corporation. All rights reserved.
Multi-Level Fault Tolerance (RAID, Mirroring, DR with Data Domain Boost)
Online System Expansion
Programmable Analytics Analytics Extensions
Workload Management
Shared-Nothing MPP
Parallel Dataflow Engine
Parallel Query Optimizer
gNet™ Software Interconnect
Polymorphic Data Storage™
Scatter/Gather Streaming™ Data Loading
15
Performance Through Parallelism Greenplum Database • Scale-out architecture on standard commodity hardware
• Automatic parallelization –
Load and query like any database
–
Automatically distributed tables across all nodes
–
No need for manual partitioning or tuning
Interconnect
• Extremely scalable MPP shared-nothing architecture –
All nodes can scan and process in parallel
–
Linear scalability by adding nodes
–
On-line expansion when adding nodes
© Copyright 2012 EMC Corporation. All rights reserved.
Loading
16
Most Powerful Data Loading Capabilities Greenplum Database Industry leading performance at 10+TB per-hour per-rack Scatter-Gather Streaming™ provides true linear scaling Support for both large-batch and continuous real-time loading strategies Enable complex data transformations “in-flight” Transparent interfaces to loading via support files, application, and services © Copyright 2012 EMC Corporation. All rights reserved.
Greenplum load rates scale linearly with the number of racks, others do not. For example, two racks = >20TB/H
17
Polymorphic Table StorageTM Greenplum Database TABLE „CUSTOMER‟ Mar „11
Apr „11
May „11
Jun „11
Jul „11
Column-oriented for COLD DATA
Aug „11
Sept „11
Oct „11
Nov „11
Row-oriented for HOT DATA
• Enable Information Lifecycle Management (ILM) • Storage types can be mixed within a table or database – –
Four table types: heap, row-oriented AO, column-oriented, external Block compression: Gzip (levels 1-9), QuickLZ
• Provide the choice of processing model for any table or partition © Copyright 2012 EMC Corporation. All rights reserved.
18
© Copyright 2012 EMC Corporation. All rights reserved.
19
Parallel Query Optimizer
PHYSICAL EXECUTION PLAN FROM SQL OR MAPREDUCE
Greenplum Database
Gather Motion 4:1(Slice 3)
• Cost-based optimization looks for the most efficient plan • Physical plan contains scans, joins, sorts, aggregations, etc. • Global planning avoids sub-optimal „SQL pushing‟ to segments • Directly inserts „motion‟ nodes for inter-segment communication
Sort HashAggregat e HashJoin
Redistribute Motion 4:4(Slice 1)
Hash
HashJoin
HashJoin
Seq Scan on lineitem
Hash Seq Scan on orders
Seq Scan on customer
Hash Broadcast Motion 4:4(Slice 2) Seq Scan on motion
© Copyright 2012 EMC Corporation. All rights reserved.
20
Gnet Software Interconnect A supercomputing-based “soft-switch” responsible for – Efficiently pumping streams of data between motion nodes during query-plan execution – Delivers messages, moves data, collects results, and coordinates work among the segments in the system
High Performance gNet for Hadoop – Parallel query access – Parallel data exchange gNet Software Interconnect
© Copyright 2012 EMC Corporation. All rights reserved.
21
High Availability Greenplum Database
Master Server Data Protection Replicated transaction logs for server failure Optional RAID protection for drive failures
Upon server failure
Master
Standby server activated Administrator alerted Orchestrated failover
Master
Segment Server Data Protection
Mirrored segments for server failures Optional RAID protection for drive failures
Segment
Segment
Segment
Segment
Upon server failure
Mirrored segments take over with no loss of service Fast online differential recovery
© Copyright 2012 EMC Corporation. All rights reserved.
22
Simple To Manage Greenplum Command Center – Complete platform management and control
Greenplum Package Manager – Automates install, uninstall, update, and query for analytics extensions – Support package migration during upgrade, segment recovery, expansion, and standby initialization
© Copyright 2012 EMC Corporation. All rights reserved.
23
In-Database Analytics Bringing the power of parallelism to commonly-used modeling and analytics functions In-database analytics
MAD lib
– SAS – HPA, Access, and Scoring Accelerator – MADLib – An open-source library of advanced analytics functions – Analytics extensions supported, including ▪ PostGIS - Geospatial support, PL/R - Statistical Computing, PL/Java, PL/Perl, etc.
© Copyright 2012 EMC Corporation. All rights reserved.
24
SAS and Greenplum Partnership Deliver High-Performance Computing and MAD Analytics
Access relational data-sets for agile analysis – SAS/ACCESS provides fast, transparent and secure access to Greenplum data.
Leverage database scalability for rapid model deployment – SAS Scoring Accelerator publishes models for execution in parallel across the Greenplum cluster.
Build complex models at massive scales – SAS HPA Appliance combines SAS In-Memory Analytics with Greenplum parallelism to produce record-breaking scalability and performance. © Copyright 2012 EMC Corporation. All rights reserved.
25
GREENPLUM HD Hadoop For The Enterprise
© Copyright 2012 EMC Corporation. All rights reserved.
26
People And Skills Challenges Greenplum HD Establish a strategic vision – Roadmap for Hadoop and unified analytics
Hadoop Architecture Services – POC planning and deployment – Installation and best practices
GPHD Training & Education – Business, Developer, Data Scientist, Administration
Access to Analytics Workbench
© Copyright 2012 EMC Corporation. All rights reserved.
27
Greenplum HD Platform Delivery Simple, efficient and scalable Proven at scale in 1,000 node test environment (AWB) with worldwide EMC support Purpose-built Hadoop infrastructure Pluggable storage layer
Management & monitoring at scale © Copyright 2012 EMC Corporation. All rights reserved.
28
Greenplum Chorus
CENTER
GREENPLUM COMMAND
Greenplum HD Platform Delivery
Hadoop Tools (Pig, Hive, HBase, Zookeeper, Mahout, etc…)
MapReduce Layer Pluggable Storage Layer (HDFS API) Apache HDFS
© Copyright 2012 EMC Corporation. All rights reserved.
Isilon OneFS
29
Greenplum HD Platform Delivery Spring Hadoop
Mahout HBase
•Integrates Spring and Hadoop Frameworks •Scalable machine learning libraries •Database for random, real time read/write access
Hive
•System for SQL-like query data on top of HDFS
Pig
•Procedural language that abstracts MapReduce
Zookeeper MapReduce HDFS © Copyright 2012 EMC Corporation. All rights reserved.
•Highly reliable distributed coordination
•Framework for writing scalable data applications •Hadoop Distributed File System
30
Productivity with Hadoop Establish Chorus Connection to GPHD Cluster Browse HDFS files Leverage gNet integration to parse HDFS using SQL interface – Determine inherent data structure
Collaboration with business, analytics and infrastructure © Copyright 2012 EMC Corporation. All rights reserved.
31
Integration with Existing Technologies Create end-to-end workflows Data Access &Query Layer
ODBC
JDBC
SQL
Leverage existing skills
Java/Perl/Python
Command Line
PARALLEL QUERY INTEGRATION
HQL
HDFS
PARALLEL IMPORT/EXPORT
GREENPLUM DATABASE
PigLatin
GREENPLUM HD
Greenplum gNet
© Copyright 2012 EMC Corporation. All rights reserved.
32
Big Data Analytics Requires Both
UNSTRUCTURED
© Copyright 2012 EMC Corporation. All rights reserved.
STRUCTURED
33
Greenplum Delivers Big Data in a Unified Analytics Platform © Copyright 2012 EMC Corporation. All rights reserved.
34