Copyright 2012 EMC Corporation. All rights reserved

© Copyright 2012 EMC Corporation. All rights reserved. 1 THE ROAD TO BIG DATA ANALYTICS Introduction to Greenplum Database and HD (Hadoop) © Copyr...
Author: Brianne Ellis
0 downloads 2 Views 1MB Size
© Copyright 2012 EMC Corporation. All rights reserved.

1

THE ROAD TO BIG DATA ANALYTICS Introduction to Greenplum Database and HD (Hadoop)

© Copyright 2012 EMC Corporation. All rights reserved.

2

First There Was The Data Warehouse • A new architecture to host data from multiple sources to support decision-making • Why the Data warehouse exists: – Centralization of high value data – Tools to process data into information – Highly regulated environment

© Copyright 2012 EMC Corporation. All rights reserved.

Legacy EDW

3

Then The MPP Database Was Introduced  A new approach to database was required to handle new analytics environment  Why the MPP Database exists: – – – –

Data got larger Queries got uglier Performance became critical R/SAS/Statistical languages need to run in-database

© Copyright 2012 EMC Corporation. All rights reserved.

4

Now There Is Hadoop  Traditional systems weren‟t built to handle the storage/processing needs of Web 2.0  Why Hadoop exists: – Data volumes moved to the PB range – Raw (unstructured) forms of data needed to be processed – Cost needed to be low – Processing must scale with storage © Copyright 2012 EMC Corporation. All rights reserved.

5

Value Of Data Co-Processing With Hadoop

© Copyright 2012 EMC Corporation. All rights reserved.

6

Hadoop And MPP Represent A Paradigm Shift • Requires a different approach to how you leverage data • Removes limitations around what data is worth storing or analyzing • Augments analysis capabilities to create competitive advantages

© Copyright 2012 EMC Corporation. All rights reserved.

7

Initially Used For Web Logs But Now… • Healthcare – EMR/Claims data

• Financials – Ticker/Social media data • Retail – Transaction/Customer sentiment data • Insurance/Automobile – Telemetry data

© Copyright 2012 EMC Corporation. All rights reserved.

8

Different Tools Have Different Strengths Partitioning SQL Indexing RDBMS BI Tools Tables and Schemas STRUCTURED

© Copyright 2012 EMC Corporation. All rights reserved.

GP MapReduce

UNSTRUCTURED

9

Different Tools Have Different Strengths

Hive Pig STRUCTURED

© Copyright 2012 EMC Corporation. All rights reserved.

Schema on load MapReduce SequenceFile Java Directories XML, JSON, … Flat files No ETL

UNSTRUCTURED

10

Big Data Analytics Requires Both Partitioning Indexing SQL Hive RDBMS BI Tools Tables and SchemasPig STRUCTURED

© Copyright 2012 EMC Corporation. All rights reserved.

Schema on load Flat files MapReduce SequenceFile Java Directories XML, JSON, … No ETL GP MapReduce

UNSTRUCTURED

11

Delivered in a Unified Platform • One system for Multistructured analysis • MPP Performance for data load and query • Massive Scale

• Unified Collaboration, Management & Monitoring

© Copyright 2012 EMC Corporation. All rights reserved.

12

GREENPLUM DATABASE Industry-Leading Massively Parallel Processing (MPP) Performance

© Copyright 2012 EMC Corporation. All rights reserved.

13

Extreme Performance for Analytics Greenplum Database  Optimized for BI and analytics –

Deep integration with statistical packages



High performance parallel implementations

• Simple and automatic –

Just load and query like any database



Tables are automatically distributed across nodes

• Extremely scalable

– MPP shared-nothing architecture

© Copyright 2012 EMC Corporation. All rights reserved.



All nodes can scan and process in parallel



Linear scalability by adding nodes

14

A Mature Enterprise Platform CLIENT ACCESS & TOOLS

CLIENT ACCESS

3rd PARTY TOOLS

ADMIN TOOLS

ODBC, JDBC, OLEDB,

BI Tools, ETL Tools

Greenplum Command Center

MapReduce, etc.

Data Mining, etc

Greenplum Package Manager

LOADING & EXT. ACCESS

STORAGE & DATA ACCESS

LANGUAGE SUPPORT

Petabyte-Scale Loading

Hybrid Storage & Execution (Row- & Column-Oriented)

Comprehensive SQL

Trickle Micro-Batching

PRODUCT FEATURES

Native MapReduce

In-Database Compression

Anywhere Data Access

SQL 2003 OLAP Extensions

Multi-Level Partitioning Indexes – Btree, Bitmap, etc. External Table Support

GREENPLUM DATABASE ADAPTIVE SERVICES

CORE MPP ARCHITECTURE

© Copyright 2012 EMC Corporation. All rights reserved.

Multi-Level Fault Tolerance (RAID, Mirroring, DR with Data Domain Boost)

Online System Expansion

Programmable Analytics Analytics Extensions

Workload Management

Shared-Nothing MPP

Parallel Dataflow Engine

Parallel Query Optimizer

gNet™ Software Interconnect

Polymorphic Data Storage™

Scatter/Gather Streaming™ Data Loading

15

Performance Through Parallelism Greenplum Database • Scale-out architecture on standard commodity hardware

• Automatic parallelization –

Load and query like any database



Automatically distributed tables across all nodes



No need for manual partitioning or tuning

Interconnect

• Extremely scalable MPP shared-nothing architecture –

All nodes can scan and process in parallel



Linear scalability by adding nodes



On-line expansion when adding nodes

© Copyright 2012 EMC Corporation. All rights reserved.

Loading

16

Most Powerful Data Loading Capabilities Greenplum Database  Industry leading performance at 10+TB per-hour per-rack  Scatter-Gather Streaming™ provides true linear scaling  Support for both large-batch and continuous real-time loading strategies  Enable complex data transformations “in-flight”  Transparent interfaces to loading via support files, application, and services © Copyright 2012 EMC Corporation. All rights reserved.

Greenplum load rates scale linearly with the number of racks, others do not. For example, two racks = >20TB/H

17

Polymorphic Table StorageTM Greenplum Database TABLE „CUSTOMER‟ Mar „11

Apr „11

May „11

Jun „11

Jul „11

Column-oriented for COLD DATA

Aug „11

Sept „11

Oct „11

Nov „11

Row-oriented for HOT DATA

• Enable Information Lifecycle Management (ILM) • Storage types can be mixed within a table or database – –

Four table types: heap, row-oriented AO, column-oriented, external Block compression: Gzip (levels 1-9), QuickLZ

• Provide the choice of processing model for any table or partition © Copyright 2012 EMC Corporation. All rights reserved.

18

© Copyright 2012 EMC Corporation. All rights reserved.

19

Parallel Query Optimizer

PHYSICAL EXECUTION PLAN FROM SQL OR MAPREDUCE

Greenplum Database

Gather Motion 4:1(Slice 3)

• Cost-based optimization looks for the most efficient plan • Physical plan contains scans, joins, sorts, aggregations, etc. • Global planning avoids sub-optimal „SQL pushing‟ to segments • Directly inserts „motion‟ nodes for inter-segment communication

Sort HashAggregat e HashJoin

Redistribute Motion 4:4(Slice 1)

Hash

HashJoin

HashJoin

Seq Scan on lineitem

Hash Seq Scan on orders

Seq Scan on customer

Hash Broadcast Motion 4:4(Slice 2) Seq Scan on motion

© Copyright 2012 EMC Corporation. All rights reserved.

20

Gnet Software Interconnect  A supercomputing-based “soft-switch” responsible for – Efficiently pumping streams of data between motion nodes during query-plan execution – Delivers messages, moves data, collects results, and coordinates work among the segments in the system

 High Performance gNet for Hadoop – Parallel query access – Parallel data exchange gNet Software Interconnect

© Copyright 2012 EMC Corporation. All rights reserved.

21

High Availability Greenplum Database

Master Server Data Protection  Replicated transaction logs for server failure  Optional RAID protection for drive failures

Upon server failure

Master

 Standby server activated  Administrator alerted  Orchestrated failover

Master

Segment Server Data Protection  

Mirrored segments for server failures Optional RAID protection for drive failures

Segment

Segment

Segment

Segment

Upon server failure  

Mirrored segments take over with no loss of service Fast online differential recovery

© Copyright 2012 EMC Corporation. All rights reserved.

22

Simple To Manage  Greenplum Command Center – Complete platform management and control

 Greenplum Package Manager – Automates install, uninstall, update, and query for analytics extensions – Support package migration during upgrade, segment recovery, expansion, and standby initialization

© Copyright 2012 EMC Corporation. All rights reserved.

23

In-Database Analytics  Bringing the power of parallelism to commonly-used modeling and analytics functions  In-database analytics

MAD lib

– SAS – HPA, Access, and Scoring Accelerator – MADLib – An open-source library of advanced analytics functions – Analytics extensions supported, including ▪ PostGIS - Geospatial support, PL/R - Statistical Computing, PL/Java, PL/Perl, etc.

© Copyright 2012 EMC Corporation. All rights reserved.

24

SAS and Greenplum Partnership Deliver High-Performance Computing and MAD Analytics

 Access relational data-sets for agile analysis – SAS/ACCESS provides fast, transparent and secure access to Greenplum data.

 Leverage database scalability for rapid model deployment – SAS Scoring Accelerator publishes models for execution in parallel across the Greenplum cluster.

 Build complex models at massive scales – SAS HPA Appliance combines SAS In-Memory Analytics with Greenplum parallelism to produce record-breaking scalability and performance. © Copyright 2012 EMC Corporation. All rights reserved.

25

GREENPLUM HD Hadoop For The Enterprise

© Copyright 2012 EMC Corporation. All rights reserved.

26

People And Skills Challenges Greenplum HD  Establish a strategic vision – Roadmap for Hadoop and unified analytics

 Hadoop Architecture Services – POC planning and deployment – Installation and best practices

 GPHD Training & Education – Business, Developer, Data Scientist, Administration

 Access to Analytics Workbench

© Copyright 2012 EMC Corporation. All rights reserved.

27

Greenplum HD Platform Delivery  Simple, efficient and scalable  Proven at scale in 1,000 node test environment (AWB) with worldwide EMC support  Purpose-built Hadoop infrastructure  Pluggable storage layer

 Management & monitoring at scale © Copyright 2012 EMC Corporation. All rights reserved.

28

Greenplum Chorus

CENTER

GREENPLUM COMMAND

Greenplum HD Platform Delivery

Hadoop Tools (Pig, Hive, HBase, Zookeeper, Mahout, etc…)

MapReduce Layer Pluggable Storage Layer (HDFS API) Apache HDFS

© Copyright 2012 EMC Corporation. All rights reserved.

Isilon OneFS

29

Greenplum HD Platform Delivery Spring Hadoop

Mahout HBase

•Integrates Spring and Hadoop Frameworks •Scalable machine learning libraries •Database for random, real time read/write access

Hive

•System for SQL-like query data on top of HDFS

Pig

•Procedural language that abstracts MapReduce

Zookeeper MapReduce HDFS © Copyright 2012 EMC Corporation. All rights reserved.

•Highly reliable distributed coordination

•Framework for writing scalable data applications •Hadoop Distributed File System

30

Productivity with Hadoop  Establish Chorus Connection to GPHD Cluster  Browse HDFS files  Leverage gNet integration to parse HDFS using SQL interface – Determine inherent data structure

 Collaboration with business, analytics and infrastructure © Copyright 2012 EMC Corporation. All rights reserved.

31

Integration with Existing Technologies  Create end-to-end workflows Data Access &Query Layer

ODBC

JDBC

SQL

 Leverage existing skills

Java/Perl/Python

Command Line

PARALLEL QUERY INTEGRATION

HQL

HDFS

PARALLEL IMPORT/EXPORT

GREENPLUM DATABASE

PigLatin

GREENPLUM HD

Greenplum gNet

© Copyright 2012 EMC Corporation. All rights reserved.

32

Big Data Analytics Requires Both

UNSTRUCTURED

© Copyright 2012 EMC Corporation. All rights reserved.

STRUCTURED

33

Greenplum Delivers Big Data in a Unified Analytics Platform © Copyright 2012 EMC Corporation. All rights reserved.

34

Suggest Documents