The Do s and Don ts of BI on Hadoop Predictive Analytics & Big Data 2015 November 16, 2015

The Do’s and Don’ts of BI on Hadoop Predictive Analytics & Big Data 2015 November 16, 2015 David Mariani CEO [email protected] Hadoop Cheat Sheet Ter...

Author: Everett Cross

1 downloads 2 Views 4MB Size

Report

Download PDF

Recommend Documents

Real-Time Big Data Analytics using Hadoop

Big Data Predictive Analytics and Machine Learning Strategy and Roadmap

Participant Do s & Don ts

RESUME DO s AND DON Ts

INSTALLATION DO S AND DON TS

Academic C.V. Do s and Don ts

Big Data: Hadoop and Memcached

Big Data Hadoop Developer

Una de las soluciones. Big Data: Hadoop. Business Analytics. Big Data

Cisco Big Data and Analytics

NASSCOM Big Data and Analytics Summit 2015 Theme: Data-Driven Disruption -The Next Frontier

Big Data Analytics and the LHC

E6893 Big Data Analytics Lecture 6: Spark and Data Analytics

DO s s & DON Ts of Epoxy Resins

Survey on Big Data using Apache Hadoop and Spark

ORIENTIERUNGSVERSAMMLUNG 16. NOVEMBER 2015

Ecolibrium. Big data How analytics are impacting the HVAC&R world. THE OFFICIAL JOURNAL OF AIRAH NOVEMBER 2015 VOLUME 14.10

SURVEY PAPER ON BIG DATA ANALYTICS USING HADOOP TECHNOLOGIES Vikas Goyal 1, Deepak Soni 2

Legal Do s and Don ts for Age-Restricted Communities

TIPS, TRICKS, DO S AND DON TS WITH CANON HDDSLR

Do s and Don ts for a clean Ganga:

Movie Maker Do s, Don ts, and Cool Tricks

How to Factor-Analyze Your Data Right: Do s, Don ts, and How-To s

Big Data Analytics for Logistics and Transportation

The Do’s and Don’ts of BI on Hadoop Predictive Analytics & Big Data 2015 November 16, 2015 David Mariani CEO [email protected]

Hadoop Cheat Sheet Term

Definition

Hadoop

Software framework that supports the running of applications on clusters of commodity hardware. Also referred to the whole ecosystem (HDFS, Hive, MapReduce…etc)

HDFS

Hadoop Distributed File System. System that stores files across multiple machines. It replicates data across machines and understand what data is being processes when, by whom.

MapReduc e

Programming model for distributed data processing. Map() filters and sorts while Reduce() performs summary operations

Spark

Multi-stage in-memory computing framework that has shown to be 100X faster than MapReduce.

YARN

Yet Another Resource Negotiator. Resource Management system to manage computing resources on a cluster

Hive

Data Warehouse infrastructure for providing summarization, query and analysis

© 2015 ATSCALE, INC. ALL RIGHTS RESERVED. CONFIDENTIAL & PROPRIETARY

2

Top Hadoop Myths 1. Hadoop is Slow 2. Hadoop is Immature 3. Hadoop is only relevant for large data sets

4. Hadoop alone can replace your EDW

© 2015 ATSCALE, INC. ALL RIGHTS RESERVED. CONFIDENTIAL & PROPRIETARY

3

Top Use Cases for Hadoop 1. 2. 3. 4. 5. 6. 7. 8.

Risk modeling Fraud analytics Customer churn Recommendations Ad targeting Transactional analysis Threat analysis Search quality

© 2015 ATSCALE, INC. ALL RIGHTS RESERVED. CONFIDENTIAL & PROPRIETARY

4

Hadoop Maturity Survey

Over 60% think of Hadoop as “Game Changing” or “Strategic” © 2015 ATSCALE, INC. ALL RIGHTS RESERVED. CONFIDENTIAL & PROPRIETARY

5

Hadoop’s next phase Business Intelligence will be the focus of Hadoop’s second phase. Companies that provide Self-Service on Hadoop are 50% more likely to gain value from Big Data Source: AtScale Hadoop Maturity Survey (Oct 2015) In Partnership with Cloudera, Hortonworks, MapR & Tableau © 2015 ATSCALE, INC. ALL RIGHTS RESERVED. CONFIDENTIAL & PROPRIETARY

6

Top Hadoop Riders Using Hadoop

Tableau is #1 BI tool for Hadoop Excel is #1 BI tool for future Hadoopers Planning on Hadoop

Assess your company @ atscale.com/survey © 2015 ATSCALE, INC. ALL RIGHTS RESERVED. CONFIDENTIAL & PROPRIETARY

7

The Do’s & Don’ts Don’t

Do

Move & copy data

Query data in place

Have multiple definitions of reality

Create a single semantic layer

Scale up with proprietary hardware

Scale out with your Hadoop cluster Leverage Hadoop’s schema-ondemand Use open source engines & any BI tool

Do relational schemas Lock yourself into proprietary stacks

© 2015 ATSCALE, INC. ALL RIGHTS RESERVED. CONFIDENTIAL & PROPRIETARY

8

DON’T:

MOVE DATA

Write Many = Bad ANALYSIS TOOLS QUERY ENGINE MART

MART

MART

ETL DATA WAREHOUSE INPUT DATA

© 2015 ATSCALE, INC. ALL RIGHTS RESERVED. CONFIDENTIAL & PROPRIETARY

 Highly complex  Lots of people & skillsets  Multiple copies of data  Stale data  Rigid schema  Tough to change 10

Write Once = Good ANALYSIS TOOLS

HADOOP

INPUT DATA

© 2015 ATSCALE, INC. ALL RIGHTS RESERVED. CONFIDENTIAL & PROPRIETARY

 Simple  Single layer  No data copies  Real time  Dynamic schema  Easy to change

11

DO:

SINGLE SOURCE OF TRUTH

Data Mart Hell

© 2015 ATSCALE, INC. ALL RIGHTS RESERVED. CONFIDENTIAL & PROPRIETARY

13

Hadoop = Data Lake

© 2015 ATSCALE, INC. ALL RIGHTS RESERVED. CONFIDENTIAL & PROPRIETARY

14

Don’t Forget about the BI Layer

I.T. needs Control & Consistency

The Business Interface for Hadoop

© 2015 ATSCALE, INC. ALL RIGHTS RESERVED. CONFIDENTIAL & PROPRIETARY

The Business needs Freedom & Self-Service 15

Querying Data in Place

Leveraging One Semantic Layer

DO:

SCALE OUT NOT UP

Yahoo! TAO Platform Architecture How did we load so much so quickly?

Data Aggregation & ETL Hadoop

2PB cluster

BI Server

Data Archive & Staging

SQL Server Analysis Services 2008 R2

Oracle 11G RAC

File 1

Partition 1

Partition 1

File 2

Partition 2

Partition 2

File N

Partition N

Partition N

1.2TB /day

135GB/day compressed

24TB Cube18 /qtr

Hadoop: MPP-like performance Query Run Times (Impala vs. HANA) (60 Million Rows) Time (Seconds)

Select Statement

HANA Small

Impala (1 Node) Parquet

Impala (3 Nodes) Parquet

Impala (1 Node) Text

Impala (3 Nodes) Text

select count(*) from lineitem

1

3

1

74

31

select count(*), sum(l_extendedprice) from lineitem

4

12

3

73

29

select l_shipmode, count(*), sum(l_extendedprice) from lineitem group by l_shipmode

8

23

5

74

28

select l_shipmode, l_linestatus, count(*), sum(l_extendedprice) from lineitem group by l_shipmode, l_linestatus

10

32

7

74

28

Size

(5 Part.) 1.9Gb

Est. Monthly Cost of Production Environment on AWS

$1022

(40 files x 80mb) 3.2Gb

$175

$350

(1 file – No Compression) 7.2Gb

$175

$350

(HANA m2.xlarge, Impala m1.medium)

Source: http://scn.sap.com/community/developer-center/hana/blog/2013/05/30/big-data-analytics-hana-vs-hadoop-impala-on-aws © 2015 ATSCALE, INC. ALL RIGHTS RESERVED. CONFIDENTIAL & PROPRIETARY

19

Hadoop: Linear & Predictable • Two clusters of the same hardware: one of 18 nodes and one of 36 nodes • 15TB on 18 nodes, 30TB on 36 nodes TPC-DS data sets • A multi-user workload of TPC-DS queries

Source: http://blog.cloudera.com/blog/2014/01/impala-performance-dbms-class-speed/ © 2015 ATSCALE, INC. ALL RIGHTS RESERVED. CONFIDENTIAL & PROPRIETARY

20

Performance Consistency and Management

Hadoop: Scale and Speed Spark SQL

 Usability  Agility

SSAS

4s

 Simplicity

AtScale

3 dimensions © 2015 ATSCALE, INC. ALL RIGHTS RESERVED. CONFIDENTIAL & PROPRIETARY

7 dimensions 22

DO:

SCHEMA-ON-DEMAND

Star Schema = Unnatural!

24

Example: Key-Values using Maps

DEMO

MOBA Game Analytics

Demo: DOTA 2 – What the User Sees

27

Demo: Dota2 – Raw Data (JSON) Match Details

Player Details

Player Profile

View Source

View Source

As Easy As 1,2,3 1

2 3

29

Complex Data Types DEMO: Complex data types

Demo: DOTA 2 – Use Case 1

Question: Who are the most popular heroes?

31

Demo: DOTA 2 – Use Case 2

Question: Which heroes have the highest win rate?

32

Demo: DOTA 2 – Use Case 3

Question: What are the top 3 items associated with the best win rate?

33

DO:

STAY OPEN

Don’t Lock Yourself In 1. Avoid “All in One” Platforms 2. Avoid solutions that move data off cluster 3. “In memory” = “doesn’t scale”

4. Avoid proprietary formats – stick with Hive

© 2015 ATSCALE, INC. ALL RIGHTS RESERVED. CONFIDENTIAL & PROPRIETARY

35

So Many SQL on Hadoop Choices… Feature

Spark SQL

Impala

Hive/Tez

Drill

Presto

Interactive performance

Good

Very Good

Poor

Good

Good

Deployment

YARN

Daemon

YARN

Daemon

Daemon

Low latency queries

Yes

Yes

No

Yes

Yes

Hive compatibility

High

Low

High

Low

Medium

No

No

No

Yes

Yes

Parquet

Parquet

ORC

Parquet

RC

Databricks

Cloudera

Hortonworks

MapR

Facebook

Supports query federation Preferred file format Sponsor

© 2015 ATSCALE, INC. ALL RIGHTS RESERVED. CONFIDENTIAL & PROPRIETARY

36

SQL & MDX Support DEMO: Complex data types

The Do’s & Don’t Don’t Move & copy data

Value Agility

Have multiple definitions of reality

Consistency

Scale up with proprietary hardware

Scalability

Do relational schemas

Flexibility

Lock yourself into proprietary stacks

Futureproof

© 2015 ATSCALE, INC. ALL RIGHTS RESERVED. CONFIDENTIAL & PROPRIETARY

Do Query data in place Create a single semantic layer Scale out with your Hadoop cluster Leverage Hadoop’s schema-ondemand Use open source engines & any BI tool

38

Thank You! www.atscale.com