The Do’s and Don’ts of BI on Hadoop Predictive Analytics & Big Data 2015 November 16, 2015 David Mariani CEO
[email protected]
Hadoop Cheat Sheet Term
Definition
Hadoop
Software framework that supports the running of applications on clusters of commodity hardware. Also referred to the whole ecosystem (HDFS, Hive, MapReduce…etc)
HDFS
Hadoop Distributed File System. System that stores files across multiple machines. It replicates data across machines and understand what data is being processes when, by whom.
MapReduc e
Programming model for distributed data processing. Map() filters and sorts while Reduce() performs summary operations
Spark
Multi-stage in-memory computing framework that has shown to be 100X faster than MapReduce.
YARN
Yet Another Resource Negotiator. Resource Management system to manage computing resources on a cluster
Hive
Data Warehouse infrastructure for providing summarization, query and analysis
© 2015 ATSCALE, INC. ALL RIGHTS RESERVED. CONFIDENTIAL & PROPRIETARY
2
Top Hadoop Myths 1. Hadoop is Slow 2. Hadoop is Immature 3. Hadoop is only relevant for large data sets
4. Hadoop alone can replace your EDW
© 2015 ATSCALE, INC. ALL RIGHTS RESERVED. CONFIDENTIAL & PROPRIETARY
3
Top Use Cases for Hadoop 1. 2. 3. 4. 5. 6. 7. 8.
Risk modeling Fraud analytics Customer churn Recommendations Ad targeting Transactional analysis Threat analysis Search quality
© 2015 ATSCALE, INC. ALL RIGHTS RESERVED. CONFIDENTIAL & PROPRIETARY
4
Hadoop Maturity Survey
Over 60% think of Hadoop as “Game Changing” or “Strategic” © 2015 ATSCALE, INC. ALL RIGHTS RESERVED. CONFIDENTIAL & PROPRIETARY
5
Hadoop’s next phase Business Intelligence will be the focus of Hadoop’s second phase. Companies that provide Self-Service on Hadoop are 50% more likely to gain value from Big Data Source: AtScale Hadoop Maturity Survey (Oct 2015) In Partnership with Cloudera, Hortonworks, MapR & Tableau © 2015 ATSCALE, INC. ALL RIGHTS RESERVED. CONFIDENTIAL & PROPRIETARY
6
Top Hadoop Riders Using Hadoop
Tableau is #1 BI tool for Hadoop Excel is #1 BI tool for future Hadoopers Planning on Hadoop
Assess your company @ atscale.com/survey © 2015 ATSCALE, INC. ALL RIGHTS RESERVED. CONFIDENTIAL & PROPRIETARY
7
The Do’s & Don’ts Don’t
Do
Move & copy data
Query data in place
Have multiple definitions of reality
Create a single semantic layer
Scale up with proprietary hardware
Scale out with your Hadoop cluster Leverage Hadoop’s schema-ondemand Use open source engines & any BI tool
Do relational schemas Lock yourself into proprietary stacks
© 2015 ATSCALE, INC. ALL RIGHTS RESERVED. CONFIDENTIAL & PROPRIETARY
8
DON’T:
MOVE DATA
Write Many = Bad ANALYSIS TOOLS QUERY ENGINE MART
MART
MART
ETL DATA WAREHOUSE INPUT DATA
© 2015 ATSCALE, INC. ALL RIGHTS RESERVED. CONFIDENTIAL & PROPRIETARY
Highly complex Lots of people & skillsets Multiple copies of data Stale data Rigid schema Tough to change 10
Write Once = Good ANALYSIS TOOLS
HADOOP
INPUT DATA
© 2015 ATSCALE, INC. ALL RIGHTS RESERVED. CONFIDENTIAL & PROPRIETARY
Simple Single layer No data copies Real time Dynamic schema Easy to change
11
DO:
SINGLE SOURCE OF TRUTH
Data Mart Hell
© 2015 ATSCALE, INC. ALL RIGHTS RESERVED. CONFIDENTIAL & PROPRIETARY
13
Hadoop = Data Lake
© 2015 ATSCALE, INC. ALL RIGHTS RESERVED. CONFIDENTIAL & PROPRIETARY
14
Don’t Forget about the BI Layer
I.T. needs Control & Consistency
The Business Interface for Hadoop
© 2015 ATSCALE, INC. ALL RIGHTS RESERVED. CONFIDENTIAL & PROPRIETARY
The Business needs Freedom & Self-Service 15
Querying Data in Place
Leveraging One Semantic Layer
DO:
SCALE OUT NOT UP
Yahoo! TAO Platform Architecture How did we load so much so quickly?
Data Aggregation & ETL Hadoop
2PB cluster
BI Server
Data Archive & Staging
SQL Server Analysis Services 2008 R2
Oracle 11G RAC
File 1
Partition 1
Partition 1
File 2
Partition 2
Partition 2
File N
Partition N
Partition N
1.2TB /day
135GB/day compressed
24TB Cube18 /qtr
Hadoop: MPP-like performance Query Run Times (Impala vs. HANA) (60 Million Rows) Time (Seconds)
Select Statement
HANA Small
Impala (1 Node) Parquet
Impala (3 Nodes) Parquet
Impala (1 Node) Text
Impala (3 Nodes) Text
select count(*) from lineitem
1
3
1
74
31
select count(*), sum(l_extendedprice) from lineitem
4
12
3
73
29
select l_shipmode, count(*), sum(l_extendedprice) from lineitem group by l_shipmode
8
23
5
74
28
select l_shipmode, l_linestatus, count(*), sum(l_extendedprice) from lineitem group by l_shipmode, l_linestatus
10
32
7
74
28
Size
(5 Part.) 1.9Gb
Est. Monthly Cost of Production Environment on AWS
$1022
(40 files x 80mb) 3.2Gb
$175
$350
(1 file – No Compression) 7.2Gb
$175
$350
(HANA m2.xlarge, Impala m1.medium)
Source: http://scn.sap.com/community/developer-center/hana/blog/2013/05/30/big-data-analytics-hana-vs-hadoop-impala-on-aws © 2015 ATSCALE, INC. ALL RIGHTS RESERVED. CONFIDENTIAL & PROPRIETARY
19
Hadoop: Linear & Predictable • Two clusters of the same hardware: one of 18 nodes and one of 36 nodes • 15TB on 18 nodes, 30TB on 36 nodes TPC-DS data sets • A multi-user workload of TPC-DS queries
Source: http://blog.cloudera.com/blog/2014/01/impala-performance-dbms-class-speed/ © 2015 ATSCALE, INC. ALL RIGHTS RESERVED. CONFIDENTIAL & PROPRIETARY
20
Performance Consistency and Management
Hadoop: Scale and Speed Spark SQL
Usability Agility
SSAS
4s
Simplicity
AtScale
3 dimensions © 2015 ATSCALE, INC. ALL RIGHTS RESERVED. CONFIDENTIAL & PROPRIETARY
7 dimensions 22
DO:
SCHEMA-ON-DEMAND
Star Schema = Unnatural!
24
Example: Key-Values using Maps
DEMO
MOBA Game Analytics
Demo: DOTA 2 – What the User Sees
27
Demo: Dota2 – Raw Data (JSON) Match Details
Player Details
Player Profile
View Source
View Source
As Easy As 1,2,3 1
2 3
29
Complex Data Types DEMO: Complex data types
Demo: DOTA 2 – Use Case 1
Question: Who are the most popular heroes?
31
Demo: DOTA 2 – Use Case 2
Question: Which heroes have the highest win rate?
32
Demo: DOTA 2 – Use Case 3
Question: What are the top 3 items associated with the best win rate?
33
DO:
STAY OPEN
Don’t Lock Yourself In 1. Avoid “All in One” Platforms 2. Avoid solutions that move data off cluster 3. “In memory” = “doesn’t scale”
4. Avoid proprietary formats – stick with Hive
© 2015 ATSCALE, INC. ALL RIGHTS RESERVED. CONFIDENTIAL & PROPRIETARY
35
So Many SQL on Hadoop Choices… Feature
Spark SQL
Impala
Hive/Tez
Drill
Presto
Interactive performance
Good
Very Good
Poor
Good
Good
Deployment
YARN
Daemon
YARN
Daemon
Daemon
Low latency queries
Yes
Yes
No
Yes
Yes
Hive compatibility
High
Low
High
Low
Medium
No
No
No
Yes
Yes
Parquet
Parquet
ORC
Parquet
RC
Databricks
Cloudera
Hortonworks
MapR
Facebook
Supports query federation Preferred file format Sponsor
© 2015 ATSCALE, INC. ALL RIGHTS RESERVED. CONFIDENTIAL & PROPRIETARY
36
SQL & MDX Support DEMO: Complex data types
The Do’s & Don’t Don’t Move & copy data
Value Agility
Have multiple definitions of reality
Consistency
Scale up with proprietary hardware
Scalability
Do relational schemas
Flexibility
Lock yourself into proprietary stacks
Futureproof
© 2015 ATSCALE, INC. ALL RIGHTS RESERVED. CONFIDENTIAL & PROPRIETARY
Do Query data in place Create a single semantic layer Scale out with your Hadoop cluster Leverage Hadoop’s schema-ondemand Use open source engines & any BI tool
38
Thank You! www.atscale.com