SAS DATA LOADER FOR HADOOP CUSTOMER CHALLENGES AND SOLUTION BENEFITS TASS SEPTEMBER 2015 JAMES WAITE

SAS DATA LOADER FOR HADOOP CUSTOMER CHALLENGES AND SOLUTION BENEFITS TASS – SEPTEMBER 2015 JAMES WAITE C op yr i g h t © 2 0 1 2 , S A S I n s t i t ...
Author: Eugene Stevens
2 downloads 0 Views 3MB Size
SAS DATA LOADER FOR HADOOP CUSTOMER CHALLENGES AND SOLUTION BENEFITS TASS – SEPTEMBER 2015 JAMES WAITE

C op yr i g h t © 2 0 1 2 , S A S I n s t i t u t e I n c . A l l r i g h t s r es er v e d .

SAS DATA LOADER FOR HADOOP

AGENDA

What Is Hadoop? Big Data Challenges Hadoop Challenges Data Loader for Hadoop

Demo Additional Resources

C op yr i g h t © 2 0 1 2 , S A S I n s t i t u t e I n c . A l l r i g h t s r es er v e d .

WHAT IS HADOOP?

C op yr i g h t © 2 0 1 2 , S A S I n s t i t u t e I n c . A l l r i g h t s r es er v e d .

HADOOP WHAT IT PROVIDES

Open-source Software • Free to download, use and contribute to Framework • All program elements, connections, etc. are provided by the software Massive Storage • Framework breaks big data into blocks, which are stored on clusters of commodity hardware Processing Power • Concurrently processes large amounts of data using multiple low-cost computers

C op yr i g h t © 2 0 1 2 , S A S I n s t i t u t e I n c . A l l r i g h t s r es er v e d .

HADOOP WHAT IT OFFERS

Computing Power • Distributed computing Flexibility • No need to preprocess data Fault Tolerance • Processing failover, data redundancy Low Cost • Open source, runs on commodity hardware

Scalability • Add unlimited nodes, little administration C op yr i g h t © 2 0 1 2 , S A S I n s t i t u t e I n c . A l l r i g h t s r es er v e d .

TERMINOLOGY TRADITIONAL RDBMS

Primary Key Relationship Normalize Index Table Primary Key

SQL Constraint

Database

Schema

C op yr i g h t © 2 0 1 2 , S A S I n s t i t u t e I n c . A l l r i g h t s r es er v e d .

Foreign Key

TERMINOLOGY HADOOP Cluster

Hadoop NameNode Hive Pig Block DataNode

JobTracker YARN

HDFS

MapReduce

C op yr i g h t © 2 0 1 2 , S A S I n s t i t u t e I n c . A l l r i g h t s r es er v e d .

Cloudera

TERMINOLOGY “IT’S ALL GREEK” TO ME (MOST)! Ελληνορωμ αϊκή.

Είναι όλα τα ελληνικά μου. Ο Θεός της βροντής.

Παραδεισένι ο νησί.

Όμορφη αρχιτεκτονικ ή.

Αρχαίοι ναοί. Γιαούρτι. Σαλάτα.

Μεσογείου.

C op yr i g h t © 2 0 1 2 , S A S I n s t i t u t e I n c . A l l r i g h t s r es er v e d .

Ολυμπιακοί Αγώνες.

Μεγάλοι της λογοτεχνίας και της φιλοσοφίας. Τραγωδία.

BIG DATA DRIVERS AND CUSTOMER CHALLENGES

C op yr i g h t © 2 0 1 2 , S A S I n s t i t u t e I n c . A l l r i g h t s r es er v e d .

CHALLENGE HADOOP SKILLS SHORTAGE

Performing even the simplest tasks in Hadoop typically requires mastering disparate tools and writing hundreds of lines of code.

C op yr i g h t © 2 0 1 2 , S A S I n s t i t u t e I n c . A l l r i g h t s r es er v e d .

Fact: There are a limited # of users with the necessary Hadoop skills • • • • •

MapReduce Pig Latin HiveQL HDFS Sqoop and Oozie

SAS & INTEL STUDY

HADOOP ADOPTION & CHALLENGES Research summary: SAS and Intel asked more than 300 IT-managers from the largest companies in Denmark, Finland, Norway and Sweden about the adoption of Big Data analytics and Hadoop. http://nordichadoopsurvey.com

Primary reason for considering Hadoop Results & Key Findings

60% - cited advanced analytics, data discovery, or as an analytical lab

22% - would like to speed up processing

Adoption / Obstacles 35% - cited “Resources and Competencies” C op yr i g h t © 2 0 1 2 , S A S I n s t i t u t e I n c . A l l r i g h t s r es er v e d .

HADOOP BIG DATA CHALLENGES

Source: Gartner (Sep 2014), Big Data Investment Grows but Deployments Remain Scarce in 2014 By Nick Heudecker, Lisa Kart C op yr i g h t © 2 0 1 2 , S A S I n s t i t u t e I n c . A l l r i g h t s r es er v e d .

CHALLENGE HADOOP SKILLS SHORTAGE

Performing even the simplest tasks in Hadoop typically requires mastering disparate tools and writing hundreds of lines of code.

C op yr i g h t © 2 0 1 2 , S A S I n s t i t u t e I n c . A l l r i g h t s r es er v e d .

Fact: There are a limited # of users with the necessary Hadoop skills • • • • •

MapReduce Pig Latin HiveQL HDFS Sqoop and Oozie

CHALLENGE HADOOP SKILLS SHORTAGE

proc sort data=dsn out=temp; by usubjid; run; data unique; set temp; by usubjid; if not first.usubjid and last.usubjid; run; data nodups; set temp; by usubjid; if first.usubjid; run; C op yr i g h t © 2 0 1 2 , S A S I n s t i t u t e I n c . A l l r i g h t s r es er v e d .

CHALLENGE HADOOP SKILLS SHORTAGE public class CalculateDistinct { public static class Map extends MapReduceBase implements Mapper { private final static IntWritable one = new IntWritable(1); private Text word = new Text(""); public void map(LongWritable key, Text value, OutputCollector output, Reporter reporter) throws IOException { word.set(value.toString()); output.collect(word,one); } } public static class Reduce extends MapReduceBase implements Reducer { public void reduce(Text key, Iterator values, OutputCollector output, Reporter reporter) throws IOException { int sum = 0; while (values.hasNext()) { sum += 1; values.next(); } output.collect(key, new IntWritable(sum)); } } C op yr i g h t © 2 0 1 2 , S A S I n s t i t u t e I n c . A l l r i g h t s r es er v e d .

CHALLENGE HADOOP SKILLS SHORTAGE (cont’d) public static void main(String[] args) throws Exception { JobConf conf = new JobConf(CalculateDistinct.class); conf.setJobName("Calculate Distinct"); conf.setOutputKeyClass(Text.class); conf.setOutputValueClass(IntWritable.class); conf.setMapperClass(Map.class); conf.setReducerClass(Reduce.class); conf.setInputFormat(TextInputFormat.class); conf.setOutputFormat(TextOutputFormat.class); FileInputFormat.setInputPaths(conf, new Path(args[0])); FileOutputFormat.setOutputPath(conf, new Path(args[1])); JobClient.runJob(conf); } } javac -classpath hadoop-0.20.1-dev-core.jar -d CalculateDistinct/ CalculateDistinct.java jar -cvf CalculateDistinct.jar -C CalculateDistinct/ . hadoop jar CalculateDistinct.jar org.myorg.CalculateDistinct /user/john/in/abc.txt /user/john/out

C op yr i g h t © 2 0 1 2 , S A S I n s t i t u t e I n c . A l l r i g h t s r es er v e d .

CHALLENGE HADOOP SKILLS

The skill sets required to leverage the many benefits of a Hadoop driven data environment are substantial, and often requires training in many areas. http://hortonworks.com/training/class/applying-data-science-using-apachehadoop/ http://university.cloudera.com/instructor-led-training/introduction-to-data-science--building-recommender-systems

C op yr i g h t © 2 0 1 2 , S A S I n s t i t u t e I n c . A l l r i g h t s r es er v e d .

CHALLENGE USER TOOLS ARE NOT BIG DATA ENABLED

Big data brings new requirements:

• • • • • •

Access to HDFS Parallel Loads New Native file types Knowledge of file structures New languages & code Need to transform data In-cluster

C op yr i g h t © 2 0 1 2 , S A S I n s t i t u t e I n c . A l l r i g h t s r es er v e d .

User tools are not engineered to process data inside Hadoop. • Tools are not optimized for Hadoop • Users move data out of Hadoop to do data management and data quality • This requires more processing time • Data is duplicated and more storage is required • Users do not use the Hadoop platform as it was designed

SOLUTION SAS & HADOOP

SAS has worked closely with the industry leaders in Hadoop development, an developed tools and solutions to facilitate and leverage SAS with Hadoop. a growing asking users to adapt to entirely new languages to leverage Hadoop, SAS has adapted traditional SAS routines and procedures to leverage Hadoop, the end result being “SAS users can stay in SAS”. • SAS/ACCESS Interface to Hadoop • DS2 Programming • SAS Data Loader

C op yr i g h t © 2 0 1 2 , S A S I n s t i t u t e I n c . A l l r i g h t s r es er v e d .

SOLUTION SAS & HADOOP

Rather than asking users to adapt to entirely new languages to leverage Hadoop, SAS has adapted traditional SAS routines and procedures to leverage Hadoop, the end result being “SAS users can stay in SAS”. DS2 Programming: Essentials https://support.sas.com/edu/schedules.html?id=1798&ctry=CA DS2 Programming Essentials with Hadoop https://support.sas.com/edu/schedules.html?id=2468&ctry=CA

C op yr i g h t © 2 0 1 2 , S A S I n s t i t u t e I n c . A l l r i g h t s r es er v e d .

THE KEY CLOSING THE GAPS IN THE DATA TO DECISION CHALLENGE LIFECYCLE User Tools are Not Hadoop Enabled

BUSINESS ANALYST

DATA SCIENTIST / STATISTICIAN

Hadoop Skill Shortage

IT SYSTEMS / MANAGEMENT

TIME TO DECISION

C op yr i g h t © 2 0 1 2 , S A S I n s t i t u t e I n c . A l l r i g h t s r es er v e d .

BUSINESS MANAGER

VALUE CAPTURED

BIG DATA ANALYSTS TAKE MANAGEMENT

Recommendation “Use self-service interactive data preparation tools to enhance analyst productivity.” and “improve the quality of data” – Gartner, “Data Preparation Is Not an Afterthought”

C op yr i g h t © 2 0 1 2 , S A S I n s t i t u t e I n c . A l l r i g h t s r es er v e d .

THE KEY CLOSING THE GAPS IN THE DATA TO DECISION CHALLENGE LIFECYCLE Hadoop Skill Shortage

BUSINESS ANALYST

DATA SCIENTIST / STATISTICIAN

IT SYSTEMS / MANAGEMENT

TIME TO DECISION

C op yr i g h t © 2 0 1 2 , S A S I n s t i t u t e I n c . A l l r i g h t s r es er v e d .

BUSINESS MANAGER

VALUE CAPTURED

MARKET TRENDS SELF-SERVICE DATA PREPARATION

The rise of self-service data-preparation tools … is putting data management directly into the hands of analysts

SAS Data Loader for Hadoop showcases the company's solid engineering talent and reputation for building high-quality software Typically, data preparation is 70-80% of the work involved in any analytic project. That number increases as complexities of the data environment increase.

C op yr i g h t © 2 0 1 2 , S A S I n s t i t u t e I n c . A l l r i g h t s r es er v e d .

SAS DATA LOADER FOR HADOOP SOLUTION OVERVIEW

C op yr i g h t © 2 0 1 2 , S A S I n s t i t u t e I n c . A l l r i g h t s r es er v e d .

SAS DATA LOADER FOR HADOOP

KEY FEATURES

Point-and-click UI designed for self-service data preparation Leverage existing skills to prepare data on Hadoop as used on other data sources Consistency & reuse: apply existing DQ standards on Hadoop data Familiar toolset for the end-to-end analytical lifecycle Purpose-Built to run on Hadoop, keeps it simple and focused

Enables parallel data movement and data quality tasks without writing code Loads data to the SAS LASR Analytic Server Big Compute: Moves the processing to the data

C op yr i g h t © 2 0 1 2 , S A S I n s t i t u t e I n c . A l l r i g h t s r es er v e d .

SAS DATA LOADER FOR HADOOP…

“Purpose-built” easy to use data management solution to specifically address: acquiring, structuring, cleaning and transforming data inside Hadoop SAS Data Loader for Hadoop is a smart approach, turning the Hadoop environment into a productive environment; where barriers are removed, and data is accessible and usable

C op yr i g h t © 2 0 1 2 , S A S I n s t i t u t e I n c . A l l r i g h t s r es er v e d .

SAS DATA LOADER ENABLES ORGANIZATIONS TO…

Manage data inside Hadoop

C op yr i g h t © 2 0 1 2 , S A S I n s t i t u t e I n c . A l l r i g h t s r es er v e d .

Reduce Complexity of Hadoop

Accelerate Business user adoption

CAPABILITIES - SAS DATA LOADER FOR HADOOP

1

ACQUIRE DATA DISCOVER DATA

2

TRANSFORM DATA

3

CLEANSE DATA

4

INTEGRATE DATA

5

DELIVER DATA

• Copy Data to Hadoop

• Query

• Validate

• Join

• Load SAS LASR

• Profile Data

• Select Columns

• Parse

• Create Match codes

• Create tables

• Identification Analysis

• Apply Filters

• Standardize

• Sort & De-duplicate

• Create views

• Query

• Map Columns

• Aggregate

• Copy from Hadoop

• Sort / Order

• Run a SAS program

• Calculate Columns • Transpose data • Aggregate

• Transform data Access data, move it into Hadoop, and assess the data structure and content

C op yr i g h t © 2 0 1 2 , S A S I n s t i t u t e I n c . A l l r i g h t s r es er v e d .

Select data of interest, manipulate it, and structure it into the data format desired

Put data into a consistent format

Combine datasets, including data that has no common key, remove duplicate data, and create new data points thru aggregation

Load datasets into SAS LASR in-memory analytic server, Create new Hadoop tables, and deliver data to other databases and apps

INTRODUCING

SAS DATA LOADER FOR HADOOP

Self-service big data preparation for business users

Certified by Hortonworks and Cloudera C op yr i g h t © 2 0 1 2 , S A S I n s t i t u t e I n c . A l l r i g h t s r es er v e d .

PRIMARY AUDIENCE: WHO IS SAS DATA LOADER DESIGNED FOR?

Business Users Data Analysts, Data Scientists, Statisticians

Data Management Specialists

C op yr i g h t © 2 0 1 2 , S A S I n s t i t u t e I n c . A l l r i g h t s r es er v e d .

SAS DATA LOADER FOR HADOOP

BENEFITS • • •

• •

Users of all skill levels can manage data in Hadoop Users can manipulate Hadoop data to fit their specific needs No need to write code Increases worker productivity and improves data quality Leverages the Hadoop cluster including • •

• •

C op yr i g h t © 2 0 1 2 , S A S I n s t i t u t e I n c . A l l r i g h t s r es er v e d .

Parallel processing Minimizes data movement

Enables reuse of skills you already have Unlocks and accesses many types of data

ADDITIONAL RESOURCES

C op yr i g h t © 2 0 1 2 , S A S I n s t i t u t e I n c . A l l r i g h t s r es er v e d .

FOR MORE INFORMATION •

Learn more about SAS Data Loader for Hadoop • SAS Data Loader for Hadoop



Learn more about SAS Data Management: • SAS Data Management



Learn more about SAS Hadoop offerings: • SAS Solutions for Hadoop



Follow us on Twitter: @sasdatamgmt Like us on Facebook: SAS Software



C op yr i g h t © 2 0 1 2 , S A S I n s t i t u t e I n c . A l l r i g h t s r es er v e d .

TRAINING

THANK YOU!

C op yr i g h t © 2 0 1 2 , S A S I n s t i t u t e I n c . A l l r i g h t s r es er v e d .



Big Data Matters Webinar Series: • Big Data On-Demand Webinar Series



SAS Training: • Introduction to SAS and Hadoop •

DS2 Programming Essentials with Hadoop



Data Science: Building Recommender Systems with SAS and Hadoop

USE CASES

C op yr i g h t © 2 0 1 2 , S A S I n s t i t u t e I n c . A l l r i g h t s r es er v e d .

PRIMARY AUDIENCE: WHO IS SAS DATA LOADER DESIGNED FOR?

Business Users Data Analysts, Data Scientists, Statisticians

Data Management Specialists

C op yr i g h t © 2 0 1 2 , S A S I n s t i t u t e I n c . A l l r i g h t s r es er v e d .

USERS BUSINESS USERS

Activities: • Self service access to data • Query and manipulate data • Copy data to/from Hadoop • Load data into SAS LASR

C op yr i g h t © 2 0 1 2 , S A S I n s t i t u t e I n c . A l l r i g h t s r es er v e d .

USERS DATA ANALYSTS, DATA SCIENTISTS, & STATISTICIANS

Activities: • Create an analytics ready dataset • Discover new data sources • Transform and manipulate data • Optional: Write SAS DS2 code • Load data into SAS LASR server

Data Preparation

Log files Customer data

Event data

Analytics ready dataset C op yr i g h t © 2 0 1 2 , S A S I n s t i t u t e I n c . A l l r i g h t s r es er v e d .

USERS DATA MANAGEMENT SPECIALISTS

Activities: • Apply enterprise data management practices to Hadoop • Manage data with discipline inside Hadoop • Reuse data quality standards inside Hadoop • Copy data to/from Hadoop • Optimize SAS code to run in Hadoop • Learn from Hadoop data discoveries • Apply knowledge gained in enterprise environment

Hadoop C op yr i g h t © 2 0 1 2 , S A S I n s t i t u t e I n c . A l l r i g h t s r es er v e d .

BUSINESS USER SELF SERVICE BIG DATA ON-BOARDING, EXPLORATION USE CASE AND DISCOVERY

Activities • User copies data from a data source into Hadoop • User profiles the table to learn the structure/content of the data • User queries the data and creates a new table specific to their needs • User loads the new table into SAS LASR

SAS® LASR ANALYTIC SERVER

C op yr i g h t © 2 0 1 2 , S A S I n s t i t u t e I n c . A l l r i g h t s r es er v e d .

BUSINESS USER CONTINUES EXPLORATION AND DISCOVERY USING SAS VA… USE CASE

SAS Data Loader for Hadoop

C op yr i g h t © 2 0 1 2 , S A S I n s t i t u t e I n c . A l l r i g h t s r es er v e d .

SAS Visual Analytics

DATA SCIENTIST BIG DATA PREPARATION FOR ADVANCED ANALYTICS USE CASE

Activities • User access previously run profile report showing table information • User defines a new table • Creates new columns using calculations • Pivots / transposes the table • Uses functions to aggregate variables • Writes a SAS DS2 program to append records with a calculated score • Sorts the data and applies filters • Then User loads the table into SAS LASR

C op yr i g h t © 2 0 1 2 , S A S I n s t i t u t e I n c . A l l r i g h t s r es er v e d .

THANK YOU !

C op yr i g h t © 2 0 1 2 , S A S I n s t i t u t e I n c . A l l r i g h t s r es er v e d .

www.SAS.com