SAS DATA LOADER FOR HADOOP CUSTOMER CHALLENGES AND SOLUTION BENEFITS TASS – SEPTEMBER 2015 JAMES WAITE
C op yr i g h t © 2 0 1 2 , S A S I n s t i t u t e I n c . A l l r i g h t s r es er v e d .
SAS DATA LOADER FOR HADOOP
AGENDA
What Is Hadoop? Big Data Challenges Hadoop Challenges Data Loader for Hadoop
Demo Additional Resources
C op yr i g h t © 2 0 1 2 , S A S I n s t i t u t e I n c . A l l r i g h t s r es er v e d .
WHAT IS HADOOP?
C op yr i g h t © 2 0 1 2 , S A S I n s t i t u t e I n c . A l l r i g h t s r es er v e d .
HADOOP WHAT IT PROVIDES
Open-source Software • Free to download, use and contribute to Framework • All program elements, connections, etc. are provided by the software Massive Storage • Framework breaks big data into blocks, which are stored on clusters of commodity hardware Processing Power • Concurrently processes large amounts of data using multiple low-cost computers
C op yr i g h t © 2 0 1 2 , S A S I n s t i t u t e I n c . A l l r i g h t s r es er v e d .
HADOOP WHAT IT OFFERS
Computing Power • Distributed computing Flexibility • No need to preprocess data Fault Tolerance • Processing failover, data redundancy Low Cost • Open source, runs on commodity hardware
Scalability • Add unlimited nodes, little administration C op yr i g h t © 2 0 1 2 , S A S I n s t i t u t e I n c . A l l r i g h t s r es er v e d .
TERMINOLOGY TRADITIONAL RDBMS
Primary Key Relationship Normalize Index Table Primary Key
SQL Constraint
Database
Schema
C op yr i g h t © 2 0 1 2 , S A S I n s t i t u t e I n c . A l l r i g h t s r es er v e d .
Foreign Key
TERMINOLOGY HADOOP Cluster
Hadoop NameNode Hive Pig Block DataNode
JobTracker YARN
HDFS
MapReduce
C op yr i g h t © 2 0 1 2 , S A S I n s t i t u t e I n c . A l l r i g h t s r es er v e d .
Cloudera
TERMINOLOGY “IT’S ALL GREEK” TO ME (MOST)! Ελληνορωμ αϊκή.
Είναι όλα τα ελληνικά μου. Ο Θεός της βροντής.
Παραδεισένι ο νησί.
Όμορφη αρχιτεκτονικ ή.
Αρχαίοι ναοί. Γιαούρτι. Σαλάτα.
Μεσογείου.
C op yr i g h t © 2 0 1 2 , S A S I n s t i t u t e I n c . A l l r i g h t s r es er v e d .
Ολυμπιακοί Αγώνες.
Μεγάλοι της λογοτεχνίας και της φιλοσοφίας. Τραγωδία.
BIG DATA DRIVERS AND CUSTOMER CHALLENGES
C op yr i g h t © 2 0 1 2 , S A S I n s t i t u t e I n c . A l l r i g h t s r es er v e d .
CHALLENGE HADOOP SKILLS SHORTAGE
Performing even the simplest tasks in Hadoop typically requires mastering disparate tools and writing hundreds of lines of code.
C op yr i g h t © 2 0 1 2 , S A S I n s t i t u t e I n c . A l l r i g h t s r es er v e d .
Fact: There are a limited # of users with the necessary Hadoop skills • • • • •
MapReduce Pig Latin HiveQL HDFS Sqoop and Oozie
SAS & INTEL STUDY
HADOOP ADOPTION & CHALLENGES Research summary: SAS and Intel asked more than 300 IT-managers from the largest companies in Denmark, Finland, Norway and Sweden about the adoption of Big Data analytics and Hadoop. http://nordichadoopsurvey.com
Primary reason for considering Hadoop Results & Key Findings
60% - cited advanced analytics, data discovery, or as an analytical lab
22% - would like to speed up processing
Adoption / Obstacles 35% - cited “Resources and Competencies” C op yr i g h t © 2 0 1 2 , S A S I n s t i t u t e I n c . A l l r i g h t s r es er v e d .
HADOOP BIG DATA CHALLENGES
Source: Gartner (Sep 2014), Big Data Investment Grows but Deployments Remain Scarce in 2014 By Nick Heudecker, Lisa Kart C op yr i g h t © 2 0 1 2 , S A S I n s t i t u t e I n c . A l l r i g h t s r es er v e d .
CHALLENGE HADOOP SKILLS SHORTAGE
Performing even the simplest tasks in Hadoop typically requires mastering disparate tools and writing hundreds of lines of code.
C op yr i g h t © 2 0 1 2 , S A S I n s t i t u t e I n c . A l l r i g h t s r es er v e d .
Fact: There are a limited # of users with the necessary Hadoop skills • • • • •
MapReduce Pig Latin HiveQL HDFS Sqoop and Oozie
CHALLENGE HADOOP SKILLS SHORTAGE
proc sort data=dsn out=temp; by usubjid; run; data unique; set temp; by usubjid; if not first.usubjid and last.usubjid; run; data nodups; set temp; by usubjid; if first.usubjid; run; C op yr i g h t © 2 0 1 2 , S A S I n s t i t u t e I n c . A l l r i g h t s r es er v e d .
CHALLENGE HADOOP SKILLS SHORTAGE public class CalculateDistinct { public static class Map extends MapReduceBase implements Mapper { private final static IntWritable one = new IntWritable(1); private Text word = new Text(""); public void map(LongWritable key, Text value, OutputCollector output, Reporter reporter) throws IOException { word.set(value.toString()); output.collect(word,one); } } public static class Reduce extends MapReduceBase implements Reducer { public void reduce(Text key, Iterator values, OutputCollector output, Reporter reporter) throws IOException { int sum = 0; while (values.hasNext()) { sum += 1; values.next(); } output.collect(key, new IntWritable(sum)); } } C op yr i g h t © 2 0 1 2 , S A S I n s t i t u t e I n c . A l l r i g h t s r es er v e d .
CHALLENGE HADOOP SKILLS SHORTAGE (cont’d) public static void main(String[] args) throws Exception { JobConf conf = new JobConf(CalculateDistinct.class); conf.setJobName("Calculate Distinct"); conf.setOutputKeyClass(Text.class); conf.setOutputValueClass(IntWritable.class); conf.setMapperClass(Map.class); conf.setReducerClass(Reduce.class); conf.setInputFormat(TextInputFormat.class); conf.setOutputFormat(TextOutputFormat.class); FileInputFormat.setInputPaths(conf, new Path(args[0])); FileOutputFormat.setOutputPath(conf, new Path(args[1])); JobClient.runJob(conf); } } javac -classpath hadoop-0.20.1-dev-core.jar -d CalculateDistinct/ CalculateDistinct.java jar -cvf CalculateDistinct.jar -C CalculateDistinct/ . hadoop jar CalculateDistinct.jar org.myorg.CalculateDistinct /user/john/in/abc.txt /user/john/out
C op yr i g h t © 2 0 1 2 , S A S I n s t i t u t e I n c . A l l r i g h t s r es er v e d .
CHALLENGE HADOOP SKILLS
The skill sets required to leverage the many benefits of a Hadoop driven data environment are substantial, and often requires training in many areas. http://hortonworks.com/training/class/applying-data-science-using-apachehadoop/ http://university.cloudera.com/instructor-led-training/introduction-to-data-science--building-recommender-systems
C op yr i g h t © 2 0 1 2 , S A S I n s t i t u t e I n c . A l l r i g h t s r es er v e d .
CHALLENGE USER TOOLS ARE NOT BIG DATA ENABLED
Big data brings new requirements:
• • • • • •
Access to HDFS Parallel Loads New Native file types Knowledge of file structures New languages & code Need to transform data In-cluster
C op yr i g h t © 2 0 1 2 , S A S I n s t i t u t e I n c . A l l r i g h t s r es er v e d .
User tools are not engineered to process data inside Hadoop. • Tools are not optimized for Hadoop • Users move data out of Hadoop to do data management and data quality • This requires more processing time • Data is duplicated and more storage is required • Users do not use the Hadoop platform as it was designed
SOLUTION SAS & HADOOP
SAS has worked closely with the industry leaders in Hadoop development, an developed tools and solutions to facilitate and leverage SAS with Hadoop. a growing asking users to adapt to entirely new languages to leverage Hadoop, SAS has adapted traditional SAS routines and procedures to leverage Hadoop, the end result being “SAS users can stay in SAS”. • SAS/ACCESS Interface to Hadoop • DS2 Programming • SAS Data Loader
C op yr i g h t © 2 0 1 2 , S A S I n s t i t u t e I n c . A l l r i g h t s r es er v e d .
SOLUTION SAS & HADOOP
Rather than asking users to adapt to entirely new languages to leverage Hadoop, SAS has adapted traditional SAS routines and procedures to leverage Hadoop, the end result being “SAS users can stay in SAS”. DS2 Programming: Essentials https://support.sas.com/edu/schedules.html?id=1798&ctry=CA DS2 Programming Essentials with Hadoop https://support.sas.com/edu/schedules.html?id=2468&ctry=CA
C op yr i g h t © 2 0 1 2 , S A S I n s t i t u t e I n c . A l l r i g h t s r es er v e d .
THE KEY CLOSING THE GAPS IN THE DATA TO DECISION CHALLENGE LIFECYCLE User Tools are Not Hadoop Enabled
BUSINESS ANALYST
DATA SCIENTIST / STATISTICIAN
Hadoop Skill Shortage
IT SYSTEMS / MANAGEMENT
TIME TO DECISION
C op yr i g h t © 2 0 1 2 , S A S I n s t i t u t e I n c . A l l r i g h t s r es er v e d .
BUSINESS MANAGER
VALUE CAPTURED
BIG DATA ANALYSTS TAKE MANAGEMENT
Recommendation “Use self-service interactive data preparation tools to enhance analyst productivity.” and “improve the quality of data” – Gartner, “Data Preparation Is Not an Afterthought”
C op yr i g h t © 2 0 1 2 , S A S I n s t i t u t e I n c . A l l r i g h t s r es er v e d .
THE KEY CLOSING THE GAPS IN THE DATA TO DECISION CHALLENGE LIFECYCLE Hadoop Skill Shortage
BUSINESS ANALYST
DATA SCIENTIST / STATISTICIAN
IT SYSTEMS / MANAGEMENT
TIME TO DECISION
C op yr i g h t © 2 0 1 2 , S A S I n s t i t u t e I n c . A l l r i g h t s r es er v e d .
BUSINESS MANAGER
VALUE CAPTURED
MARKET TRENDS SELF-SERVICE DATA PREPARATION
The rise of self-service data-preparation tools … is putting data management directly into the hands of analysts
SAS Data Loader for Hadoop showcases the company's solid engineering talent and reputation for building high-quality software Typically, data preparation is 70-80% of the work involved in any analytic project. That number increases as complexities of the data environment increase.
C op yr i g h t © 2 0 1 2 , S A S I n s t i t u t e I n c . A l l r i g h t s r es er v e d .
SAS DATA LOADER FOR HADOOP SOLUTION OVERVIEW
C op yr i g h t © 2 0 1 2 , S A S I n s t i t u t e I n c . A l l r i g h t s r es er v e d .
SAS DATA LOADER FOR HADOOP
KEY FEATURES
Point-and-click UI designed for self-service data preparation Leverage existing skills to prepare data on Hadoop as used on other data sources Consistency & reuse: apply existing DQ standards on Hadoop data Familiar toolset for the end-to-end analytical lifecycle Purpose-Built to run on Hadoop, keeps it simple and focused
Enables parallel data movement and data quality tasks without writing code Loads data to the SAS LASR Analytic Server Big Compute: Moves the processing to the data
C op yr i g h t © 2 0 1 2 , S A S I n s t i t u t e I n c . A l l r i g h t s r es er v e d .
SAS DATA LOADER FOR HADOOP…
“Purpose-built” easy to use data management solution to specifically address: acquiring, structuring, cleaning and transforming data inside Hadoop SAS Data Loader for Hadoop is a smart approach, turning the Hadoop environment into a productive environment; where barriers are removed, and data is accessible and usable
C op yr i g h t © 2 0 1 2 , S A S I n s t i t u t e I n c . A l l r i g h t s r es er v e d .
SAS DATA LOADER ENABLES ORGANIZATIONS TO…
Manage data inside Hadoop
C op yr i g h t © 2 0 1 2 , S A S I n s t i t u t e I n c . A l l r i g h t s r es er v e d .
Reduce Complexity of Hadoop
Accelerate Business user adoption
CAPABILITIES - SAS DATA LOADER FOR HADOOP
1
ACQUIRE DATA DISCOVER DATA
2
TRANSFORM DATA
3
CLEANSE DATA
4
INTEGRATE DATA
5
DELIVER DATA
• Copy Data to Hadoop
• Query
• Validate
• Join
• Load SAS LASR
• Profile Data
• Select Columns
• Parse
• Create Match codes
• Create tables
• Identification Analysis
• Apply Filters
• Standardize
• Sort & De-duplicate
• Create views
• Query
• Map Columns
• Aggregate
• Copy from Hadoop
• Sort / Order
• Run a SAS program
• Calculate Columns • Transpose data • Aggregate
• Transform data Access data, move it into Hadoop, and assess the data structure and content
C op yr i g h t © 2 0 1 2 , S A S I n s t i t u t e I n c . A l l r i g h t s r es er v e d .
Select data of interest, manipulate it, and structure it into the data format desired
Put data into a consistent format
Combine datasets, including data that has no common key, remove duplicate data, and create new data points thru aggregation
Load datasets into SAS LASR in-memory analytic server, Create new Hadoop tables, and deliver data to other databases and apps
INTRODUCING
SAS DATA LOADER FOR HADOOP
Self-service big data preparation for business users
Certified by Hortonworks and Cloudera C op yr i g h t © 2 0 1 2 , S A S I n s t i t u t e I n c . A l l r i g h t s r es er v e d .
PRIMARY AUDIENCE: WHO IS SAS DATA LOADER DESIGNED FOR?
Business Users Data Analysts, Data Scientists, Statisticians
Data Management Specialists
C op yr i g h t © 2 0 1 2 , S A S I n s t i t u t e I n c . A l l r i g h t s r es er v e d .
SAS DATA LOADER FOR HADOOP
BENEFITS • • •
• •
Users of all skill levels can manage data in Hadoop Users can manipulate Hadoop data to fit their specific needs No need to write code Increases worker productivity and improves data quality Leverages the Hadoop cluster including • •
• •
C op yr i g h t © 2 0 1 2 , S A S I n s t i t u t e I n c . A l l r i g h t s r es er v e d .
Parallel processing Minimizes data movement
Enables reuse of skills you already have Unlocks and accesses many types of data
ADDITIONAL RESOURCES
C op yr i g h t © 2 0 1 2 , S A S I n s t i t u t e I n c . A l l r i g h t s r es er v e d .
FOR MORE INFORMATION •
Learn more about SAS Data Loader for Hadoop • SAS Data Loader for Hadoop
•
Learn more about SAS Data Management: • SAS Data Management
•
Learn more about SAS Hadoop offerings: • SAS Solutions for Hadoop
•
Follow us on Twitter: @sasdatamgmt Like us on Facebook: SAS Software
•
C op yr i g h t © 2 0 1 2 , S A S I n s t i t u t e I n c . A l l r i g h t s r es er v e d .
TRAINING
THANK YOU!
C op yr i g h t © 2 0 1 2 , S A S I n s t i t u t e I n c . A l l r i g h t s r es er v e d .
•
Big Data Matters Webinar Series: • Big Data On-Demand Webinar Series
•
SAS Training: • Introduction to SAS and Hadoop •
DS2 Programming Essentials with Hadoop
•
Data Science: Building Recommender Systems with SAS and Hadoop
USE CASES
C op yr i g h t © 2 0 1 2 , S A S I n s t i t u t e I n c . A l l r i g h t s r es er v e d .
PRIMARY AUDIENCE: WHO IS SAS DATA LOADER DESIGNED FOR?
Business Users Data Analysts, Data Scientists, Statisticians
Data Management Specialists
C op yr i g h t © 2 0 1 2 , S A S I n s t i t u t e I n c . A l l r i g h t s r es er v e d .
USERS BUSINESS USERS
Activities: • Self service access to data • Query and manipulate data • Copy data to/from Hadoop • Load data into SAS LASR
C op yr i g h t © 2 0 1 2 , S A S I n s t i t u t e I n c . A l l r i g h t s r es er v e d .
USERS DATA ANALYSTS, DATA SCIENTISTS, & STATISTICIANS
Activities: • Create an analytics ready dataset • Discover new data sources • Transform and manipulate data • Optional: Write SAS DS2 code • Load data into SAS LASR server
Data Preparation
Log files Customer data
Event data
Analytics ready dataset C op yr i g h t © 2 0 1 2 , S A S I n s t i t u t e I n c . A l l r i g h t s r es er v e d .
USERS DATA MANAGEMENT SPECIALISTS
Activities: • Apply enterprise data management practices to Hadoop • Manage data with discipline inside Hadoop • Reuse data quality standards inside Hadoop • Copy data to/from Hadoop • Optimize SAS code to run in Hadoop • Learn from Hadoop data discoveries • Apply knowledge gained in enterprise environment
Hadoop C op yr i g h t © 2 0 1 2 , S A S I n s t i t u t e I n c . A l l r i g h t s r es er v e d .
BUSINESS USER SELF SERVICE BIG DATA ON-BOARDING, EXPLORATION USE CASE AND DISCOVERY
Activities • User copies data from a data source into Hadoop • User profiles the table to learn the structure/content of the data • User queries the data and creates a new table specific to their needs • User loads the new table into SAS LASR
SAS® LASR ANALYTIC SERVER
C op yr i g h t © 2 0 1 2 , S A S I n s t i t u t e I n c . A l l r i g h t s r es er v e d .
BUSINESS USER CONTINUES EXPLORATION AND DISCOVERY USING SAS VA… USE CASE
SAS Data Loader for Hadoop
C op yr i g h t © 2 0 1 2 , S A S I n s t i t u t e I n c . A l l r i g h t s r es er v e d .
SAS Visual Analytics
DATA SCIENTIST BIG DATA PREPARATION FOR ADVANCED ANALYTICS USE CASE
Activities • User access previously run profile report showing table information • User defines a new table • Creates new columns using calculations • Pivots / transposes the table • Uses functions to aggregate variables • Writes a SAS DS2 program to append records with a calculated score • Sorts the data and applies filters • Then User loads the table into SAS LASR
C op yr i g h t © 2 0 1 2 , S A S I n s t i t u t e I n c . A l l r i g h t s r es er v e d .
THANK YOU !
C op yr i g h t © 2 0 1 2 , S A S I n s t i t u t e I n c . A l l r i g h t s r es er v e d .
www.SAS.com