Big Data Analytics and Performance Optimization Rokia Missaoui LARIM Université du Québec en Outaouais, Gatineau, Canada http://w3.uqo.ca/missaoui
Rokia Missaoui
Big Data Analytics & Performance Optimization May 17, 2016 1
Outline
Introduction
Big data
Databases Knowledge bases Business intelligence: data mining & warehousing Volume, Variety, Velocity, Value, Veracity
Performance and analytics Challenges and technologies Conclusion Bibliography
Rokia Missaoui
Big Data Analytics & Performance Optimization May 17, 2016 2
Introduction
Rokia Missaoui
Big Data Analytics & Performance Optimization May 17, 2016 3
Introduction
Operational vs informational systems Data mining (Welge, 2003) Databases (OLTP) Business Intelligence BI Data warehouses (OLAP) Data mining
Rokia Missaoui
Big Data Analytics & Performance Optimization May 17, 2016 4
Data Mining
Extraction of interesting patterns from large amount of data
Non‐trivial, implicit, previously unknown and potentially useful knowledge
A step of the knowledge discovery process
A confluence of many technologies
Many techniques and algorithms
Synonyms
Rokia Missaoui
Data dredging, data analytics, … Big Data Analytics & Performance Optimization May 17, 2016 5
Data Analytics (Welge, 2003)
Decision trees
Association rules Rokia Missaoui
Clustering
Bayesian networks
Temporal series
Neural networks
Big Data Analytics & Performance Optimization May 17, 2016 6
Introduction
Rokia Missaoui
Big Data Analytics & Performance Optimization May 17, 2016 7
Big Data
Large and diverse digital data sets generated from different sources
Rokia Missaoui
Big Data Analytics & Performance Optimization May 17, 2016 8
The Vs of Big Data
Volume
Velocity
Different data formats: graphs, MDD, images, …
Value
Continuous data streams (sensors, mobile devices)
Variety
Massive data to store, manage and analyze
Analyze data to get business value and make decision
Veracity
Rokia Missaoui
Trust and integrity
Big Data Analytics & Performance Optimization May 17, 2016 9
Big Data Everywhere!
Rokia Missaoui
Big Data Analytics & Performance Optimization May 17, 2016 10
User’s Needs of Big Data
Totality
Exploration and iteration
Increased interest to process and analyze all the available data No predefined query or goal but data and knowledge browsing Many iterations through the analysis process
Frequency
Rokia Missaoui
Increase the rate of analysis Get more accurate and timely patterns for decision making Big Data Analytics & Performance Optimization May 17, 2016 11
Why Big Data?
Digital world
Overwhelming amounts of data
Sensors, Internet, social networks, satellites, mobile devices, ..
Increasing storage capacity
Images, video, text, …
Petabytes, exabytes, zettabytes, ..
Need for high‐value information and knowledge
Business intelligence (BI) Medical applications Marketing, CRM
Rokia Missaoui
Big Data Analytics & Performance Optimization May 17, 2016 12
Big Data Processing
Aggregation and statistics
Indexing, searching, and querying
Data warehouse and OLAP Keyword based search Pattern matching (XML/RDF)
Knowledge discovery
Rokia Missaoui
Data Mining Statistical Modeling
Big Data Analytics & Performance Optimization May 17, 2016 13
Main Big Data Technologies [Fyfe, 2013] Hadoop
NoSQL Databases
•
•
•
•
Low cost, reliable scale-out architecture Distributed computing Proven success in Fortune 500 companies Exploding interest
Hadoop Rokia Missaoui
• •
Huge horizontal scaling and high availability Highly optimized for retrieval and appending Types • • •
Document stores Key Value stores Graph databases
NoSQL Databases
Analytic RDBMS •
•
Optimized for bulk-load and fast aggregate query workloads Types • • •
Column-oriented MPP In-memory
Analytic Databases
Big Data Analytics & Performance Optimization May 17, 2016 14
Performance Optimization & Analytics
Main big data requirements
Rokia Missaoui
Performance Scalability Analytics
Big Data Analytics & Performance Optimization May 17, 2016 15
Performance Optimization & Analytics
Performance optimization
Efficient database design
Rokia Missaoui
Conceptual and logical modeling based on the users’ needs and the system constraints Physical organization (indices, clusters, hashing, compression) Query formulation
Parallel and distributed processing Etc.
Big Data Analytics & Performance Optimization May 17, 2016 16
Parallel & Distributed Processing
Objectives (Valduriez, 2014)
High performance
throughput for OLTP operations in DBs Low response time for OLAP queries in data warehouses
High availability and reliability using data replication Scalability
Parallel Processing
Rokia Missaoui
Massively parallel computers with many CPUs, RAM and disk devices Big Data Analytics & Performance Optimization May 17, 2016 17
Parallel & Distributed Processing
Distributed Processing
Rokia Missaoui
Data partitioning
Big Data Analytics & Performance Optimization May 17, 2016 18
Hadoop
Hadoop
Two components
A distributed file system A data processing engine to handle very large volumes of data in any structure The Hadoop distributed file system (HDFS), which supports data in structured (relational), unstructured (text), and semi‐structured (XML) forms The MapReduce programing paradigm for managing applications on multiple distributed servers
Focus on
Rokia Missaoui
Redundancy through replication Distributed architectures and parallel processing Big Data Analytics & Performance Optimization May 17, 2016 19
MapReduce
Parallel programming framework
Google Proprietary Open Source version by Hadoop For data analysis of very large data sets Data structured as (key, value) pairs Functional programming style
Typical usage
Rokia Missaoui
URL access frequencies Most important words in documents Text pattern matching
Big Data Analytics & Performance Optimization May 17, 2016 20
Big Data and Analytics
Source: Business Intelligence Strategy: A Framework for Achieving BI Excellence
Rokia Missaoui
Big Data Analytics & Performance Optimization May 17, 2016 21
Source: Business Intelligence Strategy: A Framework for Achieving BI Excellence
Rokia Missaoui
Big Data Analytics & Performance Optimization May 17, 2016 22
The Big Data platform Manifesto imperatives and underlying technologies
Rokia Missaoui
Big Data Analytics & Performance Optimization May 17, 2016 23
IBM’s Big Data Platform
Rokia Missaoui
Big Data Analytics & Performance Optimization May 17, 2016 24
Rokia Missaoui
Big Data Analytics & Performance Optimization May 17, 2016 25
Yahoo! ‐ PNUTS
Parallel and distributed database system Web applications mainly
No need for complex queries Search for performance, scalability and high availability
Used internally at Yahoo!
DB management Social network analysis Metadata processing Marketing applications
Rokia Missaoui
Big Data Analytics & Performance Optimization May 17, 2016 26
Big Data – SAP HANA (Faye, 2015)
Rokia Missaoui
Big Data Analytics & Performance Optimization May 17, 2016 27
Conclusion
Big data
Analytics and performance of Big Data
A reality A buzz‐word Tools and platforms will get better New paradigms to expect Less hand coding to get value Better scalability and performance Expect Hadoop, in‐memory and cloud computing to become common
New competencies are needed
Hiring, training and learning
Rokia Missaoui
Big Data Analytics & Performance Optimization May 17, 2016 28
References The 4 Vs. IBM http://www‐01.ibm.com/software/data/bigdata/ Abdourahmane FAYE. Le Big Data en Entreprise : Révolution et Evolutions. February, 2015 Ian Fyfe. BI for Big Data ‐ Beyond the Hype, Pentaho, 2013 What is HADOOP, and what are its limitations for Big Data? http://www.paraccel.com/resources/Whitepapers/Hadoop‐Limitations‐for‐Big‐Data‐ ParAccel‐Whitepaper.pdf Patrick Valduriez. Parallel Techniques for Big Data. 2014. http://www‐ sop.inria.fr/members/Patrick.Valduriez/pmwiki/Patrick/uploads//Conferences/bigdata M. Welge. Knowledge Discovery from Databases. Talk, Nov. 2003
Rokia Missaoui
Big Data Analytics & Performance Optimization May 17, 2016 29