HSD. W Business Analytics (M.Sc.) IT in Business Analytics

Hochschule Düsseldorf University of Applied Scienses Fachbereich Wirtschaftswissenschaften Faculty of Business Studies HSD W Business Analytics (M.S...
Author: Chloe Gilmore
5 downloads 3 Views 4MB Size
Hochschule Düsseldorf University of Applied Scienses

Fachbereich Wirtschaftswissenschaften Faculty of Business Studies

HSD W Business Analytics (M.Sc.) IT in Business Analytics

IT Applications in Business Analytics SS2016 / Lecture 03 – Tools, Technologies, Data Sources Thomas Zeutschler

SS 2016 - IT Applications in Business Analytics - 3. Tools, Technologies and Data Sources

1

Let’s get started…

HSD

Faculty of Business Studies Thomas Zeutschler Associate Lecturer

SS 2016 - IT Applications in Business Analytics - 3. Tools, Technologies and Data Sources

2

Sequence of Lectures 1

Introduction 1st April 2016

2

Methodology and process model for analytics (CRISP DM)

3

Tools, technologies and data sources

4

The R Programming Language

5

KNIME

6

Case Study 1

9

Case Study 2

12 Case Study 3 15 Wrap Up HSD

Faculty of Business Studies Thomas Zeutschler Associate Lecturer

8th

July 2016

Theory

Tools Training Hands On Case Studies

SS 2016 - IT Applications in Business Analytics - 3. Tools, Technologies and Data Sources

3

Database Systems – Introduction

HSD

Faculty of Business Studies Thomas Zeutschler Associate Lecturer

SS 2016 - IT Applications in Business Analytics - 3. Tools, Technologies and Data Sources

4

Basic Concept  A Database is a collection of data organized in some way  grouped on what they refer to (business), and on data types (technology)  relationships between pieces of data

EmployeeID  Example: Think of records (paper, electronic) describing employee LastName FirstName Phone  The organization of data is explained by Metadata.  Data = symbols for recording and communication

Number Text Text Number

 Employee is a Class of data (what data refer to), which has 1…N Attributes, and each attribute belongs to a specific data type (text, numeric…).

HSD

Faculty of Business Studies Thomas Zeutschler Associate Lecturer

SS 2016 - IT Applications in Business Analytics - 3. Tools, Technologies and Data Sources

5

Basic Concept – Database Management System  A Database Management System (DBMS) is a software for creating databases, storing & retrieving data, creating user interface (forms), creating reports, and administering a DB system (security, access, etc.).  DBMS Product (a specific DBMS software; same as “DBMS Brand”).  Database System (DBMS Application, Application)*:  An implementation of a DBMS product including one or more databases, logic (business rules), some code, and user interface.  Broader meaning: Supporting various information needs.  An information system that results from “applying” a DBMS or a database.  Narrower meaning: Supporting specific information needs.

HSD

Faculty of Business Studies Thomas Zeutschler Associate Lecturer

SS 2016 - IT Applications in Business Analytics - 3. Tools, Technologies and Data Sources

6

Basic Concept – Attributes  The set of allowed values for each attribute is called the Domain of the attribute  Attribute values are (normally) required to be atomic; that is, indivisible  The special value null is a member of every domain. Indicated that the value is “unknown”  The null value is problematic in the definition of many database operations. HSD

Faculty of Business Studies Thomas Zeutschler Associate Lecturer

SS 2016 - IT Applications in Business Analytics - 3. Tools, Technologies and Data Sources

attributes (or columns)

tuples (or rows)

7

Basic Concept – Attribute Relations  A1, A2, …, An are attributes  R = (A1, A2, …, An ) is a relation schema

Example:

instructor = (ID, name, dept_name, salary)

 Formally, given sets D1, D2, …. Dn, a relation r is a subset of

D1 x D2 x … x Dn Thus, a relation is a set of n-tuples (a1, a2, …, an) where each ai  Di  The current values (relation instance) of a relation are specified by a table  An element t of r is a tuple, represented by a row in a table

HSD

Faculty of Business Studies Thomas Zeutschler Associate Lecturer

SS 2016 - IT Applications in Business Analytics - 3. Tools, Technologies and Data Sources

8

Basic Concept – Tuples

 Order of tuples is irrelevant.  Tuples may be stored in an arbitrary order. Example: Instructor relation with unordered tuples.

HSD

Faculty of Business Studies Thomas Zeutschler Associate Lecturer

SS 2016 - IT Applications in Business Analytics - 3. Tools, Technologies and Data Sources

9

Basic Concept – Keys  Let K  R  K is a superkey of R if values for K are sufficient to identify a unique tuple of each possible relation r(R)  Example: {ID} and {ID,name} are both superkeys of instructor.

 Superkey K is a candidate key if K is minimal  Example: {ID} is a candidate key for Instructor

 One of the candidate keys is selected to be the primary key.  Which one?

 Foreign key constraint: Value in one relation must appear in another  Referencing relation  Referenced relation  Example – dept_name in instructor is a foreign key from instructor referencing department

HSD

Faculty of Business Studies Thomas Zeutschler Associate Lecturer

SS 2016 - IT Applications in Business Analytics - 3. Tools, Technologies and Data Sources

10

Basic Concept – Database Schema

HSD

Faculty of Business Studies Thomas Zeutschler Associate Lecturer

SS 2016 - IT Applications in Business Analytics - 3. Tools, Technologies and Data Sources

11

Basic Concept – SQL  SQL Structured Query Language is a special-purpose programming language designed for managing data held in a relational database management system (RDBMS)  SQL is based upon relational algebra and tuple relational calculus. https://en.wikipedia.org/wiki/Relational_algebra  SQL defines 3 language aspects:  data definition language (DDL) …to define database schemas  data manipulation language … selecting, inserting, deleting and updating data  data control language …to control access rights in databases

HSD

Faculty of Business Studies Thomas Zeutschler Associate Lecturer

SS 2016 - IT Applications in Business Analytics - 3. Tools, Technologies and Data Sources

12

Basic Concept – SQL Statement

Online RDBMS: http://sqlfiddle.com/ Online SQL editor: http://www.w3schools.com/sql/trysql.asp?filename=trysql_select_all HSD

Faculty of Business Studies Thomas Zeutschler Associate Lecturer

SS 2016 - IT Applications in Business Analytics - 3. Tools, Technologies and Data Sources

13

Basic Concept – SQL can get complex

Both queries return the same result. Do they? HSD

Faculty of Business Studies Thomas Zeutschler Associate Lecturer

SS 2016 - IT Applications in Business Analytics - 3. Tools, Technologies and Data Sources

14

Basic Concept – SQL Joins

HSD

Faculty of Business Studies Thomas Zeutschler Associate Lecturer

SS 2016 - IT Applications in Business Analytics - 3. Tools, Technologies and Data Sources

15

Basic Concept – Learning SQL http://www.w3schools.com/sql  SQL is an essential skill in data science and business analytics.  SQL is not only used in relational database management systems but also for many other data management technologies like Hadoop and No-SQL databases.

HSD

Faculty of Business Studies Thomas Zeutschler Associate Lecturer

SS 2016 - IT Applications in Business Analytics - 3. Tools, Technologies and Data Sources

16

Database System Classification

HSD

Faculty of Business Studies Thomas Zeutschler Associate Lecturer

SS 2016 - IT Applications in Business Analytics - 3. Tools, Technologies and Data Sources

17

Database System Classification – Overview  Relational Store – RDBMS, as discussed before  Column Store each storage block contains data from only one column  Document Store stores documents made up of tagged elements  Key-Value Store hash table of keys  XML Databases  Graph Databases  Codasyl Databases  Object Oriented Databases and other…

NoSQL

Databases

HSD

Faculty of Business Studies Thomas Zeutschler Associate Lecturer

SS 2016 - IT Applications in Business Analytics - 3. Tools, Technologies and Data Sources

18

Database System Classification SQL Databases  Predefined Schema  Standard definition and interface language  Tight consistency  Well defined semantics NoSQL Database  No predefined Schema  Per-product definition and interface language  Getting an answer quickly is more important than getting a correct answer HSD

Faculty of Business Studies Thomas Zeutschler Associate Lecturer

SS 2016 - IT Applications in Business Analytics - 3. Tools, Technologies and Data Sources

19

Graph Database A Graph Database is a database that uses graph structures for semantic queries with nodes, edges and properties to represent and store data.

An Ontology is a formal naming and definition of the types, properties, and interrelationships of the entities that really or fundamentally exist for a particular domain of discourse.

HSD

Faculty of Business Studies Thomas Zeutschler Associate Lecturer

SS 2016 - IT Applications in Business Analytics - 3. Tools, Technologies and Data Sources

20

Graph Database – Sample

HSD

Faculty of Business Studies Thomas Zeutschler Associate Lecturer

SS 2016 - IT Applications in Business Analytics - 3. Tools, Technologies and Data Sources

21

Graph Database Sample Largest, most actively used graph today.

HSD

Faculty of Business Studies Thomas Zeutschler Associate Lecturer

SS 2016 - IT Applications in Business Analytics - 3. Tools, Technologies and Data Sources

22

OLAP Database Online analytical processing, or OLAP is a computer-based technique of analyzing data to look for insights by answering multidimensional analytical (MDA) queries easily.

An OLAP cube is a term that typically refers to multi-dimensional array of data. The term cube here refers to a multi-dimensional dataset, which is also sometimes called a hypercube if the number of dimensions is greater than 3. HSD

Faculty of Business Studies Thomas Zeutschler Associate Lecturer

SS 2016 - IT Applications in Business Analytics - 3. Tools, Technologies and Data Sources

23

OLAP Database

HSD

Faculty of Business Studies Thomas Zeutschler Associate Lecturer

SS 2016 - IT Applications in Business Analytics - 3. Tools, Technologies and Data Sources

24

Hadoop Apache Hadoop is an open-source software framework for distributed storage and distributed processing of very large data sets on computer clusters built from commodity hardware. All the modules in Hadoop are designed with a fundamental assumption that hardware failures are common and should be automatically handled by the framework. Hadoop was designed by the web company Yahoo out of necessity for efficiently storing and analyzing the inherently enormous data set of web scale applications. HSD

Faculty of Business Studies Thomas Zeutschler Associate Lecturer

SS 2016 - IT Applications in Business Analytics - 3. Tools, Technologies and Data Sources

25

Hadoop Map Reduce MapReduce is a programming model and an associated implementation for processing and generating large data sets with a parallel, distributed algorithm on a cluster. The term MapReduce actually refers to two separate and distinct tasks that Hadoop programs perform.  The first is the map job, which takes a set of data and converts it into another set of data, where individual elements are broken down into tuples (key/value pairs).  The reduce job takes the output from a map as input and combines those data tuples into a smaller set of tuples. As the sequence of the name MapReduce implies, the reduce job is always performed after the map job.

HSD

Faculty of Business Studies Thomas Zeutschler Associate Lecturer

SS 2016 - IT Applications in Business Analytics - 3. Tools, Technologies and Data Sources

26

Hadoop Map Reduce – WordCount Sample

HSD

Faculty of Business Studies Thomas Zeutschler Associate Lecturer

SS 2016 - IT Applications in Business Analytics - 3. Tools, Technologies and Data Sources

27

Hadoop Hardware Cluster

HSD

Faculty of Business Studies Thomas Zeutschler Associate Lecturer

SS 2016 - IT Applications in Business Analytics - 3. Tools, Technologies and Data Sources

28

Spark

Apache Spark™ is a fast and general engine for large-scale data processing.

HSD

Faculty of Business Studies Thomas Zeutschler Associate Lecturer

SS 2016 - IT Applications in Business Analytics - 3. Tools, Technologies and Data Sources

29

Big Data Framework A pre-customized and pre-compiled collection of tools and technologies required for big data processing based on Hadoop.

HSD

Faculty of Business Studies Thomas Zeutschler Associate Lecturer

SS 2016 - IT Applications in Business Analytics - 3. Tools, Technologies and Data Sources

30

Analytical Data Processing

HSD

Faculty of Business Studies Thomas Zeutschler Associate Lecturer

SS 2016 - IT Applications in Business Analytics - 3. Tools, Technologies and Data Sources

31

Analytical Data Processing Platform  An Analytical Data Processing Platform is a software that provides an integrated environment for machine learning, data mining, text mining, predictive analytics and business analytics.  It is used for business and commercial applications as well as for research, education, training, rapid prototyping, and application development and supports all steps of the data mining process including data preparation, results visualization, validation and optimization. HSD

Faculty of Business Studies Thomas Zeutschler Associate Lecturer

SS 2016 - IT Applications in Business Analytics - 3. Tools, Technologies and Data Sources

32

Analytics Data Processing – Sample: Knime

www.knime.org

HSD

Faculty of Business Studies Thomas Zeutschler Associate Lecturer

SS 2016 - IT Applications in Business Analytics - 3. Tools, Technologies and Data Sources

33

Analytics Data Processing – Sample

HSD

Faculty of Business Studies Thomas Zeutschler Associate Lecturer

SS 2016 - IT Applications in Business Analytics - 3. Tools, Technologies and Data Sources

34

Analytics Data Processing – Sample

HSD

Faculty of Business Studies Thomas Zeutschler Associate Lecturer

SS 2016 - IT Applications in Business Analytics - 3. Tools, Technologies and Data Sources

35

Cloud Platforms

HSD

Faculty of Business Studies Thomas Zeutschler Associate Lecturer

SS 2016 - IT Applications in Business Analytics - 3. Tools, Technologies and Data Sources

36

Cloud Platforms – Sample

And many other (smaller) vendors… HSD

Faculty of Business Studies Thomas Zeutschler Associate Lecturer

SS 2016 - IT Applications in Business Analytics - 3. Tools, Technologies and Data Sources

37

Data Management Platform – Microsoft From Data to Decision and Action

HSD

Faculty of Business Studies Thomas Zeutschler Associate Lecturer

SS 2016 - IT Applications in Business Analytics - 3. Tools, Technologies and Data Sources

38

Cloud Platform Infrastructure Data centers around the globe

HSD

Faculty of Business Studies Thomas Zeutschler Associate Lecturer

SS 2016 - IT Applications in Business Analytics - 3. Tools, Technologies and Data Sources

39

Cloud Data Center Microsoft Azure Data Center in Dublin

HSD

Faculty of Business Studies Thomas Zeutschler Associate Lecturer

SS 2016 - IT Applications in Business Analytics - 3. Tools, Technologies and Data Sources

40

Cloud Data Center Inside a Data Center

HSD

Faculty of Business Studies Thomas Zeutschler Associate Lecturer

SS 2016 - IT Applications in Business Analytics - 3. Tools, Technologies and Data Sources

41

Cloud Data Management Platform – Microsoft

Applications

Orchestration

Relational

HSD

Faculty of Business Studies Thomas Zeutschler Associate Lecturer

Reports

Dashboards

Information management

Complex event processing

Non-relational

NoSQL

SS 2016 - IT Applications in Business Analytics - 3. Tools, Technologies and Data Sources

Natural language query

Mobile

Modeling

Machine learning

Streaming

Internal & external 42

Cloud Data Management Platform – Microsoft Big Data for IoT (Internet of Things)

Business apps

People

Custom apps

Sensors and devices

DATA

HSD

Faculty of Business Studies Thomas Zeutschler Associate Lecturer

Automated Systems

INTELLIGENCE SS 2016 - IT Applications in Business Analytics - 3. Tools, Technologies and Data Sources

ACTION 43

Cloud Data Management Platform – Microsoft Azure Data Factory – Data Orchestration  Connect to data sources  Orchestrate data movement & data processing  Publish to users as a searchable data view  Operationalize workflows (schedule, manage, debug)  Lifecycle management, monitoring

HSD

Faculty of Business Studies Thomas Zeutschler Associate Lecturer

MapReduce Hive Pig C# Stored Procedures Azure Machine Learning

SS 2016 - IT Applications in Business Analytics - 3. Tools, Technologies and Data Sources

44

Cloud Data Management Platform – Microsoft Azure Stream Analytics – Data Orchestration

 Consumes millions of real-time events from Event Hub collected from devices, sensors, infrastructure, and applications

 Performs time-sensitive analysis using SQL-like language against multiple real-time streams and reference data  Outputs to persistent stores, dashboards or back to devices

POS Terminals

Smart Phones

Kiosks

HSD

Faculty of Business Studies Thomas Zeutschler Associate Lecturer

Servers

ATM

Security Kinect

PCs/ Laptops Self Checkout Stations

Slates/ Tablets

Automation Devices

Point of Service Devices

Digital Signs

Vending Machines Thin Clients Handhelds

Logic Controllers

Remote Medical Monitors

Specialized Devices

Diagnostic Equipment

SS 2016 - IT Applications in Business Analytics - 3. Tools, Technologies and Data Sources

45

Cloud Data Management Platform – Microsoft Azure Machine Learning Get started with just a browser Requires no provisioning; simply log on to your Azure subscription or try it for free off azure.com/ml Experience the power of choice Choose from hundreds of algorithms and packages from R and Python or drop in your own custom code Take advantage of business-tested algorithms from Xbox and Bing Deploy solutions in minutes With the click of a button, deploy the finished model as a web service that can connect to any data, anywhere

Microsoft Azure Machine Learning Studio Modeling environment (shown) Microsoft Azure Machine Learning API service Model in production as a web service

Microsoft Azure Machine Learning Marketplace APIs and solutions for broad use

Connect to the world Brand and monetize solutions on our global Machine Learning Marketplace https://datamarket.azure.com/

HSD

Faculty of Business Studies Thomas Zeutschler Associate Lecturer

SS 2016 - IT Applications in Business Analytics - 3. Tools, Technologies and Data Sources

46

Data Sources

HSD

Faculty of Business Studies Thomas Zeutschler Associate Lecturer

SS 2016 - IT Applications in Business Analytics - 3. Tools, Technologies and Data Sources

47

Data Sources – Public Data Sets  Das Datenportal für Deutschland https://www.govdata.de  European Union Open Data Portal https://open-data.europa.eu  Google Public Data http://www.google.com/publicdata/directory?hl=en_US  Where can I find (very) large public datasets? https://www.quora.com/Where-can-I-find-large-datasets-open-to-the-public

HSD

Faculty of Business Studies Thomas Zeutschler Associate Lecturer

SS 2016 - IT Applications in Business Analytics - 3. Tools, Technologies and Data Sources

48

Data Sources – Public Data Sets  Kaggle.com https://www.kaggle.com/datasets  Google – Search for Data Sets https://cse.google.com/cse/publicurl?cx=002720237717066476899:v2wv26idk7m  Data Sets for Classroom Use http://www2.stetson.edu/~jrasp/data.htm  List of free data sets https://r-dir.com/reference/datasets.html  List of Reference Data Sets for Statistics https://r-dir.com/reference/datasets.html

HSD

Faculty of Business Studies Thomas Zeutschler Associate Lecturer

SS 2016 - IT Applications in Business Analytics - 3. Tools, Technologies and Data Sources

49

Lecture Summary & Homework

HSD

Faculty of Business Studies Thomas Zeutschler Associate Lecturer

SS 2016 - IT Applications in Business Analytics - 3. Tools, Technologies and Data Sources

50

Lessons Learned  The basic concept of relational databases is known.  New data management technologies are emerging.  Hadoop as a platform for big data management.  Analytical Data Processing requires adequate tools and processes.  Cloud as platform for data storage and computing.

HSD

Faculty of Business Studies Thomas Zeutschler Associate Lecturer

SS 2016 - IT Applications in Business Analytics - 3. Tools, Technologies and Data Sources

51

Resources Database Design (eBook) Chapter 1 - 7 https://opentextbc.ca/dbdesign/ Database Management Systems (eBook) http://sirpabs.ilahas.com/ebooks/Computer%20&%20Technology/Databas e%20Mgmt/Database.Management.Systems.2nd.Edition.pdf

HSD

Faculty of Business Studies Thomas Zeutschler Associate Lecturer

SS 2016 - IT Applications in Business Analytics - 3. Tools, Technologies and Data Sources

52

Get Prepared (Homework) Train yourself in R.  Take the course: http://tryr.codeschool.com/ Train yourself in SQL.  Take the course: http://www.w3schools.com/sql  Open a second window with the online SQL database: http://www.w3schools.com/sql/trysql.asp?filename=trysql_select_al  Take the SQL Quiz and try to get at least 20 points. http://www.w3schools.com/sql/sql_quiz.asp

HSD

Faculty of Business Studies Thomas Zeutschler Associate Lecturer

SS 2016 - IT Applications in Business Analytics - 3. Tools, Technologies and Data Sources

53

Any Questions?

HSD

Faculty of Business Studies Thomas Zeutschler Associate Lecturer

SS 2016 - IT Applications in Business Analytics - 3. Tools, Technologies and Data Sources

54

Suggest Documents