Probability & Statistical Inference Lecture 1. MSc in Computing (Data Analytics)

Probability & Statistical Inference Lecture 1 MSc in Computing (Data Analytics) Lecture Outline   Introduction       General Info Questionnai...
Author: William Barnett
4 downloads 0 Views 2MB Size
Probability & Statistical Inference Lecture 1 MSc in Computing (Data Analytics)

Lecture Outline  

Introduction    

 

General Info Questionnaire

Introduction to Statistics        

Statistics at work The Analytics Process Descriptive Statistics & Distributions Graphs and Visualisation

Introduction      

 

Name : Aoife D’Arcy Email: [email protected] Bio: Managing Director and Chief Consultant at the Analytics Store, has degrees in statistics, computer science, and financial & industrial mathematics. With over 10 years of experience in analytics consultancy with major national and international companies in banking, finance, insurance, manufacturing and gaming; Aoife has developed particular expertise in risk analytics, fraud analytics, and customer insight analytics. Lecture Notes: Will be available online on www.comp.dit.ie/bmacnamee and later on webcourses

Programme Overview TMP-5 Research Writing & Scientific Literature

TMP-6 Research Methods & Proposal Writing

SPEC 9160 Problem Solving Communication & Innovation

TMP-7 Research Project & Dissertaion

TMP-1 Data Mining

TMP-4 Case Studies in Computing

MATH 4814 Decision Theory & Games

TMP-3 Data Management

TMP-2 Data & Database Design for Data Analytics

SPEC9260 Geographic Information Systems

BUS9290 Legal Issues for Knowledge Management

TMP-10 Designing and Building Semantic Web Applications

SPEC9290 Universal Design for Knowledge Management

SPEC 9270 Machine Learning

TMP-0 Probability & Statistical Inference

MATH 4821 Industrial & Commercial Statistics

SENG X01 Software Project Management

MATH 4807 Financial Mathematics - I

INTC9221 Strategic Issues in IT

MATH 4809 Linear Programming

INTC9231 Internet Systems

TECH9290 Ubiquitous Computing

INTC 9141 Enterprise Systems Integration

TECH9280 Security

Core Module

MATH 4810 Queuing Theory & Markov Processes

TECH9250 Complex and Adaptive Agent Based Computation

Option Module

TMP-9 Language Technology

MATH 4818 Financial Mathematics - II

Pre-requisite

Course Outline Week

Topic

1 2&3 4 5 6 7&8 9 10 - 12 13

Introduction to Statistics Probability Theory Introduction to SAS Enterprise Guide Probability Distributions Confidence Intervals Hypothesis testing Assignment Regression Analysis Revision

Exam & Assignment Exam  

The end of term exam accounts for 60% of the overall mark

Assignment The assignment is worth 40% of the overall mark.   The assignment will be handed out in week 5   Week 9’s class will be dedicated to working on the assignment.  

Software  

SAS Enterprise Guide will be the software that will be used during the course.

Recommended Reading Applied Statistics and Probability for Engineers John Wiley & Sons Douglas C. Montgomery

Modelling Binary Data Chapman & Hall David Collett

Probability and Statistics for Engineers and Scientists Pearson Education R.E. Walpole, R.H. Myers, S.L. Myers, K.Ye

Probability and Random Processes Oxford University Press G. Grimmett & D. Stirzaker

Statistical Inference Brooks/Cole George Casella

Questionnaire

Section 1: Statistics at work

Statistics in Everyday Life  

With the increase in the amount of data available and advancement`s in the power of computers, statistics are being used more and more frequently. We are constantly reading about surveys done where 3 out 5 people prefer brand X or research has shown that having tomatoes in your diet can reduce the risk of dieses Y.

Is it good that statistics are used so much and what happens when statistics are misused?

Statistics can be misleading  

An ad claimed: “9 Out of 10 Dentists prefer Colgate”   What is wrong with this statement?

 

During the Obama presidential election the follow was stated: “According to the Advertising Project, one out of three McCain ads has been negative, criticizing Obama. Nine out of 10 Obama ads have been positive, stressing his own background and ideas.”   What is wrong with this statement?

Misinterpreted Statistics can be Devastating  

 

 

In 1999 Sally Clarke was wrongly convicted of the murder of two of her sons. The case was widely criticised because of the way statistical evidence was misrepresented in the original trial, particularly by paediatrician Professor Sir Roy Meadow. He claimed that, for an affluent non-smoking family like the Clarks, the probability of a single cot death was 1 in 8,543, so the probability of two cot deaths in the same family was around "1 in 73 million" (8543 × 8543). What is wrong with this assumption?

Video

Challenges  

As an Analytics practitioner you will face a number of challenges:      

Create insight from data Interpret statistic correctly Communicate statistically driven insight in a way that is clearly understood

The Analytics Process & Statistics

Section Overview    

Statistics and Analytics Introduction to CRISP

Predictive Analytics Is Multidisciplinary Statistics Pattern Recognition

Data Warehousing

Neurocomputing

Machine Predictive Learning Analytics

Databases KDD

AI

CRISP-DM Evolution  

Over 200 members of the CRISP-DM SIG worldwide      

 

DM Vendors: SPSS, NCR, IBM, SAS, SGI, Data Distilleries, Syllogic, etc System Suppliers/Consultants: Cap Gemini, ICL Retail, Deloitte & Touche, etc End Users: BT, ABB, Lloyds Bank, AirTouch, Experian, etc

Crisp-DM 2.0 is due…

Complete information on CRISP-DM is available at: http://www.crisp-dm.org/

CRISP-DM  

Features of CRISP-DM:        

Non-proprietary Application/Industry neutral Tool neutral Focus on business issues  

   

As well as technical analysis

Framework for guidance Experience base  

Templates for Analysis

Business Understanding

Data Understanding

Data Preparation

Deployment

Data

Evaluation

Modelling

Phases & Generic Tasks Business Understanding

Data Understanding

Determine Business Objectives

Assess Situation

Determine Data Mining Goals

Produce Project Plan

Data Preparation

Modeling

Evaluation

Deployment

Business Understanding This initial phase focuses on understanding the project objectives and requirements from a business perspective, then converting this knowledge into a data mining problem definition and a preliminary plan designed to achieve the objectives

Phases & Generic Tasks Business Understanding

Collect Initial Data

Describe Data

Explore Data

Verify Data Quality

Data Understanding

Data Preparation

Modeling

Evaluation

Deployment

Data Understanding The data understanding phase starts with an initial data collection and proceeds with activities in order to get familiar with the data, to identify data quality problems, to discover first insights into the data or to detect interesting subsets to form hypotheses for hidden information.

Phases & Generic Tasks Business Understanding

Data Understanding

Select Data

Clean Data

Construct Data

Integrate Data

Format Data

Data Preparation

Modeling

Evaluation

Deployment

Data Preparation The data preparation phase covers all activities to construct the data that will be fed into the modelling tools from the initial raw data. Data preparation tasks are likely to be performed multiple times and not in any prescribed order. Tasks include table, record and attribute selection as well as transformation and cleaning of data for modelling tools.

Phases & Generic Tasks Business Understanding

Data Understanding

Select Modeling Technique

Generate Test Design

Build Model

Assess Model

Data Preparation

Modeling

Evaluation

Deployment

Modelling In this phase, various modelling techniques are selected and applied and their parameters are calibrated to optimal values. Typically, there are several techniques for the same data mining problem type. Some techniques have specific requirements on the form of data. Therefore, stepping back to the data preparation phase is often necessary.

Phases & Generic Tasks Business Understanding

Data Understanding

Evaluate Results

Review Process

Determine Next Steps

Data Preparation

Modeling

Evaluation

Deployment

Evaluation Before proceeding to final deployment of a model, it is important to thoroughly evaluate it and review the steps executed to construct it to be certain it properly achieves the business objectives. A key objective is to determine if there is some important business issue that has not been sufficiently considered. At the end of this phase, a decision on the use of the data mining results should be reached.

Phases & Generic Tasks Business Understanding

Data Understanding

Plan Deployment

Plan Monitoring & Maintenance Produce Final Report

Review Project

Data Preparation

Modeling

Evaluation

Deployment

Deployment Creation of a model is generally not the end of the project. Even if the purpose of the model is to increase knowledge of the data, the knowledge gained will need to be organized and presented in a way that the customer can use it. Depending on the requirements, the deployment phase can be as simple as generating a report or as complex as implementing a repeatable data mining process across the enterprise.

Crisp - DM  

Business Understanding

 

Data Understanding

 

Data Preparation

 

Modelling

 

Evaluation

 

Deployment.

Crisp – DM – Areas covered in this course  

Business Understanding

 

Data Understanding

 

Data Preparation

 

Modelling

 

Evaluation

 

Deployment

Section 2: Descriptive Statistics & Distributions

Topics 1.  2.  3.  4.  5. 

Introduction to Statistics The Basics Measures of location: Mean, Median & Mode. Measures of location & Skew. Measures of dispersion: range, standard deviation (variance) & interquartile range.

Introduction to Statistics  

 

According to The Random House College Dictionary, statistics is “the science that deals with the collection, classification, analysis and interpretation of numerical facts or data.” In short, statistics is the science of data. There are two main branches of Statistics:  

 

The branch of statistics devoted to the organisation, summarization and the description of data sets is called Descriptive Statistics. The branch of statistics concerned with using sample data to make an inference about a large set of data is called Inferential Statistics.

Process of Data Analysis  

Population

A Statistical population is a data set that is our target of interest.

A sample is a subset of Representative Make data selected from the Sample Inference target population.  

Describe  

Sample Statistic

If your sample is not representative then it is referred to as being bias

Types of Data    

There are a number of data types that we will be considering.  These can be split into hierarchy of 4 levels of measurement. 1.  Categorical a)  b) 

2. 

Nominal Ordinal

Interval a)  b) 

Discrete Continuous

Describing Distributions

Describing Distributions

Measures of Location (Central Tendency) Numbers that attempt to express the location of data on the number line   Variable(s) are said to be distributed over the number line - so we talk of distributions of numbers  

Arithmetic Mean (average)  

 

 

The mean of a data set is one of the most commonly used statistics. It is a measure of the central tendency of the data set. The mean of a sample is denoted by (pronounced x bar) and the mean of a population is denoted by µ (pronounced mew). They are both ( and µ ) computed using the same formula.

Arithmetic Mean - Example  

Example: Ages of Students in 1st year history of Art degree course 18, 18, 18, 18, 19, 19, 20, 20, 58 Mean of ages here is 23.11 – but this is not a ‘typical value or a value around which the observed values cluster.

 

The same thing tends to happen with values that are strictly positive: average salaries, house prices etc.

 

We say that the mean is sensitive to extreme values

Median  

The middle value of the ordered set of values, i.e. 50% higher and 50% lower.

 

Example: The class age data again 18, 18, 18, 18, 19, 19, 20, 20, 58

 

The data is ordered, and n = 9, so the middle number is (n+1)/2 = (9+1)/2 = 5th value = 19

 

=> median = 19 years

Median • 

Robust with regard to extreme values

• 

Often a real value in the distribution or close to 2 real values - in that sense tends to be more typical of actually observed values

Mode  

The most commonly occurring value in a distribution

Example: The class age data again 18, 18, 18, 18, 19, 19, 20, 20, 58  The mode is 18 years as it occurs more than any other  

 

Tends to show where the data is concentrated

 

Mode: 18 Mean: 23.11 Median: 19

Skew – The Shape of a Distribution  There are a number of ways of describing the shape of a distribution.  We

will consider only one – skew.

 Skew

is a measure of how asymmetric a distribution is.

Symmetric Distributions  = skew is zero

Positive Skew There are few very large data points which create a 'tail' going to the right (i.e. up the number line)

Note: No axis of symmetry here - skew > 0 (i.e. it is positive) Example: Lifetime of people, house prices

Negative Skew

Note: No axis of symmetry here - skew < 0 (i.e. it is negative) Examples: Examination Scores, reaction times for drivers

Skew & Measures of Location - Symmetry

Mean, Median & Mode are the same and are found in the middle

6 6 5 6 7 4 5 6 7 8 3 4 5 6 7 8 9

Positive Skew Mode Median Mean

5 5 5

6 6 6 6 6

7 7 7

8 8

9 9 10 11

In general:

Mode < Median < Mean

Negative Skew Mode Median Mean

6 6 5 6 7 3 4 5 6 7 1 2 3 4 5 6 7 In general:

Mean = 83/17 = 4.89 Median = 5 Mode = 6

Mode > Median > Mean

Measures of Spread (Dispersion)

•  The Mean, Mode and Median all 250 for both companies •  But not the same - look at the difference in ‘spread’ of bills •  Need a measure of spread (dispersion) as well as location to describe a distribution

Range  

Simplest measure of spread = largest - smallest

   

Example for data in histograms:  Esat: Largest = €335 Smallest = €180 Range = €335 - €180 = €155

 

Meteor:

Largest = €295 Smallest = €210 Range = €295 - €210 = €85

  Very simple to compute   Easy to interpret

  Does not use all the data   Subject to effect by extreme values

Range  

Example: The class age data again 18, 18, 18, 18, 19, 19, 20, 20, 58  Range: 58-18 = 40 years  Is this really indicative of the spread of ages?

 

 => if the mature student was not there, range would be 2 years - so just 1 extreme value has huge effect on range

Typical Deviation – Average Deviation  

Consider the following data: OBS

Data

Mean

Deviation

1

3

5

-2

2

4

5

-1

3

8

5

3

Sum

15

15

0

Mean

5

5

0

Typical Deviation – Average Squared Deviation (Variance)  

Consider the following data: OBS

Data

Deviation

(Deviation)2

1

3

-2

4

2

4

-1

1

3

8

3

9

Sum

15

0

14

Mean

5

0

14/3

Variance – the formula 1. 

Square the deviations around the mean before summing. NB. quantities will be in squared units e.g. cm2 -> not original scale:

2. 

Divide by n-1 (?) to get the average of (deviations )2

Standard Deviation Take the square root of the variance . The value is in the original unit

Quantiles  

 

 

 

The nth quantile is a value that has a proportion n of the sample taking values smaller than it and 1-n taking values larger than it. For Example: if your grade in an industrial engineering class was located at the 84th percentile, then 84% of the grades were lower than your grade and 16% were higher. The median is the 50th percentile. The 25th percentile and the 75th percentile are called the lower (1st) quartile and upper (3rd) quartile respectively. The difference between the lower and upper quartile is called the inter-quartile range.

Example Class Age data: Order No:

18, 18, 18, 18, 19, 19, 20, 20, 58 1 2 3 4 5 6 7 8 9

1st Quartile = 1+(n-1)/4 =1+ (8)/4 = 3rd score => 18 Median = 1+2×(n-1)/4 = 1+2×(8)/4 = 5th score => 19 3rd Quartile = 1+3×(n-1)/4 = 1+3×(8)/4 = 7th score Interquartile Range : 20 - 18 = 2 years

=> 20

Coefficient of Variation  

A problem with s is that it is was scale specific - i.e. comparison of s calculated on difference scales is hard to do.

Example:  Distribution A: 8, 9, 10, 11, 12, 13, 14 Distribution B: 1008, 1009, 1010, 1011, 1012, 1013, 1014 Use two of the measures of spread we have Range Range for A: 14 - 8 =6 Range for B: 1014 - 1008 =6 Standard Deviation (s) s for A:= 2.16 s for B:= 2.16

A 8 9 10 11 12 13 14 Mean = 11

AMean -3 -2 -1 0 1 2 3

B 1008 1009 1010 1011 1012 1013 1014 Mean =1011

BMean -3 -2 -1 0 1 2 3

Coefficient of variation

•  C.V. is unit-less (i.e. scale-less) •  Can compare difference measurement systems and standardise for differences in scale E.G. for data above;  A: C.V. = ( 2.16 / 11 ) ×100% => 19.6 % B: C.V. = ( 2.16 / 1011 ) ×100% => 0.2 %

Section 3: Graphs and Visualisation

Graphical Displays

Bar Charts   Used

to display categorical data or discrete data with a modest number of values.   A Bar is drawn to represent each category.   The Bar height represents the frequency or % in each category.   Allows for visual comparison of relative frequencies.   Need to draw up a frequency distribution table first.

Table 1.- Counts in each exercise category Exercise Frequency V. High High Medium Low None

32 30 52 32 36

  So, 5

categories  5 bars   heights of bar are the frequencies   Clear to see hierarchy in frequency, and can make a guess at relative percentages between categories   E.G. ‘Low’  

 

looks about 60% of ‘Medium’ Actual = ( 32/52 ) × 100% = 61.53 %

Note appropriate title and axis labels

Do NOT use 3D effects – The angling loses information Also colouring effects can distract

Can use more than one set of bars to subdivide groups e.g. same data – subdivided by gender

Table 2. Exercise Level By Sex Exercise V. High High Medium Low None

Gender Female Male 13 19 12 18 22 30 16 16 8 28

Component bar charts

Histogram  

Histograms are among the most widely used method for displaying continuous data

 

Has similarities with bar chart – but definitely not the same!

 

A rectangle is drawn to represent the frequency in a grouped frequency distribution table

 

Components;

2 axes, x and y

Histogram

Table 3. Heights cm

Frequency

>150 - 155 >155-160 >160-165 >165-170 >170-175 >175-180 >180-185 >185-190 Total

3 10 29 37 44 34 19 6 182

Be careful with choice of intervals as shape can change.

Scatterplots

Time Series Plots  Used for plotting data over time X-axis is a time line Y-axis is the value changing over time

Quarter 1 2 3 4

Profit - £0,000's 1992 1993 1994 114 116 128 142 150 158 155 153 169 136 140 159

Shows ‘Trend’ & ‘Seasonality’

Suggest Documents