Probability & Statistical Inference Lecture 1 MSc in Computing (Data Analytics)
Lecture Outline
Introduction
General Info Questionnaire
Introduction to Statistics
Statistics at work The Analytics Process Descriptive Statistics & Distributions Graphs and Visualisation
Introduction
Name : Aoife D’Arcy Email:
[email protected] Bio: Managing Director and Chief Consultant at the Analytics Store, has degrees in statistics, computer science, and financial & industrial mathematics. With over 10 years of experience in analytics consultancy with major national and international companies in banking, finance, insurance, manufacturing and gaming; Aoife has developed particular expertise in risk analytics, fraud analytics, and customer insight analytics. Lecture Notes: Will be available online on www.comp.dit.ie/bmacnamee and later on webcourses
Programme Overview TMP-5 Research Writing & Scientific Literature
TMP-6 Research Methods & Proposal Writing
SPEC 9160 Problem Solving Communication & Innovation
TMP-7 Research Project & Dissertaion
TMP-1 Data Mining
TMP-4 Case Studies in Computing
MATH 4814 Decision Theory & Games
TMP-3 Data Management
TMP-2 Data & Database Design for Data Analytics
SPEC9260 Geographic Information Systems
BUS9290 Legal Issues for Knowledge Management
TMP-10 Designing and Building Semantic Web Applications
SPEC9290 Universal Design for Knowledge Management
SPEC 9270 Machine Learning
TMP-0 Probability & Statistical Inference
MATH 4821 Industrial & Commercial Statistics
SENG X01 Software Project Management
MATH 4807 Financial Mathematics - I
INTC9221 Strategic Issues in IT
MATH 4809 Linear Programming
INTC9231 Internet Systems
TECH9290 Ubiquitous Computing
INTC 9141 Enterprise Systems Integration
TECH9280 Security
Core Module
MATH 4810 Queuing Theory & Markov Processes
TECH9250 Complex and Adaptive Agent Based Computation
Option Module
TMP-9 Language Technology
MATH 4818 Financial Mathematics - II
Pre-requisite
Course Outline Week
Topic
1 2&3 4 5 6 7&8 9 10 - 12 13
Introduction to Statistics Probability Theory Introduction to SAS Enterprise Guide Probability Distributions Confidence Intervals Hypothesis testing Assignment Regression Analysis Revision
Exam & Assignment Exam
The end of term exam accounts for 60% of the overall mark
Assignment The assignment is worth 40% of the overall mark. The assignment will be handed out in week 5 Week 9’s class will be dedicated to working on the assignment.
Software
SAS Enterprise Guide will be the software that will be used during the course.
Recommended Reading Applied Statistics and Probability for Engineers John Wiley & Sons Douglas C. Montgomery
Modelling Binary Data Chapman & Hall David Collett
Probability and Statistics for Engineers and Scientists Pearson Education R.E. Walpole, R.H. Myers, S.L. Myers, K.Ye
Probability and Random Processes Oxford University Press G. Grimmett & D. Stirzaker
Statistical Inference Brooks/Cole George Casella
Questionnaire
Section 1: Statistics at work
Statistics in Everyday Life
With the increase in the amount of data available and advancement`s in the power of computers, statistics are being used more and more frequently. We are constantly reading about surveys done where 3 out 5 people prefer brand X or research has shown that having tomatoes in your diet can reduce the risk of dieses Y.
Is it good that statistics are used so much and what happens when statistics are misused?
Statistics can be misleading
An ad claimed: “9 Out of 10 Dentists prefer Colgate” What is wrong with this statement?
During the Obama presidential election the follow was stated: “According to the Advertising Project, one out of three McCain ads has been negative, criticizing Obama. Nine out of 10 Obama ads have been positive, stressing his own background and ideas.” What is wrong with this statement?
Misinterpreted Statistics can be Devastating
In 1999 Sally Clarke was wrongly convicted of the murder of two of her sons. The case was widely criticised because of the way statistical evidence was misrepresented in the original trial, particularly by paediatrician Professor Sir Roy Meadow. He claimed that, for an affluent non-smoking family like the Clarks, the probability of a single cot death was 1 in 8,543, so the probability of two cot deaths in the same family was around "1 in 73 million" (8543 × 8543). What is wrong with this assumption?
Video
Challenges
As an Analytics practitioner you will face a number of challenges:
Create insight from data Interpret statistic correctly Communicate statistically driven insight in a way that is clearly understood
The Analytics Process & Statistics
Section Overview
Statistics and Analytics Introduction to CRISP
Predictive Analytics Is Multidisciplinary Statistics Pattern Recognition
Data Warehousing
Neurocomputing
Machine Predictive Learning Analytics
Databases KDD
AI
CRISP-DM Evolution
Over 200 members of the CRISP-DM SIG worldwide
DM Vendors: SPSS, NCR, IBM, SAS, SGI, Data Distilleries, Syllogic, etc System Suppliers/Consultants: Cap Gemini, ICL Retail, Deloitte & Touche, etc End Users: BT, ABB, Lloyds Bank, AirTouch, Experian, etc
Crisp-DM 2.0 is due…
Complete information on CRISP-DM is available at: http://www.crisp-dm.org/
CRISP-DM
Features of CRISP-DM:
Non-proprietary Application/Industry neutral Tool neutral Focus on business issues
As well as technical analysis
Framework for guidance Experience base
Templates for Analysis
Business Understanding
Data Understanding
Data Preparation
Deployment
Data
Evaluation
Modelling
Phases & Generic Tasks Business Understanding
Data Understanding
Determine Business Objectives
Assess Situation
Determine Data Mining Goals
Produce Project Plan
Data Preparation
Modeling
Evaluation
Deployment
Business Understanding This initial phase focuses on understanding the project objectives and requirements from a business perspective, then converting this knowledge into a data mining problem definition and a preliminary plan designed to achieve the objectives
Phases & Generic Tasks Business Understanding
Collect Initial Data
Describe Data
Explore Data
Verify Data Quality
Data Understanding
Data Preparation
Modeling
Evaluation
Deployment
Data Understanding The data understanding phase starts with an initial data collection and proceeds with activities in order to get familiar with the data, to identify data quality problems, to discover first insights into the data or to detect interesting subsets to form hypotheses for hidden information.
Phases & Generic Tasks Business Understanding
Data Understanding
Select Data
Clean Data
Construct Data
Integrate Data
Format Data
Data Preparation
Modeling
Evaluation
Deployment
Data Preparation The data preparation phase covers all activities to construct the data that will be fed into the modelling tools from the initial raw data. Data preparation tasks are likely to be performed multiple times and not in any prescribed order. Tasks include table, record and attribute selection as well as transformation and cleaning of data for modelling tools.
Phases & Generic Tasks Business Understanding
Data Understanding
Select Modeling Technique
Generate Test Design
Build Model
Assess Model
Data Preparation
Modeling
Evaluation
Deployment
Modelling In this phase, various modelling techniques are selected and applied and their parameters are calibrated to optimal values. Typically, there are several techniques for the same data mining problem type. Some techniques have specific requirements on the form of data. Therefore, stepping back to the data preparation phase is often necessary.
Phases & Generic Tasks Business Understanding
Data Understanding
Evaluate Results
Review Process
Determine Next Steps
Data Preparation
Modeling
Evaluation
Deployment
Evaluation Before proceeding to final deployment of a model, it is important to thoroughly evaluate it and review the steps executed to construct it to be certain it properly achieves the business objectives. A key objective is to determine if there is some important business issue that has not been sufficiently considered. At the end of this phase, a decision on the use of the data mining results should be reached.
Phases & Generic Tasks Business Understanding
Data Understanding
Plan Deployment
Plan Monitoring & Maintenance Produce Final Report
Review Project
Data Preparation
Modeling
Evaluation
Deployment
Deployment Creation of a model is generally not the end of the project. Even if the purpose of the model is to increase knowledge of the data, the knowledge gained will need to be organized and presented in a way that the customer can use it. Depending on the requirements, the deployment phase can be as simple as generating a report or as complex as implementing a repeatable data mining process across the enterprise.
Crisp - DM
Business Understanding
Data Understanding
Data Preparation
Modelling
Evaluation
Deployment.
Crisp – DM – Areas covered in this course
Business Understanding
Data Understanding
Data Preparation
Modelling
Evaluation
Deployment
Section 2: Descriptive Statistics & Distributions
Topics 1. 2. 3. 4. 5.
Introduction to Statistics The Basics Measures of location: Mean, Median & Mode. Measures of location & Skew. Measures of dispersion: range, standard deviation (variance) & interquartile range.
Introduction to Statistics
According to The Random House College Dictionary, statistics is “the science that deals with the collection, classification, analysis and interpretation of numerical facts or data.” In short, statistics is the science of data. There are two main branches of Statistics:
The branch of statistics devoted to the organisation, summarization and the description of data sets is called Descriptive Statistics. The branch of statistics concerned with using sample data to make an inference about a large set of data is called Inferential Statistics.
Process of Data Analysis
Population
A Statistical population is a data set that is our target of interest.
A sample is a subset of Representative Make data selected from the Sample Inference target population.
Describe
Sample Statistic
If your sample is not representative then it is referred to as being bias
Types of Data
There are a number of data types that we will be considering. These can be split into hierarchy of 4 levels of measurement. 1. Categorical a) b)
2.
Nominal Ordinal
Interval a) b)
Discrete Continuous
Describing Distributions
Describing Distributions
Measures of Location (Central Tendency) Numbers that attempt to express the location of data on the number line Variable(s) are said to be distributed over the number line - so we talk of distributions of numbers
Arithmetic Mean (average)
The mean of a data set is one of the most commonly used statistics. It is a measure of the central tendency of the data set. The mean of a sample is denoted by (pronounced x bar) and the mean of a population is denoted by µ (pronounced mew). They are both ( and µ ) computed using the same formula.
Arithmetic Mean - Example
Example: Ages of Students in 1st year history of Art degree course 18, 18, 18, 18, 19, 19, 20, 20, 58 Mean of ages here is 23.11 – but this is not a ‘typical value or a value around which the observed values cluster.
The same thing tends to happen with values that are strictly positive: average salaries, house prices etc.
We say that the mean is sensitive to extreme values
Median
The middle value of the ordered set of values, i.e. 50% higher and 50% lower.
Example: The class age data again 18, 18, 18, 18, 19, 19, 20, 20, 58
The data is ordered, and n = 9, so the middle number is (n+1)/2 = (9+1)/2 = 5th value = 19
=> median = 19 years
Median •
Robust with regard to extreme values
•
Often a real value in the distribution or close to 2 real values - in that sense tends to be more typical of actually observed values
Mode
The most commonly occurring value in a distribution
Example: The class age data again 18, 18, 18, 18, 19, 19, 20, 20, 58 The mode is 18 years as it occurs more than any other
Tends to show where the data is concentrated
Mode: 18 Mean: 23.11 Median: 19
Skew – The Shape of a Distribution There are a number of ways of describing the shape of a distribution. We
will consider only one – skew.
Skew
is a measure of how asymmetric a distribution is.
Symmetric Distributions = skew is zero
Positive Skew There are few very large data points which create a 'tail' going to the right (i.e. up the number line)
Note: No axis of symmetry here - skew > 0 (i.e. it is positive) Example: Lifetime of people, house prices
Negative Skew
Note: No axis of symmetry here - skew < 0 (i.e. it is negative) Examples: Examination Scores, reaction times for drivers
Skew & Measures of Location - Symmetry
Mean, Median & Mode are the same and are found in the middle
6 6 5 6 7 4 5 6 7 8 3 4 5 6 7 8 9
Positive Skew Mode Median Mean
5 5 5
6 6 6 6 6
7 7 7
8 8
9 9 10 11
In general:
Mode < Median < Mean
Negative Skew Mode Median Mean
6 6 5 6 7 3 4 5 6 7 1 2 3 4 5 6 7 In general:
Mean = 83/17 = 4.89 Median = 5 Mode = 6
Mode > Median > Mean
Measures of Spread (Dispersion)
• The Mean, Mode and Median all 250 for both companies • But not the same - look at the difference in ‘spread’ of bills • Need a measure of spread (dispersion) as well as location to describe a distribution
Range
Simplest measure of spread = largest - smallest
Example for data in histograms: Esat: Largest = €335 Smallest = €180 Range = €335 - €180 = €155
Meteor:
Largest = €295 Smallest = €210 Range = €295 - €210 = €85
Very simple to compute Easy to interpret
Does not use all the data Subject to effect by extreme values
Range
Example: The class age data again 18, 18, 18, 18, 19, 19, 20, 20, 58 Range: 58-18 = 40 years Is this really indicative of the spread of ages?
=> if the mature student was not there, range would be 2 years - so just 1 extreme value has huge effect on range
Typical Deviation – Average Deviation
Consider the following data: OBS
Data
Mean
Deviation
1
3
5
-2
2
4
5
-1
3
8
5
3
Sum
15
15
0
Mean
5
5
0
Typical Deviation – Average Squared Deviation (Variance)
Consider the following data: OBS
Data
Deviation
(Deviation)2
1
3
-2
4
2
4
-1
1
3
8
3
9
Sum
15
0
14
Mean
5
0
14/3
Variance – the formula 1.
Square the deviations around the mean before summing. NB. quantities will be in squared units e.g. cm2 -> not original scale:
2.
Divide by n-1 (?) to get the average of (deviations )2
Standard Deviation Take the square root of the variance . The value is in the original unit
Quantiles
The nth quantile is a value that has a proportion n of the sample taking values smaller than it and 1-n taking values larger than it. For Example: if your grade in an industrial engineering class was located at the 84th percentile, then 84% of the grades were lower than your grade and 16% were higher. The median is the 50th percentile. The 25th percentile and the 75th percentile are called the lower (1st) quartile and upper (3rd) quartile respectively. The difference between the lower and upper quartile is called the inter-quartile range.
Example Class Age data: Order No:
18, 18, 18, 18, 19, 19, 20, 20, 58 1 2 3 4 5 6 7 8 9
1st Quartile = 1+(n-1)/4 =1+ (8)/4 = 3rd score => 18 Median = 1+2×(n-1)/4 = 1+2×(8)/4 = 5th score => 19 3rd Quartile = 1+3×(n-1)/4 = 1+3×(8)/4 = 7th score Interquartile Range : 20 - 18 = 2 years
=> 20
Coefficient of Variation
A problem with s is that it is was scale specific - i.e. comparison of s calculated on difference scales is hard to do.
Example: Distribution A: 8, 9, 10, 11, 12, 13, 14 Distribution B: 1008, 1009, 1010, 1011, 1012, 1013, 1014 Use two of the measures of spread we have Range Range for A: 14 - 8 =6 Range for B: 1014 - 1008 =6 Standard Deviation (s) s for A:= 2.16 s for B:= 2.16
A 8 9 10 11 12 13 14 Mean = 11
AMean -3 -2 -1 0 1 2 3
B 1008 1009 1010 1011 1012 1013 1014 Mean =1011
BMean -3 -2 -1 0 1 2 3
Coefficient of variation
• C.V. is unit-less (i.e. scale-less) • Can compare difference measurement systems and standardise for differences in scale E.G. for data above; A: C.V. = ( 2.16 / 11 ) ×100% => 19.6 % B: C.V. = ( 2.16 / 1011 ) ×100% => 0.2 %
Section 3: Graphs and Visualisation
Graphical Displays
Bar Charts Used
to display categorical data or discrete data with a modest number of values. A Bar is drawn to represent each category. The Bar height represents the frequency or % in each category. Allows for visual comparison of relative frequencies. Need to draw up a frequency distribution table first.
Table 1.- Counts in each exercise category Exercise Frequency V. High High Medium Low None
32 30 52 32 36
So, 5
categories 5 bars heights of bar are the frequencies Clear to see hierarchy in frequency, and can make a guess at relative percentages between categories E.G. ‘Low’
looks about 60% of ‘Medium’ Actual = ( 32/52 ) × 100% = 61.53 %
Note appropriate title and axis labels
Do NOT use 3D effects – The angling loses information Also colouring effects can distract
Can use more than one set of bars to subdivide groups e.g. same data – subdivided by gender
Table 2. Exercise Level By Sex Exercise V. High High Medium Low None
Gender Female Male 13 19 12 18 22 30 16 16 8 28
Component bar charts
Histogram
Histograms are among the most widely used method for displaying continuous data
Has similarities with bar chart – but definitely not the same!
A rectangle is drawn to represent the frequency in a grouped frequency distribution table
Components;
2 axes, x and y
Histogram
Table 3. Heights cm
Frequency
>150 - 155 >155-160 >160-165 >165-170 >170-175 >175-180 >180-185 >185-190 Total
3 10 29 37 44 34 19 6 182
Be careful with choice of intervals as shape can change.
Scatterplots
Time Series Plots Used for plotting data over time X-axis is a time line Y-axis is the value changing over time
Quarter 1 2 3 4
Profit - £0,000's 1992 1993 1994 114 116 128 142 150 158 155 153 169 136 140 159
Shows ‘Trend’ & ‘Seasonality’