Data Analysis Exploratory Data Analysis

What is data exploration? A preliminary exploration of the data to better understand its characteristics. 

Key motivations of data exploration include – Helping to select the right tool for preprocessing or analysis – Making use of humans’ abilities to recognize patterns  People can recognize patterns not captured by data analysis tools



Related to the area of Exploratory Data Analysis (EDA) – Created by statistician John Tukey – Seminal book is Exploratory Data Analysis by Tukey – A nice online introduction can be found in Chapter 1 of the NIST Engineering Statistics Handbook http://www.itl.nist.gov/div898/handbook/index.htm

Techniques Used In Data Exploration 

In EDA, as originally defined by Tukey – The focus was on visualization – Clustering and anomaly detection were viewed as exploratory techniques – In data mining, clustering and anomaly detection are major areas of interest, and not thought of as just exploratory



In our discussion of data exploration, we focus on – Summary statistics – Visualization – Online Analytical Processing (OLAP)

Data and Exploratory Data Analysis (EDA) 

The first step in analyzing data is to summarize and plot. – Initial summary of data – includes checking for unusual or erroneous values, identifying missing items

– Preliminary analysis of data – Preliminary interpretation of data 



The techniques used are called Exploratory Data Analysis Data exploration involves: – Visualization: The objective of data visualization is to obtain a high level understanding of the sample and their observed (measured) characteristics. – Summary Statistic: To make the data more manageable, we need to further reduce the amount of information in some meaningful ways so that we can focus on the key aspects of the data.

Visualization 

Visualization is the conversion of data into a visual or tabular format so that the characteristics of the data and the relationships among data items or attributes can be analyzed or reported.



Visualization of data is one of the most powerful and appealing techniques for data exploration. – Humans have a well developed ability to analyze large amounts of information that is presented visually – Can detect general patterns and trends – Can detect outliers and unusual patterns

Example: Sea Surface Temperature 

The following shows the Sea Surface Temperature (SST) for July 1982 – Tens of thousands of data points are summarized in a single figure

Data exploration Using data exploration techniques, we can learn about the distribution of a variable.  Informally, the distribution of a variable tells us 

– the possible values it can take, – the chance of observing those values – how often we expect to see them in a random sample from the population. 

Through data exploration,  We might detect previously unknown patterns and relationships that are worth further investigation.  We can also identify possible data issues, such as unexpected or unusual measurements, known as outliers.

Summarizing Data 

Examine the entire data set using basic techniques before starting a formal statistical analysis. – – – – –

Familiarizing yourself with the data. Find possible errors and anomalies. Examine the distribution of values for each variable. Create simple graphical and numerical summaries. Look at simple associations between variables.

Summarizing Data Statistic is “making sense of data”  Raw data have to be processed and summarized before one can make sense of data  Summary can take the form of 

– Summary index: using a single value to summarize data from a study variable – Tables – Diagrams

Summarizing Variables 

Categorical variables – Frequency tables - how many observations in each category? – Relative frequency table - percent in each category. – Bar chart and other plots.



Numeric variables – Bin the observations (create categories .e.g., (0-10), (11-20), etc.) then, treat as ordered categorical. – Plots specific to numeric variables.



The goal for both categorical and numeric data is data reduction while preserving/extracting key information about the process under investigation.

Frequency Table and Mode 

Frequency Table: Categories with counts



The number of times a specific observed category is called frequency. The sum of the frequencies fro all categories is equal to the total sample size The relative frequency is the sample proportion for each possible category. It is obtained by dividing the frequencies by the total number of observations. For a categorical variable, the mode of is the most common value, i.e., the value with the highest frequency







Graphing a Frequency Table - Bar Chart: 

Plot the number of observations in each category:

Categorical Data Summaries 

Summary: – Frequency tables and relative frequency tables are used for describing categorical data – Bar plots are often used to display data. – Other types of plots such as pie charts are probably not useful in scientific settings

Summarizing measurement data 

Distribution patterns – Symmetrical (bell-shaped) distribution, e.g. normal distribution – Skewed distribution – Bimodal and multimodal distribution



Indices of central tendency (Location) – Mean, median



Indices of dispersion (Spread) – Variance, standard deviation, coefficient of variance

Numeric Data – Grouping Example: Ages of 10 people: 35, 40, 52, 27, 31, 42, 43, 28, 50, 35  One option is to group these ages into decades and create a categorical age variable: 

Frequency Table 

Create a frequency table for age groups:



Reminder: The percent symbol literally means “per 100” or, equivalently, divide by 100.

Histograms 





A histogram is a bar chart constructed using the frequencies or relative frequencies of a grouped (or “binned”) continuous variable It discards some information (the exact values), retaining only the frequencies in each“bin” Age histogram of 10 people

Age Histogram 

Age histogram for a subset of 1395 subjects



Here, age is treated as a continuous variable and “binned” by ranges of values (of width 5 years). The area and height of each “block” is proportional to the relative frequency in that category.



Age histogram 

We can force the size of the bins to be smaller (2.5 years)



This generally makes the histogram rougher, may reveal hidden patterns, but also may be too “sparse” to identify patterns.

Terms used to describe distributions 

Symmetric means that a distribution has approximately the same shape on the left and the right.

Modality 

Distributions of numeric (and ordered categorical) variables can be described as unimodal, bimodal, trimodal, etc.

Unimodal vs. bimodal 

 

 

The histograms, whether symmetric or skewed, have one thing in common: they all have one peak (or mode). We call such histograms (and their corresponding distributions) unimodal. Sometimes histograms have multiple modes. The bimodal histogram appears to be a combination of two unimodal histograms. Indeed, in many situations bimodal histograms (and multimodal histograms in general) indicate that the underlying population is not homogeneous and may include two (or more in case of multimodal histograms) subpopulations.

Unimodal vs. bimodal

Histogram of a bimodal distribution. A smooth curve is superimposed so that the two peaks are more evident

Histogram of protein consumption in 25 European countries for white meat in Protein dataset. The histogram is bimodal, which indicates that the sample might be comprised of two subgroups

Distribution patterns

Skewness Skewness is the the degree to which the large (small) observations are more spread out than the small (large) observations.  Skewness indicates asymmetry in the data.  In a histogram, dotplot or (horizontal) strip plot: 

– The large observations make up the right tail. The small observations make up the left tail – A distribution is skewed to the right when there is a long right tail. – A distribution is skewed to the left when there is a long left tail.

Skewed histograms   

In many situations, we find that a histogram is stretched to the left or right. We call such histograms skewed. More specifically, we call them left-skewed if they are stretched to the left, or right-skewed if they are stretched to the right.

Histogram of variable lwt in the birthwt data set. The histogram is rightskewed

Skewness right skewed → mean > median  left skewed → mean < median  symmetric → mean ≈ median 

More About Histograms   

 

The mean is where the dotplot balances Approx. 65% of the data fall between (mean - sd) and (mean + sd). Approx. 95% of the data fall between (mean - 2sd) and (mean + 2sd). Almost exactly 1/2 the data are above and below the median Almost exactly 50% of the data are between the first and third Quartiles

For example 

A numeric variable where the mean is 15.1 and the standard deviation is 5.5.

Percentiles The p-th percentile is the value that is greater than or equal to approximately p percent of the observations and less than or equal to approximately (100 − p) percent of the observations  The most commonly used percentile is the median.  The median is the ‘middle’ observation. One-half of the observations are less than or equal to it and one-half are greater than or equal to it. It is the 50th percentile. 

The Median 

Finding the median: – odd n – the middle value from the ordered list – even n – the mean of the two middle observations from the ordered list

Example: Patient ages, n = 10: The data are: 27, 28, 31, 35, 35, 40, 42, 43, 50, 52  There are an even number of observations so we take the mean of the 5th and 6th 

Percentiles and Quartiles    

Quartiles are specially named percentiles. Assuming the data are sorted from smallest to largest and that n = number of observations: If the percentile of interest rounds to an integer k then use kth observation. If the percentile of interest rounds to ‘k.5’ then use the average of the kth observation and the (k + 1)th observation.

Percentiles & Quartiles (cont.) 

There are many, many ways to define the exact number that is the pth percentile. – R has 9 different methods – The definition above is different from the default in R (but the results are close enough). – The definition above is what Bland uses.



Patient ages 27, 28, 31, 35, 35, 40, 42, 43, 50, 52 – Q1 = 0.25(10 + 1) ≈ 3. Use 3rd observation = 31 – Q2 =median=0.5(10+1)≈5.5. Use mean of 5th and 6th observation = 37.5 – Q3 = 0.75(10 + 1) ≈ 8. Use 8th observation = 43

Five-number summary 







Patient ages 27, 28, 31, 35, 35, 40, 42, 43, 50, 52 The first quartile is greater than or equal to approximately 25% and smaller than or equal to approximately 75% of the observations Q1 = 31, 30%≤31, 80%≥31 The third quartile is larger than or equal to approximately 75% and smaller than or equal to approximately 25% of the observations Q3 = 43, 80%≤43, 30%≥43 A five-number summary consists of min, Q1, median, Q3, max = 27, 31, 37.5, 43, 52

Numerical Summaries – Measures of location 

The location or central tendency of a variable – Where is center of the data?



There are two commonly used location statistics. – The median: The middle observation in ordered data (or the mean of the middle two) – The mean: The sum of the observations divided by the number of observations.



Assume we have a data vector called x of length n. – Let x1 be the first observation in the data vector – Let x2 be the second observation in the data vector • etc.



Define



then

Example: Patient Ages 

Age : 27, 28, 31, 35, 35, 40, 42, 43, 50, 52 – Mean: – Median:



Miscellaneous facts: – The mean is the place where the dotplot balances. – 𝑥 is shorthand for mean(x) (pronounced “x bar”) – “Average” just means typical. It can be either mean or median (or mode).

Robust 

• Mean is sensitive to extreme values measure central tendency • of Example: blood pressure reading

Mean is sensitive to extreme values

Robust measure of central tendency

 Example: blood pressure reading



• Median: The number separating the higher half of a sample, a population, or a population from the lower half C Median: The number separating the higher half of a sample, a population, or a population from the lower half • Median is less sensitive to extreme values – Median is less sensitive to extreme values

Measures of Variability 

Variability (or dispersion) — How spread out are the data? – The range — difference between the maximum and the minimum values – The inter-quartile range (IQR) — difference between the 3rd and 1st quartiles. – The variance — average of the squared distances between the Indices of dispersion observations and the mean – The standard deviationthe — the square root of the mean-squared • Summarize dispersion of individual values distance fromfrom the mean (denoted sd) like the mean some central value



• Give a measureofofindividual variation values from some Summarize the dispersion central value like the mean mean x x x x x

x

15

Variance & Standard Deviation 

Definitions using math symbols: – 𝑥 is our variable. It has n observations. – The ith observation is written 𝑥𝑖 – (𝑥𝑖 −𝑥)2 is the squared distance of observation 𝑥𝑖 from the mean, 𝑥 – The variance is the mean of the squared distances between the observations and the mean

– The Standard deviation is the square root of the variance

Variance & Standard Deviation 

More shorthand symbols – variance(x ) = var(x ) = s2 – standard deviation(x ) = sd(x ) = s

The standard deviation is a measure of the “degree of scatter” in the data around the central value (mean = 𝑥)  We typically use the standard deviation rather than the variance since the units for the standard deviation are the same as data’s (years for the age variable) which makes it easier to interpret.  We use the variance in some calculations because it has handy properties. 

Example (cont.) 

Patient Age (n = 10) 27, 28, 31, 35, 35, 40, 42, 43, 50, 52 – – – –

Variance = s2 = 74.7 years2 Standard deviation = s=8.6 years IQR = 43−31 = 12 years Range = 52−27 = 25 years

Coefficient of variance Coefficient ofCoefficient variance of variance

Coefficient of variance expresses standard nt of variance expresses standard relative to its expresses mean •deviation Coefficient of variance standard 

relative to its mean s mean deviation relative to its

cv

X

cv

s X

newborn Weights of newborn Weights of newborn Weights of newborn kg) mice (kg) elephants (kg) mice (kg) 853 0.72 0.42 929 853 0.72 0.42 939 0.63 0.31 878 939 0.63 0.31 972 0.59 0.38 895 972 0.59 0.38 841 0.79 0.96 937 841 0.79 0.96 826 1.06 0.89 801 826 1.06 0.89 Mice show 87.1 X greater n=10, X =n=10, 887.1 = 0.68 X =birthn=10, 0.68 v = 0.0637 s = 0.255, cv = 0.375 weight variation s = 56.50, cv = 0.0637 s = 0.255, cv = 0.375

Mice show greater birthweight variation

Copyright 2013 © Limsoon Wong, Copyright 2013 © Limsoon Wong,

When to use coefficient of variance 

When comparison groups have very different means  CV is suitable as it expresses the standard deviation relative to its corresponding mean



When different units of measurements are involved, e.g. group 1 unit is mm, and group 2 unit is mg  CV is suitable for comparison as it is unit- free



In such cases, standard deviation should not be used for comparison

Box Plots one more way to display a numeric variable  A Box Plot shows: 

– minimum, maximum, median, quartiles and the IQR – extreme data points.

Box Plot 



Example: Car emissions data The data consists of the emissions of three different pollutants from 48 car engines. We are interested in HC = Hydrocarbon

Box Plot

Components of the Box Plot  

 

    

Upper box boundary = Q3 or 75th percentile Line in box = Median = Q2 Lower box boundary = Q1 or 25th percentile inter-quartile range (IQR) = Q3 − Q1 - used to define the upper and lower fences Upper fence (not drawn) = Q3 + (1.5 × IQR) Lower fence (not drawn) = Q1 − (1.5 × IQR) Whiskers are drawn to the smallest and largest observations that are at or within the fences Extreme data points beyond the fence locations (high or low) are usually plotted individually. Sometimes (but not in R) the mean is plotted as a point (star, circle, etc.).

Example: Drawing the Box   

Patient data: 27,28,31,35,35,40,42,43,50,52 Upper and lower boundaries for “box” at Q3 = 43 and Q1 = 31 years. Median = 37.5 years =⇒ bar through box at 37.5.

Drawing the Whiskers  

    

data: 27,28,31,35,35,40,42,43,50,52 IQR = Q3 − Q1 = 43 − 31 = 12 years Upper fence location=43+1.5×12=61 whisker drawn to 52 (this is the largest value that is ≤ 61) Lower fence location=31−1.5×12=13 whisker drawn at 27 (this is the smallest value that is ≥ 13) no extreme values in this example.

Patient ages (cont.) 



Suppose the data set contained ages 12 and 62 (instead of 27 and 52). Data are now 12, 28, 31, 35, 35, 40, 42, 43, 50, 62

Patient ages (cont.) Median, Q1, Q3, IQR and fence locations are the same.  Whiskers change 

– Upper fence location = 61 =⇒ whisker to 50 (was 52) – Lower fence location = 13 =⇒ whisker to 28 (was 27)

Two extreme points.  What would happen if we changed the 12 to 13  The lower wisker would extend down to 13 (= to the lower fence) 

Box Plots Summary 

Box plots are useful and popular because they are a simple graphical display that shows: – – – – –



The quartiles of the data The minimum and maximum The IQR Skewness Extreme data points

To create a boxplot in R Commander – – – –

have a data set with numeric variables loaded use the menus: Graphs → Boxplot choose a numeric variable click“OK”

Exploring Relationships between variables

Relationships between variables 

A. two or more numeric variables – – scatter plot – scatter plot matrix – 3d scatter plot



B. two or more categorical variables: – contingency tables (tables of counts, frequency tables) – grouped bar graph.



C. numeric and categorical variables – tables of summary statistics – side-by-side or stacked: box

plots

strip

dot

plots histograms

plots

– color and/or shape coded scatter plots – additional plotting techniques.

Introduction 



   

So far, we have focused on using graphs and summary statistics to explore the distribution of individual variables. In this lecture we discuss using graphs and summary statistics to investigate relationships between two or more variables. We want to develop a high-level understanding of the type and strength of relationships between variables. We start by exploring relationships between two numerical variables. We then look at the relationship between two categorical variables. Finally, we discuss the relationships between a categorical variable and a numerical variable.

Association Among Numeric Variables 

Frequently we wish to assess the nature of a relationship between two numeric variables, for example: – Height and weight – Blood pressure and cholesterol – Angina pain perception and age – National breast cancer incidence and average dietary fat intake – Hydrocarbon and nitrogen dioxide emissions.



The first thing to do is to plot the two variables in a scatterplot.

Example: Breast Cancer Dataset

Scatter Plot 

A scatterplot of fat vs breast cancer incidence (number of new cases per 100,000 per year).

Scatter Plot 

The theory is that fat intake effects breast cancer – fat intake is predictor variable, plot on the x-axis

– breast cancer incidence is the response variable, plot on the y axis – The choice (of which variable is the predictor and which the response) is obvious in some cases and more arbitrary in others 

In R Commander – use the menu “Graphs”→“Scatterplot” – choose two variables one for the x axis and one for the y axis.

– click on the options tab 

unclick all the “Plot Options”



under “Identify Points” select “Do not identify”

– Click “Apply” until you are happy with the plot – then click “OK”

Scatterplot 





 

Using scatterplots, we could detect possible relationships between two numerical variables. In above examples, we can see that changes in one variable coincides with substantial systematic changes (increase or decrease) in the other variable. Since the overall relationship can be presented by a straight line, we say that the two variables have linear relationship. We say that percent body fat and abdomen circumference have positive linear relationship. In contrast, we say that annual mortality rate due to malignant melanoma and latitude have negative linear relationship.

Scatterplot 



In some cases, the two variables are related, but the relationship is not linear (left plot). In some other cases, there is no relationship (linear or non-linear) between the two variables (right plot).

Scatterplot 



Left panel: The scatterplot of percent body fat by height from the bodyfat data set. The isolated point at the left of the graph is an outlier, which has a drastic influence on the overall pattern. Right panel: The scatterplot of percent body fat by height after removing the outlier. The two variables seem to be unrelated

Scatterplots – Summary 

A scatter plot is a graphical tool for presenting the distribution of two numeric variables simultaneously. – One point per observation 

one variable on the x axis



one variable on the y axis

– Every observation is presented – Allows us to visualize the relationship between the two variables

Plotting more than 2 numeric variables at a time 

There are 2 options – 1) for 3 variables: 3D plot – 2) for 3 or more variables: scatter plot matrix



A scatter plot matrix shows all 2 way scatter plots between the variables. – The axes are labeled on the outside of the plot – The upper right portion of the plot is a mirror image of the lower left. – The names of the variables (and optionally a other stuff) are in the diagonal boxes



Example: iris data set included in R. – Three iris species – Lengths and widths of petals and sepals (green, leaf-like and protect the petals before opening)

Scatter plot matrix for the versicolor iris species

To accomplish this in Rcmdr: 

To load the iris data into Rcmdr: – choose menu: Data → Data in packages → Read data set from an attached package. – Type “iris” in the “Enter name of data set” box. – hit return or click “OK”.



To create the scatterplot matrix: – menu: Graphs → Scatterplot Matrix – select all 4 numeric variables

– in the “Subset expression” box specify 

Species == ”versicolor”

– click on the options tab 

click “histogram”under“ On Diagonal”



unclick all “Other Options”

– click“Apply”until you are happy with the plot – – click“OK”

Scatter Plot Matrix

In Rcmdr: → Scatterplot matrix • Graphs • Choose variables to be plotted on“Plot by groups”and • Click pick grouping variable for color coding • In the Options tab unselect everything. • click“Apply” • when you have the plot as you want it click“OK”

3D plot with grouping - Iris Data In Rcmdr: 

Graphs → 3D graph → 3D scatterplot



Under Explanatory variables Select the first 2 variables (Petal.Length & Petal.Width)



Under Response variable Select the first 3 variable (Septal.Length)



Select Grouping variable to be Species



click on the options tab –

unselect everything under surfaces to fit.



under Identify Points select“Do not identify”



click“Apply”until you have the plot you want.



click“OK”



use your cursor to rotate the plot.

Correlation 

   

To quantify the strength and direction of a linear relationship between two numerical variables, we can use Pearson's correlation coefficient, r , as a summary statistic. The values of r are always between -1 and +1. The relationship is strong when r approaches -1 or +1. The sign of r shows the direction (negative or positive) of the linear relationship. For observed pairs of values:

Correlation 

Calculating Pearson’s correlation coefficient for height and weight

Correlation 

click Statistics → Summaries → Correlation matrix

Obtaining and viewing the correlation between percent body fat and abdomen circumference in R-Commander

Correlation matrix for most of the numerical variables in the Protein data set

Visually Evaluating Correlation

Scatter plots showing the similarity from – 1 to 1.

Sample Covariance 

If the standard deviations are removed from the denominator, the statistic is called the sample covariance,



Therefore

Two categorical variables 

We now discuss techniques for exploring relationships between categorical variables.



We usually use contingency tables to summarize such data.



As an example, we consider the five-year study to investigate whether regular aspirin intake reduces the risk of cardiovascular disease.



Each cell shows the frequency of one possible combination of disease status (heart attack or no heart attack) and experiment group (placebo or aspirin).



Using these frequencies, we can calculate the sample proportion of people who suffered from heart attack in each experiment group separately.

Two categorical variables 

Sample Proportion: – There were 11034 people in the placebo group, of which 189 had heart attack. The proportion of people suffered from a heart attack in the placebo group is therefore p1 = 189/11034 = 0.0171. – The proportion of people suffered from heart attack in the aspirin group is p2 = 104/11037 = 0.0094. – We refer to this as the risk (here, the sample proportion is used to measure risk) of heart attack.



Difference of Proportion: Substantial difference between the sample proportion of heart attack between the two experiment groups could lead us to believe that the treatment and disease status are related. – One way of measuring the strength of the relationship is to calculate the difference of proportions, p2-p1. – Here, the difference of proportions is p2-p1 = -0.0077. – The proportion of people suffered from heart attack reduces by 0.0077 in the aspirin group compared to the placebo group.

Two categorical variables 

We can present this difference as a percentage using the sample proportion (risk) in the placebo group as the baseline:



This means that the risk of heart attack reduces by 45% in the aspirin group compared to the placebo group.

Two categorical variables 

Relative Proportion: Another common summary statistic for comparing sample proportions is the relative proportion p2/p1. – Since the sample proportions in this case are related to the risk of heart attack, we refer to the relative proportion as the relative risk.

– Here, the relative risk of suffering from heart attack is p2/p1 = 0.0094/0.0171 = 0.55 – This means that the risk of a heart attack in the aspirin group is 0.55 times of the risk in the placebo group. 

If the two sample proportions are equal, the relative proportion (risk) is equal to 1, which is interpreted as no relationship between the two categorical variables. Values of the relative proportion away from 1 (either below 1 or above 1) indicate that the relationship is strong.

Two categorical variables 

It is more common to compare the sample odds,



The odds of a heart attack in the placebo group, o1, and in the aspirin group, o2, are



We usually compare the sample odds using the sample odds ratio

Summarizing categorical data Summarizing categorical data

Proportion a type of fraction in which the numerator is a subset of • A A Proportion is ais type of fraction in which the numerator is a subset of the denominator the denominator – proportion dead = 35/86 = 0.41 – proportion dead = 35/86 = 0.41 • Odds are fractions where the numerator is not part of the denominator Oddsare in favor of death = 35/51 0.69  –Odds fractions where the=numerator is not part of the denominator – • A Odds Ratio is comparison numbers in afavor of deathof=two 35/51 = 0.69 – ratio of dead: alive = 35: 51 A Ratio a comparison numbersstudy – ratio of dead: alive = 35: 51 • Odds ratio:iscommonly usedof intwo case-control Oddsratio: in favor of death for females = 12/25 = 0.48  –Odds commonly used in case-control study – Odds in favor of death for males = 23/26 = 0.88 – Odds in favor of death for females = 12/25 = 0.48 – Odds ratio = 0.88/0.48 = 1.84 –

Odds in favor of death for males = 23/26 = 0.88



Odds ratio = 0.88/0.48 = 1.84

Copyright 2013 © Limsoon Wong,

Relationships Between Numerical and Categorical Variables 

Very often, we are interested in the relationship between a categorical variable and a numerical random variable.

Dot plots of vitamin C content (numerical) by cultivar (categorical) for the cabbages data set from the MASS package.

Strip chart (dot plot) for vitamin C content (VitC) by cultivar (Cult) from the canbbages data set using RCommander

Relationships Between Numerical and Categorical Variables 

A more common way of visualizing the relationship between a numerical variable and a categorical variable is to create boxplots.

mean

sd

0%

25%

50%

75%

100%

N

C39

51.5

7.123

41

46

51

54

68

30

C52

64

8.455

47

58

64

70

84

30

Summary statistics of vitamin C content (VitC) by cultivar (Cult) from the cabbages data set

Boxplot of vitamin C content for different cultivars.

Relationships Between Numerical and Categorical Variables 







In general, we say that two variables are related if the distribution of one of them changes as the other one varies. We can measure changes in the distribution of the numerical variable by obtaining its summary statistics for different levels of the categorical variable. It is common to use the difference of means when examining the relationship between a numerical variable and a categorical variable. In the above example, the difference of means of vitamin C content is 64.4 -51.5 = 12.9 between the two cultivars.

Relationships Between Numerical and Categorical Variables 

When the categorical variable has multiple levels (categories), it is easier to compare the means across different levels using the plot of means.

Plotting the means of bp for different weight group (which are defined based on BMI).

Visualization Techniques: Contour Plots 

Contour plots – Useful when a continuous attribute is measured on a spatial grid – They partition the plane into regions of similar values – The contour lines that form the boundaries of these regions connect points with equal values – The most common example is contour maps of elevation – Can also display temperature, rainfall, air pressure, etc. 

An example for Sea Surface Temperature (SST) is provided on the next slide

Contour Plot Example: SST Dec, 1998

Celsius

Visualization Techniques: Parallel Coordinates 

Parallel Coordinates – Used to plot the attribute values of high-dimensional data – Instead of using perpendicular axes, use a set of parallel axes – The attribute values of each object are plotted as a point on each corresponding coordinate axis and the points are connected by a line – Thus, each object is represented as a line – Often, the lines representing a distinct class of objects group together, at least for some attributes – Ordering of attributes is important in seeing such groupings

Parallel Coordinates Plots for Iris Data

Other Visualization Techniques 

Star Plots – Similar approach to parallel coordinates, but axes radiate from a central point – The line connecting the values of an object is a polygon



Chernoff Faces – Approach created by Herman Chernoff – This approach associates each attribute with a characteristic of a face – The values of each attribute determine the appearance of the corresponding facial characteristic – Each object becomes a separate face – Relies on human’s ability to distinguish faces

Star Plots for Iris Data

Setosa

Versicolour

Virginica

Chernoff Faces for Iris Data Setosa

Versicolour

Virginica

Side-by-side plots: box plots 

Box plot of systolic B.P. by age group



It’s also possible to include a second categorical variable in a box plot.

There are many more types of plots 

E.g. Heat maps to display a numerical variable in two dimensions:

A different kind of heat map.

There are many ways to convey information graphically 

Example: A map of the world divided into areas containing 10 to 11 million people each: Displays human density.

DATA EXPLORATION EXAMPLE

Iris Sample Data Set 

Many of the exploratory data techniques are illustrated with the Iris Plant data set. – Can be obtained from the UCI Machine Learning Repository http://www.ics.uci.edu/~mlearn/MLRepository.html

– From the statistician Douglas Fisher – Three flower types (classes): Setosa  Virginica  Versicolour 

– Four (non-class) attributes Sepal width and length  Petal width and length 

Visualization Techniques: Histograms 

Histogram – Usually shows the distribution of values of a single variable – A histogram is similar to a bar graph after the values of the variable are grouped (binned) into a finite number of intervals (bins). – Divide the values into bins and show a bar plot of the number of objects in each bin. – The height of each bar indicates the number of objects – Shape of histogram depends on the number of bins



Example: Petal Width (10 and 20 bins, respectively)

Example of Box Plots 

Box plots can be used to compare attributes

Scatter Plot Array of Iris Attributes