Data Analysis Exploratory Data Analysis
What is data exploration? A preliminary exploration of the data to better understand its characteristics.
Key motivations of data exploration include – Helping to select the right tool for preprocessing or analysis – Making use of humans’ abilities to recognize patterns People can recognize patterns not captured by data analysis tools
Related to the area of Exploratory Data Analysis (EDA) – Created by statistician John Tukey – Seminal book is Exploratory Data Analysis by Tukey – A nice online introduction can be found in Chapter 1 of the NIST Engineering Statistics Handbook http://www.itl.nist.gov/div898/handbook/index.htm
Techniques Used In Data Exploration
In EDA, as originally defined by Tukey – The focus was on visualization – Clustering and anomaly detection were viewed as exploratory techniques – In data mining, clustering and anomaly detection are major areas of interest, and not thought of as just exploratory
In our discussion of data exploration, we focus on – Summary statistics – Visualization – Online Analytical Processing (OLAP)
Data and Exploratory Data Analysis (EDA)
The first step in analyzing data is to summarize and plot. – Initial summary of data – includes checking for unusual or erroneous values, identifying missing items
– Preliminary analysis of data – Preliminary interpretation of data
The techniques used are called Exploratory Data Analysis Data exploration involves: – Visualization: The objective of data visualization is to obtain a high level understanding of the sample and their observed (measured) characteristics. – Summary Statistic: To make the data more manageable, we need to further reduce the amount of information in some meaningful ways so that we can focus on the key aspects of the data.
Visualization
Visualization is the conversion of data into a visual or tabular format so that the characteristics of the data and the relationships among data items or attributes can be analyzed or reported.
Visualization of data is one of the most powerful and appealing techniques for data exploration. – Humans have a well developed ability to analyze large amounts of information that is presented visually – Can detect general patterns and trends – Can detect outliers and unusual patterns
Example: Sea Surface Temperature
The following shows the Sea Surface Temperature (SST) for July 1982 – Tens of thousands of data points are summarized in a single figure
Data exploration Using data exploration techniques, we can learn about the distribution of a variable. Informally, the distribution of a variable tells us
– the possible values it can take, – the chance of observing those values – how often we expect to see them in a random sample from the population.
Through data exploration, We might detect previously unknown patterns and relationships that are worth further investigation. We can also identify possible data issues, such as unexpected or unusual measurements, known as outliers.
Summarizing Data
Examine the entire data set using basic techniques before starting a formal statistical analysis. – – – – –
Familiarizing yourself with the data. Find possible errors and anomalies. Examine the distribution of values for each variable. Create simple graphical and numerical summaries. Look at simple associations between variables.
Summarizing Data Statistic is “making sense of data” Raw data have to be processed and summarized before one can make sense of data Summary can take the form of
– Summary index: using a single value to summarize data from a study variable – Tables – Diagrams
Summarizing Variables
Categorical variables – Frequency tables - how many observations in each category? – Relative frequency table - percent in each category. – Bar chart and other plots.
Numeric variables – Bin the observations (create categories .e.g., (0-10), (11-20), etc.) then, treat as ordered categorical. – Plots specific to numeric variables.
The goal for both categorical and numeric data is data reduction while preserving/extracting key information about the process under investigation.
Frequency Table and Mode
Frequency Table: Categories with counts
The number of times a specific observed category is called frequency. The sum of the frequencies fro all categories is equal to the total sample size The relative frequency is the sample proportion for each possible category. It is obtained by dividing the frequencies by the total number of observations. For a categorical variable, the mode of is the most common value, i.e., the value with the highest frequency
Graphing a Frequency Table - Bar Chart:
Plot the number of observations in each category:
Categorical Data Summaries
Summary: – Frequency tables and relative frequency tables are used for describing categorical data – Bar plots are often used to display data. – Other types of plots such as pie charts are probably not useful in scientific settings
Summarizing measurement data
Distribution patterns – Symmetrical (bell-shaped) distribution, e.g. normal distribution – Skewed distribution – Bimodal and multimodal distribution
Indices of central tendency (Location) – Mean, median
Indices of dispersion (Spread) – Variance, standard deviation, coefficient of variance
Numeric Data – Grouping Example: Ages of 10 people: 35, 40, 52, 27, 31, 42, 43, 28, 50, 35 One option is to group these ages into decades and create a categorical age variable:
Frequency Table
Create a frequency table for age groups:
Reminder: The percent symbol literally means “per 100” or, equivalently, divide by 100.
Histograms
A histogram is a bar chart constructed using the frequencies or relative frequencies of a grouped (or “binned”) continuous variable It discards some information (the exact values), retaining only the frequencies in each“bin” Age histogram of 10 people
Age Histogram
Age histogram for a subset of 1395 subjects
Here, age is treated as a continuous variable and “binned” by ranges of values (of width 5 years). The area and height of each “block” is proportional to the relative frequency in that category.
Age histogram
We can force the size of the bins to be smaller (2.5 years)
This generally makes the histogram rougher, may reveal hidden patterns, but also may be too “sparse” to identify patterns.
Terms used to describe distributions
Symmetric means that a distribution has approximately the same shape on the left and the right.
Modality
Distributions of numeric (and ordered categorical) variables can be described as unimodal, bimodal, trimodal, etc.
Unimodal vs. bimodal
The histograms, whether symmetric or skewed, have one thing in common: they all have one peak (or mode). We call such histograms (and their corresponding distributions) unimodal. Sometimes histograms have multiple modes. The bimodal histogram appears to be a combination of two unimodal histograms. Indeed, in many situations bimodal histograms (and multimodal histograms in general) indicate that the underlying population is not homogeneous and may include two (or more in case of multimodal histograms) subpopulations.
Unimodal vs. bimodal
Histogram of a bimodal distribution. A smooth curve is superimposed so that the two peaks are more evident
Histogram of protein consumption in 25 European countries for white meat in Protein dataset. The histogram is bimodal, which indicates that the sample might be comprised of two subgroups
Distribution patterns
Skewness Skewness is the the degree to which the large (small) observations are more spread out than the small (large) observations. Skewness indicates asymmetry in the data. In a histogram, dotplot or (horizontal) strip plot:
– The large observations make up the right tail. The small observations make up the left tail – A distribution is skewed to the right when there is a long right tail. – A distribution is skewed to the left when there is a long left tail.
Skewed histograms
In many situations, we find that a histogram is stretched to the left or right. We call such histograms skewed. More specifically, we call them left-skewed if they are stretched to the left, or right-skewed if they are stretched to the right.
Histogram of variable lwt in the birthwt data set. The histogram is rightskewed
Skewness right skewed → mean > median left skewed → mean < median symmetric → mean ≈ median
More About Histograms
The mean is where the dotplot balances Approx. 65% of the data fall between (mean - sd) and (mean + sd). Approx. 95% of the data fall between (mean - 2sd) and (mean + 2sd). Almost exactly 1/2 the data are above and below the median Almost exactly 50% of the data are between the first and third Quartiles
For example
A numeric variable where the mean is 15.1 and the standard deviation is 5.5.
Percentiles The p-th percentile is the value that is greater than or equal to approximately p percent of the observations and less than or equal to approximately (100 − p) percent of the observations The most commonly used percentile is the median. The median is the ‘middle’ observation. One-half of the observations are less than or equal to it and one-half are greater than or equal to it. It is the 50th percentile.
The Median
Finding the median: – odd n – the middle value from the ordered list – even n – the mean of the two middle observations from the ordered list
Example: Patient ages, n = 10: The data are: 27, 28, 31, 35, 35, 40, 42, 43, 50, 52 There are an even number of observations so we take the mean of the 5th and 6th
Percentiles and Quartiles
Quartiles are specially named percentiles. Assuming the data are sorted from smallest to largest and that n = number of observations: If the percentile of interest rounds to an integer k then use kth observation. If the percentile of interest rounds to ‘k.5’ then use the average of the kth observation and the (k + 1)th observation.
Percentiles & Quartiles (cont.)
There are many, many ways to define the exact number that is the pth percentile. – R has 9 different methods – The definition above is different from the default in R (but the results are close enough). – The definition above is what Bland uses.
Patient ages 27, 28, 31, 35, 35, 40, 42, 43, 50, 52 – Q1 = 0.25(10 + 1) ≈ 3. Use 3rd observation = 31 – Q2 =median=0.5(10+1)≈5.5. Use mean of 5th and 6th observation = 37.5 – Q3 = 0.75(10 + 1) ≈ 8. Use 8th observation = 43
Five-number summary
Patient ages 27, 28, 31, 35, 35, 40, 42, 43, 50, 52 The first quartile is greater than or equal to approximately 25% and smaller than or equal to approximately 75% of the observations Q1 = 31, 30%≤31, 80%≥31 The third quartile is larger than or equal to approximately 75% and smaller than or equal to approximately 25% of the observations Q3 = 43, 80%≤43, 30%≥43 A five-number summary consists of min, Q1, median, Q3, max = 27, 31, 37.5, 43, 52
Numerical Summaries – Measures of location
The location or central tendency of a variable – Where is center of the data?
There are two commonly used location statistics. – The median: The middle observation in ordered data (or the mean of the middle two) – The mean: The sum of the observations divided by the number of observations.
Assume we have a data vector called x of length n. – Let x1 be the first observation in the data vector – Let x2 be the second observation in the data vector • etc.
Define
then
Example: Patient Ages
Age : 27, 28, 31, 35, 35, 40, 42, 43, 50, 52 – Mean: – Median:
Miscellaneous facts: – The mean is the place where the dotplot balances. – 𝑥 is shorthand for mean(x) (pronounced “x bar”) – “Average” just means typical. It can be either mean or median (or mode).
Robust
• Mean is sensitive to extreme values measure central tendency • of Example: blood pressure reading
Mean is sensitive to extreme values
Robust measure of central tendency
Example: blood pressure reading
• Median: The number separating the higher half of a sample, a population, or a population from the lower half C Median: The number separating the higher half of a sample, a population, or a population from the lower half • Median is less sensitive to extreme values – Median is less sensitive to extreme values
Measures of Variability
Variability (or dispersion) — How spread out are the data? – The range — difference between the maximum and the minimum values – The inter-quartile range (IQR) — difference between the 3rd and 1st quartiles. – The variance — average of the squared distances between the Indices of dispersion observations and the mean – The standard deviationthe — the square root of the mean-squared • Summarize dispersion of individual values distance fromfrom the mean (denoted sd) like the mean some central value
• Give a measureofofindividual variation values from some Summarize the dispersion central value like the mean mean x x x x x
x
15
Variance & Standard Deviation
Definitions using math symbols: – 𝑥 is our variable. It has n observations. – The ith observation is written 𝑥𝑖 – (𝑥𝑖 −𝑥)2 is the squared distance of observation 𝑥𝑖 from the mean, 𝑥 – The variance is the mean of the squared distances between the observations and the mean
– The Standard deviation is the square root of the variance
Variance & Standard Deviation
More shorthand symbols – variance(x ) = var(x ) = s2 – standard deviation(x ) = sd(x ) = s
The standard deviation is a measure of the “degree of scatter” in the data around the central value (mean = 𝑥) We typically use the standard deviation rather than the variance since the units for the standard deviation are the same as data’s (years for the age variable) which makes it easier to interpret. We use the variance in some calculations because it has handy properties.
Example (cont.)
Patient Age (n = 10) 27, 28, 31, 35, 35, 40, 42, 43, 50, 52 – – – –
Variance = s2 = 74.7 years2 Standard deviation = s=8.6 years IQR = 43−31 = 12 years Range = 52−27 = 25 years
Coefficient of variance Coefficient ofCoefficient variance of variance
Coefficient of variance expresses standard nt of variance expresses standard relative to its expresses mean •deviation Coefficient of variance standard
relative to its mean s mean deviation relative to its
cv
X
cv
s X
newborn Weights of newborn Weights of newborn Weights of newborn kg) mice (kg) elephants (kg) mice (kg) 853 0.72 0.42 929 853 0.72 0.42 939 0.63 0.31 878 939 0.63 0.31 972 0.59 0.38 895 972 0.59 0.38 841 0.79 0.96 937 841 0.79 0.96 826 1.06 0.89 801 826 1.06 0.89 Mice show 87.1 X greater n=10, X =n=10, 887.1 = 0.68 X =birthn=10, 0.68 v = 0.0637 s = 0.255, cv = 0.375 weight variation s = 56.50, cv = 0.0637 s = 0.255, cv = 0.375
Mice show greater birthweight variation
Copyright 2013 © Limsoon Wong, Copyright 2013 © Limsoon Wong,
When to use coefficient of variance
When comparison groups have very different means CV is suitable as it expresses the standard deviation relative to its corresponding mean
When different units of measurements are involved, e.g. group 1 unit is mm, and group 2 unit is mg CV is suitable for comparison as it is unit- free
In such cases, standard deviation should not be used for comparison
Box Plots one more way to display a numeric variable A Box Plot shows:
– minimum, maximum, median, quartiles and the IQR – extreme data points.
Box Plot
Example: Car emissions data The data consists of the emissions of three different pollutants from 48 car engines. We are interested in HC = Hydrocarbon
Box Plot
Components of the Box Plot
Upper box boundary = Q3 or 75th percentile Line in box = Median = Q2 Lower box boundary = Q1 or 25th percentile inter-quartile range (IQR) = Q3 − Q1 - used to define the upper and lower fences Upper fence (not drawn) = Q3 + (1.5 × IQR) Lower fence (not drawn) = Q1 − (1.5 × IQR) Whiskers are drawn to the smallest and largest observations that are at or within the fences Extreme data points beyond the fence locations (high or low) are usually plotted individually. Sometimes (but not in R) the mean is plotted as a point (star, circle, etc.).
Example: Drawing the Box
Patient data: 27,28,31,35,35,40,42,43,50,52 Upper and lower boundaries for “box” at Q3 = 43 and Q1 = 31 years. Median = 37.5 years =⇒ bar through box at 37.5.
Drawing the Whiskers
data: 27,28,31,35,35,40,42,43,50,52 IQR = Q3 − Q1 = 43 − 31 = 12 years Upper fence location=43+1.5×12=61 whisker drawn to 52 (this is the largest value that is ≤ 61) Lower fence location=31−1.5×12=13 whisker drawn at 27 (this is the smallest value that is ≥ 13) no extreme values in this example.
Patient ages (cont.)
Suppose the data set contained ages 12 and 62 (instead of 27 and 52). Data are now 12, 28, 31, 35, 35, 40, 42, 43, 50, 62
Patient ages (cont.) Median, Q1, Q3, IQR and fence locations are the same. Whiskers change
– Upper fence location = 61 =⇒ whisker to 50 (was 52) – Lower fence location = 13 =⇒ whisker to 28 (was 27)
Two extreme points. What would happen if we changed the 12 to 13 The lower wisker would extend down to 13 (= to the lower fence)
Box Plots Summary
Box plots are useful and popular because they are a simple graphical display that shows: – – – – –
The quartiles of the data The minimum and maximum The IQR Skewness Extreme data points
To create a boxplot in R Commander – – – –
have a data set with numeric variables loaded use the menus: Graphs → Boxplot choose a numeric variable click“OK”
Exploring Relationships between variables
Relationships between variables
A. two or more numeric variables – – scatter plot – scatter plot matrix – 3d scatter plot
B. two or more categorical variables: – contingency tables (tables of counts, frequency tables) – grouped bar graph.
C. numeric and categorical variables – tables of summary statistics – side-by-side or stacked: box
plots
strip
dot
plots histograms
plots
– color and/or shape coded scatter plots – additional plotting techniques.
Introduction
So far, we have focused on using graphs and summary statistics to explore the distribution of individual variables. In this lecture we discuss using graphs and summary statistics to investigate relationships between two or more variables. We want to develop a high-level understanding of the type and strength of relationships between variables. We start by exploring relationships between two numerical variables. We then look at the relationship between two categorical variables. Finally, we discuss the relationships between a categorical variable and a numerical variable.
Association Among Numeric Variables
Frequently we wish to assess the nature of a relationship between two numeric variables, for example: – Height and weight – Blood pressure and cholesterol – Angina pain perception and age – National breast cancer incidence and average dietary fat intake – Hydrocarbon and nitrogen dioxide emissions.
The first thing to do is to plot the two variables in a scatterplot.
Example: Breast Cancer Dataset
Scatter Plot
A scatterplot of fat vs breast cancer incidence (number of new cases per 100,000 per year).
Scatter Plot
The theory is that fat intake effects breast cancer – fat intake is predictor variable, plot on the x-axis
– breast cancer incidence is the response variable, plot on the y axis – The choice (of which variable is the predictor and which the response) is obvious in some cases and more arbitrary in others
In R Commander – use the menu “Graphs”→“Scatterplot” – choose two variables one for the x axis and one for the y axis.
– click on the options tab
unclick all the “Plot Options”
under “Identify Points” select “Do not identify”
– Click “Apply” until you are happy with the plot – then click “OK”
Scatterplot
Using scatterplots, we could detect possible relationships between two numerical variables. In above examples, we can see that changes in one variable coincides with substantial systematic changes (increase or decrease) in the other variable. Since the overall relationship can be presented by a straight line, we say that the two variables have linear relationship. We say that percent body fat and abdomen circumference have positive linear relationship. In contrast, we say that annual mortality rate due to malignant melanoma and latitude have negative linear relationship.
Scatterplot
In some cases, the two variables are related, but the relationship is not linear (left plot). In some other cases, there is no relationship (linear or non-linear) between the two variables (right plot).
Scatterplot
Left panel: The scatterplot of percent body fat by height from the bodyfat data set. The isolated point at the left of the graph is an outlier, which has a drastic influence on the overall pattern. Right panel: The scatterplot of percent body fat by height after removing the outlier. The two variables seem to be unrelated
Scatterplots – Summary
A scatter plot is a graphical tool for presenting the distribution of two numeric variables simultaneously. – One point per observation
one variable on the x axis
one variable on the y axis
– Every observation is presented – Allows us to visualize the relationship between the two variables
Plotting more than 2 numeric variables at a time
There are 2 options – 1) for 3 variables: 3D plot – 2) for 3 or more variables: scatter plot matrix
A scatter plot matrix shows all 2 way scatter plots between the variables. – The axes are labeled on the outside of the plot – The upper right portion of the plot is a mirror image of the lower left. – The names of the variables (and optionally a other stuff) are in the diagonal boxes
Example: iris data set included in R. – Three iris species – Lengths and widths of petals and sepals (green, leaf-like and protect the petals before opening)
Scatter plot matrix for the versicolor iris species
To accomplish this in Rcmdr:
To load the iris data into Rcmdr: – choose menu: Data → Data in packages → Read data set from an attached package. – Type “iris” in the “Enter name of data set” box. – hit return or click “OK”.
To create the scatterplot matrix: – menu: Graphs → Scatterplot Matrix – select all 4 numeric variables
– in the “Subset expression” box specify
Species == ”versicolor”
– click on the options tab
click “histogram”under“ On Diagonal”
unclick all “Other Options”
– click“Apply”until you are happy with the plot – – click“OK”
Scatter Plot Matrix
In Rcmdr: → Scatterplot matrix • Graphs • Choose variables to be plotted on“Plot by groups”and • Click pick grouping variable for color coding • In the Options tab unselect everything. • click“Apply” • when you have the plot as you want it click“OK”
3D plot with grouping - Iris Data In Rcmdr:
Graphs → 3D graph → 3D scatterplot
Under Explanatory variables Select the first 2 variables (Petal.Length & Petal.Width)
Under Response variable Select the first 3 variable (Septal.Length)
Select Grouping variable to be Species
click on the options tab –
unselect everything under surfaces to fit.
–
under Identify Points select“Do not identify”
click“Apply”until you have the plot you want.
click“OK”
use your cursor to rotate the plot.
Correlation
To quantify the strength and direction of a linear relationship between two numerical variables, we can use Pearson's correlation coefficient, r , as a summary statistic. The values of r are always between -1 and +1. The relationship is strong when r approaches -1 or +1. The sign of r shows the direction (negative or positive) of the linear relationship. For observed pairs of values:
Correlation
Calculating Pearson’s correlation coefficient for height and weight
Correlation
click Statistics → Summaries → Correlation matrix
Obtaining and viewing the correlation between percent body fat and abdomen circumference in R-Commander
Correlation matrix for most of the numerical variables in the Protein data set
Visually Evaluating Correlation
Scatter plots showing the similarity from – 1 to 1.
Sample Covariance
If the standard deviations are removed from the denominator, the statistic is called the sample covariance,
Therefore
Two categorical variables
We now discuss techniques for exploring relationships between categorical variables.
We usually use contingency tables to summarize such data.
As an example, we consider the five-year study to investigate whether regular aspirin intake reduces the risk of cardiovascular disease.
Each cell shows the frequency of one possible combination of disease status (heart attack or no heart attack) and experiment group (placebo or aspirin).
Using these frequencies, we can calculate the sample proportion of people who suffered from heart attack in each experiment group separately.
Two categorical variables
Sample Proportion: – There were 11034 people in the placebo group, of which 189 had heart attack. The proportion of people suffered from a heart attack in the placebo group is therefore p1 = 189/11034 = 0.0171. – The proportion of people suffered from heart attack in the aspirin group is p2 = 104/11037 = 0.0094. – We refer to this as the risk (here, the sample proportion is used to measure risk) of heart attack.
Difference of Proportion: Substantial difference between the sample proportion of heart attack between the two experiment groups could lead us to believe that the treatment and disease status are related. – One way of measuring the strength of the relationship is to calculate the difference of proportions, p2-p1. – Here, the difference of proportions is p2-p1 = -0.0077. – The proportion of people suffered from heart attack reduces by 0.0077 in the aspirin group compared to the placebo group.
Two categorical variables
We can present this difference as a percentage using the sample proportion (risk) in the placebo group as the baseline:
This means that the risk of heart attack reduces by 45% in the aspirin group compared to the placebo group.
Two categorical variables
Relative Proportion: Another common summary statistic for comparing sample proportions is the relative proportion p2/p1. – Since the sample proportions in this case are related to the risk of heart attack, we refer to the relative proportion as the relative risk.
– Here, the relative risk of suffering from heart attack is p2/p1 = 0.0094/0.0171 = 0.55 – This means that the risk of a heart attack in the aspirin group is 0.55 times of the risk in the placebo group.
If the two sample proportions are equal, the relative proportion (risk) is equal to 1, which is interpreted as no relationship between the two categorical variables. Values of the relative proportion away from 1 (either below 1 or above 1) indicate that the relationship is strong.
Two categorical variables
It is more common to compare the sample odds,
The odds of a heart attack in the placebo group, o1, and in the aspirin group, o2, are
We usually compare the sample odds using the sample odds ratio
Summarizing categorical data Summarizing categorical data
Proportion a type of fraction in which the numerator is a subset of • A A Proportion is ais type of fraction in which the numerator is a subset of the denominator the denominator – proportion dead = 35/86 = 0.41 – proportion dead = 35/86 = 0.41 • Odds are fractions where the numerator is not part of the denominator Oddsare in favor of death = 35/51 0.69 –Odds fractions where the=numerator is not part of the denominator – • A Odds Ratio is comparison numbers in afavor of deathof=two 35/51 = 0.69 – ratio of dead: alive = 35: 51 A Ratio a comparison numbersstudy – ratio of dead: alive = 35: 51 • Odds ratio:iscommonly usedof intwo case-control Oddsratio: in favor of death for females = 12/25 = 0.48 –Odds commonly used in case-control study – Odds in favor of death for males = 23/26 = 0.88 – Odds in favor of death for females = 12/25 = 0.48 – Odds ratio = 0.88/0.48 = 1.84 –
Odds in favor of death for males = 23/26 = 0.88
–
Odds ratio = 0.88/0.48 = 1.84
Copyright 2013 © Limsoon Wong,
Relationships Between Numerical and Categorical Variables
Very often, we are interested in the relationship between a categorical variable and a numerical random variable.
Dot plots of vitamin C content (numerical) by cultivar (categorical) for the cabbages data set from the MASS package.
Strip chart (dot plot) for vitamin C content (VitC) by cultivar (Cult) from the canbbages data set using RCommander
Relationships Between Numerical and Categorical Variables
A more common way of visualizing the relationship between a numerical variable and a categorical variable is to create boxplots.
mean
sd
0%
25%
50%
75%
100%
N
C39
51.5
7.123
41
46
51
54
68
30
C52
64
8.455
47
58
64
70
84
30
Summary statistics of vitamin C content (VitC) by cultivar (Cult) from the cabbages data set
Boxplot of vitamin C content for different cultivars.
Relationships Between Numerical and Categorical Variables
In general, we say that two variables are related if the distribution of one of them changes as the other one varies. We can measure changes in the distribution of the numerical variable by obtaining its summary statistics for different levels of the categorical variable. It is common to use the difference of means when examining the relationship between a numerical variable and a categorical variable. In the above example, the difference of means of vitamin C content is 64.4 -51.5 = 12.9 between the two cultivars.
Relationships Between Numerical and Categorical Variables
When the categorical variable has multiple levels (categories), it is easier to compare the means across different levels using the plot of means.
Plotting the means of bp for different weight group (which are defined based on BMI).
Visualization Techniques: Contour Plots
Contour plots – Useful when a continuous attribute is measured on a spatial grid – They partition the plane into regions of similar values – The contour lines that form the boundaries of these regions connect points with equal values – The most common example is contour maps of elevation – Can also display temperature, rainfall, air pressure, etc.
An example for Sea Surface Temperature (SST) is provided on the next slide
Contour Plot Example: SST Dec, 1998
Celsius
Visualization Techniques: Parallel Coordinates
Parallel Coordinates – Used to plot the attribute values of high-dimensional data – Instead of using perpendicular axes, use a set of parallel axes – The attribute values of each object are plotted as a point on each corresponding coordinate axis and the points are connected by a line – Thus, each object is represented as a line – Often, the lines representing a distinct class of objects group together, at least for some attributes – Ordering of attributes is important in seeing such groupings
Parallel Coordinates Plots for Iris Data
Other Visualization Techniques
Star Plots – Similar approach to parallel coordinates, but axes radiate from a central point – The line connecting the values of an object is a polygon
Chernoff Faces – Approach created by Herman Chernoff – This approach associates each attribute with a characteristic of a face – The values of each attribute determine the appearance of the corresponding facial characteristic – Each object becomes a separate face – Relies on human’s ability to distinguish faces
Star Plots for Iris Data
Setosa
Versicolour
Virginica
Chernoff Faces for Iris Data Setosa
Versicolour
Virginica
Side-by-side plots: box plots
Box plot of systolic B.P. by age group
It’s also possible to include a second categorical variable in a box plot.
There are many more types of plots
E.g. Heat maps to display a numerical variable in two dimensions:
A different kind of heat map.
There are many ways to convey information graphically
Example: A map of the world divided into areas containing 10 to 11 million people each: Displays human density.
DATA EXPLORATION EXAMPLE
Iris Sample Data Set
Many of the exploratory data techniques are illustrated with the Iris Plant data set. – Can be obtained from the UCI Machine Learning Repository http://www.ics.uci.edu/~mlearn/MLRepository.html
– From the statistician Douglas Fisher – Three flower types (classes): Setosa Virginica Versicolour
– Four (non-class) attributes Sepal width and length Petal width and length
Visualization Techniques: Histograms
Histogram – Usually shows the distribution of values of a single variable – A histogram is similar to a bar graph after the values of the variable are grouped (binned) into a finite number of intervals (bins). – Divide the values into bins and show a bar plot of the number of objects in each bin. – The height of each bar indicates the number of objects – Shape of histogram depends on the number of bins
Example: Petal Width (10 and 20 bins, respectively)
Example of Box Plots
Box plots can be used to compare attributes
Scatter Plot Array of Iris Attributes