Display one categorical variable 1) Frequency Table/ Relative Frequency Table 2) Bar Chart/Relative Frequency Bar Chart/ Pie Chart Display two categorical variables 1) Contingency Table 2) Side-by-side bar chart Distribution / Marginal distribution / Conditional distribution / Independence
Motivation • The Titanic example The following is part of a data table showing four variables for seven people aboard the Titanic. Survival
Age
Sex
Class
Dead
Adult
Male
Third
Dead
Adult
Male
Crew
Dead
Adult
Male
Third
Dead
Adult
Male
Crew
Dead
Adult
Male
Crew
Alive
Adult
Female
First
Dead
Adult
Male
Third
Q: Is there a relationship between “Survival” and “Class”?
Motivation Survival
Age
Sex
Class
Dead
Adult
Male
Third
Dead
Adult
Male
Crew
Dead
Adult
Male
Third
Dead
Adult
Male
Crew
Dead
Adult
Male
Crew
Alive
Adult
Female
First
Dead
Adult
Male
Third
Dead
Adult
Male
Crew
Alive
Adult
Female
Third
Dead
Adult
Male
Crew
Alive
Adult
Female
Second
Dead
Adult
Male
Crew
Dead
Adult
Male
Third
Dead
Adult
Male
Third
Dead
Adult
Male
Second
Q: How about this? We need new ways to show the PATTERNS, RELATIONSHIPS, TRENDS, and EXCEPTIONS from the data.
Frequency tables • We pile together things that seem to go together, so we can see how cases distribute across different categories. • For categorical data, piling is just to count the number of cases c0rresponding to each category and pile them up.
Frequency tables • Example: pile the “Class” variable Survival
Age
Sex
Class
Class
Count
Dead
Adult
Male
Third
First
1
Dead
Adult
Male
Crew
Second
0
Dead
Adult
Male
Third
Third
3
Dead
Adult
Male
Crew
Crew
3
Dead
Adult
Male
Crew
Alive
Adult
Female
First
Dead
Adult
Male
Third
This new table is called a frequency table
Frequency tables • Frequency table A frequency table lists all the categories in a categorical variable and gives the count of cases for each category. Frequency tables are common table techniques to display a single categorical variable. Class
Count
First
1
Second
0
Third
3
Crew
3
Relative frequency tables • Sometimes we want to know the fraction or proportion of the data in each category. • Relative frequency table A relative frequency table displays the percentages (divide the counts by the total number of the cases), rather than the counts, of the values in each category. Frequency Table
Relative Frequency Table Class
%
1/7 ∙ 100% = 14.28%
First
14.28
0
0/7 ∙ 100% = 0%
Second
0
Third
3
3/7 ∙ 100% = 42.86%
Third
42.86
Crew
3
3/7 ∙ 100% = 42.86%
Crew
42.86
Class
Count
First
1
Second
Relative frequency tables • Both types of tables show how cases are distributed across the categories. In this way, they describe the distribution of a categorical variable. Relative Frequency Table
Frequency Table
Distribution: 1.The most common category 2.The least common category 3.Comparison
Class
%
First
14.28
Second
0
3
Third
42.86
3
Crew
42.86
Class
Count
First
1
Second
0
Third Crew
Exercises from the book 3.11
Bar chart • Bar chart A bar chart displays the distribution of a categorical variable, showing the counts (frequency) for each category next to each other for easy comparison. Frequency Table
Bar Chart Count
4
First
1
3
Second
0
Third
3
Crew
3
Frequency
Class
2 1 0 First
Second
Third
Crew
Relative frequency bar chart • We can also replace the counts with percentages and use a relative frequency bar chart.
Relative Frequency Table Class
Relative Freq. Bar Chart 50%
%
First
14.28
Second
0
Third
42.86
Crew
42.86
Percentage
40% 30% 20% 10% 0% First
Second Third
Crew
Percentage
Frequency
Comparison between two bar charts
• Similarity: 1) They are both the graphic techniques to display a single categorical variable. 2) They should have the same pattern. 3) Higher the bar is, more frequent the category is. • Difference: Bar chart uses counts; while relative frequency bar chart uses percentages.
Pie chart • Pie Charts show the whole group of cases as a circle. They slice the circle into pieces whose size is proportional to the fraction of the whole in each category. •
Angle size:360°* Percentage
• In other words, larger the piece size is, more frequent the category is.
Frequency Table Class
Count
First
1
Second
0
Pie Chart
First Second
Third
3
Crew
3
Third Crew
Comparison between bar chart and pie chart
Frequency
• Advantage of a pie chart Pie charts give a quick impression of how a whole group is partitioned into smaller groups. The categories can not intersect. • Advantage of a bar chart It is easy to make comparisons.
Exercises from the book 3.5 3.7, 3.13
Display two categorical variables Data Table Frequency Table of “Survival” Survival
Count
Alive
7
Dead
5
Is there a relationship between “Survival” and “Class”?
Frequency Table of “Class”
Survival
Class
Dead
First
Class
Count
Dead
Crew
First
3
Alive
Second
Second
2
Alive
First
Third
3
Dead
Third
Crew
4
Alive
Crew
Alive
Crew
Dead
Second
Alive
First
Dead
Crew
Alive
Third
Alive
Third
We need new techniques to combine the two frequency tables.
Contingency table Data Table Survival
Class
Dead
First
Dead
Crew
Alive
Second
Contingency Table First Second
Third
Crew
Total
Alive
First
Dead
Third
Alive
2
1
2
2
7
Alive
Crew
Dead
1
1
1
2
5
Alive
Crew
Total
3
2
3
4
12
Dead
Second
Alive
First
Dead
Crew
Alive
Third
Alive
Third
Variable Survival
Variable Class
Contingency table
•
Because the table shows how cases are distributed along each variable, contingent on the value of the other variable, such a table is called a two-way contingency table. With the help of two-way contingency table, we can analyze the relationship between the two variables. Class Survival
•
First
Second
Third
Crew
Total
Alive
2
1
2
2
7
Dead
1
1
1
2
5
Total
3
2
3
4
12
Contingency table •
Structure of the contingency table
Cells: Each cell of the table gives the count for a combination of values of the two variables.
Survival
Class First Second Third
Frequency Table of “Survival”
Crew
Total
Survival
Count
Alive
2
1
2
2
7
Alive
7
Dead
1
1
1
2
5
Dead
5
Total
3
2
3
4
12
Frequency Table of “Class” Class
Count
First
3
Second
2
Third
3
Crew
4
Marginal Counts: give us the row/column total of the cell counts.
Contingency table Marginal distribution The margins of the contingency table store the same distributions as the two frequency tables. When presented like this, in the margins of a contingency table, the frequency table distribution of one of the variables is called its marginal distribution. Class First Second Survival
•
Third
Crew
Total
Alive
2
1
2
2
7
Dead
1
1
1
2
5
Total
3
2
3
4
12
Contingency table •
The following is a contingency table of Class and Survival for all passengers on Titanic. Class Survival
First Second
Third
Crew
Total
Alive
203
118
178
212
711
Dead
122
167
528
673
1490
Total
325
285
706
885
2201
Q1: How many passengers are there in total on Titanic? Q2: How many Second class passengers in total? Q3: How many Alive passengers in total? Q4: Among the Alive passengers, how many of them are Crew passengers? Q5: Among the Third class passengers, how many of them are dead? Q6: Are first-class passengers more likely to survive than third-class passengers?
Contingency table •
Total / Column / Row percentage In the first cell, we know there are 203 first-class passengers survived. Now we want to get the percentage of this cell.
Survival
Percentage #1: percent of total number of passengers (203/2201*100% = 9.22%) Total (Table) Percent Percentage #2: percent of first-class passengers (203/325*100% = 62.46%) Column Percent Percentage #3: percent of survivors (203/711*100% = 28.55%) Row Percent Class First
Second Third
Crew
Total
Alive
203
118
178
212
711
Dead
122
167
528
673
1490
Total
325
285
706
885
2201
Contingency table – total (table) percent •
Percentage #1: total percent Divide the count in each cell and margin by the overall total to get the total percent.
Survival
Class First
Second
Third
Crew
Alive
203/2201
118/2201
178/2201
212/2201
711/2201
Dead
122/2201
167/2201
528/2201
673/2201
1490/2201
Total
325/2201
285/2201
706/2201
885/2201
2201/2201
First
Second
Third
Crew
Total
Alive
9.22%
5.36
8.09%
9.63%
32.3%
Dead
5.54%
7.59%
23.99% 30.58%
67.71%
Total
14.77%
12.95%
32.08% 40.21%
100%
Total
The contingency table with total percent
Contingency table – total percent Percentage #1: total percent The total percentages tell us what percent of all passengers belong to each combination of column and row category. Class Survival
•
First
Second
Third
Crew
Total
Alive
9.22%
5.36
8.09%
9.63%
32.3%
Dead
5.54%
7.59%
23.99% 30.58%
67.71%
Total
14.77%
12.95%
32.08% 40.21%
100%
Contingency table – column percent •
Percentage #2: Column Percent Divide the count in each cell and margin by the corresponding column total to get the column percent First
Second
Third
Crew
Total
Alive
203/325
118/285
178/706
212/885
711/2201
Dead
122/325
167/285
528/706
673/885
1490/2201
Total
325/325
285/285
706/706
885/885
2201/2201
First
Second
Third
Crew
Total
Alive 62.46%
41.40%
25.21%
23.95%
32.30%
Dead 37.54%
58.60%
74.79%
76.05%
67.70%
Total 100.00% 100.00% 100.00% 100.00% 100.00%
The contingency table with column percent
Contingency table • Conditional Distribution
Class Second Third
Crew
Total
Alive 62.46%
41.40%
25.21%
23.95%
32.30%
Dead 37.54%
58.60%
74.79%
76.05%
67.70%
Survival
First
Total 100.00% 100.00% 100.00% 100.00% 100.00%
The conditional distribution of Survival given Class-level
Q5: What percent of third-class passengers are alive? • By focusing on each column separately, we see the distribution of Survival under the condition of the level of the class • For example, by focusing the second-class column, we know 41.40% of second-class passengers were alive; 58.60% were dead. • Such distributions are called conditional distributions, because they show the distribution of one variable (Survival) for just those cases that satisfy a condition on another variable (class).
Contingency table – row percent •
Percentage #2: Row Percent Divide the count in each cell and margin by the corresponding row total to get the row percent First
Second
Third
Crew
Total
Alive
203/711
118/711
178/711
212/711
711/711
Dead
122/1409
167/1409
528/1409
673/1409
1490/1409
Total
325/2021
285/2021
706/2021
885/2021
2201/2201
Survival
Class First
Second
Third
Crew
Total
Alive
28.55%
16.60%
25.04%
29.82%
100.00%
Dead
8.19%
11.21%
35.44%
45.17%
100.00%
Total
14.77%
12.95%
32.08%
40.21%
100.00%
This table tells the conditional distribution of Class given Survival.
The contingency table with row percent
Contingency table First
Second
Third
Crew
Total
Alive
9.22%
5.36
8.09%
9.63%
32.3%
Dead
5.54%
7.59%
23.99%
30.58%
67.71%
Total
14.77%
12.95%
32.08%
40.21%
100%
The contingency table with total percent
This table tells the overall distribution of the combination of Survival and Class. First
Second
Third
Crew
Total
Alive 62.46%
41.40%
25.21%
23.95%
32.30%
Dead 37.54%
58.60%
74.79%
76.05%
67.70%
The contingency table with Column percent
Total 100.00% 100.00% 100.00% 100.00% 100.00% This table tells the conditional distribution of Survival conditioned on Class. First
Second
Third
Crew
Total
Alive
28.55%
16.60%
25.04%
29.82%
100.00%
Dead
8.19%
11.21%
35.44%
45.17%
100.00%
Total
14.77%
12.95%
32.08%
40.21%
100.00%
The contingency table with row percent
This table tells the conditional distribution of Class conditioned on Survival.
Contingency table • Example: The following table classifies movies released in 2005 by genre and MPAA rating:
Genre
Rating G
PG
PG-13
R
Total
Action/Adventure
4
5
17
9
35
Comedy
2
12
20
4
38
Drama
0
3
8
17
28
Thriller/Horror
0
0
11
8
19
Total
6
20
56
38
120
Q1: What percentage of these movies were comedies? Q2: What percentage of the PG-rated movies were comedies? Q3: What percentage of the dramas were R-rated? Q4: What percentage of 2005 movies were PG-rated?
Exercises from the book 3.25 3.27
Contingency table – independence
Survival
• Are Class and Survival associated, or are they independent? • In a contingency table, when the distribution of one variable is the same for all categories of another variable, we say that the variables are independent, or there is no association between these variables. First
Class Second Third
Crew
Total
Alive
62.46%
41.40%
25.21%
23.95%
32.30%
Dead
37.54%
58.60%
74.79%
76.05%
67.70%
Total
100.00% 100.00% 100.00% 100.00% 100.00%
The distribution of Survival is different for different categories of Class. We can conclude that Class and Survival are NOT independent. In other words, there is an association between Survival and Class.
Contingency table – independence • Are genre and MPAA rating are independent?
Genre
Rating G
PG
PG-13
R
Total
Action/Adventure
4
5
17
9
35
Comedy
2
12
20
4
38
Drama
0
3
8
17
28
Thriller/Horror
0
0
11
8
19
Total
6
20
56
38
120
G
PG
PG-13
R
Total
14.29%
48.57%
25.71%
100.00%
Action/Adventure 11.43% Comedy
5.26%
31.58%
52.63%
10.53%
100.00%
Drama
0.00%
10.71%
28.57%
60.71%
100.00%
Thriller/Horror
0.00%
0.00%
57.89%
42.11%
100.00%
Total
5.00%
16.67%
46.67%
31.67%
100.00%
Side-by-side bar chart
Survival
• Graphic Techniques for two categorical variables: side-byside bar chart. Class First
Second
Third
Crew
Total
Alive 62.46%
41.40%
25.21%
23.95%
32.30%
Dead 37.54%
58.60%
74.79%
76.05%
67.70%
Total 100.00% 100.00% 100.00% 100.00% 100.00% 80% 60% 40%
Alive
20%
Dead
0% First
Second
Third
Crew
Exercises from the book
3.33