Pie Chart Display two

Display one categorical variable 1) Frequency Table/ Relative Frequency Table 2) Bar Chart/Relative Frequency Bar Chart/ Pie Chart Display two categor...
Author: Joan Gilbert
4 downloads 2 Views 1MB Size
Display one categorical variable 1) Frequency Table/ Relative Frequency Table 2) Bar Chart/Relative Frequency Bar Chart/ Pie Chart Display two categorical variables 1) Contingency Table 2) Side-by-side bar chart Distribution / Marginal distribution / Conditional distribution / Independence

Motivation • The Titanic example The following is part of a data table showing four variables for seven people aboard the Titanic. Survival

Age

Sex

Class

Dead

Adult

Male

Third

Dead

Adult

Male

Crew

Dead

Adult

Male

Third

Dead

Adult

Male

Crew

Dead

Adult

Male

Crew

Alive

Adult

Female

First

Dead

Adult

Male

Third

Q: Is there a relationship between “Survival” and “Class”?

Motivation Survival

Age

Sex

Class

Dead

Adult

Male

Third

Dead

Adult

Male

Crew

Dead

Adult

Male

Third

Dead

Adult

Male

Crew

Dead

Adult

Male

Crew

Alive

Adult

Female

First

Dead

Adult

Male

Third

Dead

Adult

Male

Crew

Alive

Adult

Female

Third

Dead

Adult

Male

Crew

Alive

Adult

Female

Second

Dead

Adult

Male

Crew

Dead

Adult

Male

Third

Dead

Adult

Male

Third

Dead

Adult

Male

Second

Q: How about this? We need new ways to show the PATTERNS, RELATIONSHIPS, TRENDS, and EXCEPTIONS from the data.

Frequency tables • We pile together things that seem to go together, so we can see how cases distribute across different categories. • For categorical data, piling is just to count the number of cases c0rresponding to each category and pile them up.

Frequency tables • Example: pile the “Class” variable Survival

Age

Sex

Class

Class

Count

Dead

Adult

Male

Third

First

1

Dead

Adult

Male

Crew

Second

0

Dead

Adult

Male

Third

Third

3

Dead

Adult

Male

Crew

Crew

3

Dead

Adult

Male

Crew

Alive

Adult

Female

First

Dead

Adult

Male

Third

This new table is called a frequency table

Frequency tables • Frequency table A frequency table lists all the categories in a categorical variable and gives the count of cases for each category. Frequency tables are common table techniques to display a single categorical variable. Class

Count

First

1

Second

0

Third

3

Crew

3

Relative frequency tables • Sometimes we want to know the fraction or proportion of the data in each category. • Relative frequency table A relative frequency table displays the percentages (divide the counts by the total number of the cases), rather than the counts, of the values in each category. Frequency Table

Relative Frequency Table Class

%

1/7 ∙ 100% = 14.28%

First

14.28

0

0/7 ∙ 100% = 0%

Second

0

Third

3

3/7 ∙ 100% = 42.86%

Third

42.86

Crew

3

3/7 ∙ 100% = 42.86%

Crew

42.86

Class

Count

First

1

Second

Relative frequency tables • Both types of tables show how cases are distributed across the categories. In this way, they describe the distribution of a categorical variable. Relative Frequency Table

Frequency Table

Distribution: 1.The most common category 2.The least common category 3.Comparison

Class

%

First

14.28

Second

0

3

Third

42.86

3

Crew

42.86

Class

Count

First

1

Second

0

Third Crew

Exercises from the book 3.11

Bar chart • Bar chart A bar chart displays the distribution of a categorical variable, showing the counts (frequency) for each category next to each other for easy comparison. Frequency Table

Bar Chart Count

4

First

1

3

Second

0

Third

3

Crew

3

Frequency

Class

2 1 0 First

Second

Third

Crew

Relative frequency bar chart • We can also replace the counts with percentages and use a relative frequency bar chart.

Relative Frequency Table Class

Relative Freq. Bar Chart 50%

%

First

14.28

Second

0

Third

42.86

Crew

42.86

Percentage

40% 30% 20% 10% 0% First

Second Third

Crew

Percentage

Frequency

Comparison between two bar charts

• Similarity: 1) They are both the graphic techniques to display a single categorical variable. 2) They should have the same pattern. 3) Higher the bar is, more frequent the category is. • Difference: Bar chart uses counts; while relative frequency bar chart uses percentages.

Pie chart • Pie Charts show the whole group of cases as a circle. They slice the circle into pieces whose size is proportional to the fraction of the whole in each category. •

Angle size:360°* Percentage

• In other words, larger the piece size is, more frequent the category is.

Frequency Table Class

Count

First

1

Second

0

Pie Chart

First Second

Third

3

Crew

3

Third Crew

Comparison between bar chart and pie chart

Frequency

• Advantage of a pie chart Pie charts give a quick impression of how a whole group is partitioned into smaller groups. The categories can not intersect. • Advantage of a bar chart It is easy to make comparisons.

Exercises from the book 3.5 3.7, 3.13

Display two categorical variables Data Table Frequency Table of “Survival” Survival

Count

Alive

7

Dead

5

Is there a relationship between “Survival” and “Class”?

Frequency Table of “Class”

Survival

Class

Dead

First

Class

Count

Dead

Crew

First

3

Alive

Second

Second

2

Alive

First

Third

3

Dead

Third

Crew

4

Alive

Crew

Alive

Crew

Dead

Second

Alive

First

Dead

Crew

Alive

Third

Alive

Third

We need new techniques to combine the two frequency tables.

Contingency table Data Table Survival

Class

Dead

First

Dead

Crew

Alive

Second

Contingency Table First Second

Third

Crew

Total

Alive

First

Dead

Third

Alive

2

1

2

2

7

Alive

Crew

Dead

1

1

1

2

5

Alive

Crew

Total

3

2

3

4

12

Dead

Second

Alive

First

Dead

Crew

Alive

Third

Alive

Third

Variable Survival

Variable Class

Contingency table



Because the table shows how cases are distributed along each variable, contingent on the value of the other variable, such a table is called a two-way contingency table. With the help of two-way contingency table, we can analyze the relationship between the two variables. Class Survival



First

Second

Third

Crew

Total

Alive

2

1

2

2

7

Dead

1

1

1

2

5

Total

3

2

3

4

12

Contingency table •

Structure of the contingency table

Cells: Each cell of the table gives the count for a combination of values of the two variables.

Survival

Class First Second Third

Frequency Table of “Survival”

Crew

Total

Survival

Count

Alive

2

1

2

2

7

Alive

7

Dead

1

1

1

2

5

Dead

5

Total

3

2

3

4

12

Frequency Table of “Class” Class

Count

First

3

Second

2

Third

3

Crew

4

Marginal Counts: give us the row/column total of the cell counts.

Contingency table Marginal distribution The margins of the contingency table store the same distributions as the two frequency tables. When presented like this, in the margins of a contingency table, the frequency table distribution of one of the variables is called its marginal distribution. Class First Second Survival



Third

Crew

Total

Alive

2

1

2

2

7

Dead

1

1

1

2

5

Total

3

2

3

4

12

Contingency table •

The following is a contingency table of Class and Survival for all passengers on Titanic. Class Survival

First Second

Third

Crew

Total

Alive

203

118

178

212

711

Dead

122

167

528

673

1490

Total

325

285

706

885

2201

Q1: How many passengers are there in total on Titanic? Q2: How many Second class passengers in total? Q3: How many Alive passengers in total? Q4: Among the Alive passengers, how many of them are Crew passengers? Q5: Among the Third class passengers, how many of them are dead? Q6: Are first-class passengers more likely to survive than third-class passengers?

Contingency table •

Total / Column / Row percentage In the first cell, we know there are 203 first-class passengers survived. Now we want to get the percentage of this cell.

Survival

Percentage #1: percent of total number of passengers (203/2201*100% = 9.22%) Total (Table) Percent Percentage #2: percent of first-class passengers (203/325*100% = 62.46%) Column Percent Percentage #3: percent of survivors (203/711*100% = 28.55%) Row Percent Class First

Second Third

Crew

Total

Alive

203

118

178

212

711

Dead

122

167

528

673

1490

Total

325

285

706

885

2201

Contingency table – total (table) percent •

Percentage #1: total percent Divide the count in each cell and margin by the overall total to get the total percent.

Survival

Class First

Second

Third

Crew

Alive

203/2201

118/2201

178/2201

212/2201

711/2201

Dead

122/2201

167/2201

528/2201

673/2201

1490/2201

Total

325/2201

285/2201

706/2201

885/2201

2201/2201

First

Second

Third

Crew

Total

Alive

9.22%

5.36

8.09%

9.63%

32.3%

Dead

5.54%

7.59%

23.99% 30.58%

67.71%

Total

14.77%

12.95%

32.08% 40.21%

100%

Total

The contingency table with total percent

Contingency table – total percent Percentage #1: total percent The total percentages tell us what percent of all passengers belong to each combination of column and row category. Class Survival



First

Second

Third

Crew

Total

Alive

9.22%

5.36

8.09%

9.63%

32.3%

Dead

5.54%

7.59%

23.99% 30.58%

67.71%

Total

14.77%

12.95%

32.08% 40.21%

100%

Contingency table – column percent •

Percentage #2: Column Percent Divide the count in each cell and margin by the corresponding column total to get the column percent First

Second

Third

Crew

Total

Alive

203/325

118/285

178/706

212/885

711/2201

Dead

122/325

167/285

528/706

673/885

1490/2201

Total

325/325

285/285

706/706

885/885

2201/2201

First

Second

Third

Crew

Total

Alive 62.46%

41.40%

25.21%

23.95%

32.30%

Dead 37.54%

58.60%

74.79%

76.05%

67.70%

Total 100.00% 100.00% 100.00% 100.00% 100.00%

The contingency table with column percent

Contingency table • Conditional Distribution

Class Second Third

Crew

Total

Alive 62.46%

41.40%

25.21%

23.95%

32.30%

Dead 37.54%

58.60%

74.79%

76.05%

67.70%

Survival

First

Total 100.00% 100.00% 100.00% 100.00% 100.00%

The conditional distribution of Survival given Class-level

Q5: What percent of third-class passengers are alive? • By focusing on each column separately, we see the distribution of Survival under the condition of the level of the class • For example, by focusing the second-class column, we know 41.40% of second-class passengers were alive; 58.60% were dead. • Such distributions are called conditional distributions, because they show the distribution of one variable (Survival) for just those cases that satisfy a condition on another variable (class).

Contingency table – row percent •

Percentage #2: Row Percent Divide the count in each cell and margin by the corresponding row total to get the row percent First

Second

Third

Crew

Total

Alive

203/711

118/711

178/711

212/711

711/711

Dead

122/1409

167/1409

528/1409

673/1409

1490/1409

Total

325/2021

285/2021

706/2021

885/2021

2201/2201

Survival

Class First

Second

Third

Crew

Total

Alive

28.55%

16.60%

25.04%

29.82%

100.00%

Dead

8.19%

11.21%

35.44%

45.17%

100.00%

Total

14.77%

12.95%

32.08%

40.21%

100.00%

This table tells the conditional distribution of Class given Survival.

The contingency table with row percent

Contingency table First

Second

Third

Crew

Total

Alive

9.22%

5.36

8.09%

9.63%

32.3%

Dead

5.54%

7.59%

23.99%

30.58%

67.71%

Total

14.77%

12.95%

32.08%

40.21%

100%

The contingency table with total percent

This table tells the overall distribution of the combination of Survival and Class. First

Second

Third

Crew

Total

Alive 62.46%

41.40%

25.21%

23.95%

32.30%

Dead 37.54%

58.60%

74.79%

76.05%

67.70%

The contingency table with Column percent

Total 100.00% 100.00% 100.00% 100.00% 100.00% This table tells the conditional distribution of Survival conditioned on Class. First

Second

Third

Crew

Total

Alive

28.55%

16.60%

25.04%

29.82%

100.00%

Dead

8.19%

11.21%

35.44%

45.17%

100.00%

Total

14.77%

12.95%

32.08%

40.21%

100.00%

The contingency table with row percent

This table tells the conditional distribution of Class conditioned on Survival.

Contingency table • Example: The following table classifies movies released in 2005 by genre and MPAA rating:

Genre

Rating G

PG

PG-13

R

Total

Action/Adventure

4

5

17

9

35

Comedy

2

12

20

4

38

Drama

0

3

8

17

28

Thriller/Horror

0

0

11

8

19

Total

6

20

56

38

120

Q1: What percentage of these movies were comedies? Q2: What percentage of the PG-rated movies were comedies? Q3: What percentage of the dramas were R-rated? Q4: What percentage of 2005 movies were PG-rated?

Exercises from the book 3.25 3.27

Contingency table – independence

Survival

• Are Class and Survival associated, or are they independent? • In a contingency table, when the distribution of one variable is the same for all categories of another variable, we say that the variables are independent, or there is no association between these variables. First

Class Second Third

Crew

Total

Alive

62.46%

41.40%

25.21%

23.95%

32.30%

Dead

37.54%

58.60%

74.79%

76.05%

67.70%

Total

100.00% 100.00% 100.00% 100.00% 100.00%

The distribution of Survival is different for different categories of Class. We can conclude that Class and Survival are NOT independent. In other words, there is an association between Survival and Class.

Contingency table – independence • Are genre and MPAA rating are independent?

Genre

Rating G

PG

PG-13

R

Total

Action/Adventure

4

5

17

9

35

Comedy

2

12

20

4

38

Drama

0

3

8

17

28

Thriller/Horror

0

0

11

8

19

Total

6

20

56

38

120

G

PG

PG-13

R

Total

14.29%

48.57%

25.71%

100.00%

Action/Adventure 11.43% Comedy

5.26%

31.58%

52.63%

10.53%

100.00%

Drama

0.00%

10.71%

28.57%

60.71%

100.00%

Thriller/Horror

0.00%

0.00%

57.89%

42.11%

100.00%

Total

5.00%

16.67%

46.67%

31.67%

100.00%

Side-by-side bar chart

Survival

• Graphic Techniques for two categorical variables: side-byside bar chart. Class First

Second

Third

Crew

Total

Alive 62.46%

41.40%

25.21%

23.95%

32.30%

Dead 37.54%

58.60%

74.79%

76.05%

67.70%

Total 100.00% 100.00% 100.00% 100.00% 100.00% 80% 60% 40%

Alive

20%

Dead

0% First

Second

Third

Crew

Exercises from the book

3.33