REGRESSION ANALYSIS PROJECT
VA R I A B L E S P R E D I C T I N G L I T E R A C Y R AT E AND W O R L D ' S L I T E R A C Y R AT E DISTRIBUTION
CHIA-CHIEN CHOU
1
Introduction 1.1
Background for the study
Literacy is typically described as the ability to read and write. The United Nations Educational, Scientific and Cultural Organization (UNESCO) has drafted a definition of literacy as the "ability to identify, understand, interpret, create, communicate, compute and use printed and written materials associated with varying contexts. Literacy involves a continuum of learning in enabling individuals to achieve their goals, to develop their knowledge and potential, and to participate fully in their community and wider society."[2] "Since the 1990s, when the Internet came into wide use in the United States, some have asserted that the definition of literacy should include the ability to use tools such as web browsers, word processing programs, and text messages."[2] Many policy analysts consider literacy rates a crucial measure of a region's human capital. This claim is made on the fact that literate people can be trained less expensively than illiterate people and generally have a higher socioeconomic status and enjoy better health and employment prospects. "Human capital refers to the stock of skills and knowledge embodied in the ability to perform labor so as to produce economic value. It is the skills and knowledge gained by a worker through education and experience."[3] Policy makers also argue that literacy increases job opportunities and access to higher education. A case study from India[4] demonstrated that improvements in female literacy have a direct effect on reducing fertility. Biology researcher has also stated "many studies have shown that people with lower socio-economic status have higher mortality rates", "our health care system places high literacy demands on patients, so limited literacy likely impedes access to health care and chronic disease management. Poor understanding of how to take medication or how to manage chronic disease, not to mention being unable to navigate through the complex health care system, could also cause increased mortality."[5] As a result, we can conclude that all of the five predictor variables: Birth Rate, Death Rate, GDP-per capita(PPP), Unemployment Rate, and Internet Usage Ratio to some extends have correlation with Literacy Rate either positively or negatively.
1.2
Purpose of the study
This study consists of two parts. The main purposes of part I of this study are: (a) to conduct a statistical analysis of the relationship between the five predictor variables and Literacy Rate, (b) to determine how these five variables are significantly affected by Literacy Rate, and (c) to assess the relative importance of all five variables in the full regression model. The main purposes of part II of this study are: (a) to conduct a statistical analysis of the relationship between the six indicator variables and Literacy Rate, and (b) to study the distribution of Literacy Rate between regions around the World.
2
Dataset 2.1
Data Description & Usage
This study draws upon publicly accessible data from The CIA - World Factbook 2009.
2.1.1 Part I of the study In part I of this study, Literacy Rate is the response variable. The five predictor variables are: Birth Rate, Death Rate, GDP-per capita(PPP), Unemployment Rate, and Internet Usage Ratio.
Literacy Rate - The rate of age 15 and over can read and write. Y = Literacy (%)
Unemployment Rate - It contains the percent of the labor force that is without jobs. Substantial underemployment might be noted. X1 = Unemployment Rate (%)
GDP – per capita (PPP) - It gives GDP growth on an annual basis adjusted for inflation and expressed as a percent. X2 = GDP – per capita (PPP)(%)
Death Rate - It gives the average annual number of deaths during a year per 1,000 populations at mid-year; also known as crude death rate. X3 = Death Rate (%)
Internet Usage Ratio - It gives the percentage of users within a country that access the Internet. X4 = Internet Usage (%)
Birth Rate - It gives the average annual number of births during a year per 1,000 persons in the population at midyear; also known as crude birth rate. X5 = Birth Rate (%)
2.1.2 Part II of the study In part II of this study, Literacy Rate is the response variable. The six indicator variables are: Asia, Oceania, Europe, South America, North America and Africa represented by combination of five dummy variables are X1, X2, X3, X4, and X5, These variables are different from the five predictor variables described in part I of this study.
Literacy rate. Y = Literacy Rate
Six regions around the World represented using dummy variable = (X1, X2, X3, X4, X5) Africa (X1, X2, X3, X4, X5) = (1,0,0,0,0) South America (X1, X2, X3, X4, X5) = (0,1,0,0,0) Europe (X1, X2, X3, X4, X5) = (0,0,1,0,0) Asia (X1, X2, X3, X4, X5) = (0,0,0,1,0) Oceania (X1, X2, X3, X4, X5) = (0,0,0,0,1) North America (X1, X2, X3, X4, X5) = (0,0,0,0,0)
1. 2. 3. 4. 5. 6.
2.2
Literature of Analysis Performed
Nowadays, literacy is considered to be a fundamental and essential skill. Literacy skills have been used as both predictors and indicators in various studies. In the introduction section, we have stated that there are research and studies supporting the fact that there is either positive or negative correlation between Literacy Rate and these five variables: Birth Rate, Death Rate, GDPper capita (PPP), Unemployment Rate, and Internet Usage Ratio. There is also strong evidence in the literature that suggests that Literacy Rate predicts GDP and Employment Rate in following publications: Lenoir, Gloria., Bellemeur, Jeannette., Illescas-Glascok, Maria Luisa.; and Lim, Soojin. "Factors that Inform International Literacy Rates"[6] and Pant, Mohan. "Does Literacy Predict Economic Growth? A Multiple Regression Analysis"[7] In current literature indicates that Literacy Rate is related to socio-economic background are key determinants for these five variables: Birth Rate, Death Rate, GDP-per capita (PPP), Unemployment Rate, and Internet Usage Ratio. This study seeks to examine these variables at the macro-level by looking at the correlation between each of them and Literacy Rate. In addition, it will determine whether or not correlations can be expanded to include the Literacy Rate in each country grouped by region. This study’s goal is to add to the literature on educational comparisons among countries by region. Figure1: World's Literacy Rate Distribution - Literacy rate by country based on CIA World Factbook 2009 data
definition: age 15 and over can read and write total population: 82% male: 87% female: 77% note: over two-thirds of the world's 785 million illiterate adults are found in only eight countries (Bangladesh, China, Egypt, Ethiopia, India, Indonesia, Nigeria, and Pakistan); of all the illiterate adults in the world, two-thirds are women; extremely low literacy rates are concentrated in three regions, the Arab states, South and West Asia, and Sub-Saharan Africa, where around one-third of the men and half of all women are illiterate (2005 est.)
3
Methodology 3.1
Multiple Regression
Multiple regression analysis involves the formation of an equation with response variable Y and the predictor variables Xi used in the model and then analyzing the significance of each predictor variable in predicting the response variable. The equation for multiple regression model with more than one predictor is given by: Yi= β0+ β1X1+ β2X2+ … + βp-1Xp-1+ ϵi where: β0, β1, …,βp-1 are parameters X1, X2, …,Xp-1 are known constants ϵi are independent N(0,σ2) i = 1, …, n
3.2
ANOVA
The analysis of variance (ANOVA) is a collection of statistical models, and their associated procedures, in which the observed variance is partitioned into components due to different explanatory variables. The analysis of variance approach is based on the partitioning of sums of squares and degrees of freedom associated with the response variable Y.
3.3
Stepwise Regression
Stepwise regression includes regression models in which the choice of predictor variables is carried out by an automatic procedure. Usually, this takes the form of a sequence of F-tests, but other techniques are possible, such as t-tests or Adjusted R-square. The main approaches are: Forward selection, which involves starting with no variables in the model, trying out the predictor variables one by one and including them to test if they are statistically significant. Backward elimination, which involves starting with all predictor variables as possible candidate in the model and testing them one by one for statistical significance, deleting any that are not significant. Methods that are a combination of the above approaches, testing at each stage for predictor variables to be included or excluded.
4 Analysis We used R and Minitab to conduct the following analysis.
4.1
Part I of the study
We collected all the data for each of our predictor variables from The CIA – The World Factbook website to form our dataset. We observed that not all countries are covered in each one of the data. As a result, we decided to exclude countries with one or more missing data from the dataset. After careful selection and elimination, we have finalized our dataset, which contains 171 countries around the World.
4.1.1 Multiple Regression We conducted Multiple Regression Analysis: y versus x1, x2, x3, x4, x5 The regression equation is y = 111 + 0.0760 x1 + 0.00443 x2 - 0.253 x3 + 0.0323 x4 - 1.30 x5 S = 10.9863
R-Sq = 57.5%
R-Sq(adj) = 56.3%
Scatter plot of each predictor variables: Birth Rate, Death Rate, GDP-per capita (PPP), Unemployment Rate, and Internet Usage Ratio against Literacy Rate.
20
40
60
80
0
200
400
600
800
1000
100 40 20 5
10
X2: GDP(ppp)
15
20
25
30
0
20
X3: Death Rate
40
60
80
X4: Internet Usage(%)
60 20
40
Literacy(%)
80
100
X1: Unemployment Rate
60
Literacy(%)
80 60
Literacy(%)
40 20
20
0
80
100
100 80 60
Literacy(%)
40
60 20
40
Literacy(%)
80
100
From the plots, we can conclude that there is relationship between Birth Rate and Literacy (%) however we can't say for sure about the other predictor variables.
10
20
30
40
50
X5: Birth Rate
Scatter plot of Yˆ against Residual does not suggest any systematic deviations from the response plane, nor that does the variance of the error terms vary with the level of Yˆ .
50
60
70
80
90
100
40
60
80
20 0 -10
Residual
-20 -30
x4
60
80
10
20
30 x5
40
50
20 0 -10
Residual
-20 -30 0
200
400
600 x2
10
20 10 0
Residual
-10 -20
40
10
20 20
x1
-30
20
0 -20 -30
0
y.hat
0
-10
Residual
10
20 10 0 -30
-20
-10
Residual
0 -10 -30
-20
Residual
10
20
Scatter plots of each predictor variables: Birth Rate, Death Rate, GDP-per capita (PPP), Unemployment Rate, and Internet Usage Ratio against Residual; these plots do not show any special pattern, indicating the good fit by the response function and constant variance of the error terms.
800
1000
5
10
15 x3
20
25
30
Scatter plot of cross-product term of predictor variables: Birth Rate, Death Rate, GDP-per capita (PPP), Unemployment Rate, and Internet Usage Ratio against Residual.
20 10 0 -30
-20
-10
Residual
0 -10 -30
-20
Residual
10
20
All of the cross-product term plots do not exhibit any clear systematic pattern; hence, we cannot yet conclude if there is any interaction effects reflected by each of the corresponding model term βXiXj appear to be present.
0
2000
6000
10000
0
500
20 10 0 -20 -30
500
1000
1500
2000
0
500
1500
3500
20 10 0 -30
-20
-10
Residual
10 0 -10 -30
-20
Residual
2500
X1X5
20
X1X4
0
1000
3000
5000
7000
0
10000
30000
50000
10 0 -10 -20 -30
-30
-20
-10
0
Residual
10
20
X2X4
20
X2X3
Residual
1500
-10
Residual
10 0 -10
Residual
-20 -30
0
0
5000
10000
15000
0
200
400
600
800
10 0 -10 -20 -30
-30
-20
-10
0
Residual
10
20
X3X4
20
X2X5
Residual
1000 X1X3
20
X1X2
0
200
400 X3X5
600
800
0
500
1000
1500
X4X5
Residual plots of Y: Normal Probability Plot of Residual, Yˆ (fitted values) against Residual, Histogram of Residual and Observation Order against Residual. From the Normal Probability Plot, we can observe that the pattern is moderately linear. Histogram chart shows that the residual is normally distributed with mean about 0. This helps to confirm the reasonableness of the conclusion that the error terms are fairly normally distributed.
Residual Plots for y No rmal Pro b ab ility Plo t
Versu s Fits
99.9 99
20 Res i d u al
Percen t
90 50 10
0 -20
1
-40
0.1
-40
-20
0 Res i d u al
20
40
40
60
Histo g ram
80 Fi t t ed V al u e
100
Versu s Ord er
60 45 Res i d u al
Freq u en cy
20
30 15 0
0 -20 -40
-30
-20
-10 0 Res i d u al
10
4.1.2
20
1
20
40
60 80 100 120 O b s erv at i o n O rd er
140
160
Stepwise Regression
Here is a flowchart of our decision model for Stepwise Regression analysis using forward selection approach with the following condition: 1) To include the term if the alpha value of the new model is less than 0.15 2) To exclude the term if the alpha value of the new model is less than 0.15
Stage 1: We conducted Stepwise Regression analysis starting with constant term only and no variable terms. Stage 2: We added test each of the linear terms: x1, x2, x3, x4, and x5; one by one to determine if we will include it into our model. Here is the result after Stage 2: The regression equation is Literacy (%) = 113 - 1.35 Birth Rate
Note: Statistics Details are in Table 1 of Appendix.
R esid u al Plo ts f o r Liter acy ( %) Nor m a l P r oba bility P lot
Ve r sus F its
99.9 99
20
R es i d u al
P er cen t
90 50 10
0 -20
1 0.1
-40 -40
-20
0 R es i d u al
20
40
40
60
H istogr a m
80 F i t t ed Val u e
100
V e r sus Or de r 20
45
R es i d u al
F r eq u en cy
60
30 15
0 -20 -40
0
-40
-30
-20
-10 0 R es i d u al
10
20
30
1
20
40
60 80 100 120 Ob s er v at i o n Or d er
140
160
Stage 3: We added test each of the cross-product terms: x1x2, x1x3, x1x4, x1x5, x2x3, x2x4, x2x5, x3x4, x3x5, and x4x5; one by one to determine if we include it into our model. Here is the result after Stage 3: The regression equation is Literacy (%) = 113 - 1.53 Birth Rate + 0.0312 x4x5 - 0.646 Internet User (%) + 0.0305 x3x4 + 0.000964 x2x5
Note: Statistics Details are in Table 2 of Appendix. R esid u al Plo ts f o r Liter acy ( %) Nor m a l P r oba bility P lot
Ve r sus F its
99.9
20
99
R es i d u al
P er cen t
90 50 10
0 -20
1 0.1
-40 -40
-20
0 R es i d u al
20
40
40
60
H istogr a m 20
36
R es i d u al
F r eq u en cy
100
V e r sus Or de r
48
24 12 0
80 F i t t ed Val u e
0 -20 -40
-30
-20
-10 0 R es i d u al
10
20
1
20
40
60 80 100 120 Ob s er v at i o n Or d er
140
160
Stage 4: We added test each of the 2nd order terms: x1^2, x2^2,x3^2, x4^2, and x5^2; one by one to determine if we include it into our model. Here is the result after Stage 4: The regression equation is Literacy (%) = 110 - 1.46 Birth Rate + 0.0317 x4x5 + 0.00494 x2x3 0.000323 x2x4 - 0.395 Internet User (%)
Note: Statistics Details are in Table 3 of Appendix.
R esid u al Plo ts f o r Liter acy ( %) Nor m a l P r oba bility P lot
Ve r sus F its
99.9
20
99
R es i d u al
P er cen t
90 50 10
0 -20
1 0.1
-40 -40
-20
0 R es i d u al
20
40
40
60
H istogr a m
100
V e r sus Or de r
48
20
36
R es i d u al
F r eq u en cy
80 F i t t ed Val u e
24 12
0 -20 -40
0
-30
-20
-10 0 R es i d u al
10
20
1
20
40
60 80 100 120 Ob s er v at i o n Or d er
140
160
Outlier: We observed that there are some outliers where these observations which have a large standardized residual. As a result, we are eliminating these observations from our dataset. Again, we conducted Multiple Regression analysis with the selected terms with the new dataset. The regression equation is Literacy (%) = 91.8 - 0.0222 x5^2 + 0.00140 x1x2 + 0.00350 x1^2 + 0.0150 x4x5 - 0.00341 x4^2 + 0.0253 x3x4 - 0.00795 x1x4 - 0.0129 x3x5
Note: Statistics Details are in Table 4 of Appendix. R esid u al Plo ts f o r Liter acy ( %) Nor m a l P r oba bility P lot
Ve r sus F its
99.9
20
99
10
R es i d u al
P er cen t
90 50 10
-20
1 0.1
0 -10
-20
-10
0 R es i d u al
10
20
40
60 80 F i t t ed Val u e
H istogr a m
V e r sus Or de r 20 10
36
R es i d u al
F r eq u en cy
48
24 12 0
100
0 -10 -20
-24
-16
-8 0 R es i d u al
4.1.3
8
16
1
20
40
60 80 100 120 Ob s er v at i o n Or d er
140
160
ANOVA
ANOVA output for Multiple Regression Analysis: y versus x1, x2, x3, x4, x5 The regression equation is y = 111 + 0.0760 x1 + 0.00443 x2 - 0.253 x3 + 0.0323 x4 - 1.30 x5 S = 10.9863
R-Sq = 57.5%
R-Sq(adj) = 56.3%
ANOVA output for Multiple Regression Analysis: y versus Birth Rate, x4x5, x2x3, x2x4, Internet User The regression equation is Literacy (%) = 110 - 1.46 Birth Rate + 0.0317 x4x5 + 0.00494 x2x3 0.000323 x2x4 - 0.395 Internet User (%) S = 10.2059
R-Sq = 63.4%
R-Sq(adj) = 62.3%
ANOVA output for Multiple Regression Analysis without outliers: y versus x5^2, x1x2, x1^2, x4x5, x4^2, x3x4, x1x4, x3x5 The regression equation is Literacy (%) = 91.8 - 0.0222 x5^2 + 0.00140 x1x2 + 0.00350 x1^2 + 0.0150 x4x5 - 0.00341 x4^2 + 0.0253 x3x4 - 0.00795 x1x4 - 0.0129 x3x5 S = 7.78190
R-Sq = 71.9%
R-Sq(adj) = 70.4%
#4(TO DO: Comparison on ANOVA output) eg Adjusted R square…..
4.2
Part II of study 4.2.1
Multiple Regression
Taking this analysis further, we categorized the data into 6 groups, namely Asia, Oceania, Europe, South America, North America and Africa. From the Chart below, we noticed that the Literacy rate is lowest in Africa and followed by Asia. As a result, we will do further analysis in these two regions to see how it's Literacy Rate compares to the average of all other regions and how Literacy rate is distributed within these regions itself. Regions Asia Oceania Europe South America North America Africa
Average Literacy Rate 84 90.07 98,44 92.99 91.83 70.30
Scatterplot of Literacy vs Dummy x1(Africa) 110 100 90
Literacy (%)
80 70 60
Literacy (%) = 90.7 - 22.6 x1
50 40 30 20 0.0
0.2
0.4 0.6 Dum m y x1(Africa)
0.8
1.0
Literacy Rate in Africa is lower than the average of all other regions.
Scatterplot of Literacy vs Dummy z2(North Africa) 100
90
90
80
80
70
70
60
Literacy (%)
Literacy (%)
Scatterplot of Literacy vs Dummy z1(East Africa) 100
Literacy (%) = 67.0 + 8.85 z1
50
60
40
30
30
20
Literacy (%) = 68.7 - 2.55 z2
50
40
20 0.0
0.2
0.4 0.6 Dum m y z1(East Africa)
0.8
1.0
0.0
100
90
90
80
80
70
70
60
Literacy (%) = 63.7 + 14.2 z3
50
0.4 0.6 Dum m y z2(North Africa)
0.8
1.0
Scatterplot of Literacy vs Dummy z4(West Africa)
100
Literacy (%)
Literacy (%)
Scatterplot of Literacy vs Dummy z3(South Africa)
0.2
60 50
40
40
30
30
20
Literacy (%) = 72.4 - 13.9 z4
20 0.0
0.2
0.4 0.6 Dum m y z3(South Africa)
0.8
1.0
0.0
0.2
0.4 0.6 Dum m y z4(W est Africa)
0.8
1.0
Scatterplot of Literacy vs Dummy z5(Central Africa) 100 90
Literacy (%)
80 70 60 50
Literacy (%) = 68.7 - 20.1 z5
40 30 20 0.0
0.2
0.4 0.6 Dum m y z5(Central Africa)
0.8
1.0
Scatterplot of Literacy vs Dummy x4(Asia) 110 100 90
Literacy (%)
80 70
Literacy (%) = 87.5 - 3.52 x4
60 50 40 30 20 0.0
0.2
0.4 0.6 Dum m y x4(Asia)
0.8
1.0
Literacy Rate in Asia is lower than the average of all other regions. Scatterplot of Literacy vs Dummy w2(East Asia) 100
90
90
80
80
70
Literacy(%)
Literacy(%)
Scatterplot of Literacy vs Dummy w1(Central Asia) 100
Literacy (%) = 82.4 + 16.8 w1
60 50 40
Literacy (%) = 82.2 + 13.0 w2
70 60 50 40
30
30
20
20 0.0
0.2
0.4 0.6 Dum m y w1(Central Asia)
0.8
1.0
0.0
0.2
0.4 0.6 Dum m y w2(East Asia)
0.8
1.0
Scatterplot of Literacy vs Dummy w4(South Asia) 100
90
90
80
80
70 60
Literacy(%)
Literacy(%)
Scatterplot of Literacy vs Dummy w3(Southeasten Asia) 100
Literacy (%) = 84.1 - 0.23 w3
50 40
60
Literacy (%) = 88.7 - 30.0 w4
50 40
30
30
20
20 0.0
0.2
0.4 0.6 Dum m y w3(Southeaten Asia)
0.8
1.0
Scatterplot of Literacy vs Dummy w5(West Asia) 100 90 80
Literacy(%)
70
70
Literacy (%) = 82.4 + 4.13 w5
60 50 40 30 20 0.0
0.2
0.4 0.6 Dum m y w5(W est Asia)
0.8
1.0
0.0
0.2
0.4 0.6 Dum m y w4(South Asia)
0.8
1.0