NONPARAMETRIC BOOTSTRAPPING FOR MULTIPLE LOGISTIC REGRESSION MODEL USING R Ahmed Hossain Department of Public Health Sciences University of Toronto Ontario, Canada and

H. T. Abdullah Khan Department of Statistics University of Dhaka Dhaka, Bangladesh

ABSTRACT The use of explanatory variables or covariates in a regression model is an important way to represent heterogeneity in a population. Again bootstrapping is rapidly becoming a popular tool to apply in a broad range of standard applications including multiple regression. The nonparametric bootstrap allows us to estimate the sampling distribution of a statistic empirically without making assumptions about the form of the population, and without deriving the sampling distribution explicitly. The main objective of this study to discuss the nonparametric bootstrapping procedure for multiple logistic regression model associated with Davidson and Hinkley's (1997) “boot” library in R. Key words: Nonparametric, Bootstrapping, Sampling, Logistic Regression, Covariates.

I. INTRODUCTION Bootstrapping is a general approach to statistical inference based on building a sampling distribution for a statistic by resampling from the data at hand. Efron (1979) discussed bootstrap procedure that can be applied to estimate sampling distributions of estimators for the multiple regression model. A common approach to statistical inference is to make assumptions about the structure of the population (e.g., an assumption of normality), and along with the stipulation of random sampling, to use these assumptions to derive the sampling distribution on which the classical inference is based. This is called parametric Bootstrapping. But in certain instances, the exact distribution may be intractable, and so we instead derive its asymptotic distribution. This parametric bootstrapping may cause two potentially important deficiencies: •

If the assumptions about the population are wrong, then the corresponding sampling distribution of the statistic may be

•

seriously inaccurate. On the other hand, if asymptotic results are relied upon, these may not hold to the required level of accuracy in a relatively small sample. The approach requires sufficient mathematical prowess to derive the sampling distribution of the statistic of interest. In some cases, such a derivation may be prohibitively difficult.

II. NONPARAMETRIC BOOTSTRAPPING APPROACH FOR REGRESSION MODELS The bootstrap method can be applied to much more general situations (Efron, 1982), but all of the essential elements of the method are clearly seen by concentrating on the familiar multiple regression model: y = Xβ + ε (2.1) where X and β are fixed ( n × k ) and (k × 1) matrices with full rank and n ≥ k . The components of ε are independent identically distributed random

Ahmed Hossain and H. T. Abdullah Khan the beginning of the range of vertebrae involved in the operation.

variables with zero mean and common variance σ . 2

In the paper, the generalized linear model (GLM) tool is used to fit logistic regression model using R statistical software.

The nonparametric bootstrap estimate of the sampling distribution of an estimator βˆ of β is generated by repeatedly drawing with replacement from the residual vector *

If

eb is a

ε * = y − Xβ * (n × 1) vector of

IV. RESULTS

(2.2) n independent draws

A logistic linear regression model is fitted to examine the influence of selected three covariates on kyphosis in R by using the following command:

from ε , then the corresponding bootstrap dependent variable is given by *

y b = Xβ * + eb (2.3) For each vector y b the estimator is recomputed and

glm(formula = Kyphosis ~ Age + Start + Number, family = binomial, data = Kyphosis)

the sampling distribution of the estimator is estimated by the empirical distribution of these estimates computed over a large number of y b .

The results of logistic regression are given in Table 1. Table 1: Logistic Regression Coefficients.

III. DATA Coefficients

The kyphosis data frame has 81 rows representing data on 81 children who have had corrective spinal surgery collected from the book Statistical Models in S, Wadsworth and Brooks, Pacific Grove, CA 1992, pg. 200

(Intercept)

-2.03693

Error

t value

1.44918

-1.40557

Age

0.01093

0.00644

1.69617

Start

-0.20651

0.06768

-3.05104

0.41060

0.22478

1.82662

Number The outcome kyphosis is a binary variable and other three selected variables (columns) are numeric. Kyphosis is a factor telling whether a postoperative deformity (kyphosis) is "present" or "absent". Age represents the age of the child in months. Number represents the number of vertebrae involved in the operation. And Start represents

Value Std.

(Dispersion Parameter for Binomial family taken to be 1) Null Deviance: 83.23447 on 80 degrees of freedom Residual Deviance: 61.37998 on 77 degrees of freedom

Figure 1: Age Coefficient

110

Bootstrapping Nonparametric Logistic Regression Table 1 reveals that all three covariates are statistically significant and have expected directions. Table 2 shows the partial correlation between the covariates.

Table 3: Bootstrap Statistics for Selected Variables.

(Intercept) Age Start Number

Table 2: Correlation Matrix. Age Start Number

Start

-0.28495 0.23210

Original -2.03671 0.01093 -0.20650 0.41056

Bias -0.51139 0.00214 -0.02979 0.11311

Std. error 2.92852 0.00981 0.11878 0.51974

Figure 1-3 show the histograms and normal quantile-comparison plots for the bootstrap replications of the age (Figure 1), start (Figure 2) and number (figure 3) coefficients in Kyphosis data. The broken vertical line in each histogram shows the location of the regression coefficient for the model to fit to the original sample.

0.11075

The coefficient standard errors reported by glm rely on asymptotic approximations and may not be trustworthy. Therefore, let us turn to the bootstrap. Here we want to fit a regression model with response variable y and predictors x1 , x 2 ,...x k . We have a sample of n observations zi′ = ( yi1,, , xi1 , xi2 ,...,xik ) , i = 1, 2, …, n. Here we

While considering bootstrapping sample we find that except for Number, the bias is too small for covariates Age and Start. Looking at Figures 1-3, one can conclude that they follow approximately normal which in turn help us to justify the usefulness of bootstrapping technique.

simply select B bootstrap samples of the z i′ , fitting the model and saving the coefficients from each bootstrap sample.

Tables 4 -9 show confidence intervals of coefficients of logistic regression model. The confidence intervals are observed to be very close for covariates Age and Start; on the other hand, it is wider for Number. So the application of bootstrapping provides us better understanding and better results.

We then construct confidence intervals for the regression coefficients using the methods discussed by Davidson and Hinkley (1997).

ORDINARY NONPARAMETRIC BOOTSTRAP

> boot.ci (boot.out =boot.k, type=c(“norm”,”prec”,bca”), index=2)

>boot.h function(data, indices) { data