Current status linear regression

Current status linear regression Piet Groeneboom and Kim Hendrickx Delft University and University of Hasselt September 9, 2016 Figure: Kim Hendric...
Author: Jack Peters
6 downloads 0 Views 848KB Size
Current status linear regression Piet Groeneboom and Kim Hendrickx Delft University and University of Hasselt

September 9, 2016

Figure: Kim Hendrickx

Current status regression Consider the regression model Yi = α + β 0 Xi + i , where Xi is a k-dimensional covariate and i an observation error with expectation zero. We can not observe Yi ! Instead, we observe (X1 , T1 , ∆1 ), . . . , (Xn , Tn , ∆n ), where Ti is an observation time and ∆i = 1{Yi ≤Ti } . The i are i.i.d. variables, independent of the Ti and Xi , with expectation zero; Ti and Xi are also taken to be independent. We want to estimate α and β. How to do this?

Maximum likelihood The log likelihood of the observations is n X

{∆i log FYi (Ti ) + (1 − ∆i ) log{1 − FYi (Ti )} ,

i=1

where  FYi (t) = P α + β 0 Xi + i ≤ t . If F is the distribution function of the i , we can write the log likelihood in the form n X 

∆i log F (Ti − α − β 0 Xi ) + (1 − ∆i ) log{1 − F (Ti − α − β 0 Xi ) .

i=1

We also have the extra condition that the expectation of i is zero, which amounts to: Z x dF (x) = 0.

Reduction to simpler model We drop the condition that the expectation of i is zero, and reduce the model to: Yi = β 0 Xi + i , where the i are i.i.d. with unknown expectation α. Then the log likelihood becomes: n X 

∆i log F (Ti − β 0 Xi ) + (1 − ∆i ) log{1 − F (Ti − β 0 Xi ) .

i=1

where F is the distribution function of i . ˆ ˆ We now first estimate R F and β by some estimates F and β, and then estimate α by x d Fˆ(x) (or a variation on this).

Example Example: Xi and Ti are uniform on [0, 2], that β = 1/2, α = Ei = 1/2, and suppose i has as density a rescaled version of the density 6x(1 − x) on [0, 1]. We rescale it to a density on [3/8, 5/8]: 6

1.0

5

0.8

4

0.6 3

0.4 2

0.2

1

0.45

0.50

0.55

(a) density of i .

0.60

0.45

0.50

0.55

0.60

(b) df of i .

We can try to straightforwardly maximize w.r.t. F and β: n X i=1

{∆i log F (Ti − βXi ) + (1 − ∆i ) log{1 − F (Ti − βXi )}} .

Example (continued) Example for n = 1000: βˆ = 0.514828. For n = 10, 000: βˆ = 0.499999.

1.0

1.0

0.8

0.8

0.6

0.6

0.4

0.4

0.2

0.2

0.0

0.0 0.35

0.40

0.45

0.50

0.55

0.60

0.65

(a) MLE of F , n = 1000.

0.35

0.40

0.45

0.50

0.55

0.60

0.65

(b) MLE for F , n = 10, 000.

Profile likelihood Profile likelihood: Pick a β, maximize for this β the log likelihood n X  ∆i log F (Ti − β 0 Xi ) + (1 − ∆i ) log{1 − F (Ti − β 0 Xi )} . i=1

over F , this gives Fˆβ via one step (convex minorant) algorithm. Now try to find an argmax for β over all Fˆβ so obtained. We use Brent’s algorithm for maximization, with Fˆβ as parameter. -0.02

-0.04

-0.06

-0.08

-0.10

0.3

0.4

0.5

0.6

0.7

Figure: log likelihood for Fˆβ , as a function of β, n = 100.

Properties maximum likelihood estimator

What are the properties of the MLE of β so obtained? Unknown! √ Unknown whether the MLE of β is n-consistent Li and Zhang (1998) conjecture that the MLE will give a √ n-consistent but inefficient estimate. Murphy, van der Vaart, and Wellner (1999) prove that in a model in which one only has observations from a part of F0 where F0 stays away from 0 and 1 that the MLE gives a n1/3 -consistent estimate of β0 .

Interlude: binary choice model In econometrics, the binary choice model has been studied: special case of current status regression, where the Ti is degenerate at 0 and log likelihood of the form: n X 

∆i log F (β 0 Xi ) + (1 − ∆i ) log{1 − F (β 0 Xi )} .

i=1

The regression parameter β is only identifiable under an extra condition, for example kβk = 1. Example: Whether a person i says yes (∆i = 1) or no (∆i = 0) on a question in a questionnaire is predicted by his (socio-economic) covariate Xi . Accompanying theory is also still shrouded in mystery. Literature: Cosslett (1983, 2007), Klein and Spady (1993), Dominitz and Sherman (2005), lecture notes Bruce Hansen (2009).

Difficulties Why does estimating β seem so difficult? β is a parameter which appears as argument of a function F which √ we cannot estimate nonparametrically at rate n. Nevertheless, we expect that it is possible to estimate β at rate √ n. √ So we have to “bypass” the non- n estimable parameter F . Compare with proportional hazards model under current status: in this case the log likelihood is of the form n n o o  n X ∆i log 1 − exp −Λ(Ti )e βXi − (1 − ∆i )Λ(Ti )e βXi , i=1

where Λ is the baseline cumulative hazard function. Now β is not √ hiding inside a function F which is not- n estimable! Ordinary maximum likelihood is efficient (Huang (1996))!

Smoothing Consider restricting the estimates of F to a smooth plug-in estimator (Nadaraya Watson statistic): R δKh (t − βx − u + βy ) dPn (u, y , δ) , Fnh,β (t − βx) = R Kh (t − βx − u + βy ) dGn (u, y )

(1)

where, for a smooth symmetric kernel K : Kh (u) = h−1 K (u/h). Another way of writing Fn,β is to use ordinary sums: Pn j=1 ∆j Kh (t − βx − Tj + βXj ) Fnh,β (t − βx) = Pn . j=1 Kh (t − βx − Tj + βXj ) This type of estimate has also been considered in the econometrics literature. Note: Fn,β is not necessarily monotone, so need not be a distribution function!

Example plugin estimator

1.0

0.8

0.6

0.4

0.2

0.0 0.35

0.40

0.45

0.50

0.55

0.60

0.65

Figure: Plug-in estimator, n = 1000 and βˆ = 0.49532.

Maxima plug-in estimator -0.051

0.2 -0.052

-0.053

0.1

-0.054

0.0 -0.055

-0.056

−0.1

-0.057

−0.2

-0.058 0.44

0.46

0.48

0.50

0.52

0.54

0.56

(a) log likelihood w.r.t. β

0.44

0.46

0.48

0.50

0.52

0.54

0.56

(b) scores of plugin w.r.t. β

∂ log likelihood(Fnh,β ) = scores of plugin as function of β ∂β ∂ n 1 X {∆i − Fnh,β (Ti − βXi )} ∂β Fnh,β (Ti − βXi ) = n Fnh,β (Ti − βXi ){1 − Fnh,β (Ti − βXi )} i=1

Questions “Shrouded in mystery”: How should we choose the bandwidth h? Can we choose h  n−1/5 ? Can we use ordinary kernels? Cosslett (2007): we should have n−1/5 < h < n−1/8 (excluding h  n−1/5 and h  n−1/8 ). Klein and Spady (1993): n−1/6 < h < n−1/8 and one should take higher order (4th order) kernels. Lecture notes Bruce Hansen (2009): “It is unclear to me if these are merely technical sufficient conditions, or if there is a substantive difference with the semiparametric regression case.” Note: The Klein-Spady estimates of the finite-dimensional √ regression parameter have been proved to be n consistent and to achieve the information lower bound.

Efficiency plug-in estimators Theorem (Piet Groeneboom and Kim Hendrickx (2016)) Let Fnh,βˆn maximize the truncated log likelihood X

h ∆i logF (Ti − β 0 Xi )

F (Ti −β 0 Xi )∈[,1−]

+ (1 − ∆i ) log{1 − F (Ti − β 0 Xi )}

i

over all plug-in estimators (ordinary kernels). As n → ∞, and h  n−1/5 , then √

d

n(βˆn − β0 ) → N(0, I (β0 )−1 )

where I (β0 ) is the matrix with elements Z covar(Xi , Xj |T − β00 X = u) f0 (u)2 fT −β00 X (u) du. F (u){1 − F (u)} 0 0 F0 (u)∈[,1−]

Penalized estimates Murphy, van der Vaart, and Wellner (1999) consider the penalized maximum likelihood estimator, obtained by maximizing n X

{∆i log F (Ti − βXi ) + (1 − ∆i ) log{1 − F (Ti − βXi )}

i=1

Z − λn where

  1/λn = Op n2/5 ,

F 00 (u)2 du,

  λ2n = op n−1/2 ,

and observations are only on region where  < F0 (u) < 1 −  for some  > 0. √ Translated into bandwidth choice (hn  λn ), the penalty condition correspond to: n−1/5 ≤ h < n−1/8 .

Is smoothing really necessary? Consider the following estimator: βˆn is a value of β such that n X

 Xi ∆i − Fˆn,β (Ti − βXi ) = 0,

i=1

where Fˆn,β is the MLE of F0 based on the values Ti − βXi .

Theorem (Piet Groeneboom and Kim Hendrickx (2016)) D √  n βˆn − β0 −→ N(0, σ 2 ), where R

var(X |T − β0 X = u)F0 (u){1 − F0 (u)} fT −β0 X (u) du . R 2 var(X |T − β0 X = u)f0 (u) fT −β0 X (u) du √ Consequence: One can construct n-consistent estimates of β0 on the basis of the non-smoothed MLE’s Fˆn,β ’s. 2

σ =

Simple score function for MLE

0.0075

0.0050

0.0025

0.0000

−0.0025

−0.0050

−0.0075

0.450

0.475

0.500

0.525

0.550

Figure: Simple score function for MLE, n=1000.

Maximum rank estimator of Arag´on and Quiroz (1995)

The regression parameter β is estimated by the maximizer βˆn of Γn (β) =

n X

∆i Ri (β),

i=1

where Ri (β) is the rank of Ti − βXi in T1 − βX1 , . . . , Tn − βXn . Similar to Han’s maximum rank correlation estimator (Han (1987)) in the econometric literature. √ Abrevaya (1999) proves that βˆn is n-consistent and asymptotically normal, using methods of Sherman (1993) in treating Han’s maximum rank correlation estimator.

Proof for maximum rank estimator Use Theorem 1 in Sherman (1993): kβˆn − β0 k = Op (n−1/2 ), ˆ if βn is the maximizer of Γn (β), with population equivalent Γ(β) and (a) there exists a neighborhood N of β0 and a constant k > 0 such that Γ(β) − Γ(β0 ) ≤ −kkβ − β0 k2 , for β ∈ N, and (b) uniformly over op (1) neighborhoods of β0 , Γn (β) − Γn (β0 )

 √  = Γ(β) − Γ(β0 ) + Op kβ − β0 k/ n + op kβ − β0 k2  + Op n−1 .

1.0

1.5

2.0

2.5

1.0 0.0

0.2

0.4

0.6

0.8

1.0 0.0

0.2

0.4

0.6

0.8

1.0 0.8 0.6 0.4 0.2 0.0

0.5

0.5

1.0

β

1.5

2.0

2.5

β

1.0

1.5

2.0

2.5

β

(b) n = 100

(c) n = 500

0.0

0.2

0.4

0.6

0.8

1.0

(a) n = 50

0.5

0.5

1.0

1.5

2.0

2.5

β

(d) n = 1000 Figure: The log likelihood (red) and the criterion function Γn (blue) for sample sizes n = 50, 100, 500 and 1000. β0 = 1 is the true regression parameter. Ti , Xi and i are independent and standard normal.

Cumulative sum diagram for different β MLE Fˆn,β has jumps at a subset of {Ti − β 0 Xi , i=1,. . . ,n}. 1.0

0.040

β=0.4 β=0.5 β=0.6

0.9 0.035

0.8 0.030

0.7 0.6

0.025

0.5

0.020

0.4 0.015

0.3

0.010

0.2

0.005

0.1 0.0

0.000 0.47

0.48

0.49

0.50

0.51

0.52

0.53

0.54

0.55

0.56

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

 P Fˆn,β (u) → P ∆i = 1|Ti − β 0 Xi = u Z = F0 (u + (β − β0 )0 x)fX |T −β 0 X (x|T − β 0 X = u) dx

Efficiency The efficient variance for estimating β is: Z var(X |T − β0 X = u)f0 (u)2 fT −β0 X (u) du. F0 (u){1 − F0 (u)} We can change βˆn to a value of β such that n X

 Xi φn (Ti − βXi ) ∆i − Fˆn,β (Ti − βXi ) = 0,

i=1

where Xi φn (Ti − βXi ) estimates the efficient score, for example Xi φn (Ti − βXi ) = where fˆn,h (t) =

R

Xi fˆn,h (Ti − βXi ) . Fˆn,β (Ti − βXi ){1 − Fˆn,β (Ti − βXi )}

Kh (t − u) d Fˆn,β (u).

Theorem (Piet Groeneboom and Kim Hendrickx (2016)) Let βˆn be a value of β such that Z  xφn (t − β 0 x)) δ − Fˆn,β (t − β 0 x) dPn = 0, Fˆn,β ∈[,1−]

where: φn (t − βx) = and fˆn,h (t) =

R

fˆn,h (t − β 0 x) , Fˆn,β (t − β 0 x){1 − Fˆn,β (t − β 0 x)}

Kh (t − u) d Fˆn,β (u). Then D √  n βˆn − β0 −→ N(0, I (β0 )−1 ),

Z I (β0 ) = F0 ∈[,1−]

var(X |T − β00 X = u)f0 (u)2 fT −β00 X (u) du . F0 (u){1 − F0 (u)}

0.3 0.2

0.10

0.15

0.2

0.00

0.1

0.05

0.1

−0.10

0.0

−0.05

0.0 −0.1

0.5

1.0

1.5

2.0

2.5

0.5

1.0

β

1.5

2.0

2.5

β

1.0

1.5

2.0

2.5

β

(b) n = 100

(c) n = 500

−0.05

0.00

0.05

0.10

0.15

0.20

0.25

(a) n = 50

0.5

0.5

1.0

1.5

2.0

2.5

β

(d) n = 1000 Figure: Efficient score (red) and the simple score (blue) for sample sizes n = 50, 100, 500 and 1000. The true regression parameter β0 = 1.

Performance of estimates of β

n 100 500 1000 5000 10000 20000

Plug-in, β via argmax mean(βˆn ) nvar (βˆn ) 0.499562 0.245172 0.498857 0.191857 0.499502 0.192223 0.500314 0.181421 0.500120 0.172043 0.500096 0.174197

MLE, β via score=0 mean(βˆn ) nvar (βˆn ) 0.502964 0.362099 0.500167 0.212293 0.500178 0.194511 0.500058 0.176652 0.499989 0.174698 0.500001 0.169884

MLE, β via argmax mean(βˆn ) nvar (βˆn ) 0.487057 0.405507 0.498111 0.306398 0.499528 0.296097 0.500162 0.227073 0.500159 0.233758 0.500050 0.236070

Table: h = 0.5n−1/5 and N = 1000. (I (β0 )−1 = 0.158699,  = 0.001)

Efficiency

Conclusion: We can construct an asymptoticallly efficient estimate with the non-smoothed MLE, using an estimate of the density based on the MLE. This uses isotonic estimators and reordering properties.

Bibliography I Jason Abrevaya. Rank regression for current-status data: asymptotic normality. Statist. Probab. Lett., 43(3):275–287, 1999. ISSN 0167-7152. URL http://dx.doi.org/10.1016/S0167-7152(98)00267-3. Jorge Arag´on and Adolfo J. Quiroz. Rank regression for current status data. Statist. Probab. Lett., 24(3):251–256, 1995. ISSN 0167-7152. URL http://dx.doi.org/10.1016/0167-7152(94)00180-G. Stephen R. Cosslett. Distribution-free maximum likelihood estimator of the binary choice model. Econometrica, 51(3): 765–782, 1983. ISSN 0012-9682. URL http://dx.doi.org/10.2307/1912157. Stephen R. Cosslett. Efficient estimation of semiparametric models by smoothed maximum likelihood. Internat. Econom. Rev., 48 (4):1245–1272, 2007. ISSN 0020-6598. URL http://dx.doi.org/10.1111/j.1468-2354.2007.00461.x.

Bibliography II Jeff Dominitz and Robert P. Sherman. Some convergence theory for iterative estimation procedures with an application to semiparametric estimation. Econometric Theory, 21(4):838–863, 2005. ISSN 0266-4666. URL http://dx.doi.org/10.1017/S0266466605050425. Aaron K Han. Non-parametric analysis of a generalized regression model: the maximum rank correlation estimator. Journal of Econometrics, 35(2-3):303–316, 1987. Jian Huang. Efficient estimation for the proportional hazards model with interval censoring. Ann. Statist., 24(2):540–568, 1996. ISSN 0090-5364. URL http://dx.doi.org/10.1214/aos/1032894452. Roger W. Klein and Richard H. Spady. An efficient semiparametric estimator for binary response models. Econometrica, 61(2): 387–421, 1993. ISSN 0012-9682. URL http://dx.doi.org/10.2307/2951556.

Bibliography III

Gang Li and Cun-Hui Zhang. Linear regression with interval censored data. Ann. Statist., 26(4):1306–1327, 1998. ISSN 0090-5364. URL http://dx.doi.org/10.1214/aos/1024691244. S. A. Murphy, A. W. van der Vaart, and J. A. Wellner. Current status regression. Math. Methods Statist., 8(3):407–425, 1999. ISSN 1066-5307. Robert P. Sherman. The limiting distribution of the maximum rank correlation estimator. Econometrica, 61(1):123–137, 1993. ISSN 0012-9682. URL http://dx.doi.org/10.2307/2951780.