ADAPTIVE ESTIMATION IN AUTOREGRESSION OR -MIXING REGRESSION VIA MODEL SELECTION

The Annals of Statistics 2001, Vol. 29, No. 3, 839–875 ADAPTIVE ESTIMATION IN AUTOREGRESSION OR -MIXING REGRESSION VIA MODEL SELECTION By Y. Baraud,...
2 downloads 0 Views 265KB Size
The Annals of Statistics 2001, Vol. 29, No. 3, 839–875

ADAPTIVE ESTIMATION IN AUTOREGRESSION OR -MIXING REGRESSION VIA MODEL SELECTION By Y. Baraud, F. Comte and G. Viennet Ecole Normale Sup´erieure, Universit´e Paris VI and Universit´e Paris VII We study the problem of estimating some unknown regression function in a β-mixing dependent framework. To this end, we consider some collection of models which are finite dimensional spaces. A penalized leastsquares estimator (PLSE) is built on a data driven selected model among this collection. We state non asymptotic risk bounds for this PLSE and give several examples where the procedure can be applied (autoregression, regression with arithmetically β-mixing design points, regression with mixing errors, estimation in additive frameworks, estimation of the order of the autoregression). In addition we show that under a weak moment condition on the errors, our estimator is adaptive in the minimax sense simultaneously over some family of Besov balls.

1. Introduction. We consider the problem of estimating the unknown function f, from k into , based on the observation of n (possibly) dependent  i  1 ≤ i ≤ n, arising from the model data Yi  X  i  + εi

Yi = fX

(1.1)

 i 1≤i≤n is a stationary sequence of random vectors in k We assume that X  i ’s. The εi ’s are unobservable and we denote by µ the common law of the X identically distributed centered random variables admitting a finite variance denoted by σ22 . Throughout the paper we assume that σ22 is a known quantity (or that a bound on it is known). In this introduction, we assume that the εi ’s are independent random variables. As an example of model (1.1), consider the  i = Xi with values in 0 1 with a regression case of a random design set X function f assumed to satisfy some H¨olderian regularity condition (1.2)

fx − fy = f α < +∞ y − xα 0≤x 1, choose the penalty term to be equal to penm = x3

Dm 2 σ n 2

∀m ∈ n 

except in Section 3.3 where the penalty term is chosen in a different way. In each case, we give sufficient conditions for f˜ = fˆmˆ to achieve the best trade-off (up to a constant) between the bias and the variance term among the collection of estimators fˆm  m ∈ n . Namely, we show that for any ρ in 1 x,     x + ρ 2 R 2 2 3 Dm 2 ˜ inf f|A − fm µ + 2x (3.1) Ɛ f|A − fn ≤ σ2 +  x − ρ m∈n n n for some constant R = Rρ to be specified. With no loss of generality we shall assume that A = 0 1k . Those results, proved in Section 6, derive from our main theorems which are to be found in Section 4 and Section 5. 3.1. Autoregression framework. We deal with a particular feature of the regression framework (1.1), the autoregression framework of order 1 given by (3.2)

Yi = Xi = fXi−1  + εi 

i = 1  n

The process is initialized with some real valued random variable X0 .

846

Y. BARAUD, F. COMTE AND G. VIENNET

We assume the following: (HAR1 ) The random variable X0 is independent of the εi ’s. The εi ’s are i.i.d. centered random variables admitting a density, hε , with respect to the Lebesgue measure and satisfying σ22 = Ɛ ε1 2  < ∞. The density hε is a positive bounded and continuous function and the function f satisfies for some 0 ≤ a < 1 and b ∈ , (3.3)

∀u ∈ 

fu ≤ a u + b

The sequence of the random variables Xi ’s is stationary of common law µ. The existence of a stationary law µ derives from the assumptions on the εi ’s and f. To estimate f we use the collection of models given below. Collection of piecewise polynomials. Let r be some positive integer and mn the largest integer such that r2mn ≤ n/ ln3 n that is, mn = intlnn/ ln3 n /r ln2 (intu denotes the integer part of u). Let n be the set of integers 0  mn, for each m ∈ n we define Sm as the linear span of piecewise polynomials of degree less than r based on the dyadic grid j/2m  j = 0  2m − 1 ⊂ 0 1. The result on f˜ is the following. Proposition 1. Consider the autoregression framework 3 2 and assume p that (HAR1 ) holds. If σp = Ɛ εi p  < ∞ for some p > 6 then  3 1 holds for some constant R that depends on p x ρ hε  σp2  r f|A − f|A dx∞ . ˜ 2 , it is actually enough to To obtain results in probability on f|A − f n p assume Ɛ εi  < ∞ for some p > 2, we refer to (4.7) and the comment given there. 3.2. Regression framework. We give an illustration of Theorem 1 in case of regression with arithmetically β-mixing design points. Of course the case of autoregression with arithmetically β-mixing Xi ’s can be treated similarly. Let us consider the regression model (3.4)

Yi = fXi  + εi 

i = 1  n

In this section, we consider a sequence εi for i ∈  and we take the Xi ’s to be generated by a standard time series model: (3.5)

Xi =

+∞  k=0

ak εi−1−2k

Then we make the following assumption: (HReg ) The εi ’s are i.i.d. Gaussian random variables. The aj ’s are such that  2j a0 = 1, +∞ = 0 for all z with z ≤ 1 and for all j ≥ 1, aj ≤ Cj−d for j=0 aj z some constants C > 0 and d > 17.

ADAPTIVE ESTIMATION IN AUTOREGRESSION

847

The value 17 as bound for d is certainly not sharp. The model (3.5) for the Xi ’s together with the assumptions on the coefficients aj aim at ensuring that (HXY ) is fulfilled with arithmetically β-mixing variables. Of course, any other model implying the same property would suit. We introduce the following collection of models. Collection of wavelets: For any integer J, let #j = j k/ k = 1  2j  and let   +∞  φJ0 k  J0  k ∈ #J0  ∪ ϕjk  j k ∈ #J J=J0

be an 2 0 1 dx-orthonormal system of compactly supported wavelets of regularity r built by Cohen, Daubechies and Vial (1993). For some positive integer Jn > J0 , let n be the space spanned by the φJ0 k ’s for J0  k ∈ #J0  J −1

n #J. The integer Jn is chosen in such a and by the ϕjk ’s for j k ∈ ∪J=J 0 Jn way that dimn  = 2 is of order n4/5 / lnn. We set n = J0   Jn − 1 and for each m ∈ n we define Sm as the linear span of the φJ0 k ’s for J0  k ∈ #J0  and the ϕjk ’s for j k ∈ ∪m J=J0 #J. For a precise description and use of these wavelet systems, see Donoho and Johnstone (1998). These new functions derive from Daubechies’ wavelets (1992) at the interior of 0 1 and are boundary corrected at the “edges.”

Proposition 2. Assume that f|A ∞ < ∞ and that for all m ∈ n , the constant functions belong to Sm . If (HReg ) is satisfied, then (3.1) holds true for  some constant R depending on x ρ h0  h1  σ22  C d f|A − f|A dx∞ . 3.3. Regression with dependent errors. work (3.6)

We consider the regression frame-

 i  + εi  i = 1  n Yi = fX εi = aεi−1 + ui  i = 1  n

 i  for i = 1  n. We observe the pairs Yi  X We assume that: (HRd ) The real number a satisfies 0 ≤ a < 1, and the ui ’s are i.i.d. centered random variables admitting a common finite variance. The law of the εi ’s is assumed to be stationary admitting a finite variance σ22 . The sequence of the  i ’s is geometrically β-mixing [i.e., satisfying (6.1)] and the sequences of the X  i ’s and the εi ’s are independent. X  i ’s can be generated by an autoregressive model Geometrically β-mixing X with a regression function g and errors ηi satisfying an assumption of the same kind as (HAR1 ) in Section 3.1.

848

Y. BARAUD, F. COMTE AND G. VIENNET

The main difference between this framework and the previous one lies in the dependency between the εi ’s. To deal with it, we need to modify the penalty term: Proposition 3. Assume that f|A ∞ < ∞, that (HX ) and (HRd ) hold and that Ɛ ε1 p  < ∞ for some p > 6. Let x > 1, if the penalty term pen satisfies

 2a Dm 2 σ  (3.7) ∀m ∈ n  penm ≥ x3 1 + 1−a n 2 then by using the collection of piecewise polynomials described in Section 3 1 and applying the estimation procedure given in Section 2 we have that the estimator f˜ satisfies for any ρ ∈1 x, 2     R ˜ 2 ≤ x+ρ inf f|A − fm 2µ + 2penm +  (3.8) Ɛ f|A − f n x − ρ m∈n n  where R depends on a p σp  f|A − f|A dx∞  x ρ h0  h1  / θ. In contrast with the results of the previous examples, we cannot give a choice of a penalty term which would work for any value of a. An unknown lower bound for the choice of the penalty term seems to be the price to pay when the εi ’s are no longer independent. This example shows how this lower bound varies with respect to unknown number a, this number quantifying in some sense a discrepancy from independence (the independence corresponds to a = 0). We also see that a choice of the penalty term of the form Dm 2 σ n 2 with κ large is safer than a choice of κ close to 1. This should be kept in mind every time the independence of the εi ’s is debatable (we refer the reader to the comments following Theorem 2). penm = κ

3.4. Additive models. We consider the additive regression models, widely used in Economics, described by (3.9)

1

2

k

Yi = ef + f1 Xi  + f2 Xi  + · · · + fk Xi  + εi

where the εi ’s are i.i.d. and ef denotes a constant. Model (3.9) follows from  i = X1  X2   Xk  and the additive function f: model (1.1) with X i i i 



 x fx  = e + f x  + · · · + f x . For identifiability, we assume that 1 k f 1 1 k k  f xdx = 0, for i = 1



 k. Such a model assumes that the effects on i 01 Y of the variables Xj are additive. Our aim is to estimate f on A = 0 1k . The estimation method allows one to build estimators of f1   fk in different spaces. 1 Let 2 be some integer. We define S2 as the linear space of piecewise polynomials t of degree less that r, r ≥ 1, based on the dyadic grid j/22  j =  2 0  22  ⊂ 0 1 satisfying 01 tx dx = 0 and S2 as the linear span

849

ADAPTIVE ESTIMATION IN AUTOREGRESSION

√ √ of the functions ψ2j−1 x = 2 cos2πjx and ψ2j x = 2 sin2πjx for j = 1  22 . Now we set m1 n [m2 n respectively] the largest integers √ 1 2 2 such that dimS2  (dimS2  respectively) is smaller than n/ ln3 n. Fi1 2 nally, n and n denote respectively the set of integers 0  m1 n and 0  m2 n. We propose to estimate the fi ’s either by piecewise or trigonometric polynomials. To do so, we introduce the choice function g from 1  k into 1 2 and consider the following collections of models. Mixed additive collection of models: We set n = kn = m = k m1   gj

mk  mj ∈ n 

 and for each m = k m1   mk  ∈ n we define

Sm = tx1   xk  = a +

k  i=1

ti xi  a t1   tk  ∈  ×

k  i=1

gi Sm i



The performance of f˜ is given by the following result  i Yi  Proposition 4. Assume that f|A ∞ < ∞, that the sequence of the X is geometrically β-mixing, that is, satisfies 6 1 and that (HX ), (Hε ) and (HXε ) are fulfilled. Consider the additive regression framework 3 9 with p the above collection of models. If σp = Ɛ ε p  < ∞ for some p > 6, then f˜ satisfies 3 1 for some constant R depending on k p σp  f|A − f|A dx∞  x h0  h1  / θ. We can deduce from Proposition 4 that our procedure is adaptive in the minimax sense. The point of interest is that the additive framework avoids the curse of dimensionality in the rate of convergence that is, we can derive similar rates of convergence for k ≥ 2 as for k = 1. Let α > 0 and l > 2, we recall that a function f from 0 1 into  belongs to the Besov space αl∞ if it satisfies f αl = sup y−α wd f yl < +∞ y>0

d = α + 1

where wd f yl denotes the modulus of smoothness. For a precise definition of those notions we refer to DeVore and Lorentz [(1993), Chapter 2, Section 7]. Since for l ≥ 2, αl∞ ⊂ α2∞ , we now restrict ourselves to the case where l = 2. In the sequel, for any L > 0 α2∞ L denotes the set of functions which belong to α2∞ and satisfy f α2 ≤ L. Then the following result holds. Proposition 5. Consider model 3 9 with k ≥ 2. Let L > 0, assume that f|A ∞ ≤ L and that for all i = 1  k, fi ∈ αi 2∞ L for some αi > 1/2. Assume that for all i = 1  k such that gi = 1, αi ≤ r. Set α = minα1   αk , if Ɛ ε1 p  < ∞ for some p > 6 then under the assumptions of Proposition 4   2α ˜ 2 ≤ Ck L α Rn− 2α+1 Ɛ f|A − f (3.10)

n

850

Y. BARAUD, F. COMTE AND G. VIENNET

Comments. (i) In the case where k = 1, by using the collection of piecewise polynomials described in Section 3.1, (3.10) holds under the weaker assumption that α > 0, we refer the reader to the proof of Proposition 5. (ii) A result of the same flavor can be established in probability, this would require a weaker moment condition on the εi ’s. Namely, using (4.7) we show similarly that for any η > 0, there exists a positive constant Cη (also depending on k L α and R) such that α ˜ n ≤ Cηn− 2α+1 f|A − f 

with probability greater or equal to 1 − η, as soon as Ɛ ε1 p  < ∞ for some p > 2. 3.5. Estimation of the order of an additive autoregression. additive autoregression framework, (3.11)

Consider an

Xi = ef + f1 Xi−1  + f2 Xi−2  + + fk Xi−k  + εi

where the εi ’s are i.i.d. and ef denotes a constant. Under suitable assumptions  i = Xi−1   Xi−k  ’s are stationary and geometrically ensuring that the X β-mixing, the estimation of f1 , ,fk can be handled in the same way as in the previous section. The aim of this section is to provide an estimator of the order of autoregression, that is, an estimator of the integer k0 (k0 ≤ k, k being known) satisfying fk0 = 0 and fi = 0 for all i > k0 . To do so, let  n = kj=0 jn (we use the notations introduced in Section 3.4) and consider the collection of models Sm  m ∈ n . We estimate k0 by kˆ 0 = kˆ 0 x defined as the first coordinate of m, ˆ m ˆ being given by   3 Dm 2 ˆ m ˆ = arg min γn fm  + x σ

m∈n n 2 We measure the performance of kˆ 0 via that of f˜ = fˆmˆ , the latter being known, under the assumptions of Theorem 1, to achieve the best trade-off (up to a constant) between the bias term and the variance term among the collections of least-squares estimators fˆm  m ∈ n . 4. The main result. In this section, we give our main result concerning the estimation of a regression function from dependent data. Although this result considers the case of particular collections of models, extension including very general collections are to be found in the comments following the theorem. 4.1. The main theorem. Let n be some finite dimensional linear subspace of A-supported functions of 2 k  dx. Let φλ λ∈#n be an orthonormal basis of n ⊂ 2 A dx and set Dn = #n = dimn . We assume that there exists some positive constant !1 ≥ 1 such that for all λ ∈ #n    (Hn ) φλ ∞ ≤ !1 Dn and λ / φλ φλ ∞ = 0 ≤ !1

ADAPTIVE ESTIMATION IN AUTOREGRESSION

851

The second condition means that for each λ, the supports of φλ and φλ are disjoint except for at most !1 functions φλ ’s. We shall see in Section 10 that those conditions imply that (2.2) holds with !20 = !31 . In addition we assume some constraint on the dimension of n (HDn )8b There exists an increasing function 8 mapping + into + satisfying for some K > 0 and b ∈0 1/4 lnu ∨ 1 ≤ 8u ≤ Kub 

∀u ≥ 1 such that

Dn ≤

(4.1)

n

8n lnn

Theorem 1. Let us consider model 1 1 with f an unknown function from k into  such that f|A ∞ < ∞ and where Conditions (HX ), (Hε ) and (HXε ) are fulfilled. Consider a family Sm m∈n of linear subspaces of n . Assume that Sm m∈n satisfies (HS ) and that n satisfies (Hn ) and (HDn )8b . Suppose that (HXY ) is fulfilled for a sequence of β-mixing coefficients satisfying ∀q ≥ 1

(4.2)

 −3 βq ≤ M 8−1 Bq 

for some M > 0 and for some constant B given by 7 14. For any x > 1, let pen be a penalty function such that ∀m ∈ n 

penm ≥ x3

Dm 2 σ

n 2

Let ρ ∈1 x, for any p¯ ∈0 1, if there exists p > p0 = 21 + 2p/1 ¯ − 4b p such that σp = Ɛ ε1 p  < ∞, we have that the PLSE f˜ defined by (4.3)

n    2 1  i Yi − gX f˜ = arg min γn fˆm  + penm with γn g = m∈n n i=1

satisfies   1/p¯ ˜ 2np¯ Ɛ f|A − f (4.4)



x+ρ x−ρ

2

  R inf m∈n f|A − fm 2µ + 2penm + C n n

where C is a constant depending on p x ρ p ¯ !0  h0  h1  M K and Rn is given by   2p¯  n f|A ∞ p¯ 2p¯ −p/2+p¯ Rn = σp (4.5)

Dm + 1/4−bp−p  + 2p¯ 0 n σp m∈n

852

Y. BARAUD, F. COMTE AND G. VIENNET

Comments 1. The functions 8 of particular interest are either of the form 8u = lnu or 8u = uc with 0 < c < 1/4. In the first case, (4.2) is equivalent to a geometric decay of the β-mixing coefficients (then, we say that the variables are geometrically β-mixing), in the second case (4.2) is equivalent to an arithmetic decay (the sequence is then arithmetically β-mixing). 2. A choice of Dn small in front of n allows one to deal with stronger depen i ’s. In return, choosing Dn too small may lead to dency between the Yi  X a serious drawback with regard to the performance of the PLSE. Indeed, in the case of nested models, the smaller Dn the smaller the collection of ˜ models and the poorer the performance of f. 3. Assumption (Hn ) is fulfilled when n is generated by piecewise polynomials of degree r on 0 1 (in that case !1 = 2r+1 suits) or by wavelets such as those described in Section 3.2 (a suitable basis is obtained by rescaling the father wavelets φJ0 k ’s). 4. We shall see in Section 10 that the result of Theorem 1 holds for a larger class of linear spaces n [i.e., for n ’s which do not satisfy (Hn )], provided that (4.1) is replaced by n D2n ≤ (4.6)

lnn8n 5. Take p¯ = 1, the main term involved in the right-hand side of (4.4) is usually   inf f|A − fm 2µ + 2penm

m∈n

It is worth noticing that the constant in front of this term, that is, 

x+ρ 2 C1 x ρ = x−ρ only depends on x and ρ, and not on unpleasant quantities such as h0 , h1 . If Theorem 1 gives no precise recommendation on the choice of x to optimize the performance of the PLSE, it suggests, in contrast, that a choice of x close to 1 is certainly not a good choice since it makes the constant C1 x ρ blow up (we recall that ρ must belong to 1 x). Fix ρ, we see that C1 x ρ decreases to 1 as x becomes large; the negative effect of choosing x large being that it increases the value of the penalty term. 6. Why does Theorem 1 give a result for values of p¯ = 1? By using Markov’s inequality, we can derive from (4.4) a result in probability saying that for any τ > 0, 

   ˜ 2 > τ inf f|A − fm 2 + 2penm + Rn  f|A − f n µ m∈n n (4.7)

C ≤ p¯ τ where C depends on x ρ p ¯ C. If Ɛ ε1 p  < ∞ for some p > 2 and if it is possible to choose 8u of order a power of lnu [this is the case

ADAPTIVE ESTIMATION IN AUTOREGRESSION

853

 i ’s are geometrically β-mixing] then one can choose both when the Yi  X ¯ − 4b. b in (HDn )8b and p¯ small enough to ensure that p > 21 + 2p/1 Consequently we get that (4.7) holds true under the weak assumption that Ɛ ε1 p  < ∞ for some p > 2. Lastly we mention that an analogue of ˜ 2 is replaced by f|A − f ˜ 2 can be obtained. This (4.7) where f|A − f n µ can be derived from the fact that, under the assumptions of Theorem 1, the (semi)norms  µ and  n are equivalent on n on a set of probability close to 1 (we refer to the proof of Theorem 1 and for further details to Baraud (2001)). 7. For adequate collections of models, the quantity Rn remains bounded by some number R not depending on n. In addition, if for all m ∈ n , the constants belong to Sm , then the quantity f|A ∞ involved in Rn can be  replaced by the smaller one f|A − f|A ∞ . 5. Generalization of Theorem 1. In this section we give an extension of Theorem 1 by relaxing the independence of the εi ’s and by weakening Assumption (HXε ). In particular, the next result shows that the procedure is robust to possible dependency (to some extent) of the εi ’s. We assume that: (H’ε ) The εi ’s satisfy, for some positive number ϑ,   2 q   i   ≤ qϑ sup Ɛ  (5.1) εi tX ttµ ≤1

i=1

for any 1 ≤ q ≤ n. In addition, Assumption (HXε ) is replaced by a milder one:  i and εi are independent. (H’Xε ) For all i ∈ 1  n, X Then the following result holds. Theorem 2. Consider the assumptions of Theorem 1 and replace (Hε ) by (H’ε ) and (HXε ) by (H’Xε ). For any x > 1, let pen be a penalty function such that D ∀m ∈ n  penm ≥ x3 m ϑ

n Then, the result 4 4 of Theorem 1 holds for a constant C that also depends on ϑ. Comments 1. In the case of i.i.d. εi ’s and under Assumption (HXε ) (which clearly implies (H’Xε )), it is straightforward that (5.1) holds with ϑ = σ22 . Indeed under Condition (HXε ), for all t ∈ 2 k  µ   2 q q      i  =  i  + 0 = qσ22 t2µ

Ɛ εi tX Ɛ εi2 t2 X i=1

Then, we recover Theorem 1.

i=1

854

Y. BARAUD, F. COMTE AND G. VIENNET

 i i=1

n and εi i=1

n are independent 2. Assume that the sequences X (which clearly implies (H’Xε )) and that the εi ’s are β-mixing. Then, we know from Viennet (1997) that there exists a function dβ depending on the β-mixing coefficients of the εi ’s such that for all t ∈ 2 k  µ   2 q     i   ≤ qƐ ε12 dβ ε1  t2µ  Ɛ εi tX i=1

  which amounts to taking ϑ = ϑβ = Ɛ ε12 dβ ε1  in (5.1). Roughly speaking ϑβ is close to σ22 when the β-mixing coefficients of the εi ’s are close to 0 which corresponds to the independence of the εi ’s. Thus, in this context the result of Theorem 2 can be understood as a result of robustness, since ϑβ is unknown. Indeed, the penalized procedure described in Theorem 1 with a penalty term satisfying, for some κ > 1, ∀m ∈ n 

penm ≥ κ

Dm 2 σ  n 2

still works if ϑβ < κσ22 . This also means that if the independence of the εi ’s is debatable, it is safer to increase the value of the penalty term. 6. Proof of the propositions of Section 3. Proof of Proposition 1. The result is a consequence of Theorem 1. Let us show that under (HAR1 ) the assumptions of Theorem 1 are fulfilled. Condition (Hε ) is direct. Under (3.3) it is clear that f|01 ∞ < ∞ holds true. We now set n = Smn and 8x = ln2 x. Since dimn  = Dn ≤

n 3

ln n



(HDn )8b holds for any b > 0 and for some constant K = Kb. As to Conditions (HS ) and (Hn ), they hold with !0 = r [we refer to Birg´e and Massart (1998)]. Under Condition (3.3), we know from Duflo (1997) that the process Xi i∈ admits a stationary law µ. Furthermore, we know that if the εi ’s admit a positive bounded continuous density with respect to the Lebesgue measure then so does µ. This can easily be deduced from the connection between hX and hε given by # hX y = hε y − fxhX xdx ∀y ∈ 

Then we can derive the existence of positive numbers h1 and h0 bounding the density hX from above and below on 0 1 and thus (HX ) is true. In addition we know from Doukhan (1994) that under (3.3) the Xi ’s are geometrically β-mixing that is, there exist two positive constants /, θ such that (6.1)

βq ≤ /e−θq

∀q ≥ 1

ADAPTIVE ESTIMATION IN AUTOREGRESSION

855

√ Since 8−1 u = exp u, clearly there exists some constant M = M/ θ > 0 such that √ βq ≤ /e−θq ≤ Me−3 Bq ∀q ≥ 1

Lastly, the εi ’s being independent of the sequence Xj j 6/1 − 4b. This is true for b small enough and then (3.1) follows from (4.4) with    f|01 2∞ n 2 −p/2+1 Dm + 1/4−bp−6/1−4b + Rn = σp σp2 n m∈n   2 +∞  f|  lnn 01 ∞ ≤ σp2 r2m −2 + sup 1/4−bp−6/1−4b + σp2 n≥1 n m=0 = R

Take R = CR where C is the constant involved in (4.4) to complete the proof of Proposition 1. ✷ Proof of Proposition 2. Conditions (HS ) and (Hn ) are fulfilled [we refer to Birg´e and Massart (1998)]. Next we check that (HXY ) holds true and more precisely that the sequence εi  Xi 1≤i≤n is arithmetically β-mixing with β-mixing coefficients satisfying (6.2)

∀q ∈ 1  n

βq ≤ /q−θ 

for simply write εt  Xt  = ∞some constants / > 0 and θ > 15. For that purpose,

j=0 Aj et − j with et − j = εt−2j  εt−1−2j  , for j ≥ 0, A0 is the 2 × 2identity matrix and 

0 0 Aj =

0 aj Then Pham and Tran’s (1985) Theorem 2.1 implies under (HReg ), that εt  Xt  % $  is absolutely regular with coefficients βn ≤ K +∞ k≥j ak ≤ KC/d − j=n 1d − 2n−d+2 . This implies (6.2) with θ = d − 2 > 15. In addition, it can be proved that if aj = j−d then βn ≥ Cdn−d , which shows that we do not reach the geometrical rate of mixing. Clearly the other assumptions of Theorem 1 are satisfied and it remains to apply it with p = 30 (a moment of order 30 exists since the εi ’s are gaussian), 8u = u1/5 and p¯ = 1. An upper bound for Rn which is does not depend on n can be established in the same way as in the proof of Proposition 1. ✷ Proof of Proposition 3. The line of proof is similar to that of Proposition 1, the difference lying in the fact that we need to check the assumptions

856

Y. BARAUD, F. COMTE AND G. VIENNET

of Theorem 2. Most of them are clearly fulfilled, we only check (HXY ) and  i  Yi ’s are geometrically β-mixing (which (H’ε ). We note that the pairs X shows that (HXY ) holds true) since both sequences Xi ’s and εi ’s are geometrically β-mixing (since the εi ’s are drawn from a “nice” autoregression model, we refer to Section 3.1) and are independent. Next we show that (H’ε ) holds true with ϑ = 1 + 2a/1 − aσ22 . This will end the proof of Proposition 3. For all t ∈ 2 k  µ,   2 q q     i  ≤  i tX  j 

Ɛ εi tX t2µ σ22 + 2 Ɛεi εj ƐtX i=1

i=1

i 0. ✷ 7. Proofs of Theorems 1 and 2. The proof of Theorem 2 is clear from the proof of Theorem 1. Indeed the assumptions (HXε ) and (Hε ) are only needed in (8.6) and (8.10). For the rest of the proof assuming (H’Xε ) is enough. It remains to notice that an analogue of (8.6) and (8.10) is easily obtained from Assumption (H’ε ). Now we prove Theorem 1. The proof is divided in consecutive claims. Claim 1.

∀m ∈ n , n  ˜ 2 ≤ f|A − fm 2 + 2  i f|A − f ε f˜ − fm X n n n i=1 i +penm − penm

ˆ

(7.1)

Proof.

By definition of f˜ we know that for all m ∈ n and t ∈ Sm ˜ + penm γn f ˆ ≤ γn t + penm

In particular this holds for t = fm and algebraic computations lead to (7.2)

˜ 2 ≤ f − fm 2 + f − f n n

n 2  i  + penm − penm

ε f˜ − fm X ˆ n i=1 i

Note that the relation f − t2n = f|A − t2n + f − f|A 2n is satisfied for any A-supported function t. Applying this identity respectively ˜ to  t = f and t = fm (those functions being A-supported as elements of m ∈n Sm ), we derive (7.1) from (7.2). ✷ Claim 2. Let qn , qn1 be integers such that 0 ≤ qn1 ≤ qn /2, qn ≥ 1. Set  i , i = 1  n, then there exist random variables u∗ = ε∗  X  ∗ , ui = εi  X i i i i = 1  n satisfying the following properties! (i) For 2 = 1  2n = n/qn , the random vectors     ∗  21 = u2−1q +1 

u2−1q +q  ∗21 = u∗ U and U 

u 2−1qn +1 2−1qn +qn1 n n n1

859

ADAPTIVE ESTIMATION IN AUTOREGRESSION

have the same distribution, and so have the random vectors     ∗  ∗22 = u∗  22 = u2−1q +q +1 

u2q and U 

u

U 2q 2−1qn +qn1 +1 n n1 n n (ii) For 2 = 1  2n ,      ∗21 ≤ βq −q  and  U  22 = U  ∗22 ≤ βq

 21 = U (7.3)  U n n1 n1  ∗2 δ are independent.  ∗1δ   U (iii) For each δ ∈ 1 2, the random vectors U n Proof. The claim is a corollary of Berbee’s coupling lemma (1979) [see Doukhan et al. (1995)] together with (HXY ). For further details about the construction of the u∗i ’s we refer to Viennet (1997); see Proposition 5.1 and its proof page 484. ✷ We set A0 = h20 1 − 1/ρ2 /80!41 h1 

(7.4)

denotes the and we choose qn = intA0 8n/4   + 1 ≥ 1 (intu √ integer part of u) and qn1 = qn1 x to satisfy qn1 /qn + 1 − qn1 /qn ≤ x, namely qn1 of order x − 12 ∧ 1qn /2 works. For the sake of simplicity, we assume qn to divide n that is, n = 2n qn and we introduce the sets B∗ and Bρ defined as follows:    i  = εi∗  X  ∗i / i = 1  n B∗ = εi  X and for ρ ≥ 1,

 Bρ =

t2µ



ρt2n 

∀t ∈

 mm ∈n

 Sm + S m

We denote by Bρ∗ the set B∗ ∩Bρ . From now on, the index m denotes a minimizer of the quantity f|A − fm 2µ + penm  for m ∈ n . Therefore, m is fixed and, for the sake of simplicity, the index m is omitted in the three following notations. Let Bm  µ be the unit ball in Sm  = Sm + Sm with respect to  µ , that is,     n  1

2 2  t Xi  ≤ 1

Bm  µ = t ∈ Sm + Sm / tµ = Ɛ n i=1 For each m ∈ n , we set Dm  = dimSm . Claim 3. Let x ρ be numbers satisfying x > ρ > 1. If pen is chosen to satisfy (7.5)

penm  ≥ x3

Dm 2 σ  n 2

860

Y. BARAUD, F. COMTE AND G. VIENNET

then (7.6)

˜ 2 |B ∗ f|A − f n ρ   xx + ρ −2 ≤ C1 x ρ f|A − fm 2n + 2penm + ˆ n Wn m x−ρ

where Wn m  is defined by  Wn m  = 

sup

n 

t∈Bm µ i=1



2

 ∗i  εi∗ tX

− x2 nDm σ22   +

for m ∈ n and where C1 x ρ = x + ρ2 /x − ρ2 > 1. Proof.

The following inequalities hold on Bρ∗ . Starting from (7.1) we get

˜ 2 ≤ f|A − fm 2 + f|A − f n n

n  ∗  f˜ − fm X 2 ˜ i f − fm µ εi∗ ˜ n f − fm µ i=1

+ penm − penm ˆ ≤ f|A − fm 2n +

n  2 ˜  ∗i  εi∗ tX f − fm µ sup n t∈Bmµ ˆ i=1

+ penm − penm

ˆ Using the elementary inequality 2ab ≤ xa2 + x−1 b2 , which holds for any positive numbers a b, we have  2 n  2 2 −1 ˜ 2 −2 ∗ ∗ ˜  sup εi tXi  f|A − fn ≤ f|A − fm n + x f − fm µ + n x t∈Bmµ ˆ i=1

+penm − penm

ˆ On Bρ∗ ⊂ Bρ , we know that for all t ∈



m ∈n

Sm + Sm , t2µ ≤ ρt2n , hence 

˜ 2 ≤ f|A − fm 2 + x−1 ρf˜ − fm 2 + n−2 x f|A − f n n n

sup

n 

t∈Bmµ ˆ i=1

2

 ∗i  εi∗ tX

+penm − penm ˆ

2  ≤ f|A − fm 2n + x−1 ρ f˜ − f|A n + f|A − fm n  2 n  −2 ∗ ∗  +n x sup εi tXi  + penm − penm ˆ t∈Bmµ ˆ i=1

by the triangular inequality. Since for all y > 0 (y is chosen at the end of the proof)  2 f˜ − f|A n + f|A − fm n ≤ 1 + yf˜ − f|A 2n + 1 + y−1 f|A − fm 2n 

861

ADAPTIVE ESTIMATION IN AUTOREGRESSION

we obtain

 1+y 2 ˜ f|A − fn 1 − ρ x



2 n  1 + y−1 2 −2 ∗ ∗  ≤ f|A − fm n 1 + ρ + n x sup εi tXi  x t∈Bmµ ˆ i=1 +penm − penm ˆ

 D + Dmˆ 2 1 + y−1 2 ≤ f|A − fm n 1 + ρ + penm + x3 m σ2 x n

 2 n  x  ∗i  − x2 nDmσ sup −penm ˆ + 2 εi∗ tX ˆ 22  n t∈Bmµ ˆ + i=1 using that Dm ˆ ≤ Dmˆ + Dm . Since the penalty function pen satisfies (7.5) for all m ∈ n , we obtain that on Bρ∗

  1+y 1+y−1 2 2 ˜ f|A − fn 1−ρ ˆ ≤ f|A − fm n 1+ρ +2penm + xn−2 Wn m x x which gives the claim by choosing y = x − ρ/x + ρ. ✷ Claim 4.

For p ≥ 21 + 2p/1 ¯ − 4b we have,

  ˜ 2p¯ |B∗ Ɛ f|A − f n ρ  p¯ p¯ ≤ C1 x ρ f|A − fm 2µ + 2penm      C  −1/2 p 2p¯ −p/2+p¯ n  + p¯ !0 h0 σp Dm + 2Kp 1−4bp−21+2p/1−4b ¯ n n m ∈ n

where C is a constant that depends on x ρ p p. ¯ Proof. By taking the power p¯ ≤ 1 of the right- and left-hand side of (7.6) we obtain ˜ 2p¯ |B∗ f|A − f n ρ  p¯ p¯ ≤ C1 x ρ f|A − fm 2n + 2penm +  p¯ p¯ ≤ C1 x ρ f|A − fm 2n + 2penm +



xx + ρ n2 x − ρ xx + ρ n2 x − ρ

p¯ p¯

¯ Wp ˆ n m

 m ∈

¯

Wp n m 

n

862

Y. BARAUD, F. COMTE AND G. VIENNET

By taking the expectation on both sides of the inequality and using Jensen’s inequality we obtain that   ˜ 2p¯ |B∗ Ɛ f|A − f n ρ  p¯ p¯ ≤ C1 x ρ f|A − fm 2µ + 2penm

 xx + ρ p¯   p¯  + Ɛ Wn m 

n2 x − ρ m ∈

(7.7)

n

We now use the following result, Proposition 6. Under the assumptions of Theorem 1   ¯  Cp p ¯ −1 Ɛ Wp n m  m ∈n





−1

≤ Cp p ¯

Ɛ 

m ∈n

sup

n 

t∈Bm µ i=1

2

 ∗i  εi∗ tX

 * qn1 2 qn1 −x + 1− nDm σ22 qn qn   %p−p $ 1/3 ¯ −1/2 p 2p¯ ¯ ≤ xp/3 np¯ !0 h0 σp x −1   p  qn n −p/2+p¯ × Dm + pp−2/4p−1−p¯

n m ∈n

*





+

The proof of the second inequality is delayed to Section 8, the first one is a straightforward consequence of our choice of qn1 . Using Proposition 6 we derive from (7.7) that   ˜ 2np¯ |B∗ Ɛ f|A − f ρ  p¯  Cx p p ¯  −1/2 p 2p¯ p¯ 2 ≤ C x ρ f| − f  + 2penm + h σp ! A m 0 µ 1 0 (7.8) np¯   p  qn n −p/2+p¯ × Dm + pp−2/4p−1−p¯

n m ∈n Since A0 ≤ 1 and 1 ≤ 8n ≤ Knb we have p p p bp qp n ≤ 2 8n ≤ 2K n

hence by using the inequality pp − 2/4p − 1 ≥ p − 2/4 we get p

(7.9)

qn n

npp−2/4p−1−p¯

≤ 2Kp

n

1/4−bp−21+2 p/1−4b ¯ n

ADAPTIVE ESTIMATION IN AUTOREGRESSION

863

Note that the power of n, 1/4 − bp − 21 + 2p/1 ¯ − 4b is positive for p > 21 + 2p/1 ¯ − 4b. The result follows by combining (7.8) and (7.9). ✷ Claim 5.

Under the assumptions of Theorem 1 we have    Bρ∗c ≤ 2M + e16/A0 n−2

(7.10) and

  $ % ¯ ˜ 2p¯ |B∗c ≤ 2M + e16/A0 1−2p/p f|A 2∞p¯ + σp2p¯ n−p¯

(7.11) Ɛ f|A − f n ρ Proof. For the proof of (7.11) we refer to Baraud (2000) [see proof of Theorem 6.1, (49) with q = p¯ and β = 2] noticing that p ≥ 21 + 2p/1 ¯ − 4b > 4p/2 ¯ − p ¯ (p¯ ≤ 1). By examining the proof, it is easy to check that if the constants belong to the Sm ’s then f|A ∞ can be replaced by f|A − f|A ∞ . To prove (7.10) we use the following Proposition, which is proved in Section 9. Proposition 7. (7.12)

Under the assumptions of Theorem 1 for all ρ > 1,

    8n lnn  Bρ∗c ≤ 2n2 exp −A0 + 2nβqn1

qn

Since qn = intA0 8n/4 + 1 ≤ A0 8n/4 + 1 we have

(7.13)

  

 8n lnn 4 ≤ 2n2 exp 4 lnn −1 + 2n2 exp −A0 qn A0 8n + 4 2 16/A0 ≤ 2e  n

8n being larger than lnn. Now, set (7.14)

B = A0 x−12 ∧1/8−1 = h20 x−12 ∧11−1/ρ2 /640!30 h1 −1

Since qn ≥ A0 8n/4, under Condition (4.2) we have

(7.15)

  Bqn −3 2nβqn1 ≤ 2nM 8−1 x − 12 ∧ 1 2  −3 2M ≤ 2nM 8−1 8n = 2

n

Claim 5 is proved by combining (7.13) and (7.15). ✷ The proof of Theorem 1 is completed by combining Claim 4 and Claim 5.

864

Y. BARAUD, F. COMTE AND G. VIENNET

8. Proof of Proposition 6.

We decompose the proof into two steps:

Step 1. For all m ∈ n ,  Ɛ (8.1)

n 

sup

t∈Bm µ i=1

 ∗i  εi∗ tX

* −

qn1 + qn

*

qn1 1− qn



p

nDm σ2 +

  −1/2

p/2 p2 /4p−1 ≤ Cpσpp np/2 + !0 h0 p qp n

n Dm 

Proof. Using the result of Claim 2, we have the following decomposition: n  i=1

 ∗i  = εi∗ tX

2n  2=1

 

 1 i∈I2

 ∗i  + εi∗ tX

 2 i∈I2

  ∗i  εi∗ tX

1

2

where for 2 = 1  2n , I2 = 2 − 1qn + 1  2 − 1qn + qn1  and I2 = 2 − 1qn + qn1 + 1  2qn = 2 − 1qn + qn1 + qn − qn1 . Denoting Ɛ∗1 =   2n qn1 Dm σ2 and Ɛ∗2 = 2n qn − qn1 Dm σ2 we have  Ɛ

sup

n 

t∈Bm µ i=1

p

 ∗i  εi∗ tX



Ɛ∗1



Ɛ∗2 +

  ≤ 2p−1 Ɛ  sup

p    ∗  ∗i  − Ɛ∗1   εi tX  2n

t∈Bm µ 2=1

  +2p−1 Ɛ  sup

1

i∈I2

+

p    ∗  ∗i  − Ɛ∗2   εi tX 

2n

t∈Bm µ 2=1

2

i∈I2

+

Since the two terms can be bounded in the same way, we only show how to bound the first one. To do so, we use a moment inequality proved in Baraud [(2000), Theorem 5.2, page 478]: consider the sequence of independent $ %q  ∗1   U  ∗2 defined by U  ∗2 = ε∗  X  ∗  1 for random vectors of  × k n1 , U i i n i∈I2

2 = 1 $ 2n , and % consider m = gt / t ∈ Bm  µ the set of functions gt k qn1 mapping  ×  into  defined by n1  q  ei txi 

gt e1  x1   eqn1  xqn1  =

i=1

865

ADAPTIVE ESTIMATION IN AUTOREGRESSION

 ∗2 ’s and the class of functions By applying the moment inequality with the U

m we find for all p ≥ 2,  p  2n     ∗i  − Ɛ∗1   εi∗ tX Cp−1 Ɛ  sup  t∈Bm µ 2=1

1

i∈I2

+

 p   2n     ∗ ∗      sup ≤Ɛ εi tXi   t∈Bm µ 2=1  1  i∈I2  2   2n     ∗i    εi∗ tX +Ɛp/2  sup  

(8.2)

t∈Bm µ 2=1

1

i∈I2

p/2

= Vp + V2 provided that



Ɛ  sup

(8.3)

2n  

t∈Bm µ 2=1

1 i∈I2

  ∗i  ≤ Ɛ∗1 = εi∗ tX



2n qn1 Dm σ2

Throughout this section, we denote by G2 t the random process  ∗  ∗i  εi tX G2 t = 1

i∈I2

which is repeatedly involved in our computations. It is worth noticing that it is linear with respect to the argument t. We first show that (8.3) is true. Let ϕj , j = 1  Dm  be an orthonormal basis of Sm + Sm ⊂ 2 A µ. For each t ∈ Bm  µ we have the following decomposition Dm 

t=

(8.4)



j=1

Dm 

aj ϕ j 



j=1

a2j ≤ 1

By Cauchy–Schwarz’s inequality we know that    Dm  Dm  2n 2n 2n       G2 t = aj G2 ϕj  ≤ G2 ϕj  2=1

j=1

2=1

j=1

2=1

Thus, by using Jensen’s inequality we obtain     Dm  2n 2n    G2 t ≤  Ɛ G2 ϕj  Ɛ sup (8.5)

t∈Bm µ 2=1

j=1

 =

Dm  2n

 

j=1 2=1

2

2

1/2 

2=1

1/2 ƐG22 ϕj 

1/2 



866

Y. BARAUD, F. COMTE AND G. VIENNET

the random variables G2 ϕj 2=1

2n being independent and centered for each j = 1  Dm . Now, for each 2 = 1  2n , we know that the laws of the  ∗  1 and εi  X  i  1 are the same, therefore under Condition vectors εi∗  X i i∈I2 i∈I2 HXε 2        i    i  = qn1 σ22  Ɛ G22 ϕj  = Ɛ  εi ϕj X Ɛεi2 Ɛϕ2j X ≤ 

(8.6)

i∈I21

1

i∈I2

which together with (8.5) proves (8.3). Let us now bound Vp and V2 respectively. The connection between  ∞ and  µ over Sm + Sm allows to write that for all t ∈ Bm  µ, −1/2

t∞ ≤ !0 h0

(8.7)



Dm  × 1

Thus,  p 2n    ∗    ∗i   εi tX  

 Vp = Ɛ  sup

t∈Bm µ 2=1

1

i∈I2



2n  

1

≤ I2 p−1 Ɛ  sup

t∈Bm µ 2=1

p−1

−1/2

! 0 h0

≤ qn1



1

  ∗i  p  εi∗ p tX

i∈I2

p−2

Dm 

 Ɛ  sup

2n  

t∈Bm µ 2=1

1

  ∗i 

εi∗ p t2 X

i∈I2

Using (8.4) and Cauchy–Schwarz’s inequality we get p−1

Vp ≤ qn1

−1/2

! 0 h0



Dm 

 Ɛ





2n Dm    

2=1 j=1

−1/2 p−2

≤ qp−1 n !0 h0

p−2

1 i∈I2

 ∗i  εi∗ p ϕ2j X

nDm p/2 σpp 

recalling that 2n qn1 ≤ 2n qn ≤ n. Since for p ≥ 2, p2 /4p − 1 ≥ 1 one also has (8.8)

−1/2 p

Vp ≤ qp n !0 h0

 σpp Dm p/2 np

2

/4p−1



867

ADAPTIVE ESTIMATION IN AUTOREGRESSION

We now bound V2 . A symmetrization argument [see Gin´e and Zinn (1984)] gives   2n  2 V2 = Ɛ sup G2 t t∈Bm µ 2=1

(8.9)

 2n    2 sup  ξ2 G2 t ≤ sup t∈Bm µ 2=1 t∈Bm µ 2=1   2n    2 2  sup  ξ2 G2 t  ≤ nσ2 + 4Ɛ

2n    Ɛ G22 t + 4Ɛ



t∈Bm µ 2=1

where the random variables ξ2 ’s are i.i.d. centered random variables indepen ∗ ’s and the ε∗ ’s, satisfying ξ1 = ±1 = 1/2. It remains to bound dent of the X i i the last term in the right-hand side of (8.9). To do so, we use a truncation argument. We set M2 = maxi∈I1 εi∗ . For any c > 0, we have 2   2n  2n        2 2 Ɛ sup  ξ2 G2 t ≤Ɛ sup  ξ2 G2 t|M2 ≤c  t∈Bm µ 2=1 t∈Bm µ 2=1 (8.10)   2n    2 +Ɛ sup  ξ2 G2 t|M2 >c 

t∈Bm µ 2=1

We apply a comparison theorem [Theorem 4.12, page 112 in Ledoux and Talagrand (1991)] to bound the first term of the right-hand side of (8.10): we know that for each t ∈ Bm  µ the random variables G2 t|M2 ≤c ’s are bounded by −1/2  Dm c [using (8.7)] and are independent of the ξl ’s. The B = qn1 !0 h0 2 function x %→ x defined on the set −B B being Lipschitz with Lipschitz constant smaller than 2B, we obtain (Ɛξ denotes the conditional expectation with respect to the εi∗ ’s and the X∗i ’s)    2n   2n      2 Ɛξ sup  ξ2 G2 t|M2 ≤c  ≤ 4BƐξ sup  ξ2 G2 t|M2 ≤c 

t∈Bm µ 2=1

t∈Bm µ 2=1

 ≤ 4BƐξ   ≤ 4B

Dm 





j=1

Dm  2n

 

j=1 2=1

2n  2=1

2

ξ2 G2 ϕj |M2 ≤c

1/2 

1/2

G22 ϕj 



 ∗ and We now decondition with respect to the random variables εi∗ ’s and X i using (8.6) we get   2n    √ −1/2 2 Ɛ sup  ξ2 G2 t|M2 ≤c  ≤ 4qn1 !0 h0 Dm σ2 nc (8.11) t∈Bm µ 2=1 √

≤ 4q2n1 !20 h−1 0 Dm σp nc

868

Y. BARAUD, F. COMTE AND G. VIENNET −1/2

noticing that qn1 , !0 h0 are both greater than 1. Now, we bound the second term of the right-hand side of (8.10). We have     2n  2n    2 2 Ɛ sup  ξ2 G2 t|M2 >c  ≤ Ɛ sup G2 t|M2 >c

t∈Bm µ 2=1

t∈Bm µ 2=1

 ≤Ɛ

Dm  2n

 

j=1 2=1



Dm  2n

≤ qn1 Ɛ 

≤ qn1 c



G22 ϕj |M2 >c

 

j=1 2=1

2n  2−p 2=1

M22 |M2 >c

 p Ɛ M2



p

using (2.4). Lastly, since M2 ≤  (8.12)

Ɛ

1

i∈I2

1 i∈I2

 ∗i  ϕ2j X

Dm 



j=1

1 i∈I2

≤ q2n1 c2−p !20 h−1 0 Dm 









  ∗i   ϕ2j X

2n   p Ɛ M2

2=1

εi∗ p we get

 2n    2

p 2−p 

sup  ξ2 G2 t|M2 >c  ≤ q2n1 !20 h−1 0 nDm σp c

t∈Bm µ 2=1

By gathering (8.11) and (8.12) we obtain that for all c > 0   2n    √ $ √ p−1 2−p % 2

Ɛ sup  ξ2 G2 t ≤ 4q2n1 !20 h−1 nσp c

0 σp Dm  n c +

t∈Bm µ 2=1

We choose c = σp n1/2p−2 , and thus from (8.9) we get  (8.13) V2 = Ɛ

 2n    2 2

p/2p−1  sup  ξ2 G2 t ≤ nσ22 + 8q2n !20 h−1  0 σp Dm n

t∈Bm µ 2=1

which straightforwardly proves Step 1 by combining (8.2), (8.8) and (8.13). ✷ Step 2.

For all x > 1, m ∈ n , p¯ < 2p

 n−p¯ Ɛ 

sup

t∈Bm µ

n  i=1

*

2

 ∗i  εi∗ tX 

−x

qn1 + qn

*

qn1 1− qn

2

p¯  nDm σ22   

p p

−p/2−p ¯ p−pp−2/p−1 ¯ ≤ Cp x!0 h−1 + qp

0  σp Dm  nn

+

ADAPTIVE ESTIMATION IN AUTOREGRESSION

Proof.

869

  ∗  ≥ 0 and We set Zn m  = supt∈Bm µ ni=1 εi∗ tX i 

* * qn1 qn1 + 1− nDm σ2 ≥ nDm σ2

Ɛ∗ = qn qn

Since x > 1, there exists η > 0 such that x = 1 + η3 (i.e., η = x1/3 − 1). Thus for all τ > 0,    Z2n m  ≥ 1 + η3 Ɛ∗ 2 + τ 

2  * τ 2

∗ ≤  Zn m  ≥ 1 + ηƐ + 1 + η−1    * τ ≤  Zn m  − Ɛ∗ ≥ ηƐ∗ + 1 + η−1    * τ ≤  Zn m  − Ɛ∗ ≥ η2 Ɛ∗ 2 + 1 + η−1 

−p/2  τ p 2 ∗ 2 ≤ η Ɛ  + Ɛ Zn m  − Ɛ∗ + 1 + η−1   

p/2 Ɛ Zn m  − Ɛ∗ p x1/3 + ≤ $ %p/2  x1/3 − 1 x1/3 − 1x1/3 nDm σ22 + τ using Markov’s inequality. Now, for each p¯ such that 2p¯ < p, the integration with respect to the variable τ leads to  p¯  Ɛ Z2n m  − x Ɛ∗ 2 +

=

#

+∞

0



 ¯ pτ ¯ p−1  Z2n m  − x Ɛ∗ 2 ≥ τ dτ

p/2  x1/3 p ≤ Ɛ Zn m  − Ɛ∗ + 1/3 x −1 # +∞ ¯ pτ ¯ p−1 × $ %p/2 dτ 0 x1/3 − 1x1/3 nDm σ22 + τ  %p¯  $ 1/3 1/3 x x − 1 Ɛ Zn m  − Ɛ∗ p p + ≤ $ %p/2−p¯  p p − 2p¯ x1/3 − 1 nDm σ 2 2

and using Step 1, we get  p¯  2

∗ 2 Ɛ Zn m  − x Ɛ  +

$ ≤C

%p¯

x1/3 x1/3 − 1 x1/3 − 1

p

−1/2 p

!0 h0

2p−p ¯

 σ2

σpp np¯

870

Y. BARAUD, F. COMTE AND G. VIENNET

  ¯

p¯ −pp−2/4p−1 × Dm −p/2−p + qp n Dm  n $ ≤C

%p¯

x1/3 x1/3 − 1 x1/3 − 1

p

−1/2 p

!0 h0

 σp2p¯ np¯

  ¯ p−pp−2/4p−1 ¯  × Dm −p/2−p + qp nn since Dm  = dim Sm + Sm  ≤ n. The constant C depends on p and p. ¯ ✷ It is now easy to prove Proposition 6 by summing up over m in n . 9. Proof of Proposition 7. Since Bρ∗c  = Bρc ∩ B∗  + B∗c  and since it is clear from Claim 2 that   B∗c  ≤ 2n βqn −qn1  + βqn1 ≤ 2nβqn1  (9.1) the result holds if we prove Bρc

(9.2)

 8n lnn

∩ B  ≤ 2n exp −A0 qn ∗

2

In fact, we prove a more general result, namely,  n h2 1 − 1/ρ2 c ∗ 2 Bρ ∩ B  ≤ 2Dn exp − 0 (9.3) 16h1 qn Lφ where Lφ is a quantity specific to the orthonormal basis φλ λ∈#n , defined as follows. Let φλ λ∈#n be a 2 dx-orthonormal basis of n and as in Baraud (2001) define the quantities * # V= φ2λ xφ2λ xdx  B = φλ φλ ∞ λλ ∈#n ×#n  A

λλ ∈#n ×#n

and for any symmetric matrix A = Aλλ ,  ρA ¯ = sup aλ aλ Aλλ



aλ 

λ

a2λ =1 λλ

We set (9.4)

Lφ = maxρ¯ 2 V ρB

¯

Then, to complete the proof of Proposition 7, it remains to check that n Lφ ≤ K (9.5)  8n lnn for some constant K independent of n (we shall show the result for K = !41 ). Under (Hn ), Lemma 2 in Section 10 ensures that Lφ ≤ !41 Dn 

871

ADAPTIVE ESTIMATION IN AUTOREGRESSION

which together with (4.1) leads to (9.5). Now we prove inequality (9.3). First note that if ρ > 1,  t2µ −νn t2  1 sup ≥1−  ≥ ρ ⇔ sup 2 2 tµ ρ t∈n /0 tn t∈n /0 where νn u = 1/n cess. Then for ρ > 1,  ∗



n



i=1 uXi  − Ɛµ u

sup

t2µ

2 t∈n /0 tn

denotes the centered empirical pro-

 ∗

≥ρ ≤

sup νn t2  ≥ 1 −

µ t∈Bn 01

1 ρ

where we denote by ∗ A the probability A ∩ B∗ , and by Bnµ 0 1 = t ∈ n  tµ ≤ 1.   µ For t ∈ Bn 0 1, t = λ∈#n aλ φλ with λ∈#n a2λ ≤ h−1 0 , and we have        sup νn t2  ≤ sup h−1 a a ν φ φ 

  λ λ n λ λ 0  2 µ  

2 t∈Bn 01 a ≤1 λλ ∈# λ λ ≤ sup 

λ

a2λ ≤1

h−1 0



n

λλ ∈#n2

aλ aλ νn φλ φλ 

Let x  = h20 1 − 1/ρ2 /16h1 Lφ. Then on the set ∀λ λ  ∈ #n2 /νn φλ φλ  ≤ 2Vλλ 2h1 x + 2Bλλ x, we have   sup νn t2  ≤ 2h−1 2h1 xρV ¯ + xρB ¯ 0 µ

t∈Bn 01



1/2

2 1 ¯ h 1 − 1/ρ ρB ρ¯ V ≤ 1 − 1/ρ √ + 0 8h1 Lφ 2 Lφ

 1 1 ≤ 1 − 1/ρ √ + ≤ 1 − 1/ρ

2 8 The proof of inequality (9.3) is then achieved by using the following claim. Claim 6. Let φλ λ∈#n be an 2 A dx basis of n . Then, for all x ≥ 0 and all integers q, 1 ≤ q ≤ n, 

   nx ∗ ∃λ λ  ∈ #n2 / νn φλ φλ  > 2Vλλ 2h1 x + 2Bλλ x ≤ 2D2n exp −

qn This implies that



Bρc ∩ B∗  ≤ 2D2n exp − and thus inequality (9.3) holds true.

n h20 1 − 1/ρ2  16h1 qn Lφ

872

Y. BARAUD, F. COMTE AND G. VIENNET

∗ ∗ Proof of Claim 6. Let νn∗ φλ φλ  = νn1 φλ φλ  + νn2 φλ φλ  be defined by ∗ νnk φλ φλ  =

n −1 1 2 Z∗ φ φ  2n l=0 lk λ λ

k = 1 2

where for 0 ≤ l ≤ 2n − 1, Z∗lk φλ φλ  =

 1    ∗i φλ X  ∗i  − Ɛµ φλ φλ   φλ X qn k

k = 1 2

i∈ l

We have

    νn φλ φλ  > 2Vλλ 2h1 x + 2Bλλ x    ∗ ≤ ∗ νn1 φλ φλ  > Vλλ 2h1 x + Bλλ x    ∗ + νn2 φλ φλ  > Vλλ 2h1 x + Bλλ x = 1 + 2

Now, we bound 1 and 2 by using Bernstein’s inequality [see Lemma 8 page 366 in Birg´e and Massart (1998)] applied to the independent variables  Z∗lk , which satisfy Z∗lk ∞ ≤ Bλλ and Ɛ1/2 Z∗lk 2  ≤ h1 Vλλ . Then we obtain 1 + 2 ≤ 2 exp−x2n , which proves the Claim 6. ✷ 10. Constraints on the dimension of  n . Most elements of the following proof can be found in Baraud (2001), but we recall them for the paper to be self-contained. Let n be the linear subspace defined at the beginning of Section 4. We recall that n is generated by an orthonormal basis φλ λ∈#n and that Dn = #n . In the previous section the conditions on n (given by (Hn )) and Dn [given by (4.1)] are used to prove (9.5). To obtain (9.5) we proceed into two steps: first, under some particular characteristics of the basis φλ λ∈#n [in the case of Theorem 1 these characteristics are given by (Hn )], we state an upper bound on Lφ depending on !1 (or !0 ) and Dn . Secondly, starting from this bound we specify a constraint on Dn for (9.5) to hold. In the next lemma we consider various cases of linear spaces n (including those considered in Theorem 1) and provide upper bounds on Lφ according to the characteristics of one of their orthonormal bases. Lemma 2.

Let Lφ be the quantity defined by 9 4

1. If n satisfies (2.2) then Lφ ≤ !20 D2n . 2. Under (Hn ), Lφ ≤ !41 Dn . Moreover, (2.2) holds true with !20 = !31 . We obtain from 1 and 2 that the constraints on Dn given by (4.6) and (4.1) lead to (9.5).

873

ADAPTIVE ESTIMATION IN AUTOREGRESSION

Proof of 1. that

On the one hand, by Cauchy–Schwarz’s inequality we have  #

2

ρ¯ V ≤

λλ ∈#n

φ2λ φ2λ

 2 ≤ φ λ



λ ∈#



λ ∈#n

 #



λ∈#n

 #

 λ∈#n

φ2λ φ2λ

φ2λ ≤ !20 D2n 

n

 using (2.4). On the other hand, by (2.2) we know that φλ ∞ ≤ !0 Dn × 1. Thus, using similar arguments one gets ρB ¯ ≤ !20 D2n  which leads to Lφ ≤ !20 D2n . ✷ Proof of 2. for all x,

We now prove that (2.2) holds true in the case 2. Note that 

φ2λ x ≤ !1 φλ 2∞ ≤ !31 Dn

λ∈#n

thus, (2.4) holds true with !20 = !31 . Under (Hn ), Jλ = λ ∈ #n / φλ φλ ≡ 0 satisfies Jλ ≤ !1 and # ∀λ ∈ #n  ∀λ ∈ Jλ φ2λ φ2λ ≤ !21 Dn

Therefore, ρV ¯ =



aλ λ 

≤ =

 

sup



λ

!21 Dn

a2λ =1 λ λ ∈Jλ



sup aλ λ 



λ

#

aλ aλ

a2λ =1 λ



φ2λ φ2λ

 λ ∈Jλ

1/2



!21 Dn Wn

Besides, ∀λ ∈ #n  ∀λ ∈ Jλ, φλ φλ ∞ ≤ !21 Dn and thus ρB ¯ = sup aλ aλ φλ φλ ∞ ≤ !21 Dn Wn



a2λ =1

λ

Finally, W2n

≤ sup 



2 λ aλ =1 λ∈#n

= !1 sup 

≤ !21

2 λ aλ =1



 λ ∈Jλ

2







λ ∈#

λ∈Jλ 

n

≤ !1 sup 



2

λ aλ =1 λ∈#n λ ∈Jλ

a2λ = !1 sup 



2 λ aλ =1



λ ∈#

a2λ

Jλ  a2λ n

874

Y. BARAUD, F. COMTE AND G. VIENNET

 In other words, ρV ¯ ≤ !21 Dn and ρB ¯ ≤ !31 Dn , which gives the bound 4 Lφ ≤ !1 Dn since !1 ≥ 1. ✷ Acknowledgments. The authors are deeply grateful to Lucien Birg´e for numbers of constructive suggestions and thank Pascal Massart for helpful comments. REFERENCES Akaike, H. (1973). Information theory and an extension of the maximum likelihood principle. In Proceedings of the 2nd International Symposium on Information Theory (P. N. Petrov and F. Csaki, eds.) 267–281. Akademia Kiado, Budapest. Akaike, H. (1984). A new look at the statistical model identification. IEEE Trans. Automatic Control 19 716–723. Baraud, Y. (1998). S´election de mod`eles et estimation adaptative dans diff´erents cadres de r´egression. Ph.D. thesis, Univ. Paris-Sud. Baraud, Y. (2000). Model selection for regression on a fixed design. Probab. Theory Related Fields 117 467–493. Baraud, Y. (2001). Model selection for regression on a random design. Preprint 01-10, DMA, Ecole Normale Sup´erieure, Paris. Barron, A. R. (1991). Complexity regularization with application to artificial neural networks. In Proceedings of the NATO Advanced Study Institute on Nonparametric Functional Estimation (G. Roussas, ed.) 561–576. Kluwer, Dordrecht. Barron, A. R. (1993). Universal approximation bounds for superpositions of a sigmoidal function processes. IEEE Trans. Inform. Theory 39 930–945. ´ L. and Massart, P. (1999). Risks bounds for model selection via penalization. Barron, A., Birge, Probab. Theory Related Fields 113 301–413. Barron, A. R. and Cover, T. M. (1991). Minimum complexity density estimation. IEEE Trans. Inform. Theory 37 1034–1054. Berbee, H. C. P. (1979). Random walks with stationary increments and renewal theory. Math. Centre Tract 112. Math. Centrum, Amsterdam. ´ L. and Massart, P. (1997). From model selection to adaptive estimation. In Festschrift for Birge, Lucien Lecam: Research Papers in Probability and Statistics (D. Pollard, E. Torgensen and G. Yangs, eds.) 55–87. Springer, New York. ´ L. and Massart, P. (1998). Exponential bounds for minimum contrast estimators on sieves. Birge, Bernoulli 4 329–375. Cohen, A., Daubechies, I. and Vial, P. (1993). Wavelet and fast wavelet transform on an interval. Appl. Comp. Harmon. Anal. 1 54–81. Daubechies, I. (1992). Ten Lectures on Wavelets. SIAM, Philadelphia. Devore, R. A. and Lorentz, C. G. (1993). Constructive Approximation. Springer, New York. Donoho, D. L. and Johnstone, I. M. (1998). Minimax estimation via wavelet shrinkage. Ann. Statist. 26 879–921. Doukhan, P. (1994). Mixing properties and Examples. Springer, New York. Doukhan, P., Massart, P. and Rio, E. (1995). Invariance principle for absolutely regular empirical processes. Ann. Inst. H. Poincar´e Probab. Statist. 31 393–427. Duflo, M. (1997). Random Iterative Models. Springer, New-York. ´ E. and Zinn, J. (1984). Some limit theorems for empirical processes. Ann. Probab. 12 929– Gine, 989. Hoffmann, M. (1999). On nonparametric estimation in nonlinear AR(1)-models. Statist. Probab. Lett. 44 29–45. Kolmogorov, A. R. and Rozanov, Y. A. (1960). On the strong mixing conditions for stationary gaussian sequences. Theor. Probab. Appl. 5 204–207. Ledoux, M. and Talagrand, M. (1991). Probability in Banach Spaces. Springer, New York.

ADAPTIVE ESTIMATION IN AUTOREGRESSION

875

Li, K. C. (1987). Asymptotic optimality for Cp , Cl cross-validation and genralized cross-validation: discrete index set. Ann. Statist. 15 958–975. Mallows, C. L. (1973). Some comments on Cp . Technometrics 15 661–675. Modha, D. S. and Masry, E. (1996) Minimum complexity regression estimation with weakly dependent observations. IEEE Trans. Inform. Theory 42 2133–2145. Modha, D. S. and Masry, E. (1998). Memory-universal prediction of stationary random processes. IEEE Trans. Inform. Theory 44 117-133. Neumann, M. and Kreiss, J.-P. (1998). Regression-type inference in nonparametric autoregression. Ann. Statist. 26 1570–1613. Pham, D. T. and Tran, L. T. (1985). Some mixing properties of time series models. Stochastic Process. Appl. 19 297–303. Polyak, B. T. and Tsybakov, A. (1992). A family of asymptotically optimal methods for choosing the order of a projective regression estimate. Theory Probab. Appl. 37 471–481. Rissanen, J. (1984). Universal coding, information, prediction and estimation. IEEE Trans. Inform. Theory 30 629–636. Shibata, R. (1976). Selection of the order of an autoregressive model by Akaike’s information criterion. Biometrika 63 117-126. Shibata, R. (1981). An optimal selection of regression variables. Biometrika 68 45–54. Talagrand, M. (1996). New concentration inequalities in product spaces. Invent. Math. 126 505– 563. Viennet, G. (1997). Inequalities for absolutely regular processes: application to density estimation. Probab. Theory Related Fields 107 467–492. Y. Baraud ´ Ecole Normale Superieure DMA 45 rue d’Ulm 75230 Paris Cedex 05 France E-mail: [email protected]

F. Comte ´ Laboratoire de Probabilites ` ´ et Modeles Aleatoires Boite 188 Universite´ Paris 6 4, place Jussieu 75252 Paris Cedex 05 France G. Viennet ´ Laboratoire de Probabilites ` ´ et Modeles Aleatoires Boite 7012 Universite´ Paris 7 2, place Jussieu 75251 Paris Cedex 05 France

Suggest Documents