## Non-Linear Regression

Non-Linear Regression • You have seen models that are linear in the parameters. For example, – A model which is linear in the regressors and linear in...
Author: Shannon Simon
Non-Linear Regression • You have seen models that are linear in the parameters. For example, – A model which is linear in the regressors and linear in the parameters, Yi = β0 + β1 X1i + β2X2i + ²i . – A model which is linear in the parameters and nonlinear in the regressors, 2 Yi = β0 + β1 X1i X2i + β2 X1i + ²i .

• What happens if the model is non-linear in the parameters? For example β1 β2 ² i Yi = αX1i X2i e .

• We could consider re-parameterizing the model by log(Yi ) = log(α) + β1 log(X1i ) + β2 log(X2i ) + ²i . Note that this model is equivalent to the nonlinear model above. • We could approximate a nonlinear model by a linear model on a small interval.

F.1

Nonlinear Models, Cont. However, we will sometimes need to consider nonlinear models because • Many models cannot be made linear using a transformation: β1 β2 X2i + ²i . Yi = αX1i

• We don’t wish to re-parameterize because there is a “natural” or “theoretical” parameterization. For example, the βs make physical sense, however, the transformed βs do not. • Some models are not well approximated by any simple linear model. For instance, Yi = α/{1 + exp(−β0 − β1Xi )}.

0

1

2

y

3

4

5

Logistic Curve

0.0

0.2

0.4

0.6

0.8

1.0

x

– We cannot approximate this with a polynomial, as they go to ±∞. – No simple linear model fits “well”. F.2

Models We will consider intrinsically nonlinear models of the form Yi = f (x0i ; θ) + ²i , where f (x0i ; θ) is a nonlinear function relating E(Y ) to the independent variables xi . Here, • xi is a k × 1 vector of independent variables (fixed). • θ is a p × 1 vector of parameters. • ²i s are iid variables with mean 0 and variance σ 2 . Note that these will sometimes be normal, but sometimes not. Often, there is scientific information to suggest the form of the model. Some examples are • Exponential growth model: Yi = θ1 exp(θ2ti ) + ²i . • Gompertz growth model: Yi = θ1 exp{−θ2 exp(θ3ti )}+ ²i . • Weibull model: Yi = θ1 exp{−(Xi /θ2)θ3 } + ²i . • Segmented polynomial models: ½ β0 + β1 Xi + β2Xi2 + ²i , Xi ≤ α Yi = γ0 + γ1 Xi + ²i , Xi > α. F.3

Fitting Nonlinear Regression Models Least Squares: • The LS estimate of θ, θˆ, is the set of parameters that minimizes the sum of squared residuals: SSR(θˆ) =

n X

{Yi − f (x0i ; θˆ)}2 .

i=1

• We could equivalently consider this from a matrix perspective by writing SSR(θˆ) = [Y − f (θˆ)]0 [Y − f (θˆ)], where f (θˆ) = [f (x01 ; θˆ), f (x02 ; θˆ), · · · , f (x0n ; θˆ)]0 and Y = [Y1 , Y2 , · · · Yn ]0 . • To obtain the normal equations we consider the partial derivatives of SSR(θˆ) with respect to each θˆj , and set them equal to zero. This gives us a system of p equations. Each normal equation is given by · ¸ n 0; θ X ˆ ∂f (x ∂SSR(θˆ) ) i = −2 {Yi − f (x0i ; θˆ)} = 0. ˆj ∂ θˆj ∂ θ i=1 Unfortunately, for nonlinear models, the partial derivatives are functions of the parameters. Thus, the normal equations are nonlinear and an explicit solution for θˆ cannot be obtained. F.4

Simple Example Exponential Growth Model: Recall that this model has the form Yi = θ1 exp(θ2 ti ) + ²i . Then, the derivatives are ∂f ∂f = exp(θ2 ti ) and = θ1 ti exp(θ2 ti ). ∂θ1 ∂θ2 So, the normal equations are n X

[Yi − θˆ1 exp(θˆ2 ti )][exp(θˆ2 ti )] = 0, and

i=1 n X

[Yi − θˆ1 exp(θˆ2 ti )][θˆ1 ti exp(θˆ2 ti )] = 0.

i=1

Even for this simple example, there is no analytic solution for θˆ. We will have to consider numerical methods. Numerical Methods for Solving Nonlinear LS: Grid Search: • Consider a “grid” of possible parameter values and find those that minimize the SSR. • Use progressively finer grids to achieve accuracy. • Inefficient and not practical when p large. F.5

Numerical Solutions to Nonlinear LS, Cont. Gauss-Newton Algorithm • Suppose that we have θˆ(0) , an initial estimate of θ. • Consider a Taylor expansion of f (x0i ; θ) as a function of θ about the point θˆ(0) : Yi = f (x0i ; θ) + ²i = f (x0i ; θˆ(0) ) +

¾ p ½ 0 (0) X ˆ ∂f (xi ; θ ) ∂θj

j=1

(θj − θˆj(0) )

+ remainder + ²i . • Let

f (x01; θˆ(0))

Y1 ... , ² =  Y =  ...  , f (θˆ(0) ) =  Yn f (x0n ; θˆ(0) )   f (x01 ;θˆ(0) ) f (x01 ;θˆ(0) ) ··· ∂θ1 ∂θp   . ... . (0)  . . . ˆ . and F (θ ) =  .  f (x0n ;θˆ(0) ) ∂θ1

···

f (x0n ;θˆ(0) ) ∂θp

• So we can write . Y − f (θˆ(0) ) = F (θˆ(0) )(θ − θˆ(0)) + ² + remainder. • We will assume that the remainder is “small”. F.6

²1 ...  , ²n

Gauss-Newton Algorithm, Cont. • We could then solve for δ = θ − θˆ(0) using ordinary least squares: δˆ(j) = {F (θˆ(j−1) )0 F (θˆ(j−1) )}−1 F (θˆ(j−1))0 [Y − f (θˆ(j−1) )]. • Finally, since δ = θ − θˆ(0) , we could solve for θˆ(1) = θˆ(0) + δˆ(1) . We can view this as finding θˆ(1) , the new estimate of θ. • This new estimate, θˆ(1) , can then be used to find another new estimate, θˆ(2) , and so on. Let’s find the values of f (θˆ(0) ) and F (θˆ(0) ) for the function f (xi ; θ1 , θ2 ) = θ1 exp(θ2 xi ), where there are three observations, (x1 , y1 ), (x2 , y2 ), and, (x3 , y3 ).

F.7

Gauss-Newton Algorithm, Cont. The Gauss-Newton Algorithm proceeds according to the following steps: 1. Obtain an initial estimate. 2. Perform the Taylor series expansion, and calculate f (θˆ(j) ) and F (θˆ(j) ). 3. Use OLS to find δˆ(j+1) , the adjustment to the current estimate. 4. Find a new estimate of θ. return to step 2.

Use this estimate to

5. Continue until (a) There is only a minor change in the objective function, |SSR(θˆ(j+1) ) − SSR(θˆ(j))| < γ1 , and SSR(θˆ(j) ) (b) There is only a minor change in the parameter estimates, |θˆ(j+1) − θˆ(j) | < γ2. F.8

Gauss-Newton Algorithm, Cont. Notice that ∂SSR(θ) = −2F (θ)0 [Y − f (θ)]. ∂θ Thus, θˆ(j+1) = θˆ(j) + δˆ(j+1) = θˆ(j) + [F (θˆ(j) )0 F (θˆ(j))]−1F (θˆ(j))0 [Y − f (θˆ(j) )] 1 ∂SSR(θˆ(j) ) (j) (j) 0 (j) −1 = θˆ − [F (θˆ ) F (θˆ )] 2 ∂θ We can view the term ∂SSR(θˆ(j) )/∂θ as a gradient vector, which points in the direction in which the SSR increases most rapidly. This is the path of steepest ascent. The term [F (θˆ(j) )0 F (θˆ(j))]−1 tells us how far to move, and the −1/2 serves to move us in the direction of steepest descent. Let’s consider this for a simple univariate case.

Note: there is no guarantee that SSR(θˆ(j+1) ) < SSR(θˆ(j) ). This means that we may not always get closer to the correct parameter value using this procedure. F.9

Modified Gauss-Newton Algorithm We can avoid overstepping using the Modified GaussNewton Algorithm. In this case, we will define a new proposal for θ as θˆ(j+1) = θˆ(j) + αj δˆ(j+1) , 0 < αj ≤ 1. Here, the value of αj is used to modify the step length. There are many different ways that we could choose αj . The simplest is Step Halving: Here, we will set 1 (j+1) δˆ , 2k where k is the smallest non-negative integer such that θˆ(j+1) = θˆ(j) +

1 (j+1) δˆ ) < SSR(θˆ(j) ). k 2 That is, try δˆ(j+1) , then ˆ δ (j+1) /2, ˆ δ (j+1) /4, etc. SSR(θˆ(j) +

SAS PROC NLIN uses this procedure for k = 0, 1, · · · , 10. Other Algorithms: In general, convergence algorithms have the form ˆ(j+1)

θ

ˆ(j)

∂Q(θˆ(j) ) + α j Aj , ∂θ

where Aj is some positive definite matrix, ∂Q(θˆ(j) )/∂θ is the gradient, based upon some function Q, which is typically the SSR. F.10

Other Algorithms, Cont. The modified Gauss-Newton algorithm satisfies this form with ∂Q(θˆ(j) ) 0 −1 (j+1) (j) (j) (j) θˆ = θˆ − αj [F (θˆ ) F (θˆ )] , ∂θ where here, Q = SSR. Steepest Descent: The steepest descent algorithm proceeds according to ∂Q(θˆ(j) ) . ∂θ This algorithm is slow to converge, but it moves rapidly initially. Although it is seldom used anymore, it can be useful when starting values are not well known. θˆ(j+1) = θˆ(j) + αj Ip×p

Levenberg-Marquardt: The Levenberg-Marquardt algorithm has the form h i−1 ∂Q(θˆ(j)) (j+1) (j) (j) 0 (j) θˆ = θˆ − αj F (θˆ ) F (θˆ ) + τ Ip×p , ∂θ This method is a compromise between the Gauss-Newton and the steepest descent methods. It performs best when F (θˆ(j) )0 F (θˆ(j) ) is nearly singular (F (θˆ(j) ) is not of full rank). This is essentially a ridge regression. How do we choose τ ? SAS starts with τ = 10−3. If SSR(θˆ(j+1) ) < SSR(θˆ(j) ), then τ = τ /10 for the next iteration, otherwise, τ = 10τ . F.11

Other Algorithms, Cont. Newton-Raphson: This method (also called Newton’s method) has the form · 2 ¸ (j) ) −1 ∂Q(θ ˆ ˆ(j) ) ∂ Q( θ (j+1) (j) θˆ = θˆ − αj . ∂θ∂θ 0 ∂θ Note that n X ∂ 2Q(θˆ(j)) ∂ 2 f (xi ; θ) (j) 0 (j) = 2F (θˆ ) F (θˆ ) − 2 . [Yi − f (xi ; θ)] 0 ∂θ∂θ0 ∂θ∂θ i=1

This contains the same term that Gauss-Newton uses, combined with one containing the second partial derivatives of f (). There is no guarantee that the Hessian, ∂ 2 Q(θˆ(j) )/∂θ∂θ0 , is nonsingular. Additionally, we must supply the second partial derivatives to the computer (and they can sometimes be very difficult to calculate). Let’s do this for f (xi ; θ1 , θ2) = θ1 exp(θ2xi ).

F.12

Other Algorithms, Cont. Quasi-Newton: For the Quasi-Newton method, we will update θ according to ˆ(j+1)

θ

ˆ(j)

ˆ(j) )

∂Q(θ αj Hj−1

∂θ

,

where H is a positive definite approximation of the the Hessian, which gets closer as j → ∞. Hj is computed iteratively, and this method is best among first-order methods (only first derivatives are required). Derivative Free Methods: Ex. Secant Method: This is like the Gauss-Newton method, but it calculates the derivatives numerically, from past iterations. These methods generally work well. Practical Considerations: There are several practical considerations that you must be aware of when using these techniques. • Failure to converge: – SSR(θ) may be “flat” in a neighborhood of the minimum. F.13

Practical Considerations, Cont. • Failure to Converge: – Using better starting values can alleviate this problem. – This may also suggest that the model is too complex for the data, and a simpler model should be considered. • Convergence to a Local Minimum: – Linear least squares has the property that SSR(θ) = (Y −Xβ)0 (Y −Xβ), which is quadratic and therefore has a unique minimum. – Nonlinear least squares need not have a unique minimum. – Using multiple starting values can identify this problem if it exists. – If the dimension of θ is low, graph SSR(θ) as a function of θi . • Starting Values: – To converge, algorithms need good initial estimates. – Prior or theoretical information can be used. F.14

Practical Considerations, Cont. • Starting Values, Cont: – A grid search or a graph of SSR(θ) can provide initial estimates if no prior or theoretical information is available. – Use OLS to get starting values. For example, consider the model Yi = θ0xθ1i1 xθ2i2 + ²i . We could approximate this by log Yi ≈ log θ0 + θ1 log x1i + θ2 log x2i + error. We can use OLS to approximate the values logˆ θ0 , θˆ1, and θˆ2 from the above model and use them as starting values. – Model interpretation: look at the slope, inflection points, asymptotics, or behavior at special design points, such as 0. For example, consider the logistic curve, f (x; θ) =

θ0 . 1 + exp(−θ1 − θ2 x)

Notice that for this model, the maximum value that Y can take is θ0 . Thus, we could set θˆ0(0) = Ymax . F.15

Practical Considerations, Cont. • Starting Values, Cont. – Model interpretation, cont. Furthermore, when x = 0, then Y = θ0 /{1 + exp(−θ1 )}. Thus, an estimate of θ1 can be obtained by solving θˆ0(0) 1 + exp(−θˆ1(0) )

= Y (0).

Finally, notice that Y takes on the value θ0 /2 when x = −θ1 /θ2 . Thus, we can find the x value which corresponds to the “average” Y , and use it to solve θˆ1(0) − (0) = x. θˆ2 We could then use θˆ0(0) , θˆ1(0) , and θˆ2(0) as starting values. • Constrained Parameters: often we have constraints on parameters like θi > a or a < θi < b. – Try fitting the model ignoring the constraints and see if the converged parameter estimates satisfy the constraints. If the results satisfy the constraints, no further action is needed. F.16

Practical Considerations, Cont. • Constrained Parameters, cont. – Try re-parameterizing: for example, for the constraint θi > a we could write θi = a + θi − a = a + exp{log(θi − a)} = a + exp(αi ). Here, −∞ < αi < ∞ and this parametrization is unconstrained. Or, for the constraint a < θi < b we could write θi = a +

b−a . 1 + exp(−γi )

Again, −∞ < γi < ∞ and this parametrization is unconstrained. • Estimation of σ 2 . By analogy to the linear case, we will use Pn 2 2 i=1 [Yi − f (xi ; θ)] s = , n−p where p is the number of parameters in θ (not the number of regressors in xi ).

F.17

Inference In Nonlinear Models 2 Recall, Yi = f (xi ; θ) + ²i , where Pn ²i ∼ iid(0, σ ). 2We then obtain the θˆ that minimizes i=1 {Yi − f (xi ; θ)} and estimate Pn ˆ 2 2 i=1 {Yi − f (xi ; θ )} . s = n−p

Suppose that we wish to draw inference upon θˆ based upon our sample. Linear Function of the Parameters: First, consider the case where we are interested in a linear function of the parameters. If we assume that ²i ∼ N(0, σ 2), then we can show that θˆ ∼ AN(θ, σ 2 [F (θ)0 F (θ)]−1 ) Note that here, AN indicates asymptotic normality. Then, applying one of the rules from Chapter A, for any linear function of the parameter estimates, a, a0 θˆ ∼ AN(a0 θ, σ 2 a0 [F (θ)0 F (θ)]−1 a). We can use this distribution, jointly with the fact that a0 θˆ is asymptotically independent of s2 (to order 1/n) to construct an approximate statistic for large n: a0 θˆ − a0 θ ∼t ˙ n−p . s(a0 [F (θ)0 F (θ)]−1 a)1/2 F.18

Linear Functions of Parameters, Cont. Using this statistic, we can form an approximate 100(1− α)% confidence interval for a0 θ as a0 θˆ ± t(1−α/2,n−p)s(a0 [F (θ)0 F (θ)]−1 a)1/2. To evaluate this interval, we can substitute θˆ into F (θ). As a specific case, suppose that a0 = (0, 0, · · · , 0, 1, 0, · · · , 0) where the jth element of a is 1 and all other elements are 0. Then, a confidence interval for the jth element of θ is p θˆj ± t(1−α/2,n−p) s ˆ cjj , where ˆ cjj is the jth diagonal element of [F (θ)0 F (θ)]−1 . Nonlinear Functions of the Parameters: More often, we are interested in nonlinear functions of θ. Suppose that h(θ) is some nonlinear function of the parameters. Note: h(θˆ) ≈ h(θ) + h0 (θˆ − θ), where h = (∂h/∂θ1 , · · · , ∂h/∂θp )0 . Then, E{h(θˆ)} ≈ h(θ) and var{h(θˆ)} ≈ σ 2 h0 [F (θ)0 F (θ)]−1 h. Thus, h(θˆ) ∼ AN(h(θ), σ 2h0 [F (θ)0 F (θ)]−1 h) and an approximate 100(1 − α)% confidence interval for h(θ) is h(θˆ) ± t(1−α/2,n−p) s(h0 [F (θ)0 F (θ)]−1 h)1/2 . F.19

Confidence Intervals for Prediction Suppose that we are interested in a prediction interval for Y at x = x0 . Notice y0 = f (x0; θ) + ²0 , ²0 ∼ N(0, σ 2 ), and yˆ0 = f (x0 ; θˆ). When n is large, θˆ is close to θ, so we will use the Taylor series expansion f (x0 , θˆ) ≈ f (x0 , θ) + f 0 (θ)0 (θˆ − θ), where f 0 (θ) = (∂f (x0 , θ)/∂θ1, · · · , ∂f (x0 , θ)/∂θp )0 . Then, y0 − yˆ0

y0 − f (x0 , θ) − f 0(θ)0 (θˆ − θ) = ²0 − f 0(θ)0 (θˆ − θ). ≈

From this form, we can see that E(y0 − yˆ0 ) ≈ E(²0 ) − f 0 (θ)0 E(θˆ − θ) = 0, and var(y0 − yˆ0 )

var(²0 ) − var{f 0 (θ)0 (θˆ − θ)}

= σ 2 + σ 2 f 0 (θ)0 [F (θ)0 F (θ)]−1 f 0 (θ) = σ 2 {1 + f 0 (θ)0 [F (θ)0 F (θ)]−1 f 0 (θ)} Thus, y0 − yˆ0 ∼ AN(0, σ 2 {1 + f 0 (θ)0 [F (θ)0 F (θ)]−1 f 0(θ)}). Now, since s is asymptotically independent of θˆ (and thus of y0 − yˆ0), F.20

Prediction Intervals, Cont. y0 − yˆ0

q s

1 + f 0 (θ)0 [F (θ)0 F (θ)]−1 f 0(θ)

∼ Asymp. tn−p .

Therefore, an asymptotic 100(1 − α)% prediction interval for y0 has the form yˆ0 ± t(1−α/2,n−p) s{1 + f 0 (θ)0 [F (θˆ)0 F (θˆ)]−1 f 0 (θˆ)}1/2 . Wald Statistic: Recall from traditional linear models that the test of the general linear hypothesis H0 : Lβ = 0 vs. Ha : Lβ 6= 0, where Lq×p has rank q, can be formulated as (Lβˆ)0 {L(X 0 X)−1L0 }−1 (Lβˆ) s2 q under the null hypothesis.

∼ Fq,n−p ,

For a nonlinear model, suppose that there are q nonlinear functions of the parameters, in which we are interested: h(θ) = {h1 (θ), · · · , hq (θ)}. Define the Jacobian, H(θ), of rank q, as  ∂h1(θ)  ∂h1 (θ) ··· ∂θ1 ∂θp  ..  ... . . . Hq×p (θ) =  .  ∂hq (θ) ∂hq (θ) ··· ∂θ1 ∂θp F.21

Wald Statistic, Cont. Then, the null hypothesis H0 : h(θ) = 0 can be tested against the two sided alternative using the Wald statistic, W =

h(θˆ)0 {H(θˆ)[F (θˆ)0 F (θˆ)]−1 H(θˆ)0 }h(θˆ)

∼ Fq,n−p , s2 q under the null hypothesis. Note: likelihood methods are often better (see for example Gallant (1987)). Model/Estimation Adequacy: How appropriate are the linearization assumptions used to develop nonlinear estimation and inference? The answer depends upon the degree of nonlinearity. Bates and Watts (1980, JRSSB) assess nonlinearity in terms of two components of curvature: 1. Intrinsic nonlinearity: the degree of bending and twisting in f (θ). 2. Parameter effects nonlinearity: degree to which curvature effected by choice of θ. Intrinsic nonlinearity: • Hard to identify. Some methods are based upon differential geometry, and can be slow to converge. • Solution: use higher order Taylor expansions. F.22

Model/Estimation Adequacy, Cont. Parameter effects nonlinearity: • Hougaard’s Measure: (found in SAS) [1985, JRSSB], measures the skewness in the distribution of θˆ. If |HM | < 0.1 0.1 < |HM | < 0.25 0.25 < |HM | < 1 |HM | > 1

θˆi θˆi θˆi θˆi

close to linear reasonably close to linear has very apparent skewness has considerable nonlinearity

• Solution: try to reparameterize. For example, there is often parameter effect nonlinearity in E(Yi ) = f (xi ; θ) = exp(θxi ). However if we let φ = exp(θ), then E(Yi ) = φxi is better behaved. • Text by Seber and Wild (1989) has more details. Model Adequacy and Diagnostics: • R2 (SSR/SST). This diagnostic is not useful in the nonlinear case, as SSR+SSE 6= SST in many cases. • Residual Plots: standardize, analogous to OLS. This procedure is useful if the intrinsic curvature is small. Use the studentized residuals: ei ri = √ , s 1−ˆ cii where ˆ cii is the ith diagonal of [F (θˆ)0 F (θˆ)]−1 . F.23

Model Adequacy and Diagnostics, Cont. Possible problems and solutions. • Collinearity: the condition number of [F (θˆ)0 F (θˆ)]−1 should be less than 30. See Magel and Hertsgaard, (1987). ˆ = F (θˆ)[F (θˆ)0 F (θˆ)]−1 F (θˆ)0 . • Leverage: H Laurent and Cook, (1992).

See St.

• Heterogeneous Errors: weighted non-linear least squares. (SAS can do this). • Correlated Errors: generalized nonlinear least squares. See the Gallant textbook, (1987). • More work needs to be done on nonlinear diagnostics.

F.24