Stationary processes

Stationary processes Matteo Pelagatti October 24, 2013 1 Definitions Let us define the main object of this book: the time series. Definition 1 (Ti...

Author: Opal Higgins

8 downloads 4 Views 280KB Size

Report

Download PDF

Recommend Documents

Stationary stochastic processes

Forecasting Using Locally Stationary Wavelet Processes

A Monte Carlo simulation model for stationary non-gaussian processes

How useful are tests for unit-root in distinguishing unit-root processes from stationary but non-linear processes?

STATIONARY TRUCK AIR CONDITIONERS

Application of data compression methods to hypothesis testing for ergodic and stationary processes

6 Stationary Models. 6.1 Purpose. 6.2 Strictly stationary series

Stationary Cylindrical Anisotropic Fluid

Stationary Compressor Operating Instructions

Stationary Air Compressors

Magmatic Processes. Magmatic Processes

Lead Stationary Lithium. Energy solutions

Direct Georeferencing of Stationary LiDAR

Hook heights. ESTACIONARIA Freestanding - Stationary

Fuel Cells for Stationary Applications

Stationary Phases for Flash Purification

Feeding systems for stationary use

Appendix E Stationary Instrument Data

CP Heavy Duty Stationary Compactor

Resume Writing for Stationary Engineers

STATIONARY INFANTRY TARGET (SIT) CLUSTERS

Stationary Anode X-ray Tube

Lecture 1: Stationary Time Series

Multipole Moments of Stationary Spacetimes

Stationary processes Matteo Pelagatti October 24, 2013

1

Definitions

Let us define the main object of this book: the time series.

Definition 1 (Time Series). A time series is a sequence of observations ordered with respect to a time index t, taking values in an index set S. If the set S contains a finite or countable number of elements we speak of discrete-time time series and the generic observation is indicated with the symbol yt , while if S is a continuum we have a continuous-time time series, whose generic observation is represented as y(t).

Even though continuous-time time series are becoming very rare in a world dominated by digital computers1 , continuous-time models are nevertheless very popular in many disciplines. Indeed, observations may be taken at approximately equispaced points in time, and in this case the discrete-time framework is the most natural, or observations may be non-equispaced and in this case continuous-time models are usually more appropriate. This book concentrates on discrete-time time series, but most of the models covered here have a continuous-time counterpart. Since future values of real time series are generally unknown and cannot be predicted without error a quite natural mathematical model to describe their behaviour is that of a stochastic process. 1

Digital computers are finite-state machines and, thus, cannot record continuous-time time series.

1

Definition 2 (Stochastic Process). A stochastic process is a sequence of random variables defined on a probability space (Ω, F, P ) and ordered with respect to a time index t, taking values in an index set S.

Again, when S is numerable one speaks of discrete-time processes and denotes it as {Yt }t∈S , when S is a continuum we have a continuous-time process and represent it as {Y (t)}t∈S (sometimes also {Yt }t∈S ). By definition of random variable, for each fixed t, Yt is a function Yt (·) on the sample space Ω, while for each fixed simple event ω ∈ Ω, Y· (ω) is a function on S, or a realization (also sample-path) of a stochastic process. As customary in modern time series analysis, in this book we consider a time series as a finite realisation (or sample-path) of a stochastic process. There is a fundamental difference between classical statistical inference and time series analysis. The set-up of classical inference consists of a random variable or vector X and a random selection scheme to extract simple events, say {ω1 , ω2 , . . . , ωn }, from the sample space Ω. The observations, then, consist of the random variable values corresponding to the selected simple events: {x1 , x2 , . . . , xn }, where xi := X(ωi ) for i = 1, 2, . . . , n. In time series analysis, instead, we have a stochastic process {Yt }t∈S and observe only one finite realisation of it through the extraction of a single event, say ω1 , from the sample space Ω: we have the time series {y1 , y2 , . . . , yn }, with yt := Yt (ω1 ) for t = 1, . . . , n. Therefore, while in classical inference we have a sample of n observations for the random variable X, in time series analysis we usually have to deal with a sample of dimension 1 with n observations coming from different time points of the process {Yt }. This means that, if we cannot assume some kind of time-homogeneity of the process making the single sample-path “look like” a classical sample, then we cannot make any sensible inference and prediction from a time series. In Section 2 we introduce the classes of stationary and integrated processes which are the most important time-homogeneous processes used in time series analysis. We end up this section defining an important class of stochastic processes.

Definition 3 (Gaussian process). The process {Yt }t∈S is Gaussian if

2

for all the finite subsets {t1 , t2 , . . . , tm } of time points in Ω the joint distribution of (Yt1 , . . . , Ytm ) is normal.

2

Stationary processes

As we saw in Section 1, we treat a time series as a finite sample-path of a stochastic process. Unfortunately, unlike in statistical inference based on repeated random sampling, in time series analysis we have only one observation, the time series, from the data generating process. Thus, we have to base our inference on a sample of dimension one. Generally, we do have more than one observation in the sample-path but, unless we assume some kind of time-homogeneity of the data generating process, every observation Yt in the time series is drawn from a different random variable Yt . Most social and natural phenomena seem to evolve smoothly rather than by abrupt changes, and therefore modelling them by time-homogeneous processes is a reasonable approximation at least for a limited period of time. The most important form of time-homogeneity used in time series analysis is stationarity, which is defined as time-invariance of the whole probability distribution of the data generating process (strict stationarity), or just of its first two moments (weak stationarity).

Definition 4 (Strict stationarity). The process {Yt } is strictly stationary if for all k ∈ N, h ∈ Z, and (t1 , t2 , . . . , tk ) ∈ Zk , d

(Yt1 , Yt2 , . . . , Ytk ) = (Yt1 +h , Yt2 +h , . . . , Ytk +h ) d

where = denotes equality in distribution.

Definition 5 (Weak stationarity). The process {Yt } is weakly stationary

3

(or covariance stationary) if, for all h, t ∈ Z, E(Yt ) = µ, Cov(Yt , Yt−h ) = γ(h), with γ(0) < ∞.

As customary in time series analysis, in the rest of the book the terms stationarity and stationary will be used with the meaning of weak stationarity and weakly stationary respectively.

Theorem 1 (Relation between strict and weak stationarity). Let {Yt } be a stochastic process: 1. if {Yt } is strictly stationary, then it is also weakly stationary if and only if Var(Yt ) < ∞; 2. if {Yt } is a Gaussian process, then strict and weak stationarity are equivalent (i.e. one form of stationarity implies the other).

Proof. Trivial. Notice that the above definitions of stationarity assume that the process {Yt } is defined for t ∈ Z (i.e. the process originates in the infinite past and ends in the infinite future). This is a mathematical abstraction that is useful to derive some results (e.g. limit theorems), but the definitions can be easily adapted to the case of time series with t ∈ N or t ∈ {1, 2, . . . , n} by changing the domains of t, h and k accordingly. The most elementary (non-trivial) stationary process is the white noise. Definition 6 (White noise). A stochastic process is white noise if it has zero mean, finite variance, σ 2 , and covariance function ( σ 2 for h = 0, γ(h) = 0 for h 6= 0. 4

As the next example clarifies, white noise processes and independent identically distributed (i.i.d.) sequences are not equivalent. Example 1 (White noise and i.i.d. sequences). Let {Xt } be a sequence of independently identically distributed (i.i.d.) random variables and {Zt } be white noise. The process {Xt } is strictly stationary since the joint distribution for any k-tuple of time points is the product (by independence) of the common marginal distribution (by identical distribution), say F (·), Pr{Xt1 ≤ x1 , Xt2 ≤ x2 , . . . , Xtk ≤ xk } =

k Y

F (xi ),

i=1

and this does not depend on t. {Xt } is not necessarily weakly stationary since its first two moments may not exist (e.g. when Xt is Cauchy-distributed). The process {Zt } is weakly stationary since mean and covariance are timeindependent, but it is not necessarily strictly stationary since its marginal and joint distributions may depend on t even when the first two moments are time-invariant. The function γ(h), which characterise a weakly stationary process, is called autocovariance function and enjoys the following properties.

Theorem 2 (Properties of the autocovariance function). Let γ(·) the autocovariance function of a stationary process 1. (Positivity of variance) γ(0) ≥ 0, 2. (Cauchy-Schwatz inequality) |γ(h)| ≤ γ(0), 3. (Symmetry) γ(h) = γ(−h), 4. (Nonnegative definiteness) and (a1 , . . . , am ) ∈ Rm .

Pm Pm i=1

5

j=1

ai γ(i − j)aj ≥ 0, ∀m ∈ N

For the proof of this theorem and in many other places in this book, we will make use of the covariance matrix of the vector of n consecutive observations of a stationary process, say Y := (Y1 , Y2 , . . . , Yn )> :   γ(0) γ(1) . . . γ(n − 1)  γ(1) γ(0) . . . γ(n − 2)   Γn :=  (1) . .. .. .. . .   . . . . γ(n − 1) γ(n − 2) . . . γ(0) As any covariance matrix, Γn is symmetric with respect to the main diagonal and nonnegative definite but, as the reader can easily verify, Γn is also symmetric with respect to the secondary diagonal. Furthermore, the element of the matrix with indexes (i, j) equals the element with indexes (i + 1, j + 1) (i.e. Γn is a Toeplitz matrix ). Proof. The first two properties are well-known properties of variance and covariance. The third property follows from stationarity and the symmetry of the arguments of the covariance: γ(h) = Cov(Xt , Xt−h ) = Cov(Xt+h , Xt ) = Cov(Xt , Xt+h ) = γ(−h). As for the fourth property, let y := (Y1 , Y2 , . . . , Ym )> be m consecutive observations of the stationary process with autocovariance γ(·), then for any real m-vector of constants a, the random variable a> ym has variance m X m X Var a> y = a> Γm a = ai γ(i − j)aj , i=1 j=1

which, being a variance, is nonnegative. A stronger result asserts that any function on Z that satisfies the properties of Theorem 2 is the autocovariance of a stationary process, since it is always possible to build a Gaussian process with joint distributions based on such an autocovariance function. The autocorrelation function (ACF) is the scale-independent version of the autocovariance function.

6

Definition 7 (Autocorrelation function (ACF)). If {Yt } is a stationary process with autocovariance γ(·), then its ACF is ρ(h) := Cor(Yt , Yt−h ) = γ(h)/γ(0). By Theorem 2 the ACF satisfies the following properties: 1. ρ(0) = 1, 2. |ρ(h)| ≤ 1, 3. ρ(h) = ρ(−h), Pm Pm m 4. i=1 j=1 ai ρ(i − j)aj ≥ 0, ∀m ∈ N and (a1 , . . . , am ) ∈ R .

Another summary of the linear dependence of a stationary process can be obtained from the partial autocorellation function (PACF). The PACF measures the correlation between Yt and Yt−k after their linear dependence on the intervening random variables Yt−1 , . . . , Yt−h+1 has been removed.

Definition 8 (Partial autocorrelation function (PACF)). The partial autocorrelation function of the stationary process {Yt } is the set of correlations α(h) := Cor Yt − P(Yt |Yt−1:t−h+1 ), Yt−h − P(Yt−h |Yt−1:t−h+1 ) as function of the nonnegative integer h, where Yt−1:t−h+1 (Yt−1 , . . . , Yt−h+1 )> .

:=

As from the following theorem, the PACF can be derived as linear transformation of the ACF.

7

Theorem 3 (Durbin-Levinson algorithm). Let {Yt } be a stationary process with mean µ and autocovariance function γ(h), then its PACF is given by α(0) = 1, α(1) = γ(1)/γ(0), Ph−1 γ(h) − j=1 φh−1,j γ(h − j) α(h) = , Ph−1 γ(0) − j=1 φh−1,j γ(h − j)

for h = 2, 3, . . .

(2)

> where φh−1,j denotes the j-th element of the vector φh−1 := γh−1 Γ−1 h−1 > with γh−1 := [γ(1), . . . , γ(h − 1)] and Γh−1 as in equation (1). The coefficients φh can be recursively computed as

φh,h = α(h),

φh,j = φh−1,j − α(h)φh−1,h−j ,

for j = 1, . . . , h − 1.

Furthermore, if we call vh−1 the denominator of α(h) in equation (2), we can use the recursion v0 = γ(0), vh = vh−1 (1 − α(h)2 ) to compute it.

The first part of the theorem shows how partial autocorrelations relate to autocovariances, while the second part provide recursions to efficiently compute the PACF without the need to explicitly invert the matrices Γh−1 . Proof. The correlation of a random variable with itself is 1, and so α(0) = 1 and as no variables intervene between Yt and Yt−1 , α(1) = ρ(1) = γ(1)/γ(0). In order to lighten the notation, let us assume without loss of generality that EYt = 0. First, notice that by the properties of the optimal linear predictor (1. and 2. of Theorem ??), E(Yt − P[Yt |Yt−1:t−h+1 ])(Yt−h − P[Yt−h |Yt−1:t−h+1 ]) = E(Yt − P[Yt |Yt−1:t−h+1 ])Yt−h . Thus, by definition of PACF P γ(h) − h−1 E(Yt − P[Yt |Yt−1:t−h+1 ])Yt−h j=1 φh−1,j γ(h − j) α(h) = = , (3) Ph−1 E(Yt − P[Yt |Yt−1:t−h+1 ])Yt γ(0) − j=1 φh−1,j γ(h − j) Ph−1 since P[Yt |Yt−1:t−h+1 ] = j=1 φh−1,j Yt−j with φh−1,j j-th element of the vec−1 > tor φh−1 = [γ(1), . . . , γ(h)]Γh−1 . 8

Let us concentrate on the numerator of equation (3) but for α(h + 1). Using the updating formula for the optimal linear predictor (Theorem ??, Property 7.) we can write the numerator of α(h + 1) as E Yt − P[Yt |Yt−1:t−h+1 ]+ − P Yt − P[Yt |Yt−1:t−h+1 ] Yt−h − P[Yt−h |Yt−1:t−h+1 ] Yt−h−1 = γ(h + 1) −

h−1 X

φh−1,j γ(h + 1 − j) − α(h) γ(1) −

j=1

h−1 X

! φh−1,h−j γ(h + 1 − j) ,

j=1

since, as it can be easily checked, P Yt − P[Yt |Yt−1:t−h+1 ] Yt−h − P[Yt−h |Yt−1:t−h+1 ] = E Yt − P[Yt |Yt−1:t−h+1 ] Yt−h − P[Yt−h |Yt−1:t−h+1 ] (Yt−h − P[Yt−h |Yt−1:t−h+1 ]) = 2 E Yt−h − P[Yt−h |Yt−1:t−h+1 ] α(h)(Yt−h − P[Yt−h |Yt−1:t−h+1 ]) = ! h−1 X α(h) Yt−h − φh−1,h−j Yt−j . j=1

But, by equation (3) we have the alternative formula for the numerator of α(h + 1), h X γ(h + 1) − φh,j γ(h + 1 − j), j=1

and equating the coefficients with the same order of autocovariance we obtain φh,h = α(h),

φh,j = φh−1,j − α(h)φh−1,h−j

for j = 1, . . . , h − 1.

Let us denote with vh−1 the denominator of α(h) in equation (3) and repeat the reasoning for the denominator of α(h + 1), which will be named

9

vh : vh =E Yt − P[Yt |Yt−1:t−h+1 ]+ − P Yt − P[Yt |Yt−1:t−h+1 ] Yt−h − P[Yt−h |Yt−1:t−h+1 ] Yt ! h−1 X = vh−1 − α(h) γ(h) − φh−1,j γ(h − j) j=1

= vh−1 − α(h)2

γ(0) −

h−1 X

! φh−1,j γ(h − j)

j=1 2

= vh−1 (1 − α(h) ).

Since the population mean µ, the autocovariances γ(h), the autocorrelations ρ(h) and the partial autocorrelations α(h) are generally unknown quantities, they need to be estimated from a time series. If we do not have a specific parametric model for our time series, the natural estimators for µ and γ(h) are their sample counterparts: n

1X Yt , Y¯n := n t=1 γˆ (h) :=

n 1 X (Yt − Y¯n )(Yt−h − Y¯n ). n t=h+1

Note that in the sample autocovariance the divisor is n and not n − h (or n − h − 1) as one would aspect from classical statistical inference, indeed, the latter divisor does not guarantee that the sample autocovariance function is nonnegative definite. Instead, the sample autocovariance matrix, whose generic element is the above defined γˆ (i − j), can be expressed as the product of a matrix times its transpose and, therefore, is always nonnegative definite.

10

For example, define the matrix with k  Y1 − Y¯n 0  Y2 − Y¯n Y1 − Y¯n   Y3 − Y¯n Y2 − Y¯n  .. ..  . .  Y :=  ¯ Yn − Yn Yn−1 − Y¯n  Yn − Y¯n  0  .. ..  . . 0

columns, 0 0 Y1 − Y¯n .. . Yn−2 − Y¯n Yn−1 − Y¯n .. .

0

0

... ... ... ...

0 0 0 .. .



      . ¯ . . . Y 1 − Yn   . . . Y2 − Y¯n   .. ..  . . 0 Yn − Y¯n

The autocovariance matrix containing the first k − 1 sample autocovariances can be computed as ˆ k−1 = n−1 Y> Y, Γ which is always nonnegative definite. We summarise the properties of the sample mean of a stationary process in the following theorem. Theorem 4 (Properties of the sample mean). Let {Yt } be a weakly stationary process with mean µ and autocovariance function γ(h), then for the sample mean Y¯n the following properties hold: 1. (Unbiasedness) EY¯n = µ; P |h| 1 − γ(h); 2. (Variance) Var(Y¯n ) = n1 n−1 h=−n+1 n 3. (Normality) if {Yt } is a Gaussian process, N (µ, Var(Y¯n ));

then Y¯n

∼

4. (Consistency) if γ(h) → 0 as h → ∞, then E[Y¯n − µ]2 → 0; P 2 ¯ 5. P (Asymptotic variance) if ∞ h=−∞ |γ(h)| < ∞, then nE[Yn − µ] → ∞ h=−∞ γ(h); P 6. (Asymptotic normality) If Yt = µ + ∞ j=−∞ ψj Zt−j with Zt ∼ P∞ P∞ 2 IID(0, σ ), j=−∞ |ψj | < ∞ and j=−∞ ψj 6= 0, then ! ∞ X √ d n(Y¯n − µ) −→ N 0, γ(h) . h=−∞

11

Proof. P Unbiasedness. EY¯n = n−1 nt=1 EYt = µ. Variance. n n n n 1 XX 1 XX ¯ Var(Yn ) = 2 Cov(Yi , Yj ) = 2 γ(i − j) n i=1 j=1 n i=1 j=1 n−1 n−1 1 X |h| 1 X n − |h| γ(h) = 1− γ(h). = 2 n h=−n+1 n h=−n+1 n

Normality. The normality of the mean follows from the assumption of joint Gaussianity of (Y1 , . . . , Yn ). Consistency. In order to obtain the (mean-square) consistency of Y¯n , the quantity E(Y¯n − µ) = Var(Y¯n ) has to converge to zero. A sufficient condition for this to happen is γ(h) → 0 as h diverges. In fact, in this case we can always fix a small positive ε and find a positive integer N such that for all h > N , |γ(h)| < ε. Therefore, for n > N + 1 n−1 n−1 |h| 1 X 1 X ¯ |γ(h)| Var(Yn ) = 1− γ(h) ≤ n h=−n+1 n n h=−n+1 =

N n−1 N 1 X 2 X 1 X |γ(h)| + |γ(h)| ≤ |γ(h)| + 2ε. n h=−N n h=N +1 n h=−N

As n diverges the first addend converges to zero (it is a finite quantity divided by n), while the addend can be made arbitrarily small, and so, by the very definition of limit, Var(Y¯n ) → 0. Asymptotic variance. After multiplying the variance of Y¯n times n, we have the following inequalities: nE[Y¯n − µ]2 =

n−1 X h=−n+1

|h| 1− n

12

γ(h) ≤

n−1 X

|γ(h)|.

h=−n+1

Therefore, a sufficient Pn−1 condition for the asymptotic variance to converge as n → ∞ is that h=−n+1 |γ(h)| converges. Furthermore, by Ces`aro theorem n−1 n−1 X X |h| γ(h) = lim lim γ(h). 1− n→∞ n→∞ n h=−n+1 h=−n+1 Asymptotic normality. This is the central limit theorem for linear processes, for the proof refer to Brockwell and Davis (1991, Sec. 7.3), for instance Of course, if {Yt } is a Gaussian process, also the sample mean is Gaussian. If the process is not Gaussian, the distribution of the sample mean can be approximated by a normal only if some central limit theorem (CLT) for dependent processes applies; in the Theorem we provide one for linear processes, but there are alternative CLT under weaker conditions (in particular under mixing conditions). For the sample autocorrelation ρˆ(h) := γˆ (h)/ˆ γ (0) we provide the following result without proof, which is rather lengthy and cumbersome and can be found in Brockwell and Davis (2002, Sec. 7.3). Theorem 5 (Asymptotic distribution of the sample ACF ). Let {Yt } be the stationary process, Yt = µ +

∞ X

ψj Zt−j ,

IID(0, σ 2 ).

j=−∞

If

P∞

j=−∞ |ψj |

< ∞ and either one of the following conditions hold

• EZt4 < ∞, P∞ 2 • j=−∞ ψj |j| < ∞; then for each h ∈ {1, 2, . . .}     ρˆ(1) ρ(1) √  .   .  d n  ..  −  ..  −→ N (0, V), ρˆ(h)

ρ(h)

with the generic (i, j)-th element of the covariance matrix V being vij = P ∞ k=1 [ρ(k + i) + ρ(k − i) − 2ρ(i)ρ(k)][ρ(k + j) + ρ(k − j) − 2ρ(j)ρ(k)].

13

A corollary to this theorem is that, when a process is IID(0, σ 2 ) the√ sample autocorrelations at different lags are asymptotically independent and nˆ ρ(h) converges in distribution to a standard normal. This result is used in the following portmanteau test statistics for the null hypothesis Yt ∼ IID(0, σ 2 ): P Box-Pierce QBP (h) = n hk=1 ρˆ(k)2 ; P Ljung-Box QLB (h) = n(n + 2) hk=1 ρˆ(k)2 /(n − k). Corollary 6 (Portmanteau tests). Under the hypothesis Yt ∼ IID(0, σ 2 ), the test statistics QBP and QLB converge in distribution to a chi-square with h degrees of freedom.

Proof. Since QBP (h) is the sum of h asymptotically independent standard normal random variables, it converges in distribution to a chi-square with h degrees of freedom. The statistic QLB (h) is asymptotically equivalent to QBP (h), in fact, for each h: ! r r √ √ n(n + 2) n+2 ρ(h) = nˆ ρ(h) ρˆ(h) − nˆ −1 n−h n−h which converges in probability to zero as n diverges. The Ljung-Box statistic is more popular then the Box-Pierce since it approximates the asymptotic distribution better in small samples. We conclude this section on stationary processes with a celebrated result used also as a justification for the use of the class of ARMA models as approximation to any stationary processes. Before presenting the result we need the concept of (linearly) deterministic processes.

Definition 9 (Deterministic process). The stationary process {Vt } is (linearly) deterministic if lim E Vt − P[Vt |Vt−1 , Vt−2 , . . . , Vt−k ]

k→∞

14

2

= 0.

In other words, a stationary process is deterministic if it can be predicted without error by a linear function of its (possibly) infinite past. Example 2 (Two deterministic processes). Let Wt = W,

Vt = X cos(λt) + Y sin(λt),

where W , X and Y are random variables with zero means and finite variances; furthermore, Var(X) = Var(Y ) = σ 2 and Cov(X, Y ) = 0. The two processes are stationary, in fact their mean is zero for all t and their autocovariance functions are EWt Wt−h = EW 2 , EVt Vt−h = σ 2 cos(λt) cos λ(t − h) + sin(λt) sin λ(t − h) = σ 2 cos(λh), which are invariant with respect to t. Their linear predictions based on the past are P[Wt |Wt−1 ] = W P[Vt |Vt−1 , Vt−2 ] = 2 cos(λ)Vt−1 − Vt−2 , (the reader is invited to derive the latter formula by computing the optimal linear prediction and applying trigonometric identities). For the first process it is evident that the prediction (W ) is identical to the outcome (W ). In the second case, we need to show that the prediction is equal to the process outcome: P[Vt |Vt−1 , Vt−2 ] = cos(λ)Vt−1 + cos(2λ)Vt−2 = 2 cos(λ)[X cos(λt − λ) + Y sin(λt − λ)] − [X cos(λt − 2λ) + Y sin(λt − 2λ)] = X[2 cos(λ) cos(λt − λ) − cos(λt − 2λ)] + Y [2 cos(λ) sin(λt − λ) − sin(λt − 2λ)] = X cos(λt) + Y sin(λt), where the last line is obtained by applying well-known trigonometric identities.

15

Theorem 7 (Wold decomposition). Let Yt be a stationary process, then Yt =

∞ X

ψj Zt−j + Vt

j=0

where 1. ψ0 = 1,

P∞

j=0

ψj2 < ∞,

2. Zt is white noise, 3. Cov(Zt , Vs ) = 0 for all t and s, 4. Vt is (linearly) deterministic, 5. limk→∞ E(Zt − P[Zt |Yt , Yt−1 , . . . , Yt−k ])2 = 0; 6. limk→∞ E(Vt − P[Vt |Ys , Ys−1 , . . . , Ys−k ])2 = 0 for all t and s.

The message of this theorem is that every stationary process can be seen as the sum of two orthogonal components: one, the deterministic, is perfectly predictable using a linear function of the past of the process (point 6.), the other, the purely non-deterministic , is expressible as a (possibly) infinite linear combination of past and present observations of a white noise process Zt . This process, is the prediction error of Yt based on its (possibly) infinite past, Zt = Yt − P[Yt |Yt−1 , Yt−2 , . . .], and is generally termed innovation of the process Yt . Thus, the coefficients ψj are the projection coefficients of Yt on its past innovations Zt−j : 2 ψj = E[Yt Zt−j ]/E[Zt−j ].

As for the deterministic component, point 6. implies that it can be predicted without error also using the infinite past of {Yt } and not only using its own past {Vs }s N are smaller than 1 and j=1 |ψj | is a finite quantity. Then, P∞ for all j > N , |ψj | < ψj2 and the convergence of j=N +1 |ψj | implies the convergence of P∞ 2 j=N +1 ψj . This is true, because the convergence of

17

j=1 |ψj |

lute summability of the autocovariance function3 that, according to Theorem 4, is sufficient for the consistency of the sample mean and the existence of its asymptotic variance.

References Brockwell, P. J. and R. A. Davis (1991). Time Series: Theory and Methods (2nd edition ed.). Springer. Brockwell, P. J. and R. A. Davis (2002). Introduction to Time Series and Forecasting (2nd edition ed.). Springer.

3

To see why (all summation indexes range from −∞ to ∞): σ2

X 2 X X X X X X 2 2 2 ψ ψ ≤ σ |ψ | |ψ | ≤ σ |ψ | |ψ | = σ |ψ | . j j+h j j+h j i j h

j

j

j

h

18

i

j