Notes on Maximum Likelihood and Time Series. Antonis Demos (Athens University of Economics and Business)

Notes on Maximum Likelihood and Time Series Antonis Demos (Athens University of Economics and Business) December 2006 2 0.1 Conditional Probabilit...

Author: Bruno O’Neal’

40 downloads 0 Views 497KB Size

Report

Download PDF

Recommend Documents

ATHENS UNIVERSITY OF ECONOMICS AND BUSINESS

NATIONAL AND KAPODISTRIAN UNIVERSITY OF ATHENS ATHENS UNIVERSITY OF ECONOMICS AND BUSINESS

Maximum Likelihood and Robust Maximum Likelihood

PLUTARCHOS SAKELLARIS Professor of Economics and Finance at Athens University of Economics and Business

MISUSED FINANCIAL AID, POLITICAL AID, AND REGIME SURVIVAL. Athens University of Economics and Business. December 2010

Based on Maximum-Likelihood and Bayesian Analyses: Taxonomic and

MEKELLE UNIVERSITY DEPARTMENT OF ECONOMICS COLLEGE OF BUSINESS AND ECONOMICS

Some Computational Aspects of Exact Maximum Likelihood Estimation of Time Series Models

Maximum likelihood estimators and least squares

Maximum-Likelihood and Bayesian Parameter Estimation

Mekelle University College Business and Economics Department of Economics

Maximum Likelihood Decoding on a Communication Channel

Maximum Likelihood Estimation

5.1 Maximum Likelihood Estimators

Maximum Likelihood Estimation

Maximum Likelihood Estimators

Maximum Likelihood Estimation

Die Maximum-Likelihood-Methode

Maximum Likelihood Estimation

The Maximum Likelihood Estimator

MAXIMUM LIKELIHOOD ESTIMATION OF TIME SERIES MODELS Professor Richard Baillie, March 2004

4. Maximum Likelihood Estimation

Maximum Likelihood Estimation (MLE)

Notes on Maximum Likelihood and Time Series Antonis Demos (Athens University of Economics and Business) December 2006

2

0.1

Conditional Probability and Independence

In many statistical applications we have variables X and Y (or events A and B) and want to explain or predict Y or A from X or B, we are interested not only in marginal probabilities but in conditional ones as well, i.e., we want to incorporate some information in our predictions. Let A and B be two events in A and a probability function P (.). The conditional probability of A given event B, is denoted by P [A|B] and is defined as follows: Definition 1 The probability of an event A given an event B, denoted by P (A|B), is given by P ([A|B) =

P (A ∩ B) P (B)

if

P (B) > 0

and is left undefined if P (B) = 0. From the above formula is evident P [AB] = P [A|B]P [B] = P [B|A]P [A] if both P [A] and P [B] are nonzero. Notice that when speaking of conditional probabilities we are conditioning on some given event B; that is, we are assuming that the experiment has resulted in some outcome in B. B, in eﬀect then becomes our ”new” sample space. All probability properties of the previous section apply to conditional probabilities as well, i.e. P (·|B) is a probability measure. In particular: 1. P (A|B) ≥ 0 2. P (S|B) = 1 3. P (∪∞ i=1 Ai |B) =

P∞

i=1

P (Ai |B) for any pairwise disjoint events{Ai }∞ i=1 .

Note that if A and B are mutually exclusive events, P (A|B) = 0. When A ⊆ B, P (A|B) =

P (A) P (B)

≥ P (A) with strict inequality unless P (B) = 1. When

B ⊆ A, P (A|B) = 1. However, there is an additional property (Law) called the Law of Total Probabilities which states that: LAW OF TOTAL PROBABILITY: P (A) = P (A ∩ B) + P (A ∩ B c )

Conditional Probability and Independence

3

For a given probability space (Ω, A, P [.]), if B1 , B2 , ..., Bn is a collection of mutually n S exclusive events in A satisfying Bi = Ω and P [Bi ] > 0 for i = 1, 2, ..., n then for i=1

every A ∈ A,

P [A] =

n X

P [A|Bi ]P [Bi ]

i=1

Another important theorem in probability is the so called Bayes’ Theorem which states: BAYES RULE: Given a probability space (Ω, A, P [.]), if B1 , B2 , ..., Bn is a n S collection of mutually exclusive events in A satisfying Bi = Ω and P [Bi ] > 0 for i=1

i = 1, 2, ..., n then for every A ∈ A for which P [A] > 0 we have: P [A|Bj ]P [Bj ] P [Bj |A] = P n P [A|Bi ]P [Bi ] i=1

Notice that for events A and B ∈ A which satisfy P [A] > 0 and P [B] > 0 we have: P (B|A) =

P (A|B)P (B) . P (A|B)P (B) + P (A|B c )P (B c )

This follows from the definition of conditional independence and the law of total probability. The probability P (B) is a prior probability and P (A|B) frequently is a likelihood, while P (B|A) is the posterior. Finally the Multiplication Rule states: Given a probability space (Ω, A, P [.]), if A1 , A2 , ..., An are events in A for which P [A1 A2 ......An−1 ] > 0 then:

P [A1 A2 ......An ] = P [A1 ]P [A2 |A1 ]P [A3 |A1 A2 ].....P [An |A1 A2 ....An−1 ] Example: A plant has two machines. Machine A produces 60% of the total output with a fraction defective of 0.02. Machine B the rest output with a fraction defective of 0.04. If a single unit of output is observed to be defective, what is the probability that this unit was produced by machine A?

4

If A is the event that the unit was produced by machine A, B the event that the unit was produced by machine B and D the event that the unit is defective. Then we ask what is P [A|D]. But P [A|D] =

P [AD] . P [D]

Now P [AD] = P [D|A]P [A] = 0.02 ∗

0.6 = 0.012. Also P [D] = P [D|A]P [A] + P [D|B]P [B] = 0.012 + 0.04 ∗ 0.4 = 0.028. Consequently, P [A|D] = 0.571. Notice that P [B|D] = 1 − P [A|D] = 0.429. We can also use a tree diagram to evaluate P [AD] and P [BD]. Example: A marketing manager believes the market demand potential of a new product to be high with a probability of 0.30, or average with probability of 0.50, or to be low with a probability of 0.20. From a sample of 20 employees, 14 indicated a very favorable reception to the new product. In the past such an employee response (14 out of 20 favorable) has occurred with the following probabilities: if the actual demand is high, the probability of favorable reception is 0.80; if the actual demand is average, the probability of favorable reception is 0.55; and if the actual demand is low, the probability of the favorable reception is 0.30. Thus given a favorable reception, what is the probability of actual high demand? Again what we ask is P [H|F ] =

P [HF ] . P [F ]

Now P [F ] = P [H]P [F |H]+P [A]P [F |A]+

P [L]P [F |L] = 0.24+0.275+0.06 = 0.575. Also P [HF ] = P [F |H]P [H] = 0.24. Hence P [H|F ] =

0.24 0.575

= 0.4174

Example: There are five boxes and they are numbered 1 to 5. Each box contains 10 balls. Box i has i defective balls and 10−i non-defective balls, i = 1, 2, .., 5. Consider the following random experiment: First a box is selected at random, and then a ball is selected at random from the selected box. 1) What is the probability that a defective ball will be selected? 2) If we have already selected the ball and noted that is defective, what is the probability that it came from the box 5? Let A denote the event that a defective ball is selected and Bi the event that box i is selected, i = 1, 2, .., 5. Note that P [Bi ] = 1/5, for i = 1, 2, .., 5, and P [A|Bi ] = i/10. Question 1) is what is P [A]? Using the theorem of total probabilities we have:

Conditional Probability and Independence

P [A] =

5 P

P [A|Bi ]P [Bi ] =

i=1

5

5 P

i=1

i1 55

= 3/10. Notice that the total number of

defective balls is 15 out of 50. Hence in this case we can say that P [A] =

15 50

= 3/10.

This is true as the probabilities of choosing each of the 5 boxes is the same. Question 2) asks what is P [B5 |A]. Since box 5 contains more defective balls than box 4, which contains more defective balls than box 3 and so on, we expect to find that P [B5 |A] > P [B4 |A] > P [B3 |A] > P [B2 |A] > P [B1 |A]. We apply Bayes’ theorem: P [B5 |A] =

P [A|B5 ]P [B5 ] = 5 P P [A|Bi ]P [Bi ]

11 25 3 10

=

1 3

i=1

Similarly P [Bj |A] =

P [A|Bj ]P [Bj ] 5 P

=

P [A|Bi ]P [Bi ]

j 1 10 5 3 10

=

j 15

for j = 1, 2, ..., 5. Notice that uncon-

i=1

ditionally all Bi0 s were equally likely. Let A and B be two events in A and a probability function P (.). Events A and B are defined independent if and only if one of the following conditions is satisfied: (i) P [AB] = P [A]P [B]. (ii) P [A|B] = P [A] if P [B] > 0. (iii) P [B|A] = P [B] if P [A] > 0. These are equivalent definitions except that (i) does not really require P (A), P (B) > 0. Notice that the property of two events A and B and the property that A and B are mutually exclusive are distinct, though related properties. We know that if A and B are mutually exclusive then P [AB] = 0. Now if these events are also independent then P [AB] = P [A]P [B], and consequently P [A]P [B] = 0, which means that either P [A] = 0 or P [B] = 0. Hence two mutually exclusive events are independent if P [A] = 0 or P [B] = 0. On the other hand if P [A] 6= 0 and P [B] 6= 0, then if A and B are independent can not be mutually exclusive and oppositely if they are mutually exclusive can not be independent. Also notice that independence is not transitive, i.e., A independent of B and B independent of C does not imply that A

6

is independent of C. Example: Consider tossing two dice. Let A denote the event of an odd total, B the event of an ace on the first die, and C the event of a total of seven. We ask the following: (i) Are A and B independent? (ii) Are A and C independent? (iii) Are B and C independent? (i) P [A|B] = 1/2, P [A] = 1/2 hence P [A|B] = P [A] and consequently A and B are independent. (ii) P [A|C] = 1 6= P [A] = 1/2 hence A and C are not independent. (iii) P [C|B] = 1/6 = P [C] = 1/6 hence B and C are independent. Notice that although A and B are independent and C and B are independent A and C are not independent. Let us extend the independence of two events to several ones: For a given probability space (Ω, A, P [.]), let A1 , A2 , ..., An be n events in A. Events A1 , A2 , ..., An are defined to be independent if and only if: P [Ai Aj ] = P [Ai ]P [Aj ] for i 6= j P [Ai Aj Ak ] = P [Ai ]P [Aj ]P [Ak ] for i 6= j, i 6= k, k 6= j and so on n n T Q P [Ai ] P [ Ai ] = i=1

i=1

Notice that pairwise independence does not imply independence, as the following example shows. Example: Consider tossing two dice. Let A1 denote the event of an odd face in the first die, A2 the event of an odd face in the second die, and A3 the event of 11 22

an odd total. Then we have: P [A1 ]P [A2 ] = P [A3 |A1 ]P [A1 ] = P [A1 A3 ], and P [A2 A3 ] =

1 4

= P [A1 A2 ], P [A1 ]P [A3 ] =

=

= P [A2 ]P [A3 ] hence A1 , A2 , A3 are

pairwise independent. However notice that P [A1 A2 A3 ] = 0 6= Hence A1 , A2 , A3 are not independent.

11 22

1 8

= P [A1 ]P [A2 ]P [A3 ].

Random Variables, Distribution Functions, and Densities

0.2

7

Random Variables, Distribution Functions, and Densities

The probability space (S, A, P ) is not particularly easy to work with. In practice, we often need to work with spaces with some structure (metric spaces). It is convenient therefore to work with a cardinalization of S by using the notion of random variable. Formally, a random variable X is just a mapping from the sample space to the real line, i.e., X : S −→ R, with a certain property, it is a measurable mapping, i.e. © ª AX = {A ⊂ S : X(A) ∈ B} = X −1 (B) : B ∈ B ⊆ A,

where B is a sigma-algebra on R, for any B in B the inverse image belongs to A. The probability measure PX can then be defined by ¡ ¢ PX (X ∈ B) = P X −1 (B) .

It is straightforward to show that AX is a σ-algebra whenever B is. Therefore, PX is a probability measure obeying Kolmogorov’s axioms. Hence we have transferred (S, A, P ) −→ (R, B, PX ), where B is the Borel σ-algebra when X(S) = R or any uncountable set, and B is P (X (S)) when X (S) is finite. The function X(.) must be such that the set Ar , defined by Ar = {ω : X(ω) ≤ r}, belongs to A for every real number r, as elements of B are left-closed intervals of R. The important part of the definition is that in terms of a random experiment, S is the totality of outcomes of that random experiment, and the function, or random variable, X(.) with domain S makes some real number correspond to each outcome of the experiment. The fact that we also require the collection of ω0s for which X(ω) ≤ r to be an event (i.e. an element of A) for each real number r is not much of a restriction since the use of random variables is, in our case, to describe only events. Example: Consider the experiment of tossing a single coin. Let the random variable X denote the number of heads. In this case S = {head, tail}, and X(ω) = 1

8

if ω = head, and X(ω) = 0 if ω = tail. So the random variable X associates a real number with each outcome of the experiment. To show that X satisfies the definition we should show that {ω : X(ω) ≤ r}, belongs to A for every real number r. A = {φ, {head}, {tail}, S}. Now if r < 0, {ω : X(ω) ≤ r} = φ, if 0 ≤ r < 1 then {ω : X(ω) ≤ r} = {tail}, and if r ≥ 1 then {ω : X(ω) ≤ r} = {head, tail} = S. Hence, for each r the set {ω : X(ω) ≤ r} belongs to A and consequently X(.) is a random variable. In the above example the random variable is described in terms of the random experiment as opposed to its functional form, which is the usual case. We can now work with (R, B, PX ), which has metric structure and algebra. For example, we toss two die in which case the sample space is S = {(1, 1) , (1, 2) , ..., (6, 6)} . We can define two random variables: the Sum and Product:

X (S) = {2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12} X (S) = {1, 2, 3, 4, 5, 6, 8, 9, 10, ..., 36} The simplest form of random variables are the indicators IA ⎧ ⎨ 1 if s ∈ A IA (s) = ⎩ 0 if s ∈ /A

This has associated sigma algebra in S

{φ, S, A, Ac } Finally, we give formal definition of a continuous real-valued random variable. Definition 2 A random variable is continuous if its probability measure PX is absolutely continuous with respect to Lebesgue measure, i.e., PX (A) = 0 whenever λ(A) = 0.

Random Variables, Distribution Functions, and Densities

9

0.2.1 Distribution Functions Associated with each random variable there is the distribution function FX (x) = PX (X ≤ x) defined for all x ∈ R. This function eﬀectively replaces PX . Note that we can reconstruct PX from FX . EXAMPLE. S = {H, T } , X (H) = 1, X (T ) = 0, (p = 1/2). If x < 0, FX (x) = 0 If 0 ≤ x < 1, FX (x) = 1/2 If x ≥ 1, FX (x) = 1. EXAMPLE. The logit c.d.f. is FX (x) =

1 1 + e−x

It is continuous everywhere and asymptotes to 0 and 1 at ±∞ respectively. Strictly increasing. Note that the distribution function FX (x) of a continuous random variable is a continuous function. The distribution function of a discrete random variable is a step function. Theorem 3 A function F (·) is a c.d.f. of a random variable X if and only if the following three conditions hold 1. limx→−∞ F (x) = 0 and limx→∞ F (x) = 1 2. F is a nondecreasing function in x 3. F is right-continuous, i.e., for all x0 , limx→x+0 F (x) = F (x0 ) 4. F is continuous except at a set of points of Lebesgue measure zero. 0.2.2 Continuous Random Variables A random variable X is called continuous if there exist a function fX (.) such that Rx FX (x) = fX (u)du for every real number x. In such a case FX (x) is the cumulative −∞

distribution and the function fX (.) is the density function.

10

Notice that according to the above definition the density function is not uniquely determined. The idea is that if the a function change value if a few points its integral is unchanged. Furthermore, notice that fX (x) = dFX (x)/dx. The notations for discrete and continuous density functions are the same, yet they have diﬀerent interpretations. We know that for discrete random variables fX (x) = P [X = x], which is not true for continuous random variables. Furthermore, for discrete random variables fX (.) is a function with domain the real line and counterdomain the interval [0, 1], whereas, for continuous random variables fX (.) is a function with domain the real line and counterdomain the interval [0, ∞). Note that for a continuous r.v. P (X = x) ≤ P (x − ε ≤ X ≤ x) = FX (x) − FX (x − ε) → 0 as ε → 0, by the continuity of FX (x). The set {X = x} is an example of a set of measure (in this case the measure is P or PX ) zero. In fact, any countable set is of measure zero under a distribution which is absolutely continuous with respect to Lebesgue measure. Because the probability of a singleton is zero P (a ≤ X ≤ b) = P (a ≤ X < b) = P (a < X < b) for any a, b. Example: Let X be the random variable representing the length of a telephone conversation. One could model this experiment by assuming that the distribution of X is given by FX (x) = (1 − e−λx ) where λ is some positive number and the random variable can take values only from the interval [0, ∞). The density function

is dFX (x)/dx = fX (x) = λe−λx . If we assume that telephone conversations are meaR 10 R 10 sured in minutes, P [5 < X ≤ 10] = 5 fX (x)dx = 5 λe−λx dx = e−5λ − e−10λ , and for λ = 1/5 we have that P [5 < X ≤ 10] = e−1 − e−2 = 0.23.

The example above indicates that the density functions of continuous random variables are used to calculate probabilities of events defined in terms of the correRb sponding continuous random variable X i.e. P [a < X ≤ b] = a fX (x)dx. Again we

Expectations and Moments of Random Variables

11

can give the definition of the density function without any reference to the random variable i.e. any function f (.) with domain the real line and counterdomain [0, ∞) is defined to be a probability density function iff (i) f (x) ≥ 0 for all x R∞ (ii) −∞ f (x)dx = 1.

In practice when we refer to the certain distribution of a random variable, we state its density or cumulative distribution function. However, notice that not all random variables are either discrete or continuous. 0.3

Expectations and Moments of Random Variables

An extremely useful concept in problems involving random variables or distributions is that of expectation. 0.3.1 Mean or Expectation Let X be a random variable. The mean or the expected value of X, denoted by E[X] or μX , is defined by: P P (i) E[X] = xj P [X = xj ] = xj fX (xj )

if X is a discrete random variable with counterdomain the countable set

{x1 , ..., xj , ..} (ii) E[X] =

R∞

−∞

xfX (x)dx

if X is a continuous random variable with density function fX (x) and if either ¯R ¯ ¯R ∞ ¯ ¯ 0 ¯ ¯ ¯ xf (x)dx < ∞ or xf (x)dx ¯ ¯ < ∞ or both. X X 0 −∞ R∞ R0 (iii) E[X] = 0 [1 − FX (x)]dx − −∞ FX (x)dx for an arbitrary random variable X.

(i) and (ii) are used in practice to find the mean for discrete and continuous random variables, respectively. (iii) is used for the mean of a random variable that is neither discrete nor continuous. Notice that in the above definition we assume that the sum and the integrals exist. Also that the summation in (i) runs over the possible values of j and the

12

j th term is the value of the random variable multiplied by the probability that the random variable takes this value. Hence E[X] is an average of the values that the random variable takes on, where each value is weighted by the probability that the random variable takes this value. Values that are more probable receive more weight. The same is true in the integral form in (ii). There the value x is multiplied by the approximate probability that X equals the value x, i.e. fX (x)dx, and then integrated over all values. Notice that in the definition of a mean of a random variable, only density functions or cumulative distributions were used. Hence we have really defined the mean for these functions without reference to random variables. We then call the defined mean the mean of the cumulative distribution or the appropriate density function. Hence, we can speak of the mean of a distribution or density function as well as the mean of a random variable. Notice that E[X] is the center of gravity (or centroid) of the unit mass that is determined by the density function of X. So the mean of X is a measure of where the values of the random variable are centered or located i.e. is a measure of central location. Example: Consider the experiment of tossing two dice. Let X denote the total of the upturned faces. Then for this case we have: 12 P E[X] = ifX (i) = 7 i=2

Example: Consider a X that can take only to possible values, 1 and -1, each

with probability 0.5. Then the mean of X is: E[X] = 1 ∗ 0.5 + (−1) ∗ 0.5 = 0 Notice that the mean in this case is not one of the possible values of X. Example: Consider a continuous random variable X with density function fX (x) = λe−λx for x ∈ [0, ∞). Then R∞ R∞ E[X] = xfX (x)dx = xλe−λx dx = 1/λ −∞

0

Example: Consider a continuous random variable X with density function

Expectations and Moments of Random Variables

13

fX (x) = x−2 for x ∈ [1, ∞). Then R∞ R∞ E[X] = xfX (x)dx = xx−2 dx = lim log b = ∞ −∞

b→∞

1

so we say that the mean does not exist, or that it is infinite. Median of X: When FX is continuous and strictly increasing, we can define

the median of X, denoted m(X), as being the unique solution to 1 FX (m) = . 2 Since in this case, FX−1 (·) exists, we can alternatively write m = FX−1 ( 12 ). For discrete r.v., there may be many m that satisfy this ⎧ ⎪ ⎪ 0 ⎪ ⎨ X= 1 ⎪ ⎪ ⎪ ⎩ 2

or may none. Suppose 1/3 1/3 , 1/3

then there does not exist an m with FX (m) = 12 . Also, if ⎧ ⎪ ⎪ 0 1/4 ⎪ ⎪ ⎪ ⎪ ⎨ 1 1/4 , X= ⎪ ⎪ 2 1/4 ⎪ ⎪ ⎪ ⎪ ⎩ 3 1/4 then any 1 ≤ m ≤ 2 is an adequate median.

Note that if E (X n ) exists, then so does E (X n−1 ) but not vice versa (n > 0).

Also when the support is infinite, the expectation does not necessarily exist. R∞ R0 If 0 xfX (x)dx = ∞ but −∞ xfX (x)dx > −∞, then E (X) = ∞ R∞ R0 If 0 xfX (x)dx = ∞ and −∞ xfX (x)dx = −∞, then E (X)is not defined. 1 1 . π 1+x2

This density function is symmetric R∞ about zero, and one is temted to say that E (X) = 0. But 0 xfX (x)dx = ∞ and R0 xfX (x)dx = −∞, so E(X) does not exist according to the above definition. −∞ Example: [Cauchy] fX (x) =

Now consider Y = g(X), where g is a (piecewise) monotonic continuous func-

tion. Then E (Y ) =

Z

∞

−∞

yfY (y)dy =

Z

∞

−∞

g(x)fX (x)dx = E (g(x))

14

Theorem 4 Expectation has the following properties: 1. [Linearity] E (a1 g1 (X) + a2 g2 (X) + a3 ) = a1 E (g1 (X)) + a2 E (g2 (X)) + a3 2. [Monotonicity] If g1 (x) ≥ g2 (x) ⇒ E (g1 (X)) ≥ E (g2 (X)) 3. Jensen’s inequality. If g(x) is a weakly convex function, i.e., g (λx + (1 − λ) y) ≤ λg (x) + (1 − λ) g (y) for all x, y, and all with 0 ≤ λ ≤ 1, then E (g(X)) ≥ g (E (X)) . An Interpretation of Expectation We claim that E (X) is the unique minimizer of E (X − θ)2 with respect to θ, assuming that the second moment of X is finite. Theorem 5 Suppose that E (X 2 ) exists and is finite, then E (X) is the unique minimizer of E (X − θ)2 with respect to θ. This Theorem says that the Expectation is the closest quantity to θ, in mean square error. 0.3.2 Variance Let X be a random variable and μX be E[X]. The variance of X, denoted by σ2X or var[X], is defined by: (i) var[X] =

P P (xj − μX )2 P [X = xj ] = (xj − μX )2 fX (xj )

if X is a discrete random variable with counterdomain the countable set {x1 , ..., xj , ..} (ii) var[X] =

R∞

−∞

(xj − μX )2 fX (x)dx

if X is a continuous random variable with density function fX (x). R∞ (iii) var[X] = 0 2x[1 − FX (x) + FX (−x)]dx − μ2X for an arbitrary random variable X.

The variances are defined only if the series in (i) is convergent or if the integrals in (ii) or (iii) exist. Again, the variance of a random variable is defined in terms of

Expectations and Moments of Random Variables

15

the density function or cumulative distribution function of the random variable and consequently, variance can be defined in terms of these functions without reference to a random variable. Notice that variance is a measure of spread since if the values of the random variable X tend to be far from their mean, the variance of X will be larger than the variance of a comparable random variable whose values tend to be near their mean. It is clear from (i), (ii) and (iii) that the variance is a nonnegative number. If X is a random variable with variance σ 2X , then the standard deviation of p X, denoted by σX , is defined as var(X) The standard deviation of a random variable, like the variance, is a measure

of spread or dispersion of the values of a random variable. In many applications it is preferable to the variance since it will have the same measurement units as the random variable itself. Example: Consider the experiment of tossing two dice. Let X denote the total of the upturned faces. Then for this case we have (μX = 7): 12 P var[X] = (i − μX )2 fX (i) = 210/36 i=2

Example: Consider a X that can take only to possible values, 1 and -1, each

with probability 0.5. Then the variance of X is (μX = 0): var[X] = 0.5 ∗ 12 + 0.5 ∗ (−1)2 = 1 Example: Consider a X that can take only to possible values, 10 and -10, each with probability 0.5. Then we have: μX = E[X] = 10 ∗ 0.5 + (−10) ∗ 0.5 = 0 var[X] = 0.5 ∗ 102 + 0.5 ∗ (−10)2 = 100 Notice that in examples 2 and 3 the two random variables have the same mean but diﬀerent variance, larger being the variance of the random variable with values further away from the mean. Example: Consider a continuous random variable X with density function fX (x) = λe−λx for x ∈ [0, ∞). Then (μX = 1/λ):

16

var[X] =

R∞

−∞

R∞ (x − μX )2 fX (x)dx = (x − 1/λ)2 λe−λx dx = 0

1 λ2

Example: Consider a continuous random variable X with density function fX (x) = x−2 for x ∈ [1, ∞). Then we know that the mean of X does not exist. Consequently, we can not define the variance. Notice that £ ¤ ¡ ¢ V ar (X) = E (X − E(X))2 = E X 2 − E 2 (X) and that V ar (aX + b) = a2 V ar (X) ,

SD =

√ V ar,

SD (aX + b) = |a|SD(X),

i.e., SD(X) changes proportionally. Variance/standard deviation measures dispersion, higher variance more spread out. Interquartile range: FX−1 (3/4) − FX−1 (1/4), the range of middle half always exists and is an alternative measure of dispersion. 0.3.3 Higher Moments of a Random Variable /

If X is a random variable, the rth raw moment of X, denoted by μr , is defined as: μ/r = E[X r ] /

/

if this expectation exists. Notice that μr = E[X] = μ1 = μX , the mean of X. If X is a random variable, the rth central moment of X about α is defined as E[(X −α)r ]. If α = μX , we have the rth central moment of X about μX , denoted by μr , which is: μr = E[(X − μX )r ] We have measures defined in terms of quantiles to describe some of the characteristics of random variables or density functions. The qth quantile of a random variable X or of its corresponding distribution is denoted by ξ q and is defined as the smallest number ξ satisfying FX (ξ) ≥ q. If X is a continuous random variable, then the qth quantile of X is given as the smallest number ξ satisfying FX (ξ) ≥ q.

Expectations and Moments of Random Variables

17

The median of a random variable X, denoted by medX or med(X), or ξ q , is the 0.5th quantile. Notice that if X a continuous random variable the median of X satisfies: Z

med(X)

−∞

1 fX (x)dx = = 2

Z

∞

fX (x)dx

med(X)

so the median of X is any number that has half the mass of X to its right and the other half to its left. The median and the mean are measures of central location. The third moment about the mean μ3 = E (X − E (X))3 is called a measure of asymmetry, or skewness. Symmetrical distributions can be shown to have μ3 = 0. Distributions can be skewed to the left or to the right. However, knowledge of the third moment gives no clue as to the shape of the distribution, i.e. it could be the case that μ3 = 0 but the distribution to be far from symmetrical. The ratio

μ3 σ3

is

unitless and is call the coeﬃcient of skewness. An alternative measure of skewness is provided by the ratio: (mean-median)/(standard deviation) The fourth moment about the mean μ4 = E (X − E (X))4 is used as a measure of kurtosis, which is a degree of flatness of a density near the center. The coeﬃcient of kurtosis is defined as

μ4 σ4

− 3 and positive values are sometimes used to indicate

that a density function is more peaked around its center than the normal (leptokurtic distributions). A positive value of the coeﬃcient of kurtosis is indicative for a distribution which is flatter around its center than the standard normal (platykurtic distributions). This measure suﬀers from the same failing as the measure of skewness i.e. it does not always measure what it supposed to. While a particular moment or a few of the moments may give little information about a distribution the entire set of moments will determine the distribution exactly. In applied statistics the first two moments are of great importance, but the third and forth are also useful.

18

0.3.4 Moment Generating Functions Finally we turn to the moment generating function (mgf) and characteristic Function (cf). The mgf is defined as ¡ ¢ MX (t) = E etX =

Z

∞

etx fX (x)dx

−∞

for any real t, provided this integral exists in some neighborhood of 0. It is the Laplace transform of the function fX (·) with argument −t. We have the useful inversion formula fX (x) =

Z

∞

MX (t) e−tx dt

−∞

The mgf is of limited use, since it does not exist for many r.v. the cf is applicable more generally, since it always exists: Z Z ∞ ¡ itX ¢ itx e fX (x)dx = ϕX (t) = E e = −∞

∞

cos (tx) fX (x)dx+i

−∞

Z

∞

sin (tx) fX (x)dx

−∞

This essentially is the Fourier transform of the function fX (·) and there is a well defined inversion formula 1 fX (x) = √ 2π

Z

∞

e−itx ϕX (t) dt

−∞

If X is symmetric about zero, the complex part of cf is zero. Also, ¡ r r itX ¢ dr ϕ (0) = E iX e ↓t=0 = ir E (X r ) , X r dt

r = 1, 2, 3, ..

Thus the moments of X are related to the derivative of the cf at the origin. If c (t) =

Z

∞

exp (itx) dF (x)

−∞

notice that dr c (t) = dtr and

Z

∞

(ix)r exp (itx) dF (x)

−∞

¯ ¯ Z ∞ r ¯ d c (t) dr c (t) ¯¯ r r / r / ¯ = (ix) dF (x) = (i) μ ⇒ μ = (−i) r r dtr ¯t=0 dtr ¯t=0 −∞

Expectations and Moments of Random Variables

19

the rth uncenterd moment. Now expanding c (t) in powers of t we get ¯ ¯ r dr c (t) ¯¯ dr c (t) ¯¯ (t)r / / (it) c (t) = c (0) + + ... = 1 + μ1 (it) + ... + μr + ... t + ... + dtr ¯t=0 dtr ¯t=0 r! r! The cummulants are defined as the coeﬃcients κ1 , κ2 , ..., κr of the identity in it ! Ã (it)2 (it)r (it)r / + ... + κr + ... = 1 + μ1 (it) + ... + μ/r + ... exp κ1 (it) + κ2 2! r! r! Z ∞ = c (t) = exp (itx) dF (x) −∞

The cumulant-moment connection: Suppose X is a random variable with n moments a1 , ...an . Then X has n cumulants k1 , ...kn and ar+1 =

r X j=0

⎛ ⎝

r j

⎞

⎠ aj kr+1−j for r = 0, ..., n − 1.

Writing out for r = 0, ...3 produces: a1 = k1 a2 = k2 + a1 k1

a3 = k3 + 2a1 k2 + a2 k1 a4 = k4 + 3a1 k3 + 3a2 k2 + a3 k1 . These recursive formulas can be used to calculate the a0s eﬃciently from the k0s, and vice versa. When X has mean 0, that is, when a1 = 0 = k1 , aj becomes μj = E((X − E(X))j ), so the above formulas simplify to: μ2 = k2 μ3 = k3 μ4 = k4 + 3k22 .

20

0.3.5 Expectations of Functions of Random Variablers Product and Quotient Let f (X, Y ) =

X , Y

E (X) = μX and E (Y ) = μY . Then, expanding f (X, Y ) =

X Y

around (μX , μY ), we have f (X, Y ) = as

∂f ∂X

=

1 Y

μ μ 1 μX 1 + (X − μX )− X 2 (Y − μY )+ X 3 (Y − μY )2 − (X − μX ) (Y − μY ) μY μY (μY ) (μY ) (μY )2 ,

∂f ∂Y

= − YX2 ,

∂2f ∂X 2

= 0,

∂2f ∂X∂Y

=

∂2f ∂Y ∂X

= − Y12 , and

∂2f ∂Y 2

= 2 YX3 . Taking

expectations we have µ ¶ X μ 1 μ E Cov (X, Y ) . = X + X 3 V ar (Y ) − Y μY (μY ) (μY )2 For the variance, take again the variance of the Taylor expansion and keeping only terms up to order 2 we have: µ ¶ ∙ ¸ (μX )2 V ar (X) V ar (Y ) Cov (X, Y ) X V ar . = + −2 Y μX μY (μY )2 (μX )2 (μY )2 0.3.6 Continuous Distributions UNIFORM ON [a, b]. A very simple distribution for a continuous random variable is the uniform distribution. Its density function is:

f (x|a, b) = and F (x|a, b) =

⎧ ⎨ ⎩

Z

1 b−a

if

x ∈ [a, b]

,

0 otherwise x

f (z|a, b) dz =

a

x−a , b−a

where −∞ < a < b < ∞. Then the random variable X is defined to be uniformly distributed over the interval [a, b]. Now if X is uniformly distributed over [a, b] then a+b , E (X) = 2

a+b median = , 2

(b − a)2 V ar (X) = . 12

Expectations and Moments of Random Variables

21

If X v U [a, b] =⇒ X − a v U [0, b − a] =⇒

X−a b−a

v U [0, b − a]. Notice that if

a random variable is uniformly distributed over one of the following intervals [a, b), (a, b], (a, b) the density function, expected value and variance does not change. Exponential Distribution If a random variable X has a density function given by:

fX (x) = fX (x; λ) = λe−λx

f or 0 ≤ x < ∞

where λ > 0 then X is defined to have an (negative) exponential distribution. Now this random variable X we have E[X] =

1 λ

and

var[X] =

1 λ2

Pareto-Levy or Stable Distributions The stable distributions are a natural generalization of the normal in that, as their name suggests, they are stable under addition, i.e. a sum of stable random variables is also a random variable of the same type. However, nonnormal stable distributions have more probability mass in the tail areas than the normal. In fact, the nonnormal stable distributions are so fat-tailed that their variance and all higher moments are infinite. Closed form expressions for the density functions of stable random variables are available for only the cases of normal and Cauchy. If a random variable X has a density function given by:

fX (x) = fX (x; γ, δ) =

γ 1 2 π γ + (x − δ)2

for

−∞ 0

α is shape parameter, β is a scale parameter. Here Γ (α) = Gamma function, Γ (n) = n!. The χ2k is when α = k, and β = 1.

R∞ 0

tα−1 e−t dt is the

Multivariate Random Variables

23

Notice that we can approximate the Poisson and Binomial functions by the normal, in the sense that if a random variable X is distributed as Poisson with parameter λ, then

X−λ √ λ

is distributed approximately as standard normal. On the

other hand if Y ∼ Binomial(n, p) then √Y −np

np(1−p)

∼ N(0, 1).

The standard normal is an important distribution for another reason, as well. Assume that we have a sample of n independent random variables, x1 , x2 , ..., xn , which are coming from the same distribution with mean m and variance s2 , then we have the following: 1 X xi − m √ ∼ N(0, 1) n i=1 s n

This is the well known Central Limit Theorem for independent observations. 0.4

Multivariate Random Variables

We now consider the extension to multiple r.v., i.e., X = (X1 , X2 , .., Xk ) ∈ Rk The joint pmf, fX (x), is a function with P (X ∈ A) =

X

fX (x)

x∈A

The joint pdf, fX (x), is a function with P (X ∈ A) =

Z

fX (x)dx

x∈A

This is a multivariate integral, and in general diﬃcult to compute. If A is a rectangle A = [a1 , b1 ] × ... × [ak , bk ], then Z

x∈A

fX (x)dx =

Zbk

ak

...

Zb1

a1

fX (x)dx1 ..dxk

24

The joint c.d.f. is defined similarly FX (x) =

X

fX (z1 , z2 , ..., zk )

z1 ≤x1 ,...,zk ≤xk

FX (x) = P (X1 ≤ x1 , ..., Xk ≤ xk ) =

Zx1

−∞

...

Zxn

fX (z1 , z2 , ..., zk )dz1 ..dzk

−∞

The multivariate c.d.f. has similar coordinate-wise properties to a univariate c.d.f. For continuously diﬀerentiable c.d.f.’s fX (x) =

∂ k FX (x) ∂x1 ∂x2 ..∂xk

0.4.1 Conditional Distributions and Independence We defined conditional probability P (A|B) = P (A∩B)/P (B) for events with P (B) 6= 0. We now want to define conditional distributions of Y |X. In the discrete case there is no problem fY |X (y|x) = P (Y = y|X = x) =

f (y, x) fX (x)

when the event {X = x} has nonzero probability. Likewise we can define P Y ≤y f (y, x) FY |X (y|x) = P (Y ≤ y|X = x) = fX (x) Note that fY |X (y|x) is a density function and FY |X (y|x) is a c.d.f. 1) fY |X (y|x) ≥ 0 for all y P P f (y,x) 2) y fY |X (y|x) = fyX (x) =

fX (x) fX (x)

=1

In the continuous case, it appears a bit anomalous to talk about the P (y ∈

A|X = x), since {X = x} itself has zero probability of occurring. Still, we define the conditional density function fY |X (y|x) =

f (y, x) fX (x)

in terms of the joint and marginal densities. It turns out that fY |X (y|x) has the properties of p.d.f.

Multivariate Random Variables

1) fY |X (y|x) ≥ 0 R∞ 2) −∞ fY |X (y|x)dy =

25

R∞

f (y,x)dy fX (x)

−∞

=

fX (x) fX (x)

= 1.

We can define Expectations within the conditional distribution R∞ Z ∞ yf (y, x)dy E(Y |X = x) = yfY |X (y|x)dy = R−∞ ∞ f (y, x)dy −∞ −∞

and higher moments of the conditional distribution 0.4.2 Independence

We say that Y and X are independent (denoted by ⊥⊥) if P (Y ∈ A, X ∈ B) = P (Y ∈ A)P (X ∈ B) for all events A, B, in the relevant sigma-algebras. This is equivalent to the cdf’s version which is simpler to state and apply. FY X (y, x) = F (y)F (x) In fact, we also work with the equivalent density version f (y, x) = f (y)f (x) for all y, x fY |X (y|x) = f (y) for all y fX|Y (x|y) = f (x) f or all x If Y ⊥⊥ X, then g(X) ⊥⊥ h(Y ) for any measurable functions g, and h. We can generalise the notion of independence to multiple random variables. Thus Y , X, and Z are mutually independent if: f (y, x, z) = f (y)f (x)f (z) f (y, x) = f (y)f (x) f or all y, x f (x, z) = f (x)f (z) for all x, z f (y, z) = f (y)f (z) f or all y, z for all y, x, z.

26

0.4.3 Examples of Multivariate Distributions Multivariate Normal We say that X (X1 , X2 , ..., Xk ) v MV Nk (μ, Σ) , when µ ¶ 1 1 / −1 fX (x|μ, Σ) = exp − (x − μ) Σ (x − μ) 2 (2π)k/2 [det (Σ)]1/2 where Σ is a k × k covariance matrix ⎛ σ σ ... σ 1k ⎜ 11 12 .. ⎜ ... ⎜ . ⎜ Σ=⎜ . . . .. ⎜ . ⎝ σ kk and det (Σ) is the determinant of Σ.

⎞ ⎟ ⎟ ⎟ ⎟ ⎟ ⎟ ⎠

Theorem 6 (a) If X v MV Nk (μ, Σ) then Xi v N (μi , σ ii ) (this is shown by integration of the joint density with respect to the other variables). (b) The conditional distributions X = (X1 , X2 ) are Normal too

where

¡ ¢ fX1 |X2 (x1 |x2 ) v N μX1 |X2 , ΣX1 |X2 μX1 |X2 = μ1 + Σ12 Σ−1 22 (x2 − μ2 ) ,

ΣX1 |X2 = Σ11 − Σ12 Σ−1 22 Σ21 .

(c) Iﬀ Σ diagonal then X1 , X2 , .., Xk are mutually independent. In this case det (Σ) = σ 11 σ 22 ..σ kk ¢2 k ¡ X − μ x 1 1 j j − (x − μ)/ Σ−1 (x − μ) = − 2 2 j=1 σ jj so that

Ã ¡ ¢2 ! 1 1 xj − μj p exp − fX (x|μ, Σ) = 2 σ jj 2πσ jj j=1 k Y

Multivariate Random Variables

27

0.4.4 More on Conditional Distributions We now consider the relationship between two, or more, r.v. when they are not independent. In this case, conditional density fY |X and c.d.f. FY |X is in general varying with the conditioning point x. Likewise for conditional mean E (Y |X), conditional ¡ ¢ median M (Y |X), conditional variance V (Y |X), conditional cf E eitY |X , and other functionals, all of which characterize the relationship between Y and X. Note that

this is a directional concept, unlike covariance, and so for example E (Y |X) can be very diﬀerent from E (X|Y ).

Regression Models: We start with random variable (Y, X). We can write for any such random variable m(X)

z }| { E (Y |X) | {z }

Y =

systematic

part

ε

z }| { + Y − E (Y |X) {z } | rand om

part

By construction ε satisfies E (ε|X) = 0, but ε is not necessarily independent of X. For example, V ar (ε|X) = V ar (Y − E (Y |X) |X) = V ar (Y |X) = σ 2 (X) can be expected to vary with X as much as m (X) = E (Y |X) . A convenient and popular simplification is to assume that E (Y |X) = α + βX V ar (Y |X) = σ2 For example, in the bivariate normal distribution Y |X has σY E (Y |X) = μY + ρY X (X − μX ) σX ¡ ¢ V ar (Y |X) = σ 2Y 1 − ρ2Y X and in fact ε ⊥⊥ X. We have the following result about conditional expectations

28

Theorem 7 (1) E (Y ) = E [E (Y |X)] £ ¤ (2) E (Y |X) minimizes E (Y − g (X))2 over all measurable functions g (·)

(3) V ar (Y ) = E [V ar (Y |X)] + V ar [E (Y |X)]

R Proof. (1) Write fY X (y, x) = fY |X (y|x) fX (x) then we have E (Y ) = yfY (y)dy = R ¡R ¢ R ¡R ¢ y fY X (y, x)dx dy = y fY |X (y|x) fX (x) dx dy = ¢ R ¡R R = yfY |X (y|x) dy fX (x) dx = [E(Y |X = x] fX (x) dx = E (E (Y |X)) £ ¤ £ ¤ (2) E (Y − g (X))2 = E [Y − E (Y |X) + E (Y |X) − g (X)]2

= E [Y − E (Y |X)]2 +2E [[Y − E (Y |X)] [E (Y |X) − g (X)]]+E [E (Y |X) − g (X)]2 £ ¤ as now E (Y E (Y |X)) = E (E (Y |X))2 , and E (Y g (X)) = E (E (Y |X) g (X)) we £ ¤ get that E (Y − g (X))2 = E [Y − E (Y |X)]2 +E [E (Y |X) − g (X)]2 ≥ E [Y − E (Y |X)]2 . (3) V ar (Y ) = E [Y − E (Y )]2 = E [Y − E (Y |X)]2 + E [E (Y |X) − E (Y )]2 +2E [[Y − E (Y |X)] [E (Y |X) − E (Y )]]

£ ¤ The first term is E [Y − E (Y |X)]2 = E{E [Y − E (Y |X)]2 |X } = E [V ar (Y |X)] The second term is E [E (Y |X) − E (Y )]2 = V ar [E (Y |X)]

The third term is zero as ε = Y − E (Y |X) is such that E (ε|X) = 0, and E (Y |X) − E (Y ) is measurable with respect to X. Covariance Cov (X, Y ) = E [X − E (X)] E [Y − E (Y )] = E (XY ) − E (X) E (Y ) Note that if X or Y is a constant then Cov (X, Y ) = 0. Also Cov (aX + b, cY + d) = acCov (X, Y ) An alternative measure of association is given by the correlation coeﬃcient ρXY =

Cov (X, Y ) σX σY

Note that ρaX+b,cY +d = sign (a) × sign (c) × ρXY

Multivariate Random Variables

29

If E (Y |X) = a = E (Y ) almost surely, then Cov (X, Y ) = 0. Also if X and Y are independent r.v. then Cov (X, Y ) = 0. Both the covariance and the correlation of random variables X and Y are measures of a linear relationship of X and Y in the following sense. cov[X, Y ] will be positive when (X − μX ) and (Y − μY ) tend to have the same sign with high probability, and cov[X, Y ] will be negative when (X − μX ) and (Y − μY ) tend to have opposite signs with high probability. The actual magnitude of the cov[X, Y ] does not much meaning of how strong the linear relationship between X and Y is. This is because the variability of X and Y is also important. The correlation coeﬃcient does not have this problem, as we divide the covariance by the product of the standard deviations. Furthermore, the correlation is unitless and −1 ≤ ρ ≤ 1. The properties are very useful for evaluating the expected return and standard deviation of a portfolio. Assume ra and rb are the returns on assets A and B, and their variances are σ2a and σ 2b , respectively. Assume that we form a portfolio of the two assets with weights wa and wb , respectively. If the correlation of the returns of these assets is ρ, find the expected return and standard deviation of the portfolio. If Rp is the return of the portfolio then Rp = wa ra + wb rb . The expected portfolio return is E[Rp ] = wa E[ra ] + wb E[rb ]. The variance of the portfolio is var[Rp ] = var[wa ra + wb rb ] = E[(wa ra + wb rb )2 ] − (E[wa ra + wb rb ])2 = = wa2 E[ra2 ] + wb2 E[rb2 ] + 2wa wb E[ra rb ] −wa2 (E[ra ])2 − wb2 (E[rb ])2 − 2wa wb E[ra ]E[rb ] = = wa2 {E[ra2 ] − (E[ra ])2 }+wb2 {E[rb2 ] − (E[rb ])2 }+2wa wb {E[ra rb ] − E[ra ]E[rb ]} = wa2 var[ra ] + wb2 var[rb ] + 2wa wb cov[ra , rb ] or = wa2 σ 2a + wb2 σ 2b + 2wa wb ρσ a σ b In a vector format we⎛have: ⎞ ´ E[ra ] ³ ⎠ and E[Rp ] = wa wb ⎝ E[rb ] ⎞⎛ ⎞ ⎛ 2 ´ ³ ρσ a σ b w σa ⎠⎝ a ⎠ var[Rp ] = wa wb ⎝ ρσ a σ b σ 2b wb

30

From the above example we can see that var[aX +bY ] = a2 var[X]+b2 var[Y ]+ 2abcov[X, Y ] for random variables X and Y and constants a and b. In fact we can generalize the formula above for several random variables X1 , X2 , ..., Xn and constants n n P P a1 , a2 , a3 , ..., an i.e. var[a1 X1 + a2 X2 + ...an Xn ] = a2i var[Xi ] + 2 ai aj cov[Xi , Xj ] i=1

0.5

i 0

0

Notice that Γ(t + 1) = tΓ(t),as Γ(t + 1) =

Z∞ 0

t −x

x e dx = −

Z∞ 0

t

−x

x de

=−

¯∞ xt e−x ¯0

+t

Z∞

xt−1 de−x = tΓ(t)

0

and if t is an integer then Γ(t + 1) = t!. Also if t is again an integer then Γ(t + 12 ) = √ 1∗3∗5∗...(2t−1) √ π. Finally Γ( 12 ) = π. 2t Recall that if X is a random variable with density µ ¶k/2 k 1 1 1 fX (x) = x 2 −1 e− 2 x f or 0 < x < ∞ Γ(k/2) 2

Sampling Theory and Sample Statistics

35

where Γ(.) is the gamma function, then X is defined to have a chi-square distribution with k degrees of freedom. Notice that X is distributed as above then: E[X] = k

and

var[X] = 2k

We can prove the following theorem Theorem 12 If the random variables Xi , i = 1, 2, .., k are normally and independently distributed with means μi and variances σ 2i then U=

¶2 k µ X Xi − μ i

i=1

σi

has a chi-square distribution with k degrees of freedom. Proof omitted. Furthermore, Theorem 13 If the random variables Xi , i = 1, 2, .., k are normally and indepenn P 1 dently distributed with mean μ and variance σ 2 , and let S 2 = n−1 (Xi − X n )2 i=1

then

U=

(n − 1)S 2 v χ2n−1 σ2

where χ2n−1 is the chi-square distribution with n−1 degrees of freedom. Proof omitted. 0.5.4 The F Distribution If X is a random variable with density x 2 −1 Γ[(m + n)/2] ³ m ´m/2 fX (x) = Γ(m/2)Γ(n/2) n [1 + (m/n)x](m+n)/2 m

f or 0 < x < ∞

where Γ(.) is the gamma function, then X is defined to have a F distribution with m and n degrees of freedom. Notice that if X is distributed as above then: E[X] =

n n−2

and

var[X] =

2n2 (m + n − 2) m(n − 2)2 (n − 4)

36

Theorem 14 If the random variables U and V are independently distributed as chisquare with m and n degrees of freedom, respectively i.e. U v χ2m and V v χ2n independently, then U/m = X v Fm,n V /n where Fm,n is the F distribution with m, n degrees of freedom. Proof omitted. 0.5.5 The Student-t Distribution If X is a random variable with density fX (x) =

Γ[(k + 1)/2] 1 1 √ 2 Γ(k/2) kπ [1 + x /k](k+1)/2

f or

−∞ f (x; p/ )

where p/ is any other value in the interval 0 ≤ p ≤ 1. The likelihood function of n random variables X1 , X2 , ..., Xn is defined to be the joint density of the n random variables, say fX1 ,X2 ,...,Xn (x1 , x2 , ..., xn ; θ), which is considered to be a function of θ. In particular, if X1 , X2 , ..., Xn is a random sample from the density f (x; θ), then the likelihood function is f (x1 ; θ)f (x2 ; θ).....f (xn ; θ). To think of the likelihood function as a function of θ, we shall use the notation L(θ; x1 , x2 , ..., xn ) or L(•; x1 , x2 , ..., xn ) for the likelihood function in general. The likelihood is a value of a density function. Consequently, for discrete random variables it is a probability. Suppose for the moment that θ is known, denoted by θ0 . The particular value of the random variables which is “most likely to occur” /

/

/

is that value x1 , x2 , ..., xn such that fX1 ,X2 ,...,Xn (x1 , x2 , ..., xn ; θ0 ) is a maximum. for

40

example, for simplicity let us assume that n = 1 and X1 has the normal density with mean 0 and variance 1. Then the value of the random variable which is most /

likely to occur is X1 = 0. By “most likely to occur” we mean the value x1 of X1 /

such that φ0,1 (x1 ) > φ0,1 (x1 ). Now let us suppose that the joint density of n random variables is fX1 ,X2 ,...,Xn (x1 , x2 , ..., xn ; θ), where θ is known. Let the particular values /

/

/

which are observed be represented by x1 , x2 , ..., xn . We want to know from which density is this particular set of values most likely to have come. We want to know /

/

/

from which density (what value of θ) is the likelihood largest that the set x1 , x2 , ..., xn was obtained. in other words, we want to find the value of θ in the admissible set, denoted by b θ, which maximizes the likelihood function L(θ; x1 , x2 , ..., xn ). The value /

/

/

b θ which maximizes the likelihood function is, in general, a function of x1 , x2 , ..., xn , say b θ=b θ(x1 , x2 , ..., xn ). Hence we have the following definition:

Let L(θ) = L(θ; x1 , x2 , ..., xn ) be the likelihood function for the random vari-

ables X1 , X2 , ..., Xn . If b θ [where b θ=b θ(x1 , x2 , ..., xn ) is a function of the observations b= x1 , x2 , ..., xn ] is the value of θ in the admissible range which maximizes L(θ), then Θ

b θ(X1 , X2 , ..., Xn ) is the maximum likelihood estimator of θ. b θ=b θ(x1 , x2 , ..., xn ) is the maximum likelihood estimate of θ for the sample x1 , x2 , ..., xn .

The most important cases which we shall consider are those in which X1 , X2 , ..., Xn is a random sample from some density function f (x; θ), so that the likelihood function is L(θ) = f (x1 ; θ)f (x2 ; θ).....f (xn ; θ) Many likelihood functions satisfy regularity conditions so the maximum likelihood estimator is the solution of the equation dL(θ) =0 dθ Also L(θ) and logL(θ) have their maxima at the same value of θ, and it is sometimes easier to find the maximum of the logarithm of the likelihood. Notice also that if the likelihood function contains k parameters then we find the estimator from the solution of the k first order conditions.

Parametric Point Estimation

41

Example: Let a random sample of size n is drawn from the Bernoulli distribution f (x; p) = px (1 − p)1−x where 0 ≤ p ≤ 1. The sample values x1 , x2 , ..., xn will be a sequence of 0s and 1s, and the likelihood function is L(p) =

n Y i=1

Let y =

P

P

pxi (1 − p)1−xi = p

xi

P

(1 − p)n−

xi

xi we obtain that logL(p) = y log p + (n − y) log(1 − p)

and y n−y d log L(p) = − dp p 1−p

Setting this expression equal to zero we get pb =

1X y = xi = x n n

which is intuitively what the estimate for this parameter should be. Example: Let a random sample of size n is drawn from the normal distribution with density (x−μ)2 1 f (x; μ, σ 2 ) = √ e− 2σ2 2πσ 2

The likelihood function is 2

L(μ, σ ) =

n Y i=1

(xi −μ)2 1 √ e− 2σ2 = 2πσ 2

µ

1 2πσ 2

¶n/2

"

n 1 X exp − 2 (xi − μ)2 2σ i=1

the logarithm of the likelihood function is n n n 1 X 2 log L(μ, σ ) = − log 2π − log σ − 2 (xi − μ)2 2 2 2σ i=1 2

To find the maximum with respect to μ and σ 2 we compute n 1 X ∂ log L = 2 (xi − μ) ∂μ σ i=1

#

42

and n ∂ log L n 1 1 X =− 2 + 4 (xi − μ)2 2 ∂σ 2σ 2σ i=1

and putting these derivatives equal to 0 and solving the resulting equations we find the estimates 1X xi = x μ b= n i=1 n

and

1X (xi − x)2 σb2 = n i=1 n

which turn out to be the sample moments corresponding to μ and σ 2 . 0.6.4 Properties of Point Estimators One needs to define criteria so that various estimators can be compared. One of these is the unbiasedness. An estimator T = t(X1 , X2 , ..., Xn ) is defined to be an unbiased estimator of τ (θ) if and only if Eθ [T ] = Eθ [t(X1 , X2 , ..., Xn )] = τ (θ) for all θ in the admissible space. Other criteria are consistency, mean square error etc.

0.7

Maximum Likelihood Estimation

Let the observations be x = (x1 , x2 , ..., xn ), and the Likelihood Function be denoted by L (θ) = d (x; θ), θ ∈ Θ ⊂ z (θ0 ) − . n 2

from the relationship that if |x| < d => −d < x < d. Adding the above two inequalities we have that for Tδ ⇒

µ ¶ ∧ z θ > z (θ0 ) − δ.

Maximum Likelihood Estimation

49

Substituting out δ, employing (1) we get µ ¶ ∧ z θ > max z (θ) .

f or Tδ ⇒

θ∈A

Hence ∧

∧

∧

∧

θ∈ /A⇒θ∈ / N ∩ Θ ⇒ θ ∈ N ∩ Θ ⇒ θ ∈ N. ∧

Consequently we have shown that when Tδ is true then θ ∈ N. This implies that ¶ µ ∧ Pr θ ∈ N ≥ Pr (Tδ ) and taking limits, as n → ∞ we have lim Pr (kθ − θ0 k < ε) ≥ lim Pr (Tδ ) = 1

n→∞

n→∞

by the definition of N and by assumption A3. Hence, as ε is any small positive number, we have ∀ε > 0

lim Pr (kθ − θ0 k < ε) = 1

n→∞

which is the definition of probability limit. ¥ When the observations x = (x1 , x2 , ..., xn ) are randomly sampled, we have : (θ) =

n X

ln d (xi ; θ)

i=1

s (θ) =

∂ (θ) X ∂ ln d (xi ; θ) = ∂θ ∂θ i=1 n

∂ 2 (θ) X ∂ 2 ln d (xi ; θ) H (θ) = = . / ∂θ∂θ/ ∂θ∂θ i=1 n

Given now that the observations xi are independent and have the same distribution, with density d (xi ; θ), the same is true for the vectors ∂2

ln d(xi ;θ) . ∂θ∂θ/

∂ ln d(xi ;θ) ∂θ

and the matrices

Consequently we can apply a Central Limit Theorem to get: −1/2

n

−1/2

s (θ) = n

n X ∂ ln d (xi ; θ) i=1

∂θ

³ _ ´ d → N 0, J (θ)

50

and from the Law of Large Numbers −1

−1

n H (θ) = n

n X ∂ 2 ln d (xi ; θ) i=1

∂θ∂θ/

where _

−1

−1

_

p

→ −J (θ)

−1

J (θ) = n J (θ) = n J (θ) = −n E

µ

¶ ∂ 2 (θ) , ∂θ∂θ/

i.e., the average Information matrix. However, the two asymptotic results apply even if the observations are dependent or identically distributed. To avoid a lengthy exhibition of various cases we make the following assumptions. As n → ∞ we have: Assumption A4.

and

³ _ ´ d n−1/2 s (θ) → N 0, J (θ)

Assumption A5. p

_

n−1 H (θ) → −J (θ) where _ ¡ ¢ J (θ) = J (θ) /n = E ss/ /n = −E (H) /n.

We can now state the following Theorem Theorem 22 Under assumptions A2. and A3.the above two assumptions and identification we have:

∧

¶ µ ∙ ³_ ´−1 ¸ √ ∧ d n θ − θ0 → N 0, J (θ0 )

where θ is the MLE and θ0 the true parameter values. ∧

Proof: As θ maximises the Likelihood Function we have from the first order conditions that µ ¶ ∧ s θ =

µ ¶ ∧ ∂ θ ∂θ

= 0.

Maximum Likelihood Estimation

51

From the Mean Value Theorem, around θ0 , we have that ¶ µ ∧ s (θ0 ) + H (θ∗ ) θ − θ0 = 0

(3)

° ° ¸ ∙ ∧ ° °∧ °. θ − θ where θ∗ ∈ θ, θ0 , i.e. kθ∗ − θ0 k ≤ ° 0 ° °

Now from the consistency of the MLE we have that ∧

θ = θ0 + op (1) whereo ∙ p (1)¸ is a random variable that goes to 0 in probability as n → ∞. As now ∧ θ∗ ∈ θ, θ0 , we have that θ∗ = θ0 + op (1)

as well. Hence from 3 we have that ¶ µ √ √ ∧ n θ − θ0 = − [H (θ∗ ) /n]−1 s (θ0 ) / n. As now θ∗ = θ0 + op (1) and under the second assumption we have that ¶ h_ µ i−1 √ √ ∧ n θ − θ0 = J (θ0 ) s (θ0 ) / n + op (1) .

¶ µ √ ∧ d Under now the first assumption the above equation implies that n θ − θ0 → ∙ ³_ ´−1 ¸ . ¥ N 0, J (θ0 )

Example: Let y v N (μ, σ ) i.i.d for t = 1, ..., T . Then t

⎛

where θ = ⎝ and

μ σ

2

2

T T T ¡ 2¢ 1 X (θ) = − ln (2π) − ln σ − 2 (yt − μ)2 2 2 2σ t=1 ⎞

⎠. Now

T ∂ 1 X = 2 (yt − μ) ∂μ σ t=1 T ∂ −T 1 X = 2+ 4 (yt − μ)2 . ∂σ 2 2σ 2σ t=1

52

Now

µ ¶ ∧ ∂ θ ∂μ

Hence ∧

μ= Now 2

H (θ) =

T 1X yt T t=1

⎡

∂ (θ) ⎣ = ∂θ∂θ/ − σ14

=

µ ¶ ∧ ∂ θ ∂σ 2

∧

and σ 2 =

= 0.

T ´ 1 X³ ∧ 2 yt − μ . T t=1

− σT2 PT t=1 (yt − μ)

− σ14

T 2σ 4

−

⎤ (y − μ) t t=1 ⎦, PT 2 t=1 (yt − μ)

PT

1 σ6

∧

and consequently evaluating H (θ) at θ we have ⎡ ⎤ T µ ¶ − 0 ∧ ⎢ ∧ ⎥ H θ = ⎣ σ2 ⎦ T 0 − ∧ 2σ4

which is clearly negative definite. Now the Information matrix is: ⎛⎡ ⎤⎞ PT T 1 t=1 (yt − μ) σ2 σ4 ⎦⎠ J (θ) = −E (H (θ)) = E ⎝⎣ P PT T 2 1 T 1 (y − μ) − + (y − μ) t t t=1 t=1 σ4 2σ 4 σ6 ⎤ ⎡ T 0 2 ⎦ = ⎣ σ 0 2σT 4 0.8

The Classical Tests

Let the null hypothesis be represented by Ω = {θ ∈ Θ : ϕ (θ) = 0} where θ is the vector of parameters and ϕ (θ) = 0 are the restrictions. Consequently the Neyman ratio test is given by: µ ¶ ∧ L θ supθ∈Θ L (θ) = ³ ´ λ (x) = supθ∈Ω L (θ) L e θ

The Classical Tests

53

where L (θ) is the Likelihood function. As now the ln (·) is a monotonic, strictly increasing, an equivalent test can be based on ∙ µ ¶ ³ ´¸ ∧ LR = 2 ln (λ (x)) = 2 θ − e θ

where where LR is the well known Likelihood Ratio test and (θ) is the log-likelihood function. Using a Taylor expansion of Theorem we get:

³ ´ ∧ e θ around θ and employing the Mean Value

µ ¶ ∧ µ ¶ ∂ θ µ µ ¶ ¶/ ¶ µ ³ ´ ∧ ∧ 1 e ∧ ∂ 2 (θ∗ ) e ∧ e e θ = θ−θ + θ−θ θ + θ−θ 2 ∂θ/ ∂θ∂θ/ µ ¶ ∧ ° ° ° ° µ ¶ ∂ θ ∧° ∧° ∧ ° ° ° ≤ °e °. Now, θ θ − where ° − θ θ = s θ = 0 due to the fact the the first ∗ / ° ° ° ° ∂θ ∧

order conditions are satisfied by the ML estimator θ. Consequently, the LR test is

given by: ∙ µ ¶ µ µ ¶ ¶ ³ ´¸ ∧ ∧ / ∂ 2 (θ ) ∧ ∗ e LR = 2 θ−θ . θ − e θ =− e θ−θ ∂θ∂θ/

Now we know that ¾ h_ h_ i−1 i−1 s (θ ) ´ ½ √ ³ −1 / e √ 0 + op (1) n θ − θ0 = Ik − J (θ0 ) F (θ0 ) [P (θ0 )] F (θ0 ) J (θ0 ) n

and

Hence

¶ h_ µ i−1 s (θ ) √ ∧ √ 0 + op (1) . n θ − θ0 = J (θ0 ) n

µ ¶ h_ i−1 h_ i−1 s (θ ) ∧ √ −1 / e √ 0 + op (1) n θ − θ = − J (θ0 ) F (θ0 ) [P (θ0 )] F (θ0 ) J (θ0 ) n

and consequently µh _ i−1 h_ i−1 s (θ ) ¶/ µ 1 ∂ 2 (θ ) ¶ ∗ −1 / √0 − J (θ0 ) F (θ0 ) [P (θ0 )] F (θ0 ) J (θ0 ) LR = n ∂θ∂θ/ n i−1 h_ i−1 s (θ ) h_ √ 0 + op (1) . J (θ0 ) F / (θ0 ) [P (θ0 )]−1 F (θ0 ) J (θ0 ) n

54

Now from assumption A5. we have _

n−1 H (θ) = −J (θ) + op (1) , h_ i−1 θ∗ = θ0 + op (1) and P (θ0 ) = F (θ0 ) J (θ0 ) F / (θ0 ) .

Hence LR =

µ

s (θ0 ) √ n

¶/ h _ i−1 h_ i−1 s (θ ) √ 0 + op (1) . J (θ0 ) F / (θ0 ) [P (θ0 )]−1 F (θ0 ) J (θ0 ) n

We can now state the following Theorem:

Theorem 23 Under the usual assumptions and under the null Hypothesis we have that

where

Now

∙ µ ¶ ³ ´¸ ∧ d LR = 2 θ − e θ → χ2r .

Proof: The Likelihood Ratio is written as h i−1 / / LR = (ξ 0 )/ Z0 Z0 Z0 Z0 ξ 0 + op (1) h_ i−1/2 s (θ ) √ 0 = ξ0, J (θ0 ) n

h_ i−1/2 and Z0 = J (θ0 ) F / (θ0 ) .

h_ i−1/2 s (θ ) d √ 0 → N (0, Ik ) J (θ0 ) n

h i−1 / / and Z0 Z0 Z0 Z0 is symmetric idempotent. Hence µ h µh ¶ µ h i−1 ¶ i−1 ¶ i−1 / / / / / / Z0 = tr Z0 Z0 Z0 Z0 = tr Z0 Z0 Z0 Z0 = tr (Ir ) = r. r Z0 Z0 Z0 Consequently, we get the result. ¥

µ ¶ The Wald test is based on the idea that if the restrictions are correct the vector ∧ ϕ θ should be close to zero. µ ¶ ∧ Expanding ϕ θ around ϕ (θ0 ) we get: µ ¶ ¶ µ µ ¶ ∧ ∧ ∂ϕ (θ∗ ) ∧ θ − θ0 = F (θ∗ ) θ − θ0 ϕ θ = ϕ (θ0 ) + ∂θ/

The Classical Tests

55

as under the null ϕ (θ0 ) = 0. Hence ¶ µ ¶ µ ∧ ∧ √ √ nϕ θ = nF (θ∗ ) θ − θ0 and consequently ¶ µ ¶ µ ∧ ∧ √ √ nϕ θ = nF (θ0 ) θ − θ0 + op (1) . Furthermore recall that ¶ µ ∙ ³_ ´−1 ¸ √ ∧ d . n θ − θ0 → N 0, J (θ0 ) Hence,

µ ¶ ∙ ¸ ³_ ´−1 ∧ √ d / nϕ θ → N 0, F (θ0 ) J (θ0 ) F (θ0 ) .

Let us now consider the following quadratic:

¸−1 µ ¶ ∙ µ ¶¸/ ∙ ³_ ´−1 ∧ ∧ / F (θ0 ) J (θ0 ) F (θ0 ) ϕ θ , n ϕ θ

µ ¶ ∧ √ which is the square of the Mahalanobis distance of the nϕ θ vector. However the

above quantity can not be considered as a statistic as it is a function of the unknown parameter θ0 . The Wald test is given by the above quantity if the unknown vector of ∧

parameters θ0 is substituted by the ML estimator θ, i.e. ∙ µ ¶¸/ " µ ¶ µ _ µ ¶¶−1 µ ¶#−1 µ ¶ ∧ ∧ ∧ ∧ ∧ W = ϕ θ F θ nJ θ F/ θ ϕ θ µ ¶#−1 µ ¶ ∙ µ ¶¸/ " µ ¶ µ µ ¶¶−1 ∧ ∧ ∧ ∧ ∧ F θ J θ F/ θ ϕ θ , = ϕ θ

µ ¶ µ ¶ ∧ ∧ where J θ is the estimated information matrix. In case that J θ does not have

an explicit formula it can be substituted by a consistent estimator, e.g. by µ ¶ ∧ 2 ∂ θ n X ∧ J =− / i=1 ∂θ∂θ

56

or by the asymptotically equivalent ∧

J=

µ ¶ µ ¶ ∧ ∧ ∂ ∂ θ θ n X i=1

∂θ

∂θ/

.

Hence the Wald statistic is given by ∙ µ ¶¸/ " µ ¶ µ ¶−1 µ ¶#−1 µ ¶ ∧ ∧ ∧ ∧ ∧ W = ϕ θ F θ J F/ θ ϕ θ . Now we can prove the following Theorem: Theorem 24 Under the usual regularity assumptions and the null hypothesis we that ∙ µ ¶¸ " µ ¶ µ ¶ µ ¶#−1 µ ¶ /

∧

W = ϕ θ

F

∧

θ

∧

J

−1

∧

F/ θ

∧

ϕ θ

d

→ χ2r .

Proof: For any consistent estimator of θ0 we have that µ ¶ µ ¶ µ ¶−1 ³ _ ´−1 ∧ ∧ ∧ / J F θ = F (θ0 ) nJ (θ0 ) F / (θ0 ) + op (1) . F θ Hence

∙ µ ¶¸/ ∙ ¸−1 µ ¶ ³_ ´−1 ∧ ∧ / W =n ϕ θ F (θ0 ) J (θ0 ) F (θ0 ) ϕ θ + op (1) .

Furthermore,

µ ¶ ∙ ¸ ³_ ´−1 ∧ √ d / nϕ θ → N 0, F (θ0 ) J (θ0 ) F (θ0 ) ,

and the result follows. ¥

The Lagrange Multiplier (LM) test considers the distance from zero of the estimated Lagrange Multipliers. Recall that e d ¡ ¢ λ √ → N 0, [P (θ0 )]−1 . n

Consequently, the square Mahalanobis distance is Ã

e λ √ n

!/

P (θ0 )

Ã

e λ √ n

!

³ ´/ h _ i−1 ³ ´ e F (θ0 ) nJ (θ0 ) e . = λ F / (θ0 ) λ

The Classical Tests

57

Again, the above quantity is not a statistic as it is a function of the unknown parameters θ0 . However, we can employ the restricted ML estimates of θ0 to find the ³ ´ ³ ´ the unknown quantities, i.e. Fe = F e θ and Je = J e θ . Hence we can prove the

following:

Theorem 25 Under the usual regularity assumptions and the null hypothesis we have ³ ´ ³ ´/ h i−1 d / e e e e e F λ → χ2r . LM = λ F J

Proof: Again we have that for any consistent estimator of θ0 , as is the restricted MLE e θ, we have that

³ ´ ³ ´/ h i−1 e = e Fe Je Fe/ λ LM = λ

Ã

e λ √ n

!/

P (θ0 )

Ã

e λ √ n

!

+ op (1)

and by the asymptotic distribution of the Lagrange Multipliers we get the result. ¥ Now we have that the Restricted MLE satisfy the first order conditions of the Lagrangian, i.e.

³ ´ ³ ´ e = 0. s e θ + F/ e θ λ

Consequently the LM test can be expressed as:

³ ³ ´´/ h i−1 ³ ´ Je s e θ . LM = s e θ

Now Rao has suggested to find the score vector and the information matrix of the unrestricted model and evaluate them at the restricted MLE. Under this form the LM statistic is called eﬃcient score statistic as it measures the distance of the score vector, evaluated at the restricted MLE, from zero. 0.8.1 The Linear Regression Let us consider the classical linear regression model: y = Xβ + u,

¡ ¢ u|X v N 0, σ 2 In

58

where y is the n × 1 vector of endogenous variables, X is the n × k matrix of weakly exogenous explanatory variables, β is the k × 1 vector of mean parameters and u is ´ ³ the n × 1 vector of errors. Let us call the vector of parameters θ, i.e. θ/ = β / , σ 2 a (k + 1) × 1 vector. The log-likelihood function is:

n n ¡ 2 ¢ 1 (y − Xβ)/ (y − Xβ) (θ) = − ln (2π) − ln σ − . 2 2 2 σ2

The first order conditions are:

∂ (θ) X / (y − Xβ) =0 = ∂β σ2 and

∂ (θ) n 1 (y − Xβ)/ (y − Xβ) = − + = 0. ∂σ 2 2σ 2 2 σ4 Solving the equations we get: ¡ / ¢−1 / X X X y

∧

β = ∧ 2

σ

∧/ ∧

uu , = n

∧

∧

u = y − X β.

Notice that the MLE of β is the same as OLS estimator. Something which is not true for the MLE of σ 2 . The Hessian is ⎛

2

H (θ) =

∂ (θ) ⎝ = ∂θ∂θ/

∂ 2 (θ) ∂β∂β / ∂ 2 (θ) ∂σ2 ∂β /

∂ 2 (θ) ∂β∂σ 2 ∂ 2 (θ) ∂(σ2 )2

Hence the Information matrix is

⎞

⎛

⎠=⎝ ⎛

J (θ) = E [−H (θ)] = ⎝

− σ12 X / X

− 2σ1 4 u/ X

1 X /X σ2

0

0

n 2σ4

and the Cramer-Rao limit

⎛

J −1 (θ) = ⎝

¢−1 ¡ σ X /X 2

0

0 2σ4 n

⎞

⎠.

⎞

⎠,

− 2σ1 4 X / u n 2σ 4

−

u/ u σ6

⎞

⎠.

The Classical Tests

59

Notice that under normality, of the errors, the OLS estimator is asymptotically eﬃcient. Let us now consider r linear constrains on the parameter vector β, i.e. ϕ (β) = Qβ − q = 0

(4)

where Q is the r × k matrix of the restrictions (with r < k) and q a known vector. Let us now form the Lagrangian, i.e. L = (θ) + λ/ ϕ (β) = (θ) + ϕ/ (β) λ = (θ) + (Qβ − q)/ λ, where λ is the vector of the r Lagrange Multipliers. The first order conditions are: ∂L ∂ (θ) X / (y − Xβ) / = +Q λ= + Q/ λ = 0 2 ∂β ∂β σ

(5)

∂L ∂ (θ) n 1 (y − Xβ)/ (y − Xβ) = = − + =0 ∂σ 2 ∂σ 2 2σ 2 2 σ4

(6)

∂L = Qβ − q = 0. ∂λ

(7)

and

Now from (5) we have that X / y = X / Xβ − σ2 Q/ λ and it follows that

Hence

¡ ¢−1 / ¡ ¢−1 / ¡ ¢−1 / Q X /X X y = Q X /X X Xβ − σ 2 Q X / X Q λ.

It follows that

¡ ¢−1 / ¡ ¢−1 / Q X /X X y = Qβ − σ 2 Q X / X Q λ. ∧

Qβ − QV Q/ λ = Qβ, where ∧ ¡ ¢−1 / β = X /X X y

¢−1 ¡ and V = σ 2 X / X .

(8)

60

Now from (7) we have that Qβ = q. Hence we get µ ¶ ∧ £ ¤ / −1 λ = − QV Q Qβ − q .

(9)

Substituting out λ from (8) employing the above and solving for β we get: µ ¶ ∧ ∧ ¡ / ¢−1 / h ¡ / ¢−1 / i−1 e Qβ − q . β=β− X X Q Q X X Q

Solving (6) we get that

e (e u)/ u e 2 σ = , n

and from (9) we get:

e u e = y − X β,

¶ i−1 µ ∧ h / e e Qβ − q , λ = − QV Q

The above 3 formulae give the restricted MLEs.

¢−1 ¡ Ve = σe2 X / X .

Now the Wald test for the linear restrictions in (4) is given by ¸−1 µ ¶/ ∙ ¶ µ ∧ ∧ ∧ / QV Q Qβ − q . W = Qβ − q The restricted and unrestricted residuals are given by e u e = y − X β,

Hence

∧

∧

and u = y − X β.

µ ¶ ∧ e u e=u+X β−β ∧

∧

and consequently, if X / u = 0, i.e. the regression has a constant we have that µ ¶/ µ ¶ ∧ ∧ ∧/ ∧ / / e e u eu e=u u+ β−β X X β−β .

It follows that

µ µ ¶/ h ¶ ∧ ∧ ¡ / ¢−1 / i−1 Qβ − q . e − u u = Qβ − q Q Q X X u eu /

∧/ ∧

Hence the Wald test is given by

W =n

∧/ ∧

e−u u u e/ u ∧/ ∧

uu

.

The Classical Tests

61

The LR test is given by Ã ! ∙ µ ¶ ³ ´¸ / ∧ u e u e LR = 2 θ − e θ = n ln ∧/ ∧ uu

and the LM test is

as

∧/ ∧

e−u u u e/ u LM = n u e/ u e

¶/ h ¶ ³ ´/ h i−1 ³ ´ µ ∧ i−1 µ ∧ / / e Fe Je Fe λ e = Qβ − q LM = λ Qβ − q . QVe Q

We can now state a well known result.

Theorem 26 Under the classical assumptions of the Linear Regression Model we have that W ≥ LR ≥ LM. Proof: The three test can be written as W = n (r − 1) , where r =

u e/ u e

∧/ ∧

u u

LR = n ln (r) ,

¶ µ 1 , LM = n 1 − r

≥ 1. Now we know that ln (x) ≥

x−1 x

and the result follows by

considering x = r and x = 1/r. 0.8.2 Autocorrelation Apply the LM test to test the hypothesis that ρ = 0 in the following model /

yt = xt β + ut ,

ut = ρut−1 + εt ,

¡ ¢ i.i.d. εt v N 0, σ 2 .

Discuss the advantages of this LM test over the Wald and LR tests of this hypothesis. First notice that from ut = ρut−1 + εt we get that E (ut ) = ρE (ut−1 ) + E (εt ) = ρE (ut−1 ) as E (εt ) = 0 and for |ρ| < 1 we get that E (ut ) − ρE (ut−1 ) = 0 ⇒ E (ut ) = 0

62

as E (ut ) = E (ut−1 ) independent of t. Furthermore ¡ ¢ ¡ ¡ ¢ ¡ ¢ ¢ V ar (ut ) = E u2t = ρ2 E u2t−1 + E ε2t + 2ρE (ut−1 εt ) = ρ2 E u2t−1 + σ 2

as the first equality follows from the fact that E (ut ) = 0, and the last from the fact that E (ut−1 εt ) = E [ut−1 E (εt |It−1 )] = E [ut−1 0] = 0 where It−1 the information set at time t − 1, i.e. the sigma-field generated by {εt−1 , εt−2 , ...}. Hence ¡ ¢ ¢ ¡ ¡ ¢ E u2t − ρ2 E u2t−1 = σ2 ⇒ E u2t =

¡ ¢ as E (u2t ) = E u2t−1 independent of t.

σ2 1 − ρ2

Substituting out ut we get

/

yt = xt β + ρut−1 + εt , /

and observing that ut−1 = yt−1 − xt−1 β we get ³ ´ / / / / yt = xt β + ρ yt−1 − xt−1 β + εt ⇒ εt = yt − xt β − ρyt−1 + xt−1 βρ

where by assumption the ε0t s are i.i.d. Hence the log-likelihood function is ³ ´2 / / T y − x β − ρy + x βρ t−1 t t−1 T ¡ ¢ X t T , l (θ) = − ln (2π) − ln σ 2 − 2 2 2 2σ t=1

where we assume that y−1 = 0, and x−1 = 0. as we do not have any observations for t = −1. In any case, given that |ρ| < 1, the first observation will not aﬀect the distribution LM test, as it is based in asymptotic theory, i.e. T → ∞. The first order conditions are:

∂l = ∂ρ

³ ´ / / T yt − xt β − ρyt−1 + xt−1 βρ (xt − xt−1 ρ) X

∂l = ∂β σ2 t=1 ³ ´³ ´ / / / T yt − xt β − ρyt−1 + xt−1 βρ yt−1 − xt−1 β X t=1

σ2

=

T X εt ut−1 t=1

σ2

,

The Classical Tests

63

∂l T = − + ∂σ 2 2σ 2

T X t=1

³ ´2 / / yt − xt β − ρyt−1 + xt−1 βρ 2σ 4

The second derivatives are: ∂2l =− ∂β∂β /

T X

T X

t=1

X ε2 T t + . 2 4 2σ 2σ t=1 T

=−

³ ´ / / (xt − xt−1 ρ) xt − xt−1 ρ σ2

³ ´2 / yt−1 − xt−1 β

T X u2t−1 ∂2l = − = − , ∂ρ2 σ2 σ2 t=1 t=1 ³ ´2 / / T T y − x β − ρy + x βρ 2 X X t t−1 t t−1 ∂ l ε2t T T = − = − . 2σ 4 t=1 σ6 2σ 4 t=1 σ 6 ∂ (σ 2 )2 T X

∂2l = − ∂β∂ρ t=1 = −

³ ³ ´ ´ / / / T u X t−1 xt − xt−1 ρ + εt xt−1 t=1

∂ 2l =− ∂ρ∂σ 2 2

´³ ´ ³ ´³ ³ ´ / / / / / / yt−1 − xt−1 β xt − xt−1 ρ + yt − xt β − ρyt−1 + xt−1 βρ xt−1

T X

T X

∂ l =− ∂β∂σ2 t=1

t=1

σ2

σ2

³ ´³ ´ / / / yt − xt β − ρyt−1 + xt−1 βρ yt−1 − xt−1 β σ4

³ ´³ ´ / / / / yt − xt β − ρyt−1 + xt−1 βρ xt − xt−1 ρ σ4

=−

=− T X t=1

T X εt ut−1 t=1

σ4

,

³ ´ / / εt xt − xt−1 ρ σ4

Notice now that the Information Matrix J is ³ ´ ⎡ ⎤ PT (xt −xt−1 ρ) x/t −x/t−1 ρ 0 0 ⎥ σ2 ⎢ t=1 ⎢ ⎥ T J (θ) = −E [H (θ)] = ⎢ 0 0 ⎥ 1−ρ2 ⎣ ⎦ T 0 0 2σ4 ³ ³ ´¸ ´ ∙ ∙ ³ / / ´¸ hP i h 2i / / / ut−1 xt −xt−1 ρ +εt xt−1 εt xt −xt−1 ρ εt T εt ut−1 as E = 0, E = 0, E = 0, E = 2 4 4 t=1 σ σ σ σ6 h 2 i E u2 ut−1 ( ) 1 1 2 = σt−1 , E = 1−ρ 4 2 2 , i.e. the matrix is block diagonal between β, ρ, and σ . σ σ2

Consequently the LM test has the form LM =

−1 sρ s/ρ Jρρ

(sρ )2 = Jρρ

=

64

as sρ =

PT

t=1

εt ut−1 , σ2

Jρρ =

T . 1−ρ2

All these quantities evaluated under the null.

Hence under H0 : ρ = 0 we have that Jρρ = T,

and ut = εt

i.e. there is no autocorrelation. Consequently, we can estimate β by simple OLS, as OLS and ML result in the same estimators and σ 2 by the ML estimator, i.e. Ã T !−1 T PT 2 / X X u e u e u e / e= β = t=1 t , xt xt xt yt , and σe2 = T T t=1 t=1 e = εet the OLS residuals. Hence where u et = yt − xt β ³P ´2 !2 Ã T !−2 Ã T T et−1 u et u X X t=1 σ f2 =T u et u et−1 u e2t . LM = T t=1 t=1 /

0.9

Time Series

0.9.1 Projections (Orthogonal) Assume the usual linear regression setup, i.e. y = Xβ + u,

¡ ¢ u|X v D 0, σ 2 In

where y is the n × 1 vector of endogenous variables, X is the n × k matrix of weakly exogenous explanatory variables, β is the k × 1 vector of mean parameters and u is the n × 1 vector of errors. When we estimate a linear regression model, we simply map the regressand y b and a vector of residuals u b Geometrically, into a vector of fitted values X β b = y − X β.

these mappings are examples of orthogonal projections. A projection is a mapping that takes each point of E n into a point in a subset of E n , while leaving all the points of the subset unchanged, where E n is the usual Euclidean vector space, i.e. the set of all vectors in Rn where the addition, the scalar multiplication and the inner product (hence the norm) are defined. Because of this invariance the subspace is called invariant subspace of the projection. An orthogonal projection maps any

Time Series

65

point into the point of the subspace that is closest to it. If a point is already in the invariant subspace, it is mapped into itself. Algebraically, an orthogonal projection on to a given subspace can be performed by premultiplying the vector to be projected by a suitable projection matrix. In the case of OLS, the two projection matrices that yield the vector of fitted values and the vector of residuals, respectively, are ¡ ¢−1 / PX = X X / X X

and

¡ ¢−1 / MX = In − PX = In − X X / X X .

To see this notice that the fitted values

¢ ¡ b = X X / X −1 X / y = PX y. yb = X β

Hence the PX projection matrix project on to S (X)., i.e. the subspace of E n spanned by the columns of X. Notice that for any vector α ∈ Rk the vector Xα belongs to S (X). As now Xα ∈ S (X) then it should be the case, due to the invariance of PX , that PX Xα = Xα. But notice that ¡ ¢−1 / X X = XIk = X. PX X = X X / X

It is clear that when PX is applied to y it yields the vector of fitted values. On the other hand the MX projection matrix yields the vector of residuals as h ¡ ¢−1 / i b=u MX y = In − X X / X X y = y − PX y = y − X β b.

The image of MX is S ⊥ (X), the orthogonal complement of the image of PX . To see this, consider any vector w ∈ S ⊥ (X). It must satisfy the condition X / w = 0, which implies that PX w = 0, by the definition of PX . Consequently, (In − PX ) w = MX w =

66

w and S ⊥ (X) must be contained in the image of MX , i.e. S ⊥ (X) ⊆ Im (MX ). Now consider any image of MX . It must take the form MX z. But then (MX z)/ X = z / MX X = 0 as MX symmetric. Hence MX z belongs to S ⊥ (X), for any z. Consequently, Im (MX ) ⊆

S ⊥ (X) and hence the image of MX coincides with S ⊥ (X).

For any matrix to represent a projection, it must be idempotent. This is because the vector image of a projection matrix is say S (X), and then project it again, the second projection should have no eﬀect, i.e. PX PX z = PX z for any z. It is easy to prove that this is the case with PX and MX , as PX PX = PX

and MX MX = MX .

By the definition of MX it is obvious that ¡ ¢−1 / MX = In − X X / X X = In − PX ⇒ MX + PX = In , and consequently for any vector z ∈ E n we have MX z + PX z = z. The pair of projections MX and PX are called complementary projections, since the sum MX z and PX z restores the original vector z. Assume that we have the following linear regression model: y = Xβ + ε where y and ε are N × 1, β is k × 1,and X is N × k. For k = 2 and if the first variable is a consant we have that : yi = β 0 + xi β 1 + εi

f or i = 1, 2, ..., N.

Time Series

67

Now ⎛ P ⎞−1 ⎛ P ⎞ µc¶ ¡ ¢−1 / T x y β0 b = ⎠ ⎝ P ⎠ = X /X β X y=⎝ P P c1 β x x2 xy ⎛ P P ⎞⎛ P ⎞ 2 y x − x 1 ⎠⎝ P ⎠ = P 2 P 2⎝ P T x − ( x) − x T xy ⎛ P P 2 P P ⎞ = ⎝

Notice however that T

X

y x − x xy P 2 P 2 T x −( x) P P P T xy− x y P 2 P 2 T x −( x)

⎠.

" µP ¶2 # hX i ³X ´2 X x 2 2 2 =T x − T (x) x = T x −T x − T hX ¡ hX ¡ ¢i ¢i = T x2 − x2 = T x2 − 2xx + 2xx + x2 − 2x2 i hX X X (x − x) = T (x − x)2 , = T (x − x)2 + 2x 2

and T Hence ⎛ µc¶ β0 = ⎝ c β1

⎛

= ⎝

i hX X X xy − x y=T (x − x) (y − y) .

⎞ P x2 −T x xy P T (x−x)2 ⎠ P (x−x)(y−y) P (x−x)2 P P − [ (x−x)(y−y)] x (x−x)2 P (x−x)(y−y) P (x−x)2

Ty

y

X

P

⎛

=⎝ ⎞

y(T

⎛

⎠=⎝

P

P P P P (x−x)2 +( x)2 )−x{T [ (x−x)(y−y)]+ x y} P 2 T (x−x) P (x−x)(y−y) P (x−x)2

y−c β1x

P

(x−x)(y−y) P (x−x)2

⎞

⎠.

⎞ ⎠