The Efficiency of Logistic Regression Compared to Normal Discriminant Analysis

Comparedto of LogisticRegression The Efficiency Analysis NormalDiscriminant BRADLEYEFRON* A randomvectorx arises fromone oftwo multivariatenormaldist...
Author: Brianne Brown
7 downloads 4 Views 1MB Size
Comparedto of LogisticRegression The Efficiency Analysis NormalDiscriminant BRADLEYEFRON*

A randomvectorx arises fromone oftwo multivariatenormaldistributions differingin mean but not covariance. A trainingset xl, X2, X*, of previous cases, along with their correct assignments, is known. These can be used to estimate Fisher's discriminantby maximum likelihoodand then to assign x on the basis of the estimated discriminant,a method knownas the normaldiscriminationprocedure.Logistic regression does the same thingbut withthe estimationof Fisher's disriminantdone conditionallyon the observed values of xi, X2, *-, x,. This article computes the asymptoticrelativeefficiencyof the two procedures. Typically,logistic regression is shown to be between one halfand two thirdsas effectiveas normal discriminationfor statistically interestingvalues of the parameters.

and, of course,

XiIyi "9 IP(Uvlj,

; (1.4)

The (yj, xj) are assumed independentof each other for j = 1, 2, *.. , n. In this case, maximumlikelihoodestimates of the parametersare available, 7r* =

U1=Xl-

nl/n,

E Xi/nl,

1Y=1

7ro

Uo

= no/n xi/no

XYj=O

and

1. INTRODUCTIONAND SUMMARY Vj=1

Suppose that a random vector x can arise fromone of two p-dimensionalnormal populations differingin mean but not in covariance,

+ E Y/j=0

(xj

-

xo)(xj

-

ko)']/n

7 Yo and no ni - no are the number where ni of population 1 and population0 cases observed,respecwith prob 7ro, x P(tiO, 1) tively. Substitutingthese into (1.2) gives a version of Anderson's [1] estimated linear discriminantfunction, where 7r- + ro = 1 say, (x) = ABo+ "'x, and an estimateddiscrimination If the parametersxi-, ro = 1 -7ri, il, Lo, X are known, procedurewhich assigns a new x to population 1 or 0 then x can be assigned to a population on the basis of as S (x) is greaterthan or less than zero. This will be Fisher's "linear discriminantfunction"[1]. referredto as the "normal discriminationprocedure." X(x) = /30+ O'x, Bayes' theoremshows that X(x), as given in (1.2), is actually the a posteriorilog odds ratio for Population 1 io'L;-'lo) ' (1.2) 0--log- - (Ulversus Population 0 having observedx, x - 9t,,(a1, 1)

with

prob 7r-,

(1.1)

-

-- 2U1 7r

(1.2)1

7ri(Xj)

prob {yj = iIxj} ri(xi) j, The assignmentis to population 1 if X(x) > 0 and to i = 1,0 . (1.6) population 0 if X(x) < 0. This method of assignment minimizesthe expected probabilityof misclassification, To simplifynotationwe will also write as is easily shown by applyingBayes theorem.There is (1.7) 7rij- ri(xj) and X log (7r1/7ro). no loss of generalityin assuming s nonsingularas we have done, since singular cases can always be made Given the values x1,x2, *.., xn, the yJare conditionally nonsingularby an appropriatereductionof dimension. independentbinaryrandomvariables, In usual practice, the parameters 7r, To, U'o,Uo, Y will be unknownto the statistician,but a trainingset prob {yj = l1xj} = 7rij (yl, xi), (y2,x2), - -, (y., xn) will be available, where = exp (Oo+ g'xj)/[l'+ exp (io + f'x)], (1.8) yj indicateswhichpopulationxj comes from,so prob {yj = OlxjA = 7rOj= 1/[1 + exp (3o + f'x)]. yj = 1 with prob7ii , (1.3) To estimate (3oy,g), we can maximize the conditional 0 with prob 7rO, X(x,)

* Bradley Efron is professor,Department of Statistics,StanfordUniversity, Stanford,Calif. 94305. The authoris gratefulto Gus HaggstromofRAND Corporation forhelpfulcomments.

log

a Journalof the American Statistical Association December 1975, Volume 70, Number 352 Theoryand Methods Section

892

This content downloaded from 132.170.168.139 on Thu, 4 Apr 2013 13:37:07 PM All use subject to JSTOR Terms and Conditions

LogisticRegression VersusNormalDiscrimination

893

likelihood

unansweredquestionis the relativeefficiency undersome model other than we (1.1), when are not playing ball .. Yn fflo,f(Yly / IX1, *...*, Xn) on normaldiscrimination'shome court. n In many situations,the sampling probabilities rn, 7ro 7i ljY7r o(l i1) ) = in (1.1) may be systematicallydistorted from acting j=1 19 in the population of interest.For example, their values n (1) exp [(Eo + 5'x,)yj] if 1 is murdervictims and Population 0 is Population =11 + exp (3o + 5'xj)]' all other people, a study conductedin a morgue would j= [1 have Irl much largerthan in the whole population.Quite withrespectto (0o, 1).The maximizingvalues, call them oftenni and no are set by the experimenterand are not (Bo D), give X(x) = Bo+ D'x as an estimateof the linear randomvariables at all. These cases are discussedbriefly discriminant function. The discriminationprocedure in Section 5. which chooses Population 1 or 0 as X(x) is greaterthan Technical details relating to asymptotic normality or less than zero will be referredto as the "logistic reand consistency are omitted throughout the article. gressionprocedure."An excellentdiscussionof such proThese gaps can be filledin by the applicationof standard ceduresis given in Cox's monograph[2]. exponential family theory, as presented, say, in [5], The logisticregressionproceduremust be less efficient to (1.1). For another comparisonof normal discriminathan the normal discriminationprocedureunder model tion and logisticregression,the readeris referredto [4]. (1.1), at least asymptotically,as n goes to infinity, In that article, and also in [3], the distributionsof x since the latter is based on the full maximumlikelihood are allowed to have discretecomponents. estimatorforX(x). This articlecalculates the asymptotic relative efficiencies(ARE) of the two procedures. The 2. EXPECTEDERRORRATE central result is that, under a variety of situationsand By means of a linear transformationx = a + Ax, measuresof efficiency, the ARE is given by we can always reduce (1.1) to the case 1 + A'riro e-r2/2 X = ARE edA2/8 dx , (1.10) x - 91p((A/2)el, I) , with prob 7r,X (2.1) + woe-Ax/2 7re-,Ax/2 (27r) x-

where A-E(

-

vo)'1(Ii

v-

,

O)]i

(1.11)

(- (A/2)e1,I), mp

with prob 7ro

where ri + ro = 1

(1, 0, 0, *0*, 0); I is the p X p identitymatrix; the square root of the Mahalanobis distance. Following and el' and A = ((p, - po)''-l (is a small tabulation of (1.10) for reasonable values of p)) as before. The boundary B -{ x: X(x) = 01 between Fisher's A, with ri = 7= (the case most favorable to the optimumdecisionregionsforthe two populationstranslogisticregressionprocedure). formsto the new optimumboundaryin the obvious way, A

ARE

0

1.000

.5

1.000

1

1.5

2

2.5

3

3.5

4

.995 .968 .899 .786 .641 .486 .343

B_ {x:X(x) = O - {:x

= a+Ax,xEB}

.

(2.2)

(1.12)

Moreover, if xl, x2, * x.n is an iid sample from (1.1), and xi = a + Axi, i=1, 2, ... , n, is the Why use logisticregressionat all if it is less efficient transformedsample, then both estimated boundaries (and also more difficult to calculate)? Because it is more P _ {x: S (x) = 0} and B- {x: X(x) = 0} also transrobust,at least theoretically, than normaldiscrimination. formas in (2.2). In words,then,forboth logisticregresThe conditionallikelihood (1.9) is valid under general sion and normal discrimination,the estimated disexponentialfamilyassumptionson the densityf(x) of x, criminationprocedure based on the transformeddata is the transformof that based on the originaldata. All f (x) = g(01,vj)h(x,q) exp (01'x) 'with prob r (1.13) of these statementsare easy to verify. f (x) = g(6o q)h(x, 0) exp (0o'x) with prob 7ro Suppose we have the regionsRo and Ri, a partition = 1 where Ir + mro of the p-dimensionalspace E', and we decide forpopulaHere, n is an arbitrarynuisance parameter,like X in tion 0 or population 1 as x falls into Ro or Ri, respectively. The errorrate of such a partitionis the prob(1.1). Equation (1.13) includes (1.1) as a special case. ability of misclassification under assumptions(1.1), Unfortunately,(1.12) shows that the statisticianpays

a moderately steep price for this added generality, assuming,of course,'that (1.1) is actually correct.Just when good discriminationbecomes possible, for A between 2.5 and 3.5, the ARE of the logisticprocedurefalls offsharply.The questionofhow to chooseor compromise between the two procedures seems important,but no results are available at this time. Another important

ErrorRate -7r prob {x C RoIx -D9p(tll m) +ro prob {x E R1IX 9p(-o, a t(.3

(2.3)

When the partitionis chosen randomly,as it is by the logisticregressionand normaldiscriminationprocedures, errorrate is a randomvrariable.For eitherprocedure,it followsfromthe precedingthat errorrate will have the

This content downloaded from 132.170.168.139 on Thu, 4 Apr 2013 13:37:07 PM All use subject to JSTOR Terms and Conditions

of the American Journal StatisticalAssociation, December1975

894

under(1.1) and (2.1). Henceforth, same distribution Now, define we will workwith the simplerassumptions(2.1), calling di- (D1 - dT) cos (da), thisthe"standardsituation"(withthebasicrandomvarido (Do + dr) cos (da), able referred to as "x" ratherthan"x" forconvenience). Forthestandardsituation, Fisher'slineardiscriminantthe distancesfrom and to B(dr, dca).Then, L t,o function(1.2) becomes X(x) =X + Ax1 The boundary X(x)

(2.4)

(dr, dac) = 7r4(- di) + ro4(-do)

ER

.

2

(2.9)

From the Taylor expansions,

0 is the (p - 1)-dimensional plane orthogonalto the x1 axis and intersecting it at thevalue and (2.5) =

cos (dca) = 1 - (da)2/2 + + dT) = 4(-D)

P(-D

*.

+ p(D)dr

In the figure, the optimalboundaryis labeledB(O, 0). + D (p(D) (dr)2/2 + The figure also shows another boundary,labeled we get the followinglemma. B (dr,da), intersecting thexi axisat r + dT,withnormal vectorat an angleda fromthe xi axis. The differential Lemma 1: Ignoring differentialterms of third and notationdr and da indicatessmall discrepancies from higherorders, optimal,whichwillbe thecaseinthelargesampletheory. ER (dr,da) = ER (0, 0) The errorrate(2.3) oftheregionsseparatedbyB (dT,da) + (A/2)7riq'(Di)[(dr)2 + (da)2] . (2.10) willbe denotedby ER (di, da). Letting (A/2)

Di)

Do- (A/2) +

T,

,

(2.6)

Equation (2.10) makes use of the fact that, by Bayes theorem7rsp(Di)/7roso(Do) = 1, or equivalently, (2.11)

ro.p(Do) .

7rl o(Di)=

we see that the errorrate of the optimalboundary Suppose now that the boundary,B(dr, da) is given B(O, 0) is by those x satisfying ER

where ?I'(Z)

(0, 0)

f

=

rit(-Di)

jp(t)dt and

p(t)

-00

+ 7ro4(-Do) '

(2.7)

(2r)-irexp (-t2/2)

(X + d,3o)+ (Ael + dg)'x = 0

., dfo and dg = (d,3i,d02, * dO,)', indicatingsmall discrepanciesfromthe optimallinear function(2.4). Again, ignoringhigher-order terms,we have

dr = (1/A) (-df3o + (X/A)dO,)

as usual. (We are tacitlyassumingthatthe tworegions dividedby B (dr,doe)areassignedto populations 1 and 0, and so 2X in thebestway.) respectively, 1/ (dr)2= - (Vd,o)2 --d#3dA OptimumBoundaryx(x) = 0 in Standard Situationa (Mx) -Q)

=

(2.13)

+ (dfp)2)i/(A + df3l)]

arctanE ((d,B2)2+ *

gives (dax)2= ((d32)2 + (d33)2 +

...

+ (diP)2)/A2

.

(2.14)

Finally, suppose that under some method of estimation, the (p + 1) vector of errors (d,3o, d@) has a limiting normal distribution with mean vector 0 and covariancematrix s/n,

Non-Optimum B(dr,da) Boundary

d0d~~

-

?S: v/(dIo)

A~~~~~~ I1

0

r

x1axis