Advanced Measurement - Logistic regression for DIF detection Lian Hortensius March, 2012

Contents 1 Abstract

2

2 Literature review 2.1 What is DIF? . . . . . . . . . . . . . . . . . . . 2.2 Logistic regression as a way to detect DIF . . . 2.3 Other methods of detecting DIF . . . . . . . . . 2.4 Comparing Logistic regression to other methods

. . . .

2 2 3 4 5

. . . . . . . . .

6 6 6 7 8 8 9 10 10 11

4 Sampling 4.1 Experimental setup . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4.2 Example of the code . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4.3 Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

12 12 12 14

5 Conclusion

14

3 The 3.1 3.2 3.3 3.4 3.5 3.6 3.7 3.8 3.9

. . . .

. . . .

. . . .

. . . .

. . . .

. . . .

R code to run logistic regression and generate data Logistic regression - what the program does . . . . . . . . Logistic regression - instructions for using the function . . Logistic regression - an example . . . . . . . . . . . . . . . Generating data - what the program does . . . . . . . . . . Generating data - instructions for using the function . . . Generating data - an example . . . . . . . . . . . . . . . . Adding DIF - what the program does . . . . . . . . . . . . Adding DIF - instructions for using the function . . . . . . Adding DIF - an example . . . . . . . . . . . . . . . . . .

1

. . . .

. . . . . . . . .

. . . .

. . . . . . . . .

. . . .

. . . . . . . . .

. . . .

. . . . . . . . .

. . . .

. . . . . . . . .

. . . .

. . . . . . . . .

. . . .

. . . . . . . . .

. . . .

. . . . . . . . .

. . . .

. . . . . . . . .

. . . .

. . . . . . . . .

. . . .

. . . . . . . . .

. . . .

. . . . . . . . .

. . . .

. . . . . . . . .

. . . .

. . . . . . . . .

. . . .

. . . . . . . . .

. . . .

. . . . . . . . .

1

Abstract

This paper starts with a short literature review of the use of logistic regression for detecting DIF in psychometrics. I wrote an R program to calculate logistic regression for testing differential item functioning (DIF) in dichotomous items, and a program to generate dichotomous item responses (with or without DIF) using IRT. This paper contains an explanation of what these programs do and how to use them, as well as examples. Using these programs I simulated several sets of data to test the DIF detection function’s effectiveness at detecting DIF. I included different types of DIF and varied presence or absence of impact and the sample size. The program worked as expected.

2

Literature review

2.1

What is DIF?

Differential Item Functioning (DIF) occurs when an item on a test functions differently for different groups, given the ability level. Usually the groups are called reference group and focal group, and DIF means that the item has different characteristics for the different groups. Usually two types of DIF are distinguished: uniform DIF and non-uniform DIF (Mellenbergh, 1982). The former is what occurs when for every ability level, the test item is easier for one group than it is for another. When considered in an IRT framework this means that the item characteristic curve for one group is more to the left (the b/location parameter is lower) than for the other group. Non-uniform DIF can be split into two types (crossing and non-crossing) and occurs when there is an interaction between group membership and ability level. In crossing non-uniform DIF, for one end of the ability level spectrum the item is easier for members of one group, whereas at the other end of the ability level the item is easier for members of the other group. In IRT terms this means that the b-parameter is the same (or very similar), whereas the a/discrimination parameter is different, which causes the item characteristics curves to cross in the middle. In non-crossing non-uniform DIF, the item is of similar

2

difficulty for both groups at one end of the ability spectrum, but different difficulties for the groups at the other end of the ability spectrum. In an IRT framework this means that the a-parameter and the b-parameter are different. Although in general uniform DIF is the most common type of DIF, previous applied research has found non-uniform DIF in operational tests as well (e.g. Hambleton and Rogers, 1989). Therefore just testing for uniform DIF is insufficient. One issue in the detection of DIF is the presence of impact. When the focal group and reference group differ in their underlying ability distribution, i.e. when one group has a higher average ability than the other group, this is called impact. The presence of impact can make the detection of DIF more difficult (e.g. Guler and Penfield, 2009). Regardless of the type of DIF, the issue is that the item does not function the same for members of different groups, which can make a test unfair if the item is treated as functioning the same in both groups.

2.2

Logistic regression as a way to detect DIF

Logistic regression as a test of DIF was proposed as an alternative to the Mantel-Haenszel statistic in 1990 by Swaminathan and Rogers [9]. Logistic regression is a generalized linear model to calculate the probability of giving a correct answer to a dichotomous item given a score and group membership, P (Y = 1|X, G). Group membership is usually defined in terms of a focal group (G = 1) and a reference group (G = 0). X represents the ability score of this person, usually the total test score. P (Y = 1|X, G) =

exp(β0 + β1 X + β2 G + β3 XG) 1 + exp(β0 + β1 X + β2 G + β3 XG)

In order to determine the presence of DIF, we want to know whether β2 and β3 are significantly different from 0. β2 will be different when examinees in one group score higher on the item than examinees in the other group, conditional on ability level (uniform DIF). β3 will be different when there is an interaction effect between group membership and total test score (nonuniform DIF). In order to test the significance of these β’s, different models are specified including neither of β2 and β3 , one of them, or both, and then 3

the model fit of these models is compared. Originally Swaminathan and Rogers proposed a combined test of both β2 and β3 , but in recent years performing two separate tests has become more popular since this way the type of DIF (uniform vs. nonuniform) can be detected (e.g. Guler and Penfield, 2009). The normal way of testing for model significance involves calculating χ232 = 2ln

L(M odel3) L(M odel2)

This is equal to: χ232 = 2([lnL(M odel3)] − [lnL(M odel2)]) This test statistic χ232 is compared to values from a χ2 distribution, in order to determine whether the difference in model fit is larger than chance would suggest (i.e. is significant). Model 1 uses only β0 and β1 . Model 2 also includes β2 , for a main group effect. Finally, Model 3 also includes β3 , for an interaction effect. Comparing Model 2 to Model 1, or comparing Model 3 to Model 2, this χ2 has 1 degree of freedom. For comparing Model 3 to Model 1, it has 2 degrees of freedom.

2.3

Other methods of detecting DIF

Apart from Logistic regression there are a few other popular methods for detecting DIF, with the MantelHaenszel (MH) procedure being by far the most popular. The MH procedure is simple to compute and implement, and allows for significance testing (Rogers and Swaminathan, 1993). However, the MH-procedure is designed to detect uniform DIF and may not be good at detecting non-uniform DIF, particularly crossing non-uniform DIF. The MH-test statistic is computed by comparing the observed frequency of correct and incorrect answers split out by group membership and ability level, to the expected frequency if there were no DIF. If the difference is significant, the conclusion is that there is DIF. In IRT terms, the MH-statistic calculates the size of the area between the two Item Response Functions (IRFs), but in case of crossing non-uniform DIF the MH-procedure can incorrectly claim there is no difference because the areas left and right of the crossing cancel each other out.

4

Another procedure that has gotten some attention in the past is the Breslow-Day procedure for DIF detection. It was proposed by Breslow and Day in 1980. It compares the odds ratio of a correct response per group membership given an ability level stratum. A larger difference is an indicator of non-uniform DIF. The BD-procedure can be combined with the MH-procedure in what is called the Combined Decision Rule (CDR), because the MH-procedure is good at detecting uniform DIF and the BD-procedure is good at detecting non-uniform DIF (Penfield, 2003; G¨ uler and Penfield, 2009). Both procedures are carried out with a correction for multiple testing, and if either is significant DIF is said to be present.

2.4

Comparing Logistic regression to other methods

The original study proposing Logistic regression (Swaminathan and Rogers, 1990) found that Logistic regression is as powerful as the MH-procedure at detecting uniform DIF, and is more powerful at detecting non-uniform DIF, particularly crossing non-uniform DIF. However, there was also a larger number of false positives when using Logistic regression. These results have been replicated in several studies (e.g. Rogers and Swaminathan, 1993; Narayanan and Swaminathan, 1996). In addition, Logistic regression has been found to be more powerful than an IRT based analysis of variance method at detecting (nonuniform) DIF (Whitmore and Schumacker, 1999). Compared to the Combined Decision Rule, however, Logistic regression was found to have a lower power across all types of DIF and to have a higher type I error rate, particularly when impact is present (G¨ uler and Penfield, 2009). In general, Logistic regression has advantages over the standard MH-procedure, but it may have a tendency to detect DIF where there is none. In this project I wrote an R program to test for DIF using logistic regression. In addition, I wrote an IRT based program to generate dichotomous item responses for both a reference and focal group, with or without DIF. I will combine these programs to simulate data, varying presence/absence of impact and using different sample sizes, with different types of DIF (uniform vs. nonuniform), and then test the effectiveness of logistic regression by running my R program on these data.

5

3

The R code to run logistic regression and generate data

3.1

Logistic regression - what the program does

In order to calculate the test statistic for LR, the R program runs a regression using the logistic model, and then compares the model fit of model 1 (including just β0 and β1 ), model 2 (additionally including β2 , i.e. a group effect) and model 3 (including β0 through β3 , i.e. including a group effect and an interaction effect). The model fit is compared using the earlier mentioned calculation, which follows a χ2 distribution, and therefore the test statistics are compared to critical values calculated for χ2 distribution with 1 and 2 df. A test statistics larger than the critical value for comparing model 2 to model 1 indicates uniform DIF, and a test statistics larger than the critical value for comparing model 3 to model 2 indicates nonuniform DIF.

3.2

Logistic regression - instructions for using the function

The function is called difdetection and takes three arguments: Data, itemnr and alpha. Data is the data matrix of raw item responses, where the rows correspond to examinees and the columns correspond to items, except for the first column which corresponds to group membership (represented by 0s and 1s). Here is an example of such a data matrix:

> datamatrix

1 2 3 4 5 6 7 8 9

group item1 item2 item3 item4 item5 item6 item7 item8 item9 item10 0 1 0 1 0 1 1 1 0 0 1 0 1 1 1 0 1 1 1 0 0 1 0 1 0 1 0 0 1 0 1 0 1 0 1 0 1 0 1 1 1 0 1 1 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 1 0 1 0 1 1 1 1 0 0 0 1 0 0 1 1 0 1 1 1 1 1 1 6

10 11 12 13 14 15

0 1 1 1 1 1

0 1 0 1 0 0

0 0 0 1 0 0

0 1 1 1 0 0

0 0 0 0 0 0

0 1 1 1 0 1

0 1 1 1 1 1

0 1 0 0 0 0

0 0 0 0 0 0

0 0 0 0 0 0

1 1 1 1 1 1

The matrix must have this format or the program will not work (although the column names are optional). Itemnr is the specific item that you want to test for the presence of DIF. Because the first column of the data matrix must be the group column, the column itemnr + 1 will be read for item (i.e. if you specific itemnr is 5, the program will read the 6th column of the data matrix, which holds the responses to item 5). Finally, the argument alpha can be used to specificy a confidence level. This is set to .05 (a confidence level of .95) by default. Because the program runs two tests (one for uniform DIF and one for nonuniform DIF) a Bonferroni correction is built in: when alpha is .05, the significant χ2 test statistic is calculated for an area of .025. The output of the program specifies whether DIF is present or absent: it can indicate the presence of uniform DIF and/or non-uniform DIF, or the absence of DIF. It also prints the relevant p-value, which was compared to alpha to determine significance, and the relevant χ2 test statistic.

3.3

Logistic regression - an example

We will use the matrix above to run the program, testing for DIF in item 1.

> difdetection(Data=datamatrix, itemnr=1) There is no significant DIF, p = 0.2445 and chi-square = 2.817 for model 3 versus model 1. Since the data were generated without DIF, this conclusion is correct.

7

3.4

Generating data - what the program does

Gensample is a function to generate item responses based on item response theory. It first generates ability levels (from a N (0, 1) distribution) and item parameters (discrimination (a-parameter) from a U nif (.5, 1.5) and location (b-parameter) from a U nif (−2, 2) distribution). It calculates a matrix of probabilities of responding to items correctly (based on the familiar IRT formula) and then compares these probabilities to random deviates from a Unif(0,1) distribution. If the probability (for a specific person for a specific item) is higher than the deviate, the item response is 1, and otherwise the item response is 0.

3.5

Generating data - instructions for using the function

The function gensample takes 6 arguments, some of which are optional. The first two arguments, npreference and npfocal, specify the number of people in the reference and focal group (respectively). The third argument, nritems, specifies the number of items. seed is used to set a seed for generating the random numbers, so results can be replicated. When this argument is not specified, a random integer between 0 and 10.000 is generated from a uniform distribution to be the seed. impactsize can optionally be used to specify impact: the number entered here is the amount by which the focal group ability level mean increases over 0. The final argument, D, is the D-parameter from the IRT model and is set to 1.7 as a standard but can be changed to 1.0 if so preferred. In that case, the a-parameters will be drawn from a U nif (.85, 2.55) distribution (instead of a U nif (.5, 1.5) distribution). The function will return a list of results: the first element is the datamatrix, with the first column indicating group membership and the other columns representing items. The second element is a vector of person ability scores, one for each examinee. The third element is a vector of location parameters, one for each item. The fourth element is a vector of discrimination parameters, one for each item.

8

3.6

Generating data - an example

The matrix used above in the difdetection explanation was generated using gensample:

> mydata mydata $datamatrix group item1 item2 item3 item4 item5 item6 item7 item8 item9 item10 1 0 1 0 1 0 1 1 1 0 0 1 2 0 1 1 1 0 1 1 1 0 0 1 3 0 1 0 1 0 0 1 0 1 0 1 4 0 1 0 1 0 1 1 1 0 1 1 5 0 0 0 0 0 1 0 0 0 0 0 6 0 0 0 0 0 1 0 0 0 0 0 7 0 0 0 0 0 0 1 0 0 0 1 8 0 1 0 1 1 1 1 0 0 0 1 9 0 0 1 1 0 1 1 1 1 1 1 10 0 0 0 0 0 0 0 0 0 0 1 11 1 1 0 1 0 1 1 1 0 0 1 12 1 0 0 1 0 1 1 0 0 0 1 13 1 1 1 1 0 1 1 0 0 0 1 14 1 0 0 0 0 0 1 0 0 0 1 15 1 0 0 0 0 1 1 0 0 0 1 $ability [1,] [2,] [3,] [4,] [5,] [6,] [7,] [8,] [9,] [10,] [11,] [12,] [13,] [14,]

[,1] 1.1574701 1.2793092 0.9358954 1.8383080 -1.0072701 -1.3351441 -0.4103148 1.3739080 1.2100689 -1.4944025 1.0676441 -0.4520208 0.6931748 -0.9086439 9

[15,] -0.5945509 $location [1] 0.2402671 [7] 0.6855960

1.4646283 1.8166490

0.1107804 1.7933544 -1.4153728 -1.1886457 1.5992394 -1.3088213

$discrimination [1] 0.6062838 1.0881245 1.2805567 1.1594601 0.5552541 0.9621954 0.9205742 [8] 1.0082268 1.0926173 1.0199537 The first element of the list is the data matrix. The other elements tell us the (randomly generated) ability levels and item parameters.

3.7

Adding DIF - what the program does

The function adddif is a way to take the output of the gensample function and change one column (the set of responses to one item) to represent DIF. It recalculates the item responses based on changed item parameters. It can add both uniform and nonuniform DIF.

3.8

Adding DIF - instructions for using the function

The function adddif takes 5 arguments. The first, mydata, must be the list output of the gensample function. The argument itemnr must be the number of the item that will be changed. unifDIFamount is the argument that specifies the amount of uniform DIF, in particular, how much more difficult (how much higher the b-parameter) the item is for members of the focal group. The argument nonunifDIFamount represents the amount of non-uniform DIF, in particular, how much steeper the curve is (how much higher the a-parameter) for members of the focal group. Both amount arguments are set to 0 by default. The D argument and seed argument function the same as in gensample. This function will return a (longer) list of results: the first element is the datamatrix, with the first column indicating group membership and the other columns representing items. The second element is the vector of ability scores, one for each examinee. The third element is a vector of (original) difficulty

10

parameters, one for each item. The fourth element is a vector of (original, reference group) discrimination parameters, one for each item. The fifth element is a number, the item number that has been changed. The fifth and sixth element are the specified amount of uniform and non-uniform DIF (respectively).

3.9

Adding DIF - an example

We will change two items of the data matrix that we used earlier: we will add uniform DIF to item 1 and nonuniform DIF to item 2.

> mydata2 mydata3 mydata3[[1]]

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15

group item1 item2 item3 item4 item5 item6 item7 item8 item9 item10 0 0 0 1 0 1 1 1 0 0 1 0 1 1 1 0 1 1 1 0 0 1 0 0 0 1 0 0 1 0 1 0 1 0 1 1 1 0 1 1 1 0 1 1 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 1 0 1 0 1 1 1 1 0 0 0 1 0 1 1 1 0 1 1 1 1 1 1 0 0 0 0 0 0 0 0 0 0 1 1 1 1 1 0 1 1 1 0 0 1 1 0 0 1 0 1 1 0 0 0 1 1 0 0 1 0 1 1 0 0 0 1 1 0 0 0 0 0 1 0 0 0 1 1 0 0 0 0 1 1 0 0 0 1 When comparing the old data matrix to the new data matrix, one might notice that not just the

responses for the focal group have changed. This is because the data are regenerated for the specified item and there is an element of randomness in generating the item responses. However, the item parameters used for the reference group have not changed, whereas those for the focal group have. Finally, we will see if the difdetection function picks up on the newly introduced presence of DIF: 11

> data difdetection(Data=data, itemnr=1) There is no significant DIF, p = 1 and chi-square = 2.767e-10 for model 3 versus model 1. > difdetection(Data=data, itemnr=2) There is no significant DIF, p = 1 and chi-square = 3.96e-10 for model 3 versus model 1. It does not - however, this is not surprising, given that the sample datamatrix used here is small both in number of people and in number of items.

4 4.1

Sampling Experimental setup

In order to test the difdetection function, I generated data using the gensample and adddif functions. I generated six sets of data, each with 40 items, three of which had uniform DIF (with change in location parameter of -1, .5 and 1), three had non-uniform crossing DIF (with change in discrimination parameter of .5, .8 and 1.0) and two had non-uniform noncrossing DIF (one with location change -1 and discrimation change .5, and one with location change .5 and discrimination change 1.0). The other 32 items had no DIF. The first dataset had 200 subjects and no impact. The second dataset had 200 subjects and impact of -1 (the average ability level for the focal group was -1 rather than 0). The third and fourth dataset had 500 examinees with no impact and impact of -1 respectively. The fifth and sixth dataset had 1000 examinees with no impact and impact of -1 respectively. In each dataset there was an equal number of reference and focal group examinees.

4.2

Example of the code

The code used to generate the first sample, including DIF, and to test for DIF in the first 5 items, is given next:

12

> > > > + > + > + > + > + > + > + > + > > >

# Condition nr 1: 200 people, no impact # Generating the data mydata.20.0