Components of reading in first and second language, test item difficulty and overall reading ability

Components of reading in first and second language, test item difficulty and overall reading ability J. Charles Alderson, Tineke Brunfaut, Gareth McCr...
Author: Kevin Taylor
4 downloads 2 Views 2MB Size
Components of reading in first and second language, test item difficulty and overall reading ability J. Charles Alderson, Tineke Brunfaut, Gareth McCray & Lea Nieminen

Three projects Components of reading test item difficulty 1. 2009 PISA reading items

2. PTE Academic reading items

Components of reading in a first and a foreign language 3. DIALUKI Project

Components of reading item difficulty Why predict reading item difficulty? • To reduce the number of unsuitable items produced by item writers • To reduce the amount of piloting • To facilitate a more accurate deployment of items which attempt to measure a specific band of reading ability • To inform inference of the cognitive processes underlying specific items, and by extension, the construct being measured

The use of regression to predict reading item difficulty Who

When

What

Number of Variables

Variance Explained

Drum, Calfee, and Cook

1981

Children's reading test

10 variables

55-94%

Pollitt, Hutchinson, Entwistle, and De Luca

1985

Scottish O'levels

22 variables

61%

Davey

1988

Stanford Achievement Test

2 variables

41% / 29%

Freedle & Kostin

1991

Scholastic Aptitude 8 variables Test

58%

Freedle & Kostin

1992

Graduate Record Exam

7 variables

41%

Freedle & Kostin

1993

TOEFL Reading comprehension

11 variables

58%

Project 1: PISA 2009 • Programme for International Student Assessment • 15-year-old students • 65 countries • Reading test: – Reading in the language of instruction – Item types: selected and constructed response

Research question To what extent can item difficulties of the PISA 2009 reading items be predicted from a selection of judgment variables?

Judgment variables Variable

Refers to

1

Number of features and conditions

The number of elements which the respondent must extract from the text to answer the question correctly.

2

Proximity of pieces of required information

How close the required pieces of information are in the text.

3

Competing information

The amount and plausibility of distracting information.

4

Structural prominence The prominence of the location of the of target information necessary information within the text.

5

Transparency of Task

The complexity of the nature of the task.

Judgment variables Variable

Refers to

6

Semantic match between task and target information

The closeness of the semantic link between the information in the task and in the item text.

7

Concreteness of information

The type of information, on a continuum ‘abstract-concrete’ which the reader must identify.

8

Familiarity of information needed

The likelihood that the reader will be familiar with the subject of the text.

9

Register of the text

The degree of formality in the text.

10 Is information from outside the text is required

The degree to which the respondent must draw upon background knowledge to correctly respond to the item.

Data • 97 PISA reading comprehension items • Each item was judged on a 4-point scale for each of the 10 variables (Lumley et al., 2009) E.g. 4-point scale for the variable Number of features and conditions 1 – Question provides a single feature to identify/understand. 2 – Question provides two features or conditions to identify/understand. 3 – Question provides three features or conditions to identify/understand. 4 – Task requires more than three features or conditions to identify/understand.

• 3 expert judges, 1 agreed judgment

Exploratory analysis

2

4 Delta

2

1

3

2

Register of the text

Information from outside the text Outside infois required? 4 2 Delta

Delta

2

4

Register

4

3

Level

Level

2 Delta

0

0 -2

1

1

1

2

2

1

2

3

4

-2 1

2

3

1

2

3 Level

Level

-2

-2

-2

-2

0

Delta

4

Familiarity of information needed Familiarity

2

4

3 Level

Concreteness of information Concreteness

2

2

4 Delta

-2

1

4

Semantic match Semantic between task and match target information

Delta

3

Level

0

2

Transparency

2

2 Delta

-2 1

3

Transparency of Task

prominence

0

2 Level

0

1

Structural prominence Structural of target information

0

Delta

0 -2

0 -2

Delta

2

2

4

4

features

Competing

Competing information info

0

Proximity of pieces of Proximity required information

4

Number of features Number of and conditions

Level

Level

Level

The problem with Stepwise Removal • The Stepwise function is used to exclude variables with low explanatory power from a statistical model. • However, for certain types of ordinal variable (particularly variables related to human judgments on a posited underlying scale) this procedure can remove variables with explanatory power. • In such cases, a collapse of the variable scale points would be more useful.

Statement solution • Therefore, to find an optimal statistical model, a generalized linear model of all possible collapse permutations of all variables was run (524,288 models). • The models were assessed for parsimony according to the AIC criterion. • The models with the lowest AIC was selected as the best (most explanatory power with fewest variables, based on this criterion).

Results Model with Best Fit (AIC) Variable

Level

Coefficient

Observed Vs. Fitted Values

-1.03**

Intercept Number of features

Level 2

0.53*

Proximity

Level 2

-0.41

Competing info

Level 1 Level 2

-0.95** 0.15

Structural prominence

Level 2

0.37*

Transparency of task

Level 2

0.92***

Semantic match

Level 2

0.59**

Familiarity

Level 2 Level 3

0.91*** 1.27***

Register

Level 2 Level 3

-0.39* -0.71**

Adjusted R2 = 0.64

Extent of item difficulty prediction AIC Specified Model

+ / - 1.36 Logits

Conclusion Who

When

What

Number of Variables

Variance Explained

Drum, Calfee, and Cook

1981

Children's reading test

10 variables

55-94%

Pollitt, Hutchinson, Entwistle, and De Luca

1985

Scottish O'levels

22 variables

61%

Davey

1988

Stanford Achievement Test

2 variables

41% / 29%

Freedle & Kostin

1991

Scholastic Aptitude Test

8 variables

58%

Freedle & Kostin

1992

Graduate Record Exam

7 variables

41%

Freedle & Kostin

1993

TOEFL Reading comprehension

11 variables

58%

McCray, Alderson & Brunfaut

2012

PISA 2009 Reading

8 variables

64%

Project 2: PTE Academic • Pearson Test of English - Academic • Computer-based test measuring all four skills • Reading items used in Project 2: • Single-answer multiple-choice • Multiple-answer multiple-choice • Fill in the blank

Research question Can the variables used in Project 1 (PISA 2009) describe reading item difficulty in the L2 context of the PTE Academic?

Data • 81 PTE Academic reading comprehension items • Each item was judged on a 4-point scale for each of the 10 variables • 5 expert judges, 5 judgements

Data • Very low expert judge agreement using a 4-point scale Collapsed to a binary scale to increase rater agreement

• Removal of two variables: – Transparency of task: judges found it difficult to understand – Number of features: very high correlation with the variable Proximity

• Final variable set: binary scale judgments on 8 variables

Exploratory analysis Competing

Structural prominence Structural of target information

Competing information info

0 -2

0 -2 2

1

Level

0

4

4

2

2

Delta

Delta

0

0

-2

-2

2 Level

2

Information from outside the text is Outside info required?

Register

4 Delta 0 -2

1

1 Level

Register of the text

2

2 Delta 0 -2

2

2 Level

Familiarity of information needed Familiarity

4

Concreteness of information Concreteness

Level

4

2 Delta

Delta 0 -2

1

2 Level

1

2

4

4

prominence

2

4 2 Delta 0 -2

1

Semantic match Semantic between task and targetmatch information

Delta

Proximity of pieces of required information Proximity

1

2 Level

1

2 Level

Results Model with Best Fit (AIC) Variable

Level

Coefficient

Intercept

Observed Vs. Fitted Values

-0.27***

Proximity

Level 2

0.48***

Competing info

Level 2

0.46***

Concreteness

Level 2

0.34

Adjusted R2 by judge

Adjusted R2 = 0.05

Judge 1

Judge 2

Judge 3

Judge 4

Judge 5

0.18

0.13

0.23

0.13

0.00

Conclusion Can the variables used in Study 1 (PISA 2009) describe reading item difficulty in the L2 context of the PTE Academic?  No, but ...

Methodological challenge To what extent do expert judges agree in language testing? Who

What

Conclusion

Bejar (1983)

• item difficulty and discrimination prediction • MCQs American Scholastic Aptitude Test (SAT)

pooled expert judgements are not sufficiently accurate to replace empirically gathered values

Alderson (1993)

1) Judgments of skills assessed by items 10 NS judges 2) Judgments of skills assessed by items 17 EFL teachers

Bachman et al (1996)

judges had difficulty in agreeing as to what skills were being tested

3) standard setting judgements (procedure similar to Angoff method)

pooled judgements reasonably reflect true performances of the 20,000 candidates

• judgment of salient characteristics of 40 items • using two taxonomical frameworks: • Test Methods Characteristics (TM) • Communicative Language Ability (CLA) • 5 judges

Inter-judge average raw agreement proportion value: • 0.64 for the CLA framework • 0.75 for the TM framework. (Note: values would be 0.43 (CLA) and 0.55 ™ by chance)

Methodological challenge Inter-judge reliability AC1 Altman (1991)

Proximity 0.46** Moderate

Competing info 0.27** Fair

Prominence 0.47** Moderate

Semantic match 0.08 Poor

Inter-judge reliability AC1 Altman (1991)

Concreteness

Familiarity

Register

Outside info

0.41**

0.48**

0.49**

0.60**

Moderate

Moderate

Moderate

Moderate

Three projects Components of reading test item difficulty 1. 2009 PISA reading items

2. PTE Academic reading items

Components of reading in a first and a foreign language 3. DIALUKI Project

Acknowledgments DIALUKI Other DIALUKI team members, University of Jyväskylä • Academic coordinator: Dr Ari Huhta • Post-doc research fellow: Dr Riikka Ullakonoja • Research assistant: Eeva-Leena Haapakangas Co-funded by UK Economic and Social Research Council, Academy of Finland and University of Jyväskylä

Project 3: DIALUKI Informants • Finnish-speaking learners of English as FL – primary school 4th grade (age 10) – lower secondary school, 8th grade (age 14) – gymnasium (academically oriented upper secondary school), 2nd year students (age 17)

• Russian-speaking learners of Finnish as SL – primary school (3-6th grade) – lower secondary school (7-9th grade)

• From 111 schools around Finland

DIALUKI: Three major studies Study 1 (2010/2011) • A cross-sectional study with 3 x 200 + 250 students. • Exploring the value of a range of L1 & L2 measures in predicting L2 reading & writing, in order to select the best predictors for further studies.

Study 2 (2011 – 2012/13) • Longitudinal, 2-3 years. • The development of literacy skills, and the relationship of this development to the diagnostic measures

Study 3 (2012/13) • Several training / experimental studies, each a few weeks in length. • Morphological awareness, extensive reading, vocabulary learning strategies, phonological awareness, strategies in reading and writing

Independent predictor variables in L1 and FL

Cognitive measures o Backwards digit span in L1 and FL o Rapid recognition of words in L1 and FL o Rapid word list reading in L1 and FL o Rapid automatised naming in L1 and FL o Non-word reading in L1 and FL o Non-word spelling in L1 o Non-word repetition in L1 and FL o Phoneme deletion in L1 and FL o Common unit in L1 and FL

Stepwise multiple regression Cognitive variables with L1 Finnish reading Adjusted R % First Square variance variable 4th Grade

8th Grade Gymnasium

.108

.086

.039

11%

7%

4%

Word list L1 (.247)

Second variable

Third variable

Digit span L1 (.243)

NW repeat L1 (.219)

Digit span L1 Rapid words (.249) L1 (.203) Digit span L1 (.210)

Stepwise multiple regression Cognitive variables with EFL reading Adjusted % First R Square variance variable 4th Grade

8th Grade Gymnasium

.264

.260

.237

Second variable

Third variable

26%

PhonDel in English (.441)

RapidW in Finnish (.366)

Digit span in English (.317)

26%

RAN in English (-.436)

PhonDel in English (.369)

RapidW in English (.298)

24%

RAN in English (-.399)

RAN in Finnish (-.057)

Common unit in Finnish (.267)

Fourth variable

NonWread in English (.294)

SEM analyses • Mplus, version 5.21 • MLR (maximum likelihood estimation with robust standard errors) was used • All models presented here were acceptable according to the model fit indices (CFI, TLI, RMSE, SRMR and chi-square)

(We would like to thank Karen Dunn (Lancaster U.) and Kenneth Eklund (U. of Jyväskylä) for their invaluable advice on the SEM analyses)

Structural Equation Modelling (SEM) Cognitive variables, 4th graders Three latent variables

Structural Equation Modelling (SEM) Cognitive variables, 4th graders Three latent variables (path model)

Structural Equation Modelling (SEM) Cognitive variables, 8th graders Three latent variables

Structural Equation Modelling (SEM) Cognitive variables, Gymnasium Three latent variables

Structural Equation Modelling (SEM) Cognitive variables, Gymnasium Three latent variables (path model)

Summary of the results • Cognitive variables predict variance in EFL reading better than in Finnish-as-L1 reading • However, cognitive tasks only have a small contribution to the prediction of variance in reading tasks • The cognitive measures administered in the foreign language may be as much linguistic as cognitive • These cognitive measures have more to do with decoding (lower-level skills) than with reading comprehension (higher-level skills)

Summary of the SEM results • (Almost) the same 3 latent cognitive traits could be identified for two out of three age groups: – 1) Fluent (word) reading / lexical retrieval • FL RAN, FL list reading in all groups (+ L1 RAN, L1 List reading; 4th graders also Rapidly presented words in FL and L1)

– 2) Phonological processing / efficiency • Weaker loadings of measures than in the other two latent variables • Composition changed across age groups: only L1 tasks in 4th graders, mixture in 8th graders, only FL tasks in gymnasium

– 3) Working memory • Backwards Digit Span in L1 & FL in all groups

• The latent cognitive traits usually correlated with each other

Summary of the SEM results Question: proper place and role of Working memory in modeling cognitive traits? – Direct ’effect’ on reading or via other cognitive traits? – We only measured backward digit span (numbers) – Additional, non-linguistic, measures of working memory would be useful

Thank you for your attention! J. Charles Alderson, Tineke Brunfaut, Gareth McCray & Lea Nieminen

Suggest Documents