Components of reading in first and second language, test item difficulty and overall reading ability J. Charles Alderson, Tineke Brunfaut, Gareth McCray & Lea Nieminen
Three projects Components of reading test item difficulty 1. 2009 PISA reading items
2. PTE Academic reading items
Components of reading in a first and a foreign language 3. DIALUKI Project
Components of reading item difficulty Why predict reading item difficulty? • To reduce the number of unsuitable items produced by item writers • To reduce the amount of piloting • To facilitate a more accurate deployment of items which attempt to measure a specific band of reading ability • To inform inference of the cognitive processes underlying specific items, and by extension, the construct being measured
The use of regression to predict reading item difficulty Who
When
What
Number of Variables
Variance Explained
Drum, Calfee, and Cook
1981
Children's reading test
10 variables
55-94%
Pollitt, Hutchinson, Entwistle, and De Luca
1985
Scottish O'levels
22 variables
61%
Davey
1988
Stanford Achievement Test
2 variables
41% / 29%
Freedle & Kostin
1991
Scholastic Aptitude 8 variables Test
58%
Freedle & Kostin
1992
Graduate Record Exam
7 variables
41%
Freedle & Kostin
1993
TOEFL Reading comprehension
11 variables
58%
Project 1: PISA 2009 • Programme for International Student Assessment • 15-year-old students • 65 countries • Reading test: – Reading in the language of instruction – Item types: selected and constructed response
Research question To what extent can item difficulties of the PISA 2009 reading items be predicted from a selection of judgment variables?
Judgment variables Variable
Refers to
1
Number of features and conditions
The number of elements which the respondent must extract from the text to answer the question correctly.
2
Proximity of pieces of required information
How close the required pieces of information are in the text.
3
Competing information
The amount and plausibility of distracting information.
4
Structural prominence The prominence of the location of the of target information necessary information within the text.
5
Transparency of Task
The complexity of the nature of the task.
Judgment variables Variable
Refers to
6
Semantic match between task and target information
The closeness of the semantic link between the information in the task and in the item text.
7
Concreteness of information
The type of information, on a continuum ‘abstract-concrete’ which the reader must identify.
8
Familiarity of information needed
The likelihood that the reader will be familiar with the subject of the text.
9
Register of the text
The degree of formality in the text.
10 Is information from outside the text is required
The degree to which the respondent must draw upon background knowledge to correctly respond to the item.
Data • 97 PISA reading comprehension items • Each item was judged on a 4-point scale for each of the 10 variables (Lumley et al., 2009) E.g. 4-point scale for the variable Number of features and conditions 1 – Question provides a single feature to identify/understand. 2 – Question provides two features or conditions to identify/understand. 3 – Question provides three features or conditions to identify/understand. 4 – Task requires more than three features or conditions to identify/understand.
• 3 expert judges, 1 agreed judgment
Exploratory analysis
2
4 Delta
2
1
3
2
Register of the text
Information from outside the text Outside infois required? 4 2 Delta
Delta
2
4
Register
4
3
Level
Level
2 Delta
0
0 -2
1
1
1
2
2
1
2
3
4
-2 1
2
3
1
2
3 Level
Level
-2
-2
-2
-2
0
Delta
4
Familiarity of information needed Familiarity
2
4
3 Level
Concreteness of information Concreteness
2
2
4 Delta
-2
1
4
Semantic match Semantic between task and match target information
Delta
3
Level
0
2
Transparency
2
2 Delta
-2 1
3
Transparency of Task
prominence
0
2 Level
0
1
Structural prominence Structural of target information
0
Delta
0 -2
0 -2
Delta
2
2
4
4
features
Competing
Competing information info
0
Proximity of pieces of Proximity required information
4
Number of features Number of and conditions
Level
Level
Level
The problem with Stepwise Removal • The Stepwise function is used to exclude variables with low explanatory power from a statistical model. • However, for certain types of ordinal variable (particularly variables related to human judgments on a posited underlying scale) this procedure can remove variables with explanatory power. • In such cases, a collapse of the variable scale points would be more useful.
Statement solution • Therefore, to find an optimal statistical model, a generalized linear model of all possible collapse permutations of all variables was run (524,288 models). • The models were assessed for parsimony according to the AIC criterion. • The models with the lowest AIC was selected as the best (most explanatory power with fewest variables, based on this criterion).
Results Model with Best Fit (AIC) Variable
Level
Coefficient
Observed Vs. Fitted Values
-1.03**
Intercept Number of features
Level 2
0.53*
Proximity
Level 2
-0.41
Competing info
Level 1 Level 2
-0.95** 0.15
Structural prominence
Level 2
0.37*
Transparency of task
Level 2
0.92***
Semantic match
Level 2
0.59**
Familiarity
Level 2 Level 3
0.91*** 1.27***
Register
Level 2 Level 3
-0.39* -0.71**
Adjusted R2 = 0.64
Extent of item difficulty prediction AIC Specified Model
+ / - 1.36 Logits
Conclusion Who
When
What
Number of Variables
Variance Explained
Drum, Calfee, and Cook
1981
Children's reading test
10 variables
55-94%
Pollitt, Hutchinson, Entwistle, and De Luca
1985
Scottish O'levels
22 variables
61%
Davey
1988
Stanford Achievement Test
2 variables
41% / 29%
Freedle & Kostin
1991
Scholastic Aptitude Test
8 variables
58%
Freedle & Kostin
1992
Graduate Record Exam
7 variables
41%
Freedle & Kostin
1993
TOEFL Reading comprehension
11 variables
58%
McCray, Alderson & Brunfaut
2012
PISA 2009 Reading
8 variables
64%
Project 2: PTE Academic • Pearson Test of English - Academic • Computer-based test measuring all four skills • Reading items used in Project 2: • Single-answer multiple-choice • Multiple-answer multiple-choice • Fill in the blank
Research question Can the variables used in Project 1 (PISA 2009) describe reading item difficulty in the L2 context of the PTE Academic?
Data • 81 PTE Academic reading comprehension items • Each item was judged on a 4-point scale for each of the 10 variables • 5 expert judges, 5 judgements
Data • Very low expert judge agreement using a 4-point scale Collapsed to a binary scale to increase rater agreement
• Removal of two variables: – Transparency of task: judges found it difficult to understand – Number of features: very high correlation with the variable Proximity
• Final variable set: binary scale judgments on 8 variables
Exploratory analysis Competing
Structural prominence Structural of target information
Competing information info
0 -2
0 -2 2
1
Level
0
4
4
2
2
Delta
Delta
0
0
-2
-2
2 Level
2
Information from outside the text is Outside info required?
Register
4 Delta 0 -2
1
1 Level
Register of the text
2
2 Delta 0 -2
2
2 Level
Familiarity of information needed Familiarity
4
Concreteness of information Concreteness
Level
4
2 Delta
Delta 0 -2
1
2 Level
1
2
4
4
prominence
2
4 2 Delta 0 -2
1
Semantic match Semantic between task and targetmatch information
Delta
Proximity of pieces of required information Proximity
1
2 Level
1
2 Level
Results Model with Best Fit (AIC) Variable
Level
Coefficient
Intercept
Observed Vs. Fitted Values
-0.27***
Proximity
Level 2
0.48***
Competing info
Level 2
0.46***
Concreteness
Level 2
0.34
Adjusted R2 by judge
Adjusted R2 = 0.05
Judge 1
Judge 2
Judge 3
Judge 4
Judge 5
0.18
0.13
0.23
0.13
0.00
Conclusion Can the variables used in Study 1 (PISA 2009) describe reading item difficulty in the L2 context of the PTE Academic? No, but ...
Methodological challenge To what extent do expert judges agree in language testing? Who
What
Conclusion
Bejar (1983)
• item difficulty and discrimination prediction • MCQs American Scholastic Aptitude Test (SAT)
pooled expert judgements are not sufficiently accurate to replace empirically gathered values
Alderson (1993)
1) Judgments of skills assessed by items 10 NS judges 2) Judgments of skills assessed by items 17 EFL teachers
Bachman et al (1996)
judges had difficulty in agreeing as to what skills were being tested
3) standard setting judgements (procedure similar to Angoff method)
pooled judgements reasonably reflect true performances of the 20,000 candidates
• judgment of salient characteristics of 40 items • using two taxonomical frameworks: • Test Methods Characteristics (TM) • Communicative Language Ability (CLA) • 5 judges
Inter-judge average raw agreement proportion value: • 0.64 for the CLA framework • 0.75 for the TM framework. (Note: values would be 0.43 (CLA) and 0.55 ™ by chance)
Methodological challenge Inter-judge reliability AC1 Altman (1991)
Proximity 0.46** Moderate
Competing info 0.27** Fair
Prominence 0.47** Moderate
Semantic match 0.08 Poor
Inter-judge reliability AC1 Altman (1991)
Concreteness
Familiarity
Register
Outside info
0.41**
0.48**
0.49**
0.60**
Moderate
Moderate
Moderate
Moderate
Three projects Components of reading test item difficulty 1. 2009 PISA reading items
2. PTE Academic reading items
Components of reading in a first and a foreign language 3. DIALUKI Project
Acknowledgments DIALUKI Other DIALUKI team members, University of Jyväskylä • Academic coordinator: Dr Ari Huhta • Post-doc research fellow: Dr Riikka Ullakonoja • Research assistant: Eeva-Leena Haapakangas Co-funded by UK Economic and Social Research Council, Academy of Finland and University of Jyväskylä
Project 3: DIALUKI Informants • Finnish-speaking learners of English as FL – primary school 4th grade (age 10) – lower secondary school, 8th grade (age 14) – gymnasium (academically oriented upper secondary school), 2nd year students (age 17)
• Russian-speaking learners of Finnish as SL – primary school (3-6th grade) – lower secondary school (7-9th grade)
• From 111 schools around Finland
DIALUKI: Three major studies Study 1 (2010/2011) • A cross-sectional study with 3 x 200 + 250 students. • Exploring the value of a range of L1 & L2 measures in predicting L2 reading & writing, in order to select the best predictors for further studies.
Study 2 (2011 – 2012/13) • Longitudinal, 2-3 years. • The development of literacy skills, and the relationship of this development to the diagnostic measures
Study 3 (2012/13) • Several training / experimental studies, each a few weeks in length. • Morphological awareness, extensive reading, vocabulary learning strategies, phonological awareness, strategies in reading and writing
Independent predictor variables in L1 and FL
Cognitive measures o Backwards digit span in L1 and FL o Rapid recognition of words in L1 and FL o Rapid word list reading in L1 and FL o Rapid automatised naming in L1 and FL o Non-word reading in L1 and FL o Non-word spelling in L1 o Non-word repetition in L1 and FL o Phoneme deletion in L1 and FL o Common unit in L1 and FL
Stepwise multiple regression Cognitive variables with L1 Finnish reading Adjusted R % First Square variance variable 4th Grade
8th Grade Gymnasium
.108
.086
.039
11%
7%
4%
Word list L1 (.247)
Second variable
Third variable
Digit span L1 (.243)
NW repeat L1 (.219)
Digit span L1 Rapid words (.249) L1 (.203) Digit span L1 (.210)
Stepwise multiple regression Cognitive variables with EFL reading Adjusted % First R Square variance variable 4th Grade
8th Grade Gymnasium
.264
.260
.237
Second variable
Third variable
26%
PhonDel in English (.441)
RapidW in Finnish (.366)
Digit span in English (.317)
26%
RAN in English (-.436)
PhonDel in English (.369)
RapidW in English (.298)
24%
RAN in English (-.399)
RAN in Finnish (-.057)
Common unit in Finnish (.267)
Fourth variable
NonWread in English (.294)
SEM analyses • Mplus, version 5.21 • MLR (maximum likelihood estimation with robust standard errors) was used • All models presented here were acceptable according to the model fit indices (CFI, TLI, RMSE, SRMR and chi-square)
(We would like to thank Karen Dunn (Lancaster U.) and Kenneth Eklund (U. of Jyväskylä) for their invaluable advice on the SEM analyses)
Structural Equation Modelling (SEM) Cognitive variables, 4th graders Three latent variables
Structural Equation Modelling (SEM) Cognitive variables, 4th graders Three latent variables (path model)
Structural Equation Modelling (SEM) Cognitive variables, 8th graders Three latent variables
Structural Equation Modelling (SEM) Cognitive variables, Gymnasium Three latent variables
Structural Equation Modelling (SEM) Cognitive variables, Gymnasium Three latent variables (path model)
Summary of the results • Cognitive variables predict variance in EFL reading better than in Finnish-as-L1 reading • However, cognitive tasks only have a small contribution to the prediction of variance in reading tasks • The cognitive measures administered in the foreign language may be as much linguistic as cognitive • These cognitive measures have more to do with decoding (lower-level skills) than with reading comprehension (higher-level skills)
Summary of the SEM results • (Almost) the same 3 latent cognitive traits could be identified for two out of three age groups: – 1) Fluent (word) reading / lexical retrieval • FL RAN, FL list reading in all groups (+ L1 RAN, L1 List reading; 4th graders also Rapidly presented words in FL and L1)
– 2) Phonological processing / efficiency • Weaker loadings of measures than in the other two latent variables • Composition changed across age groups: only L1 tasks in 4th graders, mixture in 8th graders, only FL tasks in gymnasium
– 3) Working memory • Backwards Digit Span in L1 & FL in all groups
• The latent cognitive traits usually correlated with each other
Summary of the SEM results Question: proper place and role of Working memory in modeling cognitive traits? – Direct ’effect’ on reading or via other cognitive traits? – We only measured backward digit span (numbers) – Additional, non-linguistic, measures of working memory would be useful
Thank you for your attention! J. Charles Alderson, Tineke Brunfaut, Gareth McCray & Lea Nieminen