On weights in dynamic-ageing microsimulation models
Gijs Dekkers1 Federal Planning Bureau, and Centre for Sociological Research CeSO, K.U.Leuven.
[email protected]
Richard Cumpston Australian National University,
[email protected]
Paper presented at the 3rd General Conference of the International Microsimulation Association, “Microsimulation and Policy Design” June 8th to 10th,2011 Stockholm, Sweden. 1
Contact information: Federal Planning Bureau, 47-49 Avenue des Arts, 1000 Brussels, Belgium. Email:
[email protected], phone: +32/2/5077413, fax: +32/2/5077373. The ideas in this paper were first presented at the Ministero dell'Economia e delle Finanze, dipartimento di Tesoro, Rome, Italy, February 14th, 2011. The authors are grateful for the comments by the participants to this meeting.
Introduction A dynamic model with cross-sectional ageing builds up complete synthetic life histories for each individual in the dataset, including data on mortality, labour market status, retirement age, savings and so on (Emmerson, et al., 2004, 3). They do so starting from a survey dataset or an administrative dataset. Many of these datasets however include weights, which are needed so that the sample reflects unbiased and credible sample estimators on population parameters. This is a problem for dynamic microsimulation models, since the most obvious solution, expanding the dataset, but not always advisable or even possible with unscaled probability weights or frequency weights. This short paper hopes to open the discussion on how to use weights in dynamic microsimulation models with dynamic ageing, other than simply expanding the dataset. It starts by briefly discussing the use of weights. A well-known way of transforming probability weights into frequency weights is discussed briefly. Next, we show how the problem of weights can easily be circumvented by expanding the starting dataset using the frequency weights. However, we will also argue that this approach is in many cases not possible. Next, an alternative way of using weights will be presented. This approach treats the weighting variable as just another variable in the model, and the weights are only used after simulation to derive weighted simulation results. Finally, the consequences of using weights in the case of alignment are discussed. Tests using Australian data confirm that weighted variables give sensible results and efficiency gains. The first author developed the concepts and theoretical analysis, and the second author did the testing.
Why weights? The need for weights in panel datasets comes from the fact that these datasets are often used to assess trends of aggregated units. So, besides reflecting intertemporal changes on the individual level, panel datasets also need to contain unbiased and credible sample estimators on population parameters (such as means) in each wave. This need for representativeness is however hampered by bias caused by (i) immigration, (ii) differential cross-sectional selection probabilities, and (iii) non-response. The first source of bias pertains to the fact that the initial selection of the panel does not take into account households that immigrate in later years. So the individuals in these households do not enter the sample, unless the sample is extended later on. The second source of bias emerges because selected households are being followed even if their composition changes and new individuals emerge while others disappear. For example, suppose that an individual in a household in the starting year of the panel creates a household of his or her own by cohabiting or marrying with somebody who was not in the original sample. Due to assertive mating, the characteristics of this new individual are not randomly distributed and this may introduce a divergence between the sample and the
population. The third and last source of bias is non-response. As time goes by, some households or individuals within the panel cease to respond and therefore leave the sample. If this attrition is not randomly distributed, then the panel sample will over time diverge from the population. The first source of divergence can be overcome by carefully selecting an additional sample. The second and third sources of divergence however require that the subsequent samples be weighted.
Transforming probability weights into frequency weights Including probability weights in a dynamic cross-sectional microsimulation model is problematic, because of the fact that these kinds of models actually change the size and composition of the household between the starting year and the consecutive simulation years. Suppose, for example, that we have a household consisting of two parents and one daughter in the base-year 2002, and that this household has a certain weight. Between 2002 and 2003, the daughter marries and leaves the household to form a household with the new spouse. Then what about the weight of the household of the newlyweds? Should it be that of the household of the wife, or that of the husband or a mixture of both? In the case of frequency weights, the situation is different. Frequency weights are integer numbers that represent the number of population cases that each sample case represents. The population can be derived from the sample by multiplying each observation by its frequency weight. The starting dataset thus „absorbs‟ the weights and becomes the population, and the model can be ran as would be the case without weights. But many survey datasets do not have frequency weights, but probability weights. A way to work around this is to round these probability weights and use the resulting integers as if they were frequency weights. In what follows, we use the following notation #(x=1|p) denotes the number (#) of cases of x=1 in the population (p). X=1 can, for example, represent the number of males of 30 years old, or the number of civil servants. #(x=1|s) represents the same number in the sample (s). #p and #s denotes the size or total number of cases in the population and sample.
# ( x 1 | p) Using this notation, the probability weight PW can be written as PW
#p # ( x 1 | s) #s
# ( x 1 | p) #s . The second part of the right hand side of this equation is # ( x 1 | s) #p the size of the sample relative to that of the population. PW
or
The first part of the right hand side is the number of cases in the population divided by the number of cases in the sample. It therefore equals the number of cases in the population represented by one case in the sample. This is the definition of the frequency #s weight FW. Hence PW FW of which only PW is known. Rewrite to #p
FW
FW
PW
#s #p
PW
#s #p
1
and use the fact that FW is an integer to come to
1
.
(1)
A remaining unknown in this equation (1) is the inverse of the relative sample size #s .2 However, there is no need to include this, if the goal of the model is not to #p simulate and produce monetary aggregates, but to simulate the adequacy of pensions. Hence, if we limit ourselves to the simulation of indicators that are replication invariant (group means, poverty incidences or intensities, replication invariant inequality indicators such as the Gini), then we can use any integer without any loss of generality. Obviously, the most simple case is when # s equals 1 and equation (1) then becomes #p FW PW (1‟) Now a problem with this approach is standard rounding procedure takes the closest integer on or below the weighting variable. There is information loss involved with the rounding of the probability weights. One straightforward way of reducing this information loss pertaining to rounding of a variable y is simply .10[round(10y)]. The equivalent approach in this context implies multiplying the weights by 10 before rounding. This will cause the frequency-weighted sample size to increase with a factor 10. To reduce it back again to a workable size, we take a 10% sample with replacement. So
FW
sample 0.10 PW 10
(2)
Given (1‟), the expected value E(FW)=0.1 times FW times 10 which equals FW.
Using weights in dynamic models: circumventing the problem by expanding the starting dataset
2
In its most straightforward form, this is the number of individuals in the sample divided by the number of individuals in the population. However, a more interesting approach would be to use separate ratios for subgroups within the sample, for example, age-gender groups. This would mean that the relation (1) between FW and PW would be specific to that age-gender category.
Now we have a starting dataset that consists of individuals grouped in households with a household-level frequency weight. It is crucial that the weights are the same for all individuals in a household, since household structure should be preserved when weighting, because this structure is later needed to simulate equivalent household income that forms the basis of indicators of poverty and inequality. Furthermore, we have the situation where frequency weights are derived from probability weights by multiplying the latter by 10 and then rounding the result. Next, we expand the starting dataset using these „enlarged‟ frequency weights. Figure 1 explains how it works. Figure 1: Weighting a sample dataset Unweighed dataset
Weighed dataset
Expanded dataset
household 13 (1)
…. household 13 (15)
(fweight=2)
individual 1 individual 2 individual 3
….
household 13
household 13 (20)
10% sampling without replacement
household 13 (2)
household 131
individual 1 individual 2 individual 3
household 1315
individual 1 individual 2 individual 3
We start with a dataset consisting of individuals that are grouped in household, and where each household has a certain (frequency) weight. In the above Figure 1, household 13, exists of three individuals 1 to 3, and has a frequency weight equal to 2. Next the sample is expanded by multiplying each household by its frequency weight times 10. Household number 13 is therefore replicated 20 times. In the third step, a 10% sample is drawn. The expected number of draws from the expanded dataset is two, but this household could obviously also be drawn one or three times. Suppose that replicas numbers 1 and 15 of household 13 are selected. This means that, ultimately, individuals 1, 2 and 3 each appear twice in the weighted dataset, but in different households. In short: by weighting and sampling on the level of the household, the grouping of individuals within households is maintained.
An alternative strategy: using weights as a simulation variable in the model A problem with the previously discussed method of applying weights in dynamic microsimulation models is that it expands the dataset (in the case of unscaled weights). This is a problem since dynamic microsimulation models already use a lot of memory, and require a lot of time to run. Expanding the starting dataset obviously enhances this problem. This becomes even more pressing if large, administrative samples are being used in combinations with weights. In the extreme, expanding the dataset using frequency weights brings up the question why a sample has been used in the first place A second problem is that the 10% random sampling required to reduce rounding errors adds sampling variance to the simulation results. This section of the paper presents a second way in which sample frequency weights can be included in dynamic microsimulation models. In contrast to the previous approach, the weights are not used to expand the starting dataset before, but are used after simulation to derive weighted simulation results. So the previous approach used the weights to change the starting dataset of the model, and then applying the model to this dataset. Now we do not change the starting dataset, but treat the household weight variable as any other variable. The model then (re)produces these simulation weights, and these are then used to weigh the simulated dataset to in the production of aggregates, indices of poverty and inequality, et cetera The weighting variable in most survey datasets are „shared weights‟ that pertain to the household level. This is all the more important in the case of microsimulation models, since weighting needs to preserve the household structure. However, when individuals share an equal weight when they are members of the same household, then this has consequences when new individuals or households appear during simulation. To see this, remember that a frequency weight equal to 10 implies that an observed individual represents 10 „actual‟ individuals with the same characteristics. When a new individual appears through birth, then the obvious solution is that the newborn receives the same weight as the other members in the household. So in our example, it is as if 10 households each received a newborn child. An equally obvious situation appears when some leaves a household to start a one-person household of his or her own. This can be the case when a child „leaves the nest‟, or when a couple divorces. In this case, there result two households with equal weight. A more complicated situation however arises when two individuals from different households with different weights form a third household. What weight should this newly created household get? To see the problem more clearly, suppose two households X and Y. Both households consist of two individuals, denoted X1, X2, Y1 and Y2. Now suppose that individuals X2 and Y2 fall in love and form a new household, say, Z. What frequency weight should this household get? I.e. what is the value of fz ? Start from the obvious case where the frequencies of both households are equal, say fx=fy=2. In this case, the frequency weight of the new household is also equal to 2. Or, one individual X2 from two households of type X forms a partnership with one individual Y2 from two households of type Y, hence creating two new households of type Z.
The case where the weight of both donating households is exactly the same is however trivial in that it will seldom or never appear. Suppose next that the frequencies of both households is fx=3 and fy=2. The problem arises from the fact that the weights fx and fy are unequal. The solution again lies in interpreting it as if the dataset was expanded: there are 3 households of „type x‟ and 2 households of „type y‟. Denote these expanded households X1 to X3 and Y1 and Y2. Hence, individual Y2 of both households Y1 and Y2 forms a partnership with individual X2 from two out of three households X. The result is that there are two new households Z1 and Z2, both consisting of a pair of individuals X2 and Y2. There are also two „old‟ households Y, of which individual Y2 has left, and only Y1 remains. Likewise, there are two „old‟ households X, of which individual X2 has left, and only X1 remains. Finally, one household of type X remains unchanged, consisting of X2 and X1. In short, the problem of having to merge individuals from households with different weights is solved by creating two households representing the household with the highest frequency weight, with one of the two households having the same weight as the The trick therefore is that when the frequency weights of the two „donating‟ households differ, that the household with the highest frequency (in this case household X with fx=3) is expanded to two households, of which one has a weight equal to fy. And then the merge is done as in the above case where the frequency weights are equal. More generally, the problem is as follows. Donating households X and Y consists of nx and ny individuals (nx, ny≥1) and have frequency weights fx and fy (fx, fy≥1, fx≠fy). Summarize these households as X[x1..xnx; fx] and Y[y1..yny; fy]. Individuals x1 and y1 from the donating households form a new household Z with an unknown frequency weight fz. The solution is to break up X and Y to subsets with equal weights fz=min(fx, fy), and then create Z with weight fz. The resulting situation is 1) 2) 3) 4)
Household Z[x1, y1; min(fx,fy)] is the new created household. Household X[x2..xnx; min(fx,fy)] is the donating household X without individual x1. Household Y[y2..ynx; min(fx,fy)] is the donating household Y without individual y1. Household X[x1..xnx; fx-min(fx,fy)] is the remaining donating household X with individual x1. 5) Household Y[y1..ynx; fy-min(fx,fy)] is the remaining donating household Y with individual y1.
Note that either cases 4 or 5 are empty. Let us clarify this even further by returning to the previous example. The situation then is nx=ny=2, fx=3 and fy=2. The solution then is 1) 2) 3) 4) 5)
Household Z[x1, y1; 2] is the new created household with frequency weight 2. Household X[x2; 2] is the donating household X without individual x1. Household Y[y2; 2] is the donating household Y without individual y1. Household X[x1,x2; 1] is the remaining donating household X with individual x1. Household Y[y1..y2; 0] is the remaining donating household Y with individual y1. This household does not exist in the solution.
The method presented in this paper involves the partial expansion or “splitting up” of individual weighted households in case of moves of individuals in between households of different weights. As a result of a continuous process of moves and the accompanying partial expansions, the average size of the weights will gradually decrease as the dataset becomes more and more expanded. In some future simulation year, all weights will have been reduced to one, and the dataset will have been fully expanded.
Alignment and weights: a problem... The alternative strategy of treating the frequency weight f as any other variable has as the main advantage that it prevents large losses in efficiency involved in expanding the starting dataset. However, a problem arises in how the MSM applies alignment procedures. Alignment is a general name for a set of procedures by which one imposes that an aggregate simulation result of the model be in line with a desired aggregate result, usually based on predictions of semi-aggregate models or social-policy scenarios. This „aggregate simulation result‟ may be the average of an event probability or proportion, in which case we discuss „state‟ or „event alignment‟ of discrete (state) variables. This type of alignment is discussed in O‟Donoghue et al. (2009). In order to select the number of people that experience a certain event or are in a certain state, we use the exogenous alignment probabilities. Firstly individuals are grouped into the appropriate age and gender groups. Then individuals within each age and gender category are ranked according to the risk that the event will occur. This risk is usually generated by a logistic regression on various explanatory variables, obviously besides gender and age category. The output of this model equation is used to rank the individuals to whom the event occurs, but to leave the decision to the alignment process. If „aggregate simulation result‟ is defined as an average or sum of a continuous variable (usually earnings or income) then the procedure is known as „monetary alignment‟. Monetary alignment essentially is a two-tier approach. A first run of the model with unconstrained micro-level simulation results from the behavioural variable generates a series of uncorrected aggregate serie of the monetary variable. Next, these unconstrained results are confronted with the exogenous „target‟ growth rates, resulting in a series of correction rates. The model is then simulated again, while applying the corrected growth rate to the micro-level simulation results. This is discussed in Dekkers et. al., (O‟Donoghue (Ed.) 2010). If the starting dataset is expanded to take into account frequency weights, then both the procedures for event alignment and monetary alignment will give the correct results. If weights are treated as any other simulation variable, as is the case in our alternative strategy, then applying alignment without taking them into account will result in simulation errors. The simulation results need to equal the exogenous data after the weights have been applied.
In the case of monetary alignment, the situation is again straightforward: instead of performing a simple aggregate of the unconstrained simulation results in the first run, weigh the individual amounts before aggregation. In the case of event/state alignment, the situation is a bit more complicated. Let us start from a simple example, where the alignment dictates that an exogenous proportion P of the sample of S individuals should experience a certain event, such as entering or being in a state. Furthermore, we assume that the individuals are ranked using a ranking variable r ). We assume that those with the lowest rank are selected before those with a higher rank, but selecting the other way around does not change the results. Note also that this rank can be a random variable, a predicted probability from a logistic regression, a mechanical ordering, or a combination. Finally, suppose again a frequency weight fi . Now the unweighted case is that individual i enters a certain state or if ri≤SP and j is the rank of the last individual affected, i.e. the highest rank of the individuals that experience the event j=max(ri| ri≤SP))=SP. In the unweighted case, the „closing condition‟ j=SP is always met, because SP is a discrete number of individuals and the basis of the alignment is the individual. Thus the model exactly reproduces the alignment condition SP. In the case that frequency weights are used, this is not necessarily so. In the remainder of this section, we hope to show that the proportional error remains limited. The next section will then argue that an iterative solution is possible. A first step into doing this is to bring the alignment total to the same scale as the sum of the frequency weights. Hence . Next, remember that an individual i with rank ri is evaluated only when all individuals with a lower rank have experienced the event, meaning that they have been put in the pool NP. This a priori condition for the evaluation of individual i can be written as . Hence and analogous to the unweighted case, individual i experiences the event pertaining to the alignment only if , with individuals ordered according to rank r. It is purely a matter of coincidence if the closing condition is met, and this means that the model will in most cases not exactly replicate the alignment proportions. The maximum possible proportional error is max(fi-1:i=1..S)/NP, but the expected proportional error equals , which is considerably smaller.
Four possible solutions to the mismatch problem in alignment. After the alignment process, the number of individuals occupying a certain state (or having experienced a certain event) will in most cases not be exactly equal to NP. The „mismatch‟ will be D = This paper suggests four possible solutions 1. Split the 'overshooting household' in such a way that there is no mismatch 2. Search for a household with a frequency weight exactly enough to have no or a minimal mismatch 3. Carry forward any mismatch to the next period
4. Iteratively reduce the mismatch until a minimal level is reached.
Strategy 1: split up the last household. The application of the alignment process results in the mismatch D = The household weight fj+1 of individual j+1 overshoots NP. The solution to this problem is to split up this household in two separate households, in such a way that one of the new households can experience the event, while reducing the mismatch to zero. The newly created households therefore receive weights f1j+1= D and f2j+1= fj+1-D and individual j+1 from household f1 will experience the event. Strategy 2: select a household for alignment so that there is no mismatch. The second strategy involves reducing this mismatch via a process of selecting one individual. A first step in this is the adoption of a weak closing condition which means that there is a minimum level of mismatch D* that is considered acceptable. Of course, one possible value of D*=0 which implies no mismatch at all. The first application of the alignment process resulted in the mismatch D = Next, an unweighted individual from the remaining set (that did not experience the event) will have to be selected. This choice must be made in such a way that the mismatch is reduced to its minimum level D* (and of course there is no overshooting) while respecting the ranks ri of the individuals. Since the ranks did not change, the unweighted individual j+1 that was the first not to experience the event (because fj+1 would overshoot NP) will still have the lowest rank of those that did not experience the event, and this individual therefore will by definition not meet the above requirement of reducing the mismatch to D*. Next, the mismatch pertaining to individual j+2 is considered, and then individual j+3 and so forth, until either the process stops when the sample is exhausted (in which case the weak closing condition was too strict) or when the weak closing condition is met. The advantage of this approach is that it does not require splitting up a household. The disadvantage is that not the individual with the highest ranks r that experiences the event. Strategy 3: carry forward any misalignment to the next period. Here the best alternative is chosen between implementing the event for the whole or none of the last household, depending on which alternative results in the smallest mismatch. In case the last household is chosen not to have the event, the resulting shortfall in event numbers is added to the alignment total for the next period. In the alternative case, there is an excess in event numbers, which is then deducted from the alignment total of the next period. Strategy 4: iteratively reduce mismatch. In the second strategy, we search for one individual whose weight reduces the mismatch to at least the minimum level. In this fourth strategy, we try to reduce the mismatch by finding an appropriate combination of individuals.
A first step in this strategy again is the adoption of a weak closing condition that is the minimum acceptable level of mismatch D*. The first application of the alignment process again results in the mismatch D1 =
Figure 2: iterative alignment process
In the second iteration, a weighted number of D1 (the mismatch from the first iteration) individuals will have to be selected, again according to their rank ri, to experience the event. However, since the ranks did not change, the unweighted individual j+1 that was the first not to experience the event (because fj+1 would overshoot NP) will still have the lowest rank of those that did not experience the event, and this individual will again be the first in line to select the event, which again means that NP will be overshot like in the first round, and the second iteration will not reduce D, nor will the third iteration, and so forth ad infinitum. If the ranks ri have a stochastic component, like the logit has a logistic random error component, then one way to get out of this vicious circle may be to recalculate these ranks. This approach is however not guaranteed to do the trick: if the deterministic part of the logits of individuals j+1 and j+2 are very different, then the stochastic component needs to be very important to reverse their rank order and get out of the vicious circle. This is by definition unlikely to occur. A failsafe solution would simply be not to recalculate the ranks, but to simply prevent individual j+1 from being first in rank, by –in this case- setting its rank equal to ∞. Then the mismatch D is again calculated, and the third iteration starts, unless or until D ≤ D*. In this last case, the iteration stops because the weak closing condition is met.
This approach assumes that the iterations stop at a, which implies that Da ≥ D*, with Da the mismatch after a iterations. If not, then the iterations do not solve the problem. However, contrary to the previous strategy, there is no need in this case to adapt the weak closing condition. The mismatch in the last iteration, on the individual with rank r=S will be the minimum mismatch achievable (while respecting the order set by the ranks r). Contrary to the second strategy, the approach taken in the third strategy will therefore always reach a solution, albeit not necessarily a good one. What are the advantages and disadvantages of these strategies? The advantage of the first strategy is that the individual with the lowest rank will be selected to experience the event. The downside is that this approach requires further splitting up of weighted households, thereby reducing the marginal efficiency gain of weighting the dataset. The second strategy avoids splitting up households, but has two disadvantages. The first is that it is not necessarily the individual with the lowest rank will experience the event. The second is that a certain choice for a weak closing condition may not result in a solution. The third strategy is probably the simplest, since it does not require changing the alignment procedures. It overcomes the problems associated with the first two strategies: there is no additional splitting up required, and the individuals with the lowest rank will be selected for the event. The drawback is that this strategy introduces a misalignment in any period that will cause another misalignment in the period after, since it changes the alignment targets. Thus, compared to the auxiliary alignment targets, proportions will be off in a specific period, but these differences should cancel each other out over several periods. The fourth and final strategy avoids splitting up households, and will always reach a solution, and probably one with a lower mismatch than resulting from strategy 2. However, it requires considerable searching so it will make the model relatively slow.
Australian data used for testing Unweighted unit records were available from the 2001 Household Sample Survey (HSF), Australian Bureau of Statistics (2003c), together with a household microsimulation model using these unit records as baseline data (Cumpston 2009). Weighted unit records from the Australian 2000-01 Survey of Income and Housing Costs (SIHC), Australian Bureau of Statistics (2003a), were converted into suitable input files for this model. These files covered 16,824 persons, grouped into 6,786 households. Dwelling weights in the SIHC sample were intended to replicate the Australian population of about 19.4m. To give an unweighted sample size of about 175,000, the weights were multiplied by 0.00937, obtaining dwelling weights ranging from 1.66 to about 34. These weights were then rounded to the nearer integer. The unweighted sample size was chosen so as to be similar to the size of the unweighted sample file currently used by the microsimulation model. For these tests, the sample data were split into 8 regions (the seven states and the Australian Capital Territory), and migration modelled between the regions.
Model changes to use weighted data The model was extended to input, store and update a weight for each occupied and vacant dwelling. These code changes needed care, because the microsimulation model simulates dwellings separately from households. When a household moves out of a dwelling, the dwelling is assumed to continue in existence as a vacant dwelling, with the same location, structure, market value and potential rent. The principles described above have been applied: the destination for a moving household is simulated, taking into account the destination patterns of moving households three vacant dwellings are randomly selected in the destination area, choosing the one most closely meeting the housing patterns of the moving household if the weight of the moving household is greater than the weight of the selected dwelling, that dwelling is assumed to be become fully occupied, and the weight of the source household is reduced accordingly, with destination and dwelling searching by the source household continuing if the weights are equal, the source household occupies the selected dwelling, with the source dwelling becoming vacant, and keeping its original weight if the weight of the moving household is less than the weight of the selected dwelling, the selected dwelling is assumed to become occupied with weight equal to that of the source dwelling, and an identical vacant dwelling is assumed in the destination area, with weight equal to the difference between the selected and source dwellings. Similar principles were used for exits (where an exit is any move of one or more persons out of a dwelling, where at least one person remains in the dwelling), and immigrants. The microsimulation model contained assumptions about the numbers of immigrants each year, and their destination, age and type distributions. The same immigration assumptions were made, with each immigrant household given a weight of 10. Alignment methods used during testing Alignment totals for births, deaths, emigrants and immigrants were available for 8 regions, subdivided into 9 age-groups. As suggested by Cumpston (2009), random sampling was used to give one-pass alignment of these events for each combination of region and age-group. Alignment was thus done separately for each of 72 alignment pools, for each of four event types. To allow random sampling within each alignment pool, lists of person reference numbers were maintained for each pool, updated each time a person changed regions, and recreated each year to allow for age changes. For example, the alignment total for deaths amongst persons aged 75 to 84 in Victoria in the third projection year was 105. Persons were selected at random from those in this alignment pool, and if their probability of death was greater than a random number
between 0 and 1, assumed to die. The number of deaths for the pool was increased by the weight for the household, unless this would have given total deaths exceeding 105. In the first alignment method used for testing, a death resulting in an excess over the alignment total was dealt with by splitting the household, so that enough persons died to exactly reach the total, and the remainder were assumed to remain alive. For example, If 100 persons had already been simulated to die, and the next person simulated to die had a household weight of 15, then 5 persons would be assumed to die, and the remaining 10 remain alive in a new household. Death simulation for the pool ceased when the alignment total was reached. This alignment process created some additional household splitting, reducing the performance gains potentially available from the use of weights. A modified alignment method was thus tested, choosing the best alternative between implementing the event for the whole or none of the last household, and carrying forward any misalignments to the next period. In the above example, assuming all the persons to die in the last household would have exceeded the alignment total by 10, while assuming none to die would have fallen short by 5. No persons would thus have been assumed to die in the last household, and the alignment total for the next period would have been increased by 5. This modified alignment strategy was used in the cases with alignment in tables 1 and 2.
Tests using Australian data Using the aforementioned datasets HSF and SHIC as the starting point, the Cumpston (2009) model was run in its original form (i.e. without weights) and in its weighted form. Table 1 presents the results of these runs. Table 1 Results of 50-year projections using Australian data Data Weighted Event Run time Persons source Data alignment seconds 2001 Weighted HSF HSF SHIC SHIC SHIC SHIC
No No Yes Yes No No
No Yes No Yes No Yes
43.0 55.4 30.4 42.0 41.6 54.5
175044 175044 16824 16824 175108 175108
Persons 2001 Unweighted
Persons 2051 Weighted
Persons 2051 Unweighted
175044 175044 175108 175108 175108 175108
225694 237317 219948 234678 229292 237381
225694 237317 222422 237530 229292 237381
The 50-year runtimes in Table 1 allow for births, deaths, immigrants, emigrants, household changes and moves, but not for data input, output listings, education, employment or wealth accumulation. The last column of the table shows that all six tests gave similar estimates of the population after 50 years. The eighth column shows the unweighted number of persons in the final simulation year 2051. In the case of the weighted models, these numbers have been produced by expanding the weighted number of individuals with the frequency weights. The results in table 1 show that microsimulations using weighted data, or unweighted data derived from weighted data, can give reasonable total population projections. The weighted dataset clearly is almost fully expanded in the final simulation year (compare 222422 with 219948 persons, and
237530 with 234678 persons). The marginal efficiency gain (the reduction in simulation time achieved in the last year of simulation) is therefore close to zero, as expected, and there is no more efficiency gain from weighting in this last simulation year. However, over the whole simulation period, the total gain in efficiency is considerable. This can be seen from column 4, representing the average run time in seconds between 2001 and 2051. The run time is reduced by 11.2 and 12.5 seconds (model without and with alignment). Proportional to the run times of the model without applying the weights, this is a reduction in run times of 27 and 23%, respectively. The below table 2 shows the event numbers in 50 year projections. Table 2 Event numbers in 50-year projections using Australian data Data Weighted Event Births Deaths Emisource Data alignment grants HSF HSF SHIC SHIC SHIC SHIC
No No Yes Yes No No
No Yes No Yes No Yes
99595 107470 97139 107470 105565 107470
92312 90757 96031 90757 91531 90757
121493 119300 118803 119300 124710 119300
Exits
ImmiGrants
Moves
260313 296115 254941 283146 265585 296161
164860 164860 164920 164920 164860 164860
1410005 1477720 1358814 1388343 1471250 1495985
National population projections (Australian Bureau of Statistics, 2003b) provided yearly projections of births, deaths, emigrants and immigrants for each region, and subdivisions of these event numbers were obtained for 9 age-groups (0-14, 15-24, 25-34, 35-44, 45-54, 55-64, 65-74, 75-84, 85+). These subdivided numbers were used as alignment totals, for the tests with event alignment. No alignment totals were available for household exits or household moves. Alignment was done using random selection, as suggested by Cumpston (2009). Individuals are randomly selected for testing for an event such as death, with selection and testing continuing until the desired alignment total is reached. Exact alignment was obtained each year, for the two tests with unweighted data. The immigrant totals with and without alignment are identical, as immigrant families were pre-generated at the start of each projection year. For the weighted data, the numbers of families in the first few years were too low to properly reflect the diversity of immigrants, so immigrant families for each projection decade were pre-generated at the start of the decade, and randomly drawn without replacement. For the alignment test using unweighted data, exact alignment was obtained each year for births, deaths and emigrants, but alignment of immigrants occurred over each decade. Strategy 3 was used for alignment using weighted data, meaning that mismatches where transferred to the next period.
Convergence from weighted to unweighted projections The method presented in this paper involves the partial expansion or “splitting up” of individual weighted households in case of moves of individuals in between households of different weights. As a result of a continuous process of moves and the accompanying
partial expansions, the average size of the weights will gradually decrease as the dataset becomes more and more expanded. In some future simulation year, all weights will have been reduced to one, and the dataset will have been fully expanded. The total efficiency gain of using weights in the way suggested by this paper depends on the average initial size of the weight, and the speed of the convergence process. This is shown in Figure 1. Figure 1 Weighted and unweighted person projections
Projected persons 250,000 200,000 Weighted persons
150,000
Unweighted persons 100,000 Weighted persons (no moves)
50,000 0 2001
2011
2021
2031
2041
2051
Figure 1 shows that the numbers of weighted person recorded, in the unrestricted projections using weighted SIHC data, rose quickly, and after 50 years were about 99% of the unweighted numbers. Clearly, the efficiency gain that comes with the approach proposed in this paper is achieved in the first part of the simulation period. The more splitting up is required, the faster the marginal efficiency gain is reduced. But even if the marginal efficiency gain is reduced, the total efficiency gain is quite considerable. The bottom line in Figure 1 was obtained by not simulating moves of whole households, and shows greater efficient gains.
Conclusions This short paper discusses the use of weights in dynamic microsimulation models with dynamic ageing, other than simply expanding the dataset. An alternative approach is presented, that treats the frequency weights as just another variable in the model. These weights are then used after simulation to derive weighted simulation results. This alternative strategy has the main advantage in preventing losses in efficiency involved in expanding the starting dataset. Testing using Australian data showed that microsimulations using weighted data, or unweighted data derived from weighted data, can give reasonable total population projections. Depending on the nature of the events simulated, efficiency gains may be quite considerable, though the marginal efficiency gain decreases with the increase of the simulation period. The efficiency gain is therefore limited to the first few decades.
Finally, the consequences of using weights in the case of alignment are discussed. Where a model without weights exactly reproduces the alignment condition, this may not be the case when frequency weights are used. This paper finally proposes an iterative procedure where the „overshooting‟ individual in each iteration is taken out of the pool before starting the next iteration.
References Australian Bureau of Statistics (2003a) "Survey of income and housing costs Australia 2000-01 – confidentialised unit record file (CURF) technical paper", catalog no, 3222.0, Canberra August, 63 pages Australian Bureau of Statistics (2003b) "Population projections Australia 2002-2101", catalog no 3222.0, Canberra September 3, 186 pages Australian Bureau of Statistics (2003c) " Census of population and housing – housing sample file Australia 2001”, technical paper, catalog no, 3027.0, Canberra September 29, v + 57 pages Cumpston JR (2009), “Acceleration, alignment and matching in multi-purpose household microsimulations”, paper presented to the second general conference of the International Microsimulation Association, Ottawa, June 8-10, 22 pages, available from http://www.microsimulation.org/IMA/Ottawa_2009.htm Dekkers, Gijs, Hermann Buslei, Maria Cozzolino, Raphael Desmet, Johannes Geyer, Dirk Hofmann, Michele Raitano, Viktor Steiner, Paola Tanda, Simone Tedeschi, Frédéric Verschueren, 2010, What are the consequences of the European AWG-projections on the adequacy of pensions? An application of the dynamic micro simulation model MIDAS for Belgium, Germany and Italy. in O'Donoghue, C. (ed.) Life Cycle Microsimulation Modelling, Lambert Academic Press. Emmerson, C., H. Reed and A. Shephard, 2004, An Assessment of PENSIM2, IFS Working Paper WP04/21, London: Institute for Fiscal Studies. O‟Donoghue, Cathal, John Lennon and Stephen Hynes, 2009, The Life-Cycle In-come Analysis Model (LIAM): A Study of a Flexible Dynamic Microsimulation Modelling Computing Framework, International Journal of Microsimulation, vol. 2(1), 16-31.