Introduction to Stata using the UK Labour Force Survey

Introduction to Stata using the UK Labour Force Survey ESDS Government Author: Version: Date: Anthony Rafferty 7.2 September 2008 G10 The series of ...

Author: Kerrie Strickland

2 downloads 1 Views 1MB Size

Report

Download PDF

Recommend Documents

Improvements to the Labour Force Survey (LFS)

Introduction to Time Series Using Stata

Immigrant Labour Market Assimilation and Arrival Effects: Evidence from the UK Labour Force Survey

Catalogue no G. Guide to the Labour Force Survey

Introduction to Stata

Introduction to Stata Programming

Using Labour Force Surveys to Investigate the Employment Characteristics of UK Tourism Industries

Analyzing Survey Data Using Stata 10

Introduction to Stata

Introduction to STATA

The over-education of UK immigrants and minority ethnic groups: Evidence from the Labour Force Survey

LONGITUDINAL ANALYSIS OF LABOUR FORCE SURVEY DATA

LABOUR FORCE SURVEY. 1. Purpose and use

Labour Force Survey 2002:1 Meta-data

EU Labour Force Survey Database User Guide

Manual of Instructions for Labour Force Survey

EU Labour Force Survey database User Guide

European Union Labour force survey - annual results

Appendix C: Labour Force Survey questionnaire

Labour Market Returns to Vocational Qualifications in the Labour Force Survey

Using the Labour Force Survey in a migration context A Home Office Perspective. 40 Years of the Labour Force Survey 28 th November 2013

Introduction to Stata Katrien Stevens

Introduction to SEM in Stata

A brief Introduction to Genetic Epidemiology using Stata

Introduction to Stata using the UK Labour Force Survey ESDS Government Author: Version: Date:

Anthony Rafferty 7.2 September 2008

G10 The series of ESDS Guides are available online at www.esds.ac.uk

Contents 1.0

Introduction

2

2.0

Getting Started: The Basic features of Stata

5

3.0

Exploring your data

14

4.0

Generating variables and changing their values

28

5.0

Graphics in Stata

34

6.0

Statistical modelling using Stata: A brief introduction 37

7.0

Do-files: Using and saving commands

56

Appendix A

Resources for Learners

59

Appendix B

Entering and transferring data into Stata

61

Appendix C

Reserved names and Stata operators

64

Stata ® and the Stata logo ® are registered trademarks of StataCorp LP. This guide has not been sponsored or approved by Stata. The author is solely responsible for any mistakes.

G10 1 The series of ESDS Guides are available online at www.esds.ac.uk

1.0 Introduction This short guide provides an introduction to STATA 9 using the ESDS Labour Force Survey (LFS) Teaching Dataset (2002). Its central aim is to provide a learning resource for those who have little or no experience of using Stata, through the use of practical examples which you can try. The dataset accompanying this guide was produced by ESDS Government, and can be downloaded together with supporting documentation from the following website: http://www.esds.ac.uk/government/lfs/resources/#teaching. The LFS teaching dataset (2002) gives a subset of data drawn from the UK Labour Force Survey, containing data from all four quarters of the 2002/3 LFS, for respondents aged 16-65 and resident in the UK (n=63,559). For ease of use within a teaching context, the dataset is restricted to a subset of 58 key (mainly individual level) variables1. In order to access this data you will need to register. For UK academic users, or previous users of the UK Data Archive, this can be done by registering as an ESDS and UK Data Archive user with your ATHENS username and password. Simply follow the instructions which appear when you attempt to download the data.

Using this Guide

The guide is separated into seven sections. In section one, the visual operating environment of Stata is explored. Basic commands for opening, examining, and saving datasets are demonstrated. Section two looks at ways of producing frequency tables, cross-tabulations and summary statistics. Some specific functions for the coding of missing and inapplicable values are also considered. In section four, ways to create, manipulate, and recode variables are outlined, whereas section five goes on to explore the graphical capabilities of more recent versions of Stata beyond version 8. Section six gives a basic introduction to estimation and post-estimation commands used for statistical modelling, illustrating examples of multiple linear and logistic regression. Finally, section seven considers how to record and edit syntax commands using do-files. Throughout the guide, illustrations are given for Stata SE version 9. This version of Stata has slightly greater capabilities than some of its predecessors. Those with older versions of Stata will find most of the illustrations (bar section 5 on using the menu system to produce graphics) functional. Information about the different versions of Stata can be found on page 1 of the Getting Started with Stata for Windows Guide (Release 9) (StataCorp, 2005).

1

There are two household variables that take the same value for all members of the household: ‘ten96’ and ‘house’.

G10 2 The series of ESDS Guides are available online at www.esds.ac.uk

Why use the Labour Force Survey?

As well as giving a basic introduction to Stata, a further intention of this guide is to promote the usage of the full LFS dataset for secondary analysis. A wider exploration of the documentation supporting the LFS available from the ESDS Government website is consequently encouraged to supplement the present text (see http://www.esds.ac.uk/government/lfs/). The LFS is carried out by the Office for National Statistics (ONS). Other than the Population Census, it represents the only comprehensive source of information about all aspects of the labour market. The survey has a high research potential for secondary analysis due to its large sample size and detailed questions. Since 1992 (following major methodological changes and the introduction of the quarterly LFS), a simple, stratified random, unclustered sample design has been used to select a sample of addresses. Each quarterly LFS sample of 57,000 responding UK households is made up of five waves, each of approximately 11,000 private households. Each wave is interviewed in five successive quarters, so that in any one quarter, one wave will be receiving their first interview, one wave their second, and so forth with one wave receiving their fifth and final interview. Thus there is an 80 per cent overlap in the samples for each successive quarter. All adults within responding households are interviewed face to face at their first inclusion in the survey and by telephone (if possible) at quarterly intervals thereafter. Each household has their fifth and last quarterly interview on the anniversary of the first. Unlike most other large-scale government surveys, the LFS includes people living in NHS accommodation. Information is also available for young people aged between 16 to 24 years, as the LFS sample includes people living away from their parental home in a student hall of residence or similar institution during term time. Due to the large sample size and stratified unclustered random sample, the LFS has small sampling errors for main population sub-groups. The sample design also allows representative results to be published for any thirteen-week period. In terms of its limitations, the LFS has a high proportion of proxy interviews (c. 30%) in comparison to other surveys such as the General Household Survey (c. 5%). Also, and as with most UK government surveys, response rates have dropped in recent years. In 1999/2000, the response rate for the LFS was 63%. Despite this, the LFS remains a primary resource for those wishing to undertake secondary analysis on the UK labour market and employment related issues. Some other notable features of the LFS data are that: • • • •

Longitudinal datasets are available which link the quarters e.g. June 2001 to August 2002. Separate datasets exist for the analyses of households. These are available for every quarter and may be used for household level analyses, or for individual analyses which draw on household and family characteristics. Special license data is available which contains additional detail and geography. Aggregated Local Authority level datasets are available on the standard ‘end user’ license.

G10 3 The series of ESDS Guides are available online at www.esds.ac.uk

• •

The survey can be used for the analyses of ethnic minorities and other small samples. In order to obtain adequate sample sizes, it may be necessary to combine a number of years of data together. A related dataset, the Annual Population Survey (APS), combines results from five different sources: The LFS (waves 1 and 5); the English Local Labour Force Survey (LLFS); the Welsh Labour Force Survey (WLFS); The Scottish Labour Force Survey (SLFS); and the Annual Population Boost Sample (APS (B)). The APS (B) ceased to exist at the end of December 2005, therefore APS data from January 2006 onwards will contain all of the above data apart from APS (B) data.

The Publication database www.esds.ac.uk/government/citations on the ESDS Government website provides a useful way to search for publications resulting from secondary analyses of the LFS, or other ESDS Government supported datasets. In addition to ESDS Government resources, information about the LFS can be obtained by searching the UK government Office for National Statistics website www.statistics.gov.uk. Any further queries can be directed towards the ESDS Government Helpdesk: E-mail : [email protected] Tel: +44 (0)161 275 1980 Fax: 0161 275 4722 Postal address: ESDS Government CCSR School of Social Sciences University of Manchester Crawford House Manchester M13 9PL

G10 4 The series of ESDS Guides are available online at www.esds.ac.uk

2.0 Getting Started: The Basic Features of Stata This section provides a brief introduction to the visual operating environment of Stata. Some of the basic commands for opening and exploring the contents of datasets are considered. This part of the guide will be mainly relevant to those who have no experience of using the software. 2.1 The Stata Environment Opening Stata

You can open Stata in the same way as you would most other software packages by clicking on its icon or menu item as shown above. When you open Stata, you should see the following screen (although the layout of the windows might vary somewhat and some windows may be minimised or shaped differently):-

The large black window is the results window. The results window will contain your output. This includes: The commands that you run The results you obtain Error messages Active links to Stata web pages, the help system and further output

G10 5 The series of ESDS Guides are available online at www.esds.ac.uk

The Review window is designed to contain a list of past commands. The Variables window contains the list of variables in your data file. The Stata Command window is where you can type commands. At the start of a new Stata session, all three windows are empty. The review and variable windows will also be minimised. To view each window click on the tab. The pushpin icon allows you to toggle between different preset sizes. Try opening the variables window, and resizing it using the pushpin. The relative sizes of the windows can be altered by clicking on the meeting point of the windows in the same way that you can change the size of cells in a table. At the top of the screen is an icon bar menu. Some of the menu items will be familiar whereas others are more specific to Stata. You can see a description of what each item does by running your mouse pointer over it.

In Stata version 10, the icons have been modified and will appear as:

2.2 Setting memory size

Before loading your dataset, you will probably need to set the amount of memory allocated to Stata by your computer. Stata achieves a higher processing speed by holding data within memory whilst performing calculations (as opposed to accessing it from hard disk). This means that the size of the dataset you can load into Stata is limited by the amount of memory allocated. An error message will appear if you attempt to load datasets larger than your allocated memory. The default memory allocated is roughly one megabyte. For most datasets, it will probably be necessary to increase this allocation. A memory size of 16 Mb or slightly higher will be enough for most purposes. For example, the dataset we will be using is around 2.5MB. We will set the memory to 20 Mb by typing a simple instruction into the command window at the bottom of the Screen: Type set mem 20m in the Command window and hit the return key.

•

Notice that your command has appeared in the results window

G10 6 The series of ESDS Guides are available online at www.esds.ac.uk

• •

If you click the review window tab to make this visible (if it is not already) you will see that the command has also appeared there. Stata’s response to your command is also given in the results window (this will give some information about your settings and will include a message saying that you have set memory to 20m).

If you want to keep the memory set to 20 megabytes permanently (until you instruct Stata otherwise) then type: set memory 20m, permanently

This means that each time you open Stata, a memory allocation of 20Mb will already be assigned. Notice that a comma separates the option from the command. This is a general syntax feature for specifying options in Stata commands.

2.3 Opening a file

As with many of Stata’s functions, when opening a data file, you can either use the menu system or enter instructions through the command box. In section seven, we will also consider how commands can be entered from text files known as do-files. For users of SPSS, these files are Stata’s equivalent of syntax files. Using the open (use menu button):

To open a file using the menu system: •

Click on the Open (Use) button

You should obtain a dialogue box:

•

You will find yourself in the default directory: Browse to the folder which you extracted the LFS teaching dataset to. The graphic shows an example in which the data has been saved as ‘lfs2002.dta’ within a folder called stata8.

G10 7 The series of ESDS Guides are available online at www.esds.ac.uk

•

Click on the filename and then click the ‘Open’ button to load the file.

You will find that: • The command has been echoed in the review window as before • There should be no error command in the results window • The variables window now contains a list of variables. Click on the variables tab to see the variables window

Click the pushpin to view the labels for the variables as well as the names. Click the pushpin a second time to hide the window.

Opening files using written commands:

The command for opening a data file takes the following form:Use , clear

Throughout this guide, words in italics in parentheses such as indicate where a specific name of a file or variable needs to be added. Underlined portions of command words indicate abbreviations, which can be used instead of typing out full commands. The clear command at the end of the instruction means that any data that you currently have stored in memory will be erased. Stata saves and opens files from a default directory on your hard disk (c:\data\). You can however organise your data as you wish, determining the directory that you use for loading and saving data. Suppose the file we wish to open is LFS2002.dta and is located in the directory C:\Data_Stata\Course4_Stata_for_LFS\. We could use the following command to open this file: use “C:\Data_Stata\Course4_Stata_for_LFS\LFS2002.dta”, clear

G10 8 The series of ESDS Guides are available online at www.esds.ac.uk

Changing the default directory You can also alter the directory from which Stata loads and saves data by using the cd ‘change directory’ command. For example, if you were to type in the following: cd C:\Data_Stata\Course4_Stata_for_LFS All load and save options would now operate to and from this specified location. This means you can simply enter the use command without specifying the file directory each time.

2.4 Opening the Data Browser

Unlike some other statistical packages, the data in Stata is not immediately visible upon opening a data file. To view the raw data, it is necessary to open the data browser window. This window allows you to view the data, but not to change its values (See Appendix B for information on entering data into Stata). Click on the

button to open the data browser.

Note that: • • • • • • • •

The data for each individual is on a separate row. We call the individual the case because we will be analysing the data at the individual level (although it is possible to structure data differently) Each row is numbered. Each column relates to a particular variable. Each column is headed with the name of the variable. By double clicking the name of the variable at the top of the column, you can read a variable label, which will help you to understand what the variable is. Each cell contains text describing the value for a particular variable and individual case. The text associated with each value was defined by the data creator, using a value label. By clicking on a particular cell the value for that individual for that particular variable is given at the top of the screen. If we click on the first cell (the cell for case 1 for ten96), the value will appear in the space above the data. Ten96[1]=2 means, the value of ten96 for case 1 is 2. Each value of a categorical variable is associated with a specific label indicating what this value represents. The cell contains the text ‘being bou.’ This is truncated. The text ‘being bought with a mortgage or loan’ is the full value label associated with the value 2 for the ten96 variable. This label is therefore associated with all cases that have value 2 for ten96.

G10 9 The series of ESDS Guides are available online at www.esds.ac.uk

•

By double clicking on the word ten96 at the top of the ten96 column we obtain the following window:

This gives: • The variable name • The variable label, in this case ‘accommodation details’, which tells us more information about what the variable is • Some information about the format in which the data is displayed (in this case, %9.0g means that the data is a general numeric variable up to 9 digits long with no decimal places) • The name of the value label.

All of this information is greyed out to prevent you from accidentally changing any details. N.B. You will need to close the browser window and any associated dialogue boxes in order to run further commands.

G10 10 The series of ESDS Guides are available online at www.esds.ac.uk

2.5 The describe command

Another way to look at what variables are in a dataset is to use the describe command. •

Close the data browser window. In the command window, type:

Describe Typing describe alone without any variable names will give you information on all of the variables contained in the dataset: Contains data from C:\Data_Stata\Course4_Stata_for_LFS\LFS2002.dta obs: 63,559 vars: 58 size: 5,021,161 (75.8% of memory free) ------------------------------------------------------------------------------storage display value variable name type format label variable label ------------------------------------------------------------------------------ten96 byte %8.0g ten96 accommodation details house byte %8.0g house accommodation details (grouped) sex byte %8.0g sex sex age byte %8.0g age age last birthday ages byte %8.0g ages age groups in 5 yearly intervals nation byte %8.0g nation nationality cry01 byte %8.0g cry01 country of birth region byte %8.0g region region of usual residence numchild byte %8.0g number of children in the household aged 0-4 numchil1 byte %8.0g number of children in the household aged 5-16 ayfl19 byte %8.0g ayfl19 age of youngest dependent child in family aged tabulation of sex by married | whether | married/cohabiting sex | no yes | Total -----------+----------------------+---------male | 13052 17344 | 30396 | 42.94 57.06 | 100.00 -----------+----------------------+---------female | 14052 19111 | 33163 | 42.37 57.63 | 100.00 -----------+----------------------+---------Total | 27104 36455 | 63559 | 42.64 57.36 | 100.00

G10 20 The series of ESDS Guides are available online at www.esds.ac.uk

-> tabulation of sex by fb | whether born outside | uk sex | no yes | Total -----------+----------------------+---------male | 27816 2580 | 30396 | 91.51 8.49 | 100.00 -----------+----------------------+---------female | 30138 3025 | 33163 | 90.88 9.12 | 100.00 -----------+----------------------+---------Total | 57954 5605 | 63559 | 91.18 8.82 | 100.00

-> tabulation of married by fb whether | whether born outside married/co | uk habiting | no yes | Total -----------+----------------------+---------no | 25126 1978 | 27104 | 92.70 7.30 | 100.00 -----------+----------------------+---------yes | 32828 3627 | 36455 | 90.05 9.95 | 100.00 -----------+----------------------+---------Total | 57954 5605 | 63559 | 91.18 8.82 | 100.00

3.4 Creating Summary Statistics

The summarize command can be used to create summary statistics for continuous variables. In the following examples, we will consider the variable for gross hourly pay (hourpay). The variable hourpay will again have many inapplicable cases. If we run our summary statistics before assigning Stata recognised missing values, we will include “-8” and “-9” values in our analysis, distorting the results. If we look at the codebook for hourpay using the if command to see the values for those who are not in paid employment, we can see that this question only applies to those in paid employment (where stat==1). codebook hourpay if stat ~=1 hourpay gross hourly pay (£) -----------------------------------------------------------------------------------------------------------------------------------------type: label: range: unique values: tabulation:

numeric (double) hourpay [-9,-9] 1 Freq. 22942

units: missing .: Numeric -9

Label does not apply

G10 21 The series of ESDS Guides are available online at www.esds.ac.uk

1 0/22942

Like most Stata commands, summarize can be used with if to select a subset of the data. Note that the symbols “~=” means “not equal to”. Considering that the hourpay variable is routed by stat, another way to handle inapplicable cases is to select those who are in paid employment when creating our summary statistics i.e. where stat==1. However, if we check the codebook for hourpay for respondents where stat==1, we can see that even amongst those in paid employment, there are still some missing values (-9).

codebook hourpay if stat==1 --------------------------------------------------------------hourpay gross hourly pay (£) ---------------------------------------------------------------------------type: numeric (double) label: hourpay, but 4333 nonmissing values are not labeled range: unique values: examples:

[-9,204.8] 4334

units: missing .:

.01 0/40617

-9 does not apply 5.25 7.6399999 11.63

Summarize hourpay if stat==1 Variable | Obs Mean Std. Dev. Min Max -------------+----------------------------------------------------hourpay | 31410 9.700548 6.99723 .05 204.8

To omit not answered and inapplicable cases, we could alternatively select cases with a value greater than zero: su hourpay if stat==1 & hourpay >0 & hourpay ~=. Variable | Obs Mean Std. Dev. Min Max -------------+-------------------------------------------------------hourpay | 31410 9.700548 6.99723 .05 204.8

The “~=” operator is one of many which can be used in Stata to make conditional statements in commands. The table below gives some other commonly used operators. Arithmetic + * / ^

addition subtraction multiplication division power

Logical ~ ! | &

not not or and

Relational (numeric and string) > greater than < less than >= > or equal == equal ~= not equal != not equal

+ string concatenation Note that a double equal sign (==) is used for equality testing.

G10 22 The series of ESDS Guides are available online at www.esds.ac.uk

Note that in the above example using the summarize command, the hourpay ~= . instruction is superfluous. This is because: a) we know from the codebook that there are no Stata coded missing values for hourpay and, b) the summarize command ignores Stata missing values. However, it is good practice to account for missing value in commands that use the greater than (>), or greater than or equal to (>=) operators. This is because Stata stores missing values as a value greater than all other values of a variable. This means that the greater than, or greater or equal to operators will include missing values if we do not tell Stata not to include such values by using either the commands < . or ~=. instructions. Further details on this issue can be found in section 4.2. Examining Continuous Variables: Note that minimum indicates that the smallest value is 0.05 (5 pence per hour!) This is probably due to coding error in the dataset. In a fuller analysis, you may have to make decisions on how to ‘clean’ errors, outliers, and improbable values on continuous variables such as income.

The detail option for summarize gives a wider range of summary statistics. •

Type the following:

su hourpay if stat==1 & hourpay >0 & hourpay ~=., detail gross hourly pay (£) ------------------------------------------------------------Percentiles Smallest 1% 2.06 .05 5% 3.66 .09 10% 4.2 .11 Obs 31410 25% 5.47 .11 Sum of Wgt. 31410 50%

7.81

75% 90% 95% 99%

11.9 17.18 21.35 34.6

Largest 138.46 150 163.5 204.8

Mean Std. Dev.

9.700548 6.99723

Variance Skewness Kurtosis

48.96123 4.826005 65.15382

We can also combine the functions for cross-tabulation and summarize to create tables of summary statistics for sub-groups as defined through cross-tabulation. In the following example, we compare the mean hourly pay for men and women, dependent upon whether they are living as a couple (i.e. married or in a cohabiting union), or whether they are not living as a couple:•

Type:

ta married sex if (stat==1& hourpay>0), su(hourpay) means

G10 23 The series of ESDS Guides are available online at www.esds.ac.uk

Means of gross hourly pay (£) whether | married/co | sex habiting | male female | Total -----------+----------------------+---------no | 9.0985452 7.9928881 | 8.5115412 yes | 12.454774 8.619823 | 10.514367 -----------+----------------------+---------Total | 11.13273 8.3577719 | 9.7005479

The option means specifies that we only require the mean values for each category. If we had not specified this option, we would also obtain standard deviations and cell frequencies in the output table. This table suggests that for both men and women, married people get higher pay than non-married people (although we could also check whether these differences are statistically significant). However, this pattern may be confounded by age. Non – married people may be younger, have less work experience and seniority, and so thus be paid less. We need to account for age differences in order to discount this explanation. We can approach this by considering whether the relationship between marital status and income differs between different age groups.

We can use the variable ages, derived from age for this. This gives the ages of respondents in five year categories. The bysort: command allows us to perform operations by levels of a specified variable: bysort ages: tabulate married sex if (status==1&hourpay>0), summ(hourpay) means nofreq ------------------------------------------------------------------------------------------------------------------------------------------> ages = 16-19 Means of gross hourly pay (£) whether | married/co | sex habiting | male female | Total -----------+----------------------+---------no | 4.3975325 4.4363439 | 4.4177293 yes | 4.23 4.652 | 4.5816667 -----------+----------------------+---------Total | 4.3973349 4.4375108 | 4.4182844 ------------------------------------------------------------------------------------------------------------------------------------------> ages = 20-24 Means of gross hourly pay (£) whether | married/co | sex habiting | male female | Total -----------+----------------------+---------no | 6.7361486 6.3238977 | 6.5155047 yes | 7.1792592 6.5097887 | 6.6942347 -----------+----------------------+---------Total | 6.7581009 6.3436704 | 6.5299505

G10 24 The series of ESDS Guides are available online at www.esds.ac.uk

------------------------------------------------------------------------------------------------------------------------------------------> ages = 25-29 Means of gross hourly pay (£) whether | married/co | sex habiting | male female | Total -----------+----------------------+---------no | 9.3453418 8.5594706 | 8.9495492 yes | 9.7817412 8.5467598 | 9.0923597 -----------+----------------------+---------Total | 9.4733402 8.5551396 | 8.9949653 ------------------------------------------------------------------------------------------------------------------------------------------> ages = 30-34 Means of gross hourly pay (£) whether | married/co | sex habiting | male female | Total -----------+----------------------+---------no | 10.548311 9.437 | 10.012208 yes | 11.821378 9.2466534 | 10.445421 -----------+----------------------+---------Total | 11.243172 9.3234061 | 10.259968 ------------------------------------------------------------------------------------------------------------------------------------------> ages = 35-39 Means of gross hourly pay (£) whether | married/co | sex habiting | male female | Total -----------+----------------------+---------no | 11.777334 9.5189622 | 10.585505 yes | 12.909386 8.9489756 | 10.9662 -----------+----------------------+---------Total | 12.586221 9.1294836 | 10.851559 ------------------------------------------------------------------------------------------------------------------------------------------> ages = 40-44 Means of gross hourly pay (£) whether | married/co | sex habiting | male female | Total -----------+----------------------+---------no | 11.414991 9.5329118 | 10.426241 yes | 13.720475 8.7016084 | 11.164072 -----------+----------------------+---------Total | 13.136032 8.9225785 | 10.972367 ------------------------------------------------------------------------------------------------------------------------------------------> ages = 45-49 Means of gross hourly pay (£) whether | married/co |

sex

G10 25 The series of ESDS Guides are available online at www.esds.ac.uk

habiting | male female | Total -----------+----------------------+---------no | 11.790461 9.7145892 | 10.597074 yes | 13.240801 8.6652131 | 10.86329 -----------+----------------------+---------Total | 12.924504 8.9365285 | 10.799492 ------------------------------------------------------------------------------------------------------------------------------------------> ages = 50-54 Means of gross hourly pay (£) whether | married/co | sex habiting | male female | Total -----------+----------------------+---------no | 11.206868 9.1092617 | 9.9189148 yes | 12.658364 8.7526648 | 10.612248 -----------+----------------------+---------Total | 12.399727 8.8377683 | 10.465945 ------------------------------------------------------------------------------------------------------------------------------------------> ages = 55-59 Means of gross hourly pay (£) whether | married/co | sex habiting | male female | Total -----------+----------------------+---------no | 10.193532 8.1655882 | 8.9578674 yes | 11.846913 7.6297244 | 9.8324965 -----------+----------------------+---------Total | 11.575704 7.7640855 | 9.65073 ------------------------------------------------------------------------------------------------------------------------------------------> ages = 60-64 Means of gross hourly pay (£) whether | married/co | sex habiting | male female | Total -----------+----------------------+---------no | 9.069663 7.8004444 | 8.3047322 yes | 10.54429 7.7155882 | 9.5276321 -----------+----------------------+---------Total | 10.355453 7.7397053 | 9.2935043 ------------------------------------------------------------------------------------------------------------------------------------------> ages = 65-69 Means of gross hourly pay (£) whether | married/co | sex habiting | male female | Total -----------+----------------------+---------no | 8.0625001 7.5027273 | 7.652 yes | 9.2046154 6.8515789 | 8.4337931 -----------+----------------------+---------Total | 9.0983721 7.0903333 | 8.2731507

G10 26 The series of ESDS Guides are available online at www.esds.ac.uk

For men (with the exception of those below 20 years of age), those who are married tend to earn more than their non-married counterparts. However, the difference between the earnings of married and non-married women is less marked. One problem is that the above table gives a large number of cross-tabulations, marking it more difficult to interpret the results. In the following section, we shall go on to consider how we can recode variables into formats more specifically tailored to our questions and analyses.

The ‘table’ command: N.B. If you wish to present your results with less decimal places it is necessary to use the table command and format () option instead of tabulate: table married sex if (status==1&hourpay>0), c(mean hourpay) format(%9.2f)

See help table for further details.

Exercises and suggested answers 2.1 2.2 2.3 2.4 2.5

Do male and female respondents in the dataset have a similar age profile? [sex ages] Do levels of male and female hourly and weekly pay differ by a) ethnicity b) educational attainment? [sex ethnic hourpay grsswk hiqual] Which ethnic group is most likely and which is least likely to be not born in the UK? [ethnic fb] For men aged 30-65, which ethnic groups get the highest pay? [age sex ethnic grsswk] Which ethnic group are least likely to be owner-occupiers? [ethnic house]

Suggested answers: *Exercise 2.1 sort sex tab sex, summ(age) bysort sex: summ age by sex, sort: su age table sex, content(mean age)

*Exercise 2.2 tab ethnic sex tab ethnic sex table ethnic sex table ethnic sex table ethnic sex f(%5.2f)

if status==1&hourpay>0, summ(hourpay) nost nofr if status==1&grsswk>0, summ(grsswk) nost nofr if status==1&hourpay>0, c(mean hourpay) f(%5.2f) if status==1&grsswk>0, c(mean grsswk) f(%5.2f) if status==1&grsswk>0&hiqual>=1&hiqual=35&age0) tab ethn, summ(grsswk) ,if (age>=35&age0)

*Exercise 2.5 tab ethni house if house >0, row

G10 27 The series of ESDS Guides are available online at www.esds.ac.uk

4.0 Generating variables and changing their values From the preceding examples, it can be seen that the format in which secondary datasets are often supplied will mean that some variable recoding will be required prior to analysis. In this section, we focus upon three of the most important commands in Stata for this task: generate, recode, and replace.

4.1 Creating variables using the generate command

The generate command allows you to create new variables. This command can be used in conjunction with label, which is used to assign a label to a variable, create a set of value definitions for a given set of numerical values of variables, and to assign these value labels to a given variable. •

Generate the following variable:

Generate sex1= sex.

•

The label variable [“label name”] command gives a name to your variable:

Label variable sex1 “sex” •

Label define [“label name”] establishes a set of labels for a set of values:

Label define sexlabel 0 male

•

1 female

label values [“label name”] attaches labels as created through the label define option to the values of a specified variable:

Label values sex1 sexlabel

Note that labels created through the label define option are stored independently to variables. This means that one label can be assigned to a number of different variables: •

For example:

ge sex3 = sex label define sex3 sexlabel

This command would thus use the same label (sexlabel) again. Alternatively, we could have used the label that already existed for sex to label our new variables, given that their values are identical.

G10 28 The series of ESDS Guides are available online at www.esds.ac.uk

Creating ‘replica’ variables: Instead of recoding original variables, it is good practice to create ‘replica’ variables, identical in their values to original variables, and perform any recoding on these variables. The advantage of this is that you will keep the integrity of the original values of the variables in your dataset. The original variables can then provide a useful check to your recoding. They can also be indispensable to correcting otherwise potentially irreversible changes that result from your data manipulation. Generate also allows you to create a variable with values based upon the mathematical transformation of another variable. If for instance, we find a curvilinear relationship between age and income, we might fit a quadratic model. In such cases, we may need a variable for age squared. This can be created easily using the generate command:generate agesquared=age^2

The superscript is indicated by a “^” symbol. Another use for the generate command can be found in the handling of variables which relate to a date, year, or some other form of calendar information. The LFS dataset contains the variable conmpy indicating the year in which a respondent first joined their present employing organisation. From this we can use generate to create a ‘tenure,’ or ‘length of service’ variable, by calculating how many years prior to the interview year (2003) respondents were present at their current organisation: • • •

•

Create a replica variable identical to conmpy named lengthcom Convert the -8 and -9 values of your new variable to two different Stata missing values. Next, create a variable indicating length of tenure, the value of which equals 2003 minus lengthcom (2003-lengthcom) as the dataset is for 2002. This will mean that the value one will indicate those with one or less years of experience. Check your values using the tabulate command

If you did the above correctly, you should have done something like the following. First, create a replica variable so as not to alter the values of the original one: gen

lengthcom=conmpy

Second, recode missing values (n.b. A backwards slash can separate multiple codings): mvdecode lengthcom, mv(-8=.\ -9=.a)

Next, a variable is created to indicate tenure (approximated to no. years prior to interview year+1, to avoid minus values): gen

tenure=2003 - lengthcom

We can check our new variable by using tabulate, once again specifying that missing values are include in the table:G10 29 The series of ESDS Guides are available online at www.esds.ac.uk

tab

tenure, m

tenure | Freq. Percent Cum. ------------+----------------------------------0 | 132 0.21 0.21 1 | 5,435 8.55 8.76 2 | 6,348 9.99 18.75 3 | 3,926 6.18 24.92 4 | 3,081 4.85 29.77 5 | 2,478 3.90 33.67 6 | 1,957 3.08 36.75 7 | 1,705 2.68 39.43 8 | 1,379 2.17 41.60 9 | 1,104 1.74 43.34 10 | 831 1.31 44.65 11 | 912 1.43 46.08 12 | 943 1.48 47.56 ......values ommitted between here 40 | 28 0.04 63.28 41 | 30 0.05 63.33 42 | 25 0.04 63.37 43 | 18 0.03 63.40 44 | 8 0.01 63.41 45 | 7 0.01 63.42 46 | 5 0.01 63.43 47 | 6 0.01 63.44 48 | 4 0.01 63.45 49 | 7 0.01 63.46 50 | 6 0.01 63.47 . | 23,220 36.53 100.00 ------------+----------------------------------Total | 63,559 100.00

Another common use for the generate command is to create indicator or ‘dummy variables’. These are typically binary variables (holding valid values of 1 or 0) which can be used to enter categorical variables into statistical models. For example, we might wish to make a binary indicator variable from tenure, which assigns a value of one to those with ten or less years service, and zero to those with greater service. This is to identify respondents with less than ten years tenure within their current organisation. To do this, the generate command can be used alone or in combination with its accompanying conditional or if options. Here are these two alternative ways of doing this:gen tenure_a= tenure 4000 & varz ~= .

N.B. This applies to all uses of greater than or greater or equal to operators, and not just when using the replace command.

4.3 Recoding variables

Recode provides a slightly different function to replace. Typically, recode is used to collapse the number of categories in a variable. In the following example, we will create a recoded version of the continuous variable, tenure. Our new variable, tenure4 will have four categories. As always when transforming or creating new variables, it is first important to look at the codebook and survey documentation. This is to ensure that we understand what each value means or what range of values exist (i.e. what are the max and min). This also allows us to consider whether there are any missing or inapplicable cases, which need to be accounted for. From the above tabulations, we know that the longest tenure is 50 years whereas the shortest is zero years. The latter value indicates those who joined their current organisation in 2003. In order to keep the integrity of the values of the original variable, we will first create a replica variable which we will then recode into four categories:gen tenure4= tenure recode tenure4 min/5=1 6/10=2 11/20=3 21/50=4 *=. label variable tenure4 "num. years in present company" label define tenure4 1 "0-5yrs" 2 "6-10yrs" 3 "11-20yrs" */ 4 "21- 50" label values

/*

tenure4 tenure4

Note that if we had not recoded the -8 and -9 values to “.”, the min option denoting the minimum value would include these in the recode. For long commands, we can G10 32 The series of ESDS Guides are available online at www.esds.ac.uk

use the /* at the end of one line, and */ at the beginning of the next to tell Stata that the text on different lines is part of the same command.

Some notes can be made on the symbols usable within recode statements. Stata understands ‘/’ to mean 'through' (6/10 in the present context thus means 6 through 10). The symbol ‘*’ in the context of recoding instructions means 'remaining' or 'all others'. To those familiar with SPSS, this is similar to the 'else' option. We can also use ‘min’ to denote the minimum value. Just as when using the greater than (>) or greater or equal to (>=) operators, care must be taken when using the max option so as not to accidentally recode missing values as valid cases.

Instead of using generate and recode as separate commands, you can also use generate as an option of recode. The name of the new variable is defined in the brackets that follow the generate option:recode length4 min/5=1 6/10=2 11/20=3 21/50=4 *=., generate(length4_b)

Exercises and suggested answers 4.1 Return to the example in 3.4, which looks at differences in earnings by marital status and age. Create a recoded variable called ‘age3’ which recodes the variable age into the following categories: 1=16-35yrs, 2 =36-50yrs, 3 =5165yrs.” 4.2 Using the bysort command again, reconsider how the relationship between marital status and income differs by age and gender using your new variable (age3, married sex).

Suggested answers: *Exercise 2.1 tab age,m gen age3=age recode age3 min/35=1 36/50=2 51/max=3 label var age3 "Age groups" label def age3 1 "16-35" 2 "36-50" 3 "51-65" label val age3 age3 tab age age3, m *Exercise 2.2 bys age3: tabulate married sex if (status==1&hourpay>0), summ(hourpay) means nofreq

G10 33 The series of ESDS Guides are available online at www.esds.ac.uk

5.0 Graphics in Stata The graphical capabilities of recent versions of Stata (beyond Stata 8) have been improved considerably compared to earlier versions of the software. This means you can now produce publication quality graphics with relative ease, and without having to transfer data or output to different software. Below, we will consider some techniques for producing simple histograms and two-way scatter plots. These can provide a useful accompaniment to the summary statistics created in Section 3.5.

5.1 Producing a histogram

The menu system is a good way to produce graphics because the graphic command structure is different to the usual structure of Stata commands. Because the full options available in the graph menus can be excessive we’ll stick to the Easy graphs option. •

0

Density .05

.1

•

Open the histogram dialogue by choosing Graphs, Easy graphs, Histogram Select hourpay using the drop down menu and click OK

0

50

100 gross hourly pay (£)

150

Where did the graph appear? G10 34 The series of ESDS Guides are available online at www.esds.ac.uk

200

•

• • •

Does the graph include cases where the values are not valid? What is the y axis marked as? Use the if/in to only include those cases where hourpay>=0 o Click on the Options tab, to change 3

options:

o Set the Scheme to s1 monochrome – this will change the colour scheme to a black and white one which will be better suited to printing o Set the bin width to 10, to make the width of the bars £10 wide o Change the y axis to percent o Click OK to run the graph What happened to your original graph when your new graph appeared? How do you think you might add a title and notes? Try playing around with the various options in the histogram dialogue When you’re happy click OK to run the graph Save the graph by right clicking on it. Save the graph as a windows metafile format (.wmf) to import to a word document.

N.B. Stata 9 can now also save to .tiff, which may be preferable when producing results for publication.

G10 35 The series of ESDS Guides are available online at www.esds.ac.uk

5.2 Producing a two-way scatter plot

To produce a 2 way scatter plot of men’s hourly pay by age: • Select, graphics, easy graphs, scatter plot • Select age as your x variable and hourpay as your y variable in the main tab • Limit the procedure to those cases where hourpay >=1 and sex==1 in the if tab • Can you locate the appropriate options to replicate the graph below? • Do this and save the file in a format appropriate to import into a word document.

0

Hourly pay in pounds 50 100 150

200

Men's Hourly Pay by Age

20

30

40 50 Age (in years)

Source: Labour Force Survey 2002 Teaching Dataset

Type help graph for further options

G10 36 The series of ESDS Guides are available online at www.esds.ac.uk

60

70

6.0 Statistical modelling using Stata: A brief introduction One of the most powerful aspects of Stata is its range of capabilities for statistical modelling. Most estimation commands in Stata follow a similar syntactical structure. This means once you have learnt the basic estimation procedure for one command, with a little knowledge, you will be able to produce results using a wide range of procedures. In this section, worked examples of two common forms of statistical models are given, these being multiple linear and logistic regression.

6.1

Example 1: Multiple Linear Regression

Multiple linear regression can be used to consider the extent to which a set of explanatory (independent) covariates predict the values of a continuous outcome (dependent) variable. Introductory texts for these techniques are suggested in Appendix A. In Stata, commands that produce statistical models are referred to as estimation commands. Such commands are demonstrated by the basic structure of the linear regression command: regress

depvar [varlist] [weight] [if exp] [in range] [, level(#) beta robust noconstant noheader]

The dependent variable is indicated by the first variable after the regress command. Subsequent listed variables define independent/ explanatory variables. Sampling weights2 can also be defined in the statement, as can conditional statements [if exp] or a range of cases [in range]. These latter two options allow for the selection of a subset of data. Following the comma in estimation commands, a number of other options can be specified. A full list of options for the regress command can be viewed by searching for ‘regress’ using the help option. For now, we will consider some of the more important options. Level (#) allows the specification of the significance level for your results. Beta provides standardised coefficients in your output. Noconstant suppresses the constant term (intercept) in the model output. The option robust specifies a robust estimation of the variance covariance matrix. Noheader suppresses the display of the ANOVA table and summary statistics at the top of the output, so that only the coefficient table is displayed. •

2

Type help regress to find out about further options available for the regress command.

See http://www.esds.ac.uk/government/resources/analysis/ for a guide to sample weighting.

G10 37 The series of ESDS Guides are available online at www.esds.ac.uk

In the following example, we will ask a number of questions relating to ethnicity, gender, and income: 1: Which ethnic groups are more likely to have higher incomes? 2: Are such levels of income related to gender and marital status? 3: Do people who are not born in the UK suffer a ‘nativity penalty’ in terms of lower pay? 4. Do ethnic differences in wages persist after controlling for educational attainment and job experience? To attempt to answer these questions, we will estimate four models: Model 1:income = β0 + β1ethnic + ε Model 2:income = β0 + β1ethnic + β2sex + β3married + ε Model 3:income = β0 + β1ethnic + β2sex + β3married + β4foreignborn + ε Model 4:income = β0 + β1ethnic + β2sex + β3married + β4foreignborn +β5education+β6 jobtenure+ ε

Before proceeding with our models, it is first necessary to prepare the data. In Section 4.1, we considered how indicator variables can be created. These variables allow categorical variables to be entered into models as sets of binary variables, taking on the values 1 and 0.The variables sex, married and fb (‘foreign born’) are already coded as indicator (dummy) variables in the dataset: sex (female = 1, male = 0); married (married = 1, non-married = 0) fb (foreign born = 1 and native born = 0).

In addition to these, we need to create indictor variables for ethnicity. In Section 4, we learnt a few commands we can use to do this, including generate, replace, and recode. In this section, we shall introduce another way, which is particularly useful in modelling contexts: using the char [omit] # command in conjunction with the xi: “interaction expansion” command. When entering indicator variables for a categorical variable into a model, it is always important to omit one category of the variable. This category will act as the reference group to which all other category values of a variable will be compared. The char [omit] # function specifies a categorical variable as an indicator variable to Stata by denoting the category value of a variable to be omitted (#). G10 38 The series of ESDS Guides are available online at www.esds.ac.uk

Suppose that we wish to use the white category as the reference group for our ethnicity variable. •

Enter the following:

char ethnic [omit]1 The # in the command syntax is replaced by a number 1, denoting the category which we wish to use as a reference group (in this case 1=‘white’). Once specified, Stata remembers the category we have selected and omits it from every model we run (unless we specify another category). We can decide to omit whichever group we prefer. Although this will change the absolute value of the coefficients, it will not change the relative magnitude of difference between the coefficient values for each category of a variable. Selecting Comparison Groups: When selecting your omitted category, choose a theoretically and empirically meaningful category. For instance, in relation to our ethnicity variable, to use the ‘Other’ ethnic category may not be meaningful if we do not have a clear idea about who the ‘Other’ group’ are. The reference group should also be of a fairly large size, otherwise this could affect the stability of the models. Stata will give a message encouraging the use of another reference category if it considers the one you have chosen is too small.

Identifying indicator variables using the xi: command

The regression command, and all other estimation commands, do not automatically recognise categorical independent variables listed in our instructions. Consequently, it is necessary to tell Stata that a model contains such variables, by prefixing our regression statement with the xi: “interaction expansion” term. When using xi: we also need to prefix each of our indicator variables in the model with i. to tell Stata that these are categorical variables which are to be split into indicator variables. Otherwise, Stata will treat such categorical variables as continuous, in many cases rendering meaningless results. When Stata sees a variable preceded by i. it will treat it as a categorical variable, omitting the category we have declared as the base group. If we haven’t specified a baseline category, Stata by default will treat the category assigned the lowest numerical value as the comparison group. In each of the following models, we will tell Stata that the models contain categorical variables. We will also filter the sample, selecting for those who are in paid employment (status==1) and who do not have -8 missing or inapplicable values for the weekly gross pay variable (grsswrk >= 0). Always know your data before conducting any analysis. Prior to analysis, we might also want to check the distribution of our income variable, check and clean our variables, perform descriptive analyses, handle missing values, and consider potential transformations for our variables etc. For present illustrative purposes, we shall omit these stages. G10 39 The series of ESDS Guides are available online at www.esds.ac.uk

•

In the first model estimated, we will consider the univariate relationship between ethnicity and gross pay:

xi: regress grsswk i.ethnic if status==1 & grsswk>=0

This should produce the following results in your output screen: i.ethnic

_Iethnic_1-9

(naturally coded; _Iethnic_1 omitted)

Source | SS df MS -------------+-----------------------------Model | 3068172.75 8 383521.594 Residual | 2.4508e+09 31559 77656.2294 -------------+-----------------------------Total | 2.4538e+09 31567 77733.7446

Number of obs F( 8, 31559) Prob > F R-squared Adj R-squared Root MSE

= = = = = =

31568 4.94 0.0000 0.0013 0.0010 278.67

-----------------------------------------------------------------------------grsswk | Coef. Std. Err. t P>|t| [95% Conf. Interval] -------------+---------------------------------------------------------------_Iethnic_2 | -23.72905 17.73292 -1.34 0.181 -58.48626 11.02816 _Iethnic_3 | -5.981017 21.49624 -0.28 0.781 -48.11448 36.15245 _Iethnic_4 | -118.1102 31.39386 -3.76 0.000 -179.6434 -56.57704 _Iethnic_5 | 25.67318 14.46086 1.78 0.076 -2.670672 54.01702 _Iethnic_6 | -56.42616 22.96276 -2.46 0.014 -101.4341 -11.41826 _Iethnic_7 | -115.4301 40.25447 -2.87 0.004 -194.3304 -36.52978 _Iethnic_8 | 15.2324 36.01186 0.42 0.672 -55.35227 85.81706 _Iethnic_9 | 34.13716 14.29295 2.39 0.017 6.122419 62.1519 _cons | 349.0343 1.607448 217.14 0.000 345.8836 352.1849 ------------------------------------------------------------------------------

We find that, when only ethnicity is included in the model, ‘Black Other’, ‘Pakistani’ and ‘Bangladeshi’ respondents get significantly less income than ‘White’ respondents. The ‘Other’ ethnic group gets significantly higher incomes. The R-squared tells us how well our regression model fits the observed data. For example, a value of 0.23 would tell us that the model accounts for 23 per cent of the variance in the data. The R-square for our first model is 0.001, which is very low. If we enter more variables into the model which predict our dependent variable, the fit of our model to the data should improve, and this value should rise. Storing estimates

If you have opened up a log file, the output from your results will be stored permanently. However, within its active memory, Stata automatically holds results for the last model ran. However, once we run a subsequent model, the last model will be lost. This means we can no longer perform any further estimations or statistical tests on the prior model. We can prevent this happening by using the estimate store command to assign names to models which we can use to recall results and perform further estimation procedures on at a later time: est store model1

Below, we will consider how the est command can be used to simultaneously handle a number of different estimated models. In section 6.2, we will also go on to see how est options can be used to perform post-estimation procedures following the initial estimation and storing of our models. G10 40 The series of ESDS Guides are available online at www.esds.ac.uk

•

In our second model, we will include gender and marital status:

xi: regress grsswk i.ethni i.sex i.married if status==1&grsswk>=0

est store model2 i.ethnic i.sex i.married

_Iethnic_1-9 _Isex_0-1 _Imarried_0-1

(naturally coded; _Iethnic_1 omitted) (naturally coded; _Isex_0 omitted) (naturally coded; _Imarried_0 omitted)

Source | SS df MS -------------+-----------------------------Model | 320155308 10 32015530.8 Residual | 2.1337e+09 31557 67613.0751 -------------+-----------------------------Total | 2.4538e+09 31567 77733.7446

Number of obs F( 10, 31557) Prob > F R-squared Adj R-squared Root MSE

= = = = = =

31568 473.51 0.0000 0.1305 0.1302 260.03

-----------------------------------------------------------------------------grsswk | Coef. Std. Err. t P>|t| [95% Conf. Interval] -------------+---------------------------------------------------------------_Iethnic_2 | -1.831729 16.55295 -0.11 0.912 -34.27617 30.61271 _Iethnic_3 | -5.290509 20.05968 -0.26 0.792 -44.60826 34.02725 _Iethnic_4 | -79.97105 29.30674 -2.73 0.006 -137.4134 -22.5287 _Iethnic_5 | 17.44311 13.49736 1.29 0.196 -9.012241 43.89846 _Iethnic_6 | -71.83893 21.42786 -3.35 0.001 -113.8384 -29.8395 _Iethnic_7 | -151.5194 37.56624 -4.03 0.000 -225.1507 -77.8881 _Iethnic_8 | 12.12303 33.60301 0.36 0.718 -53.74018 77.98624 _Iethnic_9 | 33.56038 13.3376 2.52 0.012 7.418158 59.7026 _Isex_1 | -188.2034 2.930134 -64.23 0.000 -193.9466 -182.4602 _Imarried_1 | 66.28188 2.984302 22.21 0.000 60.43253 72.13123 _cons | 406.6536 2.795909 145.45 0.000 401.1735 412.1337

The results indicate the effect of each variable after controlling for the effects of other variables in the model. This model suggests that women receive significantly lower incomes than men, and that married people get higher incomes than non-married people. Through the inclusion of gender and marital status, we can see that the Rsquare value has increased to 0.13 or 13 per cent explained variance. Since sex and married are coded as binary dummies already, prefixing i. or not prefixing i. before the two variable names in the command would produce exactly the same results. You could consequently just write the following: xi: regress grsswk i.ethni sex married if status==1& grsswk>=0

We can assess whether the inclusion of additional variables significantly improves overall model fit by using the testparm command to produce Wald tests. This is one of many post–estimation commands you can implement after fitting your model. testparm ( 1) ( 2)

_Is* _Ima*

_Isex_1 = 0.0 _Imarried_1 = 0.0 F(

2, 31557) = 2344.87 Prob > F = 0.0000

G10 41 The series of ESDS Guides are available online at www.esds.ac.uk

Note that by inspecting the output, we know _Isex_1 stands for sex and _Imarried_1 for married and that there are no other terms beginning with s or m so that we can simply use _Is* and _Im* to stand for the two terms respectively. •

In the third model, we will include the variable fb to control for income differences between people who were born in the United Kingdom or who were born in another country:

xi: regress grsswk i.ethni sex married fb if status==1&grsswk>=0 est store model3 Source | SS df MS -------------+-----------------------------Model | 327352435 11 29759312.3 Residual | 2.1265e+09 31556 67387.1429 -------------+-----------------------------Total | 2.4538e+09 31567 77733.7446

Number of obs F( 11, 31556) Prob > F R-squared Adj R-squared Root MSE

= = = = = =

31568 441.62 0.0000 0.1334 0.1331 259.59

-----------------------------------------------------------------------------grsswk | Coef. Std. Err. t P>|t| [95% Conf. Interval] -------------+---------------------------------------------------------------_Iethnic_2 | -26.30005 16.69402 -1.58 0.115 -59.02099 6.420884 _Iethnic_3 | -56.19266 20.62295 -2.72 0.006 -96.61445 -15.77087 _Iethnic_4 | -87.46918 29.26673 -2.99 0.003 -144.8331 -30.10525 _Iethnic_5 | -22.66187 14.02247 -1.62 0.106 -50.14646 4.822708 _Iethnic_6 | -102.5066 21.59687 -4.75 0.000 -144.8373 -60.17588 _Iethnic_7 | -187.8878 37.66817 -4.99 0.000 -261.7189 -114.0567 _Iethnic_8 | -33.2174 33.83249 -0.98 0.326 -99.5304 33.0956 _Iethnic_9 | -11.09539 13.99887 -0.79 0.428 -38.53372 16.34295 sex | -188.3054 2.925251 -64.37 0.000 -194.039 -182.5718 married | 65.41331 2.980497 21.95 0.000 59.57142 71.2552 fb | 67.33948 6.515964 10.33 0.000 54.56794 80.11103 _cons | 404.2611 2.800818 144.34 0.000 398.7714 409.7508

Surprisingly, people who were not born in the UK (fb==1) have significantly higher incomes than native born respondents (after controlling for ethnicity). The sex and marital status parameters are similar to those in Model 2. However, once all sex, marital and nativity factors are controlled for, several of the non-white ethnic groups still have have significantly lower incomes than the white category. •

Does the inclusion of fb make a statistically significant contribution to the terms already included in Model 2?

testparm ( 1)

fb

fb = 0.0 F(

1, 31556) = Prob > F =

106.80 0.0000

The answer is yes, it does. Instead of just comparing all other ethnic groups to the White category, we can also consider whether there are statistically significant differences between any particular two ethnic groups. This is achieved using the test command to produce Wald tests.

G10 42 The series of ESDS Guides are available online at www.esds.ac.uk

The following example tests whether there are significant differences between the Black Caribbean and Black Other groups, between Indian and Pakistani groups, and between Pakistanis and Bangladeshi ethnic groups:

test

_Iethnic_2=_Iethnic_4

( 1)

_Iethnic_2 - _Iethnic_4 = 0.0 F(

1, 31556) = Prob > F =

test

_Iethnic_5=_Iethnic_6

( 1)

_Iethnic_5 - _Iethnic_6 = 0.0 F(

1, 31556) = Prob > F =

test ( 1)

3.32 0.0684

10.03 0.0015

_Iethnic_6=_Iethnic_7

_Iethnic_6 - _Iethnic_7 = 0.0 F(

1, 31556) = Prob > F =

3.92 0.0477

After controlling for gender, marital status and country of birth, the results indicate that there are no significant income differences between the Black Caribbean and Black Other groups. Those within the Indian ethnic category however received significantly higher incomes than the Pakistani group, whereas Pakistanis received significantly more income than Bangladeshis. Note that when we tell Stata to compare _Iethnic_2=_Iethnic_4, it automatically re-orders the equation so that it becomes _Iethnic_2 - _Iethnic_4 = 0.0. • •

Type ‘help test’. What is the difference between the ‘test’ and ‘testparm’ commands? What other variables that predict income do you think need to be added to the model?

Although the above models give some initial indications regarding ethnic differences in income, they are incomplete in that there are many other important explanatory variables for income (such as educational attainment, and job tenure) which are missing from the model, and as a result could mean our current estimates are biased. In the next model, variables to denote level of education attainment and tenure are included. For simplicity, we will restrict this to a binary variable indicating whether a respondent has a higher level qualification or not (although you can try some more detailed categorisations if you wish3). We can obtain a definition of a higher qualification from the National Qualifications Framework as used for the 2001 Census.

3

The variable hiquald is included in the teaching dataset which gives more detailed categorisation. You may wish to try using this variable as an alternative.

G10 43 The series of ESDS Guides are available online at www.esds.ac.uk

•

Create a higher education variable:

gen replace tab

hieduc=hiqual hieduc=. if hiqual=1& hiqual=15 & hiqual