Stata: Merging Data Sets

Stata: Merging Data Sets Social Science Research Lab American University, Washington, D.C. Web. www.american.edu/provost/ctrl/pclabs.cfm Tel. x3862 Em...
Author: Nora Daniels
3 downloads 6 Views 243KB Size
Stata: Merging Data Sets Social Science Research Lab American University, Washington, D.C. Web. www.american.edu/provost/ctrl/pclabs.cfm Tel. x3862 Email. [email protected]

Course Objective This course would provide students with basic information on how to merge datasets in STATA.

Learning Outcomes One-to-one merge Match-merge One-to-many/Many-to-one merge Merging multiple data sets

Open a LOG FILE before you start work… You should always open a log file before you start work. This file will record all the input that you type as well as all the output produced by STATA. You can easily delete it later if you decide not to keep it. To open a log file, choose a relevant name and in the STATA command window, type: log using "G:\SSRL\intermediate_example", replace

If you want to save your work on to an existing file without erasing the contents of the existing file, type: log using "G:\SSRL\intermediate_example", append

You can open and review the log at any time by clicking on File, Log, View and then navigating to the location of the log file.

1

One-to-one merge 1) To open the first data set, type: webuse odd This command opens an example dataset installed in STATA named “odd”. 2) To display the data contents, type: list You will see the following data listed:

1. 2. 3. 4. 5.

+--------------+ | number odd | |--------------| | 1 1 | | 2 3 | | 3 5 | | 4 7 | | 5 9 | +--------------+

3) To open the second data set, type: webuse even1 This data set called “even1” is the data set currently in STATA’s memory. Notice how the variable window now has the variable “even” instead of “odd”. The data set currently in memory (or currently open) is also called the “master” data file. 4) To display data contents, type: list You will see the following data listed:

1. 2. 3. 4.

+---------------+ | number even | |---------------| | 5 10 | | 6 12 | | 7 14 | | 8 16 | +---------------+

5) To perform one-to-one merge, type: merge using http://www.stata-press.com/data/r10/odd

The “using” data set is the data set that you would like to merge with the “master” data set. In this example, the merge command specifies a url to call the data set from the web. You can replace this with any other path where your “using” data set is located. For example, if you had saved the first data set in your G: drive you could instead type: merge using “G:\odd.dta”

2

6) To display the merged data contents, type: list You will see the following data listed:

1. 2. 3. 4. 5.

+------------------------------+ | number even odd _merge | |------------------------------| | 5 10 1 3 | | 6 12 3 3 | | 7 14 5 3 | | 8 16 7 3 | | 5 . 9 2 | +------------------------------+

The _merge variable is a new variable generated by the merge command. The legend is as follows: _merge==1 obs. from master data _merge==2 obs. from only one using dataset _merge==3 obs. from at least two datasets, master or using

Notice that the last observation has a missing value for “even”, and a _merge value of “2”, meaning that data was obtained only from the using data set. Another way we can merge data is by matching the observations according to some identifier. In this case, the ID variable is “number”. To match observations with the same ID we use the match-merge option.

Match-merge

1) Let’s open the master data set again. Type: webuse even1, clear 2) To perform match-merge, type:

merge number using http://www.stata-press.com/data/r10/odd, sort

Notice the “sort” option after the comma. Both data sets must be sorted according to the ID variable for the match-merge to run. 3) To display data contents, type: list You will see the following data listed:

1. 2. 3. 4. 5. 6. 7. 8.

+------------------------------+ | number even odd _merge | |------------------------------| | 5 10 9 3 | | 6 12 . 1 | | 7 14 . 1 | | 8 16 . 1 | | 1 . 1 2 | |------------------------------| | 2 . 3 2 | | 3 . 5 2 | | 4 . 7 2 | +------------------------------+

3

To make the display easier to read, we can sort the ID by typing: sort number Type “list” to display the data again, this time in ascending order:

1. 2. 3. 4. 5. 6. 7. 8.

+------------------------------+ | number even odd _merge | |------------------------------| | 1 . 1 2 | | 2 . 3 2 | | 3 . 5 2 | | 4 . 7 2 | | 5 10 9 3 | |------------------------------| | 6 12 . 1 | | 7 14 . 1 | | 8 16 . 1 | +------------------------------+

Match-merge is ideal for data sets where the IDs are unique in both data sets you are merging. It is useful to check whether your ID variables are unique before performing the merge command. To do this, type: duplicates report number Duplicates in terms of number

-------------------------------------copies | observations surplus -------+-----------------------------1| 8 0 --------------------------------------

If your IDs are unique, your duplicates report should indicate a 0 surplus. If not, then your ID is not unique in the data set and you need to find additional identifying variables before you can perform the match-merge.

Many-to-one/One-to-many merge (or merge with spreading) 1) Open a new data set. Type: webuse dollars, clear 2) To display data contents, type: list You will see the following data listed:

1. 2. 3. 4.

+-----------------------------+ | region sales cost | |-----------------------------| | N Cntrl 419,472 227,677 | | NE 360,523 138,097 | | South 532,399 330,499 | | West 310,565 165,348 | +-----------------------------+

3) Next, open the second data set. Type: webuse sforce 4) To display data contents, type: list 4

You will see the following data listed:

1. 2. 3. 4. 5. 6. 7. 8. 9. 10. 11. 12.

+--------------------+ | region name | |--------------------| | N Cntrl Krantz | | N Cntrl Phipps | | N Cntrl Willis | | NE Ecklund | | NE Franks | |--------------------| | South Anderson | | South Dubnoff | | South Lee | | South McNeil | | West Charles | |--------------------| | West Cobb | | West Grant | +--------------------+

The common ID variable is “region”. However, in the “sforce” data set, the “region” variable is not unique – there are multiple observations with the same value for “region”. STATA automatically detects that you are attempting a “Many-to-one” merge, this means that the ID is not unique in the “master” data set (sforce) but it is unique in the “using” data set (dollars). 5) To perform the many-to-one merge. Type: merge region using http://www.stata-press.com/data/r10/dollars

Notice that the syntax is exactly the same as the match-merge without the sort option at the end. You should NOT include the sort option when your IDs are not unique. Instead make sure your data sets are sorted according to the IDs before merging.

5

6) To display data contents, type: list You will see the following data listed:

1. 2. 3. 4. 5. 6. 7. 8. 9. 10. 11. 12.

+-------------------------------------------------+ | region name sales cost _merge | |-------------------------------------------------| | N Cntrl Krantz 419,472 227,677 3 | | N Cntrl Phipps 419,472 227,677 3 | | N Cntrl Willis 419,472 227,677 3 | | NE Ecklund 360,523 138,097 3 | | NE Franks 360,523 138,097 3 | |-------------------------------------------------| | South Anderson 532,399 330,499 3 | | South Dubnoff 532,399 330,499 3 | | South Lee 532,399 330,499 3 | | South McNeil 532,399 330,499 3 | | West Charles 310,565 165,348 3 | |-------------------------------------------------| | West Cobb 310,565 165,348 3 | | West Grant 310,565 165,348 3 | +-------------------------------------------------+

As you can see, the sales and cost data was simply repeated for all observations with the same region value. To perform one-to-many merge, open the “dollar” data set first and then merge using the “sforce” data. You should get the same data list as above.

Merging multiple data sets If you are merging data with the same structure, merging them all at once is useful. To do this you can simply use the dropdown menus and follow the prompts: Data>Combine datasets>Merge multiple datasets

Nevertheless, it is highly recommended that you merge only two data sets at a time. Why? It is much easier to correct any merging errors when you are dealing with only two data sets. For example, if you have a problematic data set where you don’t have the correct ID variables, it is easier to pinpoint which data set is causing the error. Instead of merging all at once, do it successively. Once you have merged the first two data sets, rename or drop the “_merge” variable and then merge again using the next data file (and so on). For additional information on the syntax for the merge command, type: help merge 6