I. Executing the STATA Program for the First Time

Harvard University Graduate School of Education S-290: Quantitative Methods for Improving Causal Inference In Educational Research Introducing the ST...
5 downloads 0 Views 172KB Size
Harvard University Graduate School of Education

S-290: Quantitative Methods for Improving Causal Inference In Educational Research Introducing the STATA Program (Last edited February 24, 2008)

NOTE FOR S-030 STUDENTS: Unfortunately, the datasets referenced below are not available for S-030 use. However, we have posted a STATA version of your final exam dataset (the MCAS file) on the S-030 Course Website. We have produced this worksheet to help you become familiarized with STATA. If you are new to the software, we recommend that you follow the instructions in the worksheet below, in the order in which they are presented. Otherwise, feel free to pick and choose selected topics.

I. Executing the STATA Program for the First Time 1. Sit at a convenient workstation and execute STATA by double clicking on the STATA program listing. You can find this listing by left-clicking once on the Windows Start button in the lower left-hand corner of the Windows Desktop and selecting the Programs, Research Applications, and STATA 10 subfolders sequentially. 2. Notice that, when STATA boots up, several new dedicated windows are revealed, usually including the STATA Command, the STATA Results, the Review and Variables windows (if these windows don’t show up, it means that the previous user has modified the window organization and then logged off without returning STATA to its default display. If this happens, just pop down the Edit menu in the STATA standard toolbar at the top of the page, and select PreferencesÆ Manage Preferences Æ Load Preferences Æ Factory Settings). The functions of these windows are described briefly below: a. The STATA Command window is the place where you type commands for STATA, one at a time. To execute a command after you have typed it into this window, you hit . You can also paste several commands at once into this window from a separate text file. b. The STATA Results window is the place where the results of your analyses will eventually appear (if you are successful in typing in appropriate STATA commands without error). You can scroll up and down through the material in this window to check on output from your most recent analyses. c. The Review window is a diary of all the commands that you enter into the STATA Command window, during the current set of analyses. This listing provides a very useful record because you can easily copy an earlier command from this window back into the STATA Command window, simply by left-clicking once on the old command. d. The Variables window will ultimately contain a list of the variables (and their labels, if you create them) contained in the dataset that you are currently analyzing. This window is again useful because you can easily write a variable name from this window back into the STATA Command window by left-clicking on it once. e. There are also several other standard STATA windows that appear periodically or that you can actively select, but they will probably not have popped up when you started the program. These additional windows include the Log, the Data Editor and the Do-File Editor windows, among others. You will learn more about them below.

3. You can re-adjust the size and placement of any, or all, of the visible windows so that they fit more beautifully and artistically on the screen by grabbing (left-click and delay release) their lower right corners with your mouse and tugging them around at your whim. You can maximize the STATA Results window to fill the remaining space.

II. Reading Raw Data into STATA 1. Data input is always the first step in a data-analysis. There are as many ways to input your data into STATA as there are grains of sand in the universe, including the inputting of data from a raw data-file and from a system data-file. 2. Let’s begin here by inputting data from a raw data-file, using the Tennessee STAR data. I have put several datasets that we will use in this tutorial on the S-290 website under “Other Resources”. Minimize STATA and download the following four files to your desktop now: • star_publicqje.dta • STAR Data CSV.csv • STAR Data Tab.txt • NYVouchExpt_IVSubsample.dta These four files contain similar data in different formats (STATA, Comma-delimited, and Tabdelimited). You might find yourself using data in each of these formats. You can easily save data from most programs (including Excel) into comma- or tab-delimited formats. If you have data from another program (like SAS, SPSS, etc.), you can generally convert it to a STATA data file using programs like StatTransfer or DBMS Copy. You can find DBMS Copy in the same Research Applications menu where you found STATA. We refer to data in the STATA file format (.dta) as a system data file and data in the other formats as raw data files. 3. Locate and inspect the dataset. Double left-click on the “STAR Data Tab.txt” file and its contents will appear in the Windows Notepad editor. As you can see, I have extracted six variables from the larger dataset: Student ID, sex, race, reading score, and two math scores. Each column is separated by a tab (hence, tab-delimited) and each row consists of a new student record. Notice that the first record has missing test scores. Close the data file. 4. There are several ways to input raw data into STATA. First, you can enter a data input command directly into the STATA Command window. You will need to know the correct “path” for your file – minimize STATA, right-click on the “STAR Data Tab.txt” file and select “Properties”. Then, highlight all of the text next to Location (probably something like: "C:\Documents and Settings\hgseuser\Desktop”) and copy it (Control-C). Place your arrow cursor in the STATA Command window and left-click once on your mouse. A new cursor – now, a vertical line – will appear in the window, indicating that STATA is ready to accept any command you give it. Carefully, type in the following command, all on the same line, replacing **FILE PATH** with the path for the file you copied by pasting it in (Control-V): insheet using "**FILE PATH**\STAR Data Tab.txt", tab clear and hit the key to execute it. STATA should then read the STAR raw data into memory and be prepared to conduct data-analyses of these data for you. The insheet command asks STATA to read in the raw data-file whose filename is included in quotes further along the line (notice that you use the fully-qualified filename, including the drive letter and relevant folders, and that it is contained within double quotes). The clear option, included at the end of the command after the comma, erases any data that may already be in memory before the new data are input. Most STATA commands possess additional options that can be attached to them, after commas, in this way.

2

5.

6.

7.

8.

Your variables now appear in the STATA Variables window. Notice that the window includes some additional information, like variable formats As you can see, STATA has named the variables for you, based on the labels in the initial file. You can also choose names for the variables as you are inputting your data. For example, with your cursor in the STATA Command window, hit the “Page Up” key. As you see, the previous line of code has reappeared (in the future, you can hit “Page Up” and “Page Down” keys to navigate between earlier lines of code). Now modify this line by adding in id sex race read1 math1 math2 immediately after insheet, as follows: insheet id sex race read1 math1 math2 using "**FILE PATH**\STAR Data Tab.txt", tab clear and hit . You see that the variables in the STATA Variables window are now labeled as you have specified. Be careful with what you are typing, because STATA is case-sensitive. If you name your variables in lowercase here, in the insheet statement, then you must continue to name them thus in all subsequent STATA code. You can also input raw data into STATA using the file menus. In some cases, this approach is easier. First, clear the current data from memory by typing clear into the STATA Command window and hitting . Then, select the File menu, left-click on Import, and left-click on ASCII data created by a spreadsheet. A pop-up window opens with several fields. Click on the button that says “Browse” and navigate to the Desktop where you stored your data files. Change the “Files of Type” at the bottom of the window to read “Comma Separated Values (.csv).” Select “STAR Data CSV.csv” and click “Open”. Under “Delimiter”, you can either select “comma-delimited data” or leave the default setting of “Automatically determine delimiter.” STATA has now imported your data set. Notice that you can also add new variable names if you prefer. Click “OK”. You should see the same six variables in the Variables window. To list out and inspect the data that you have just input, type the following command into the STATA Command window: list and hit . The dataset will then be listed out in its entirety in the STATA Results window. If there is insufficient space in the window for the listing to be completed (which will happen because the dataset contains nearly 12,000 records), then the listing will pause with the phrase “--more--“ and you will have to either click on the bar to continue the listing, or click on the button the top of the screen to break into the listing and kill it (the button is the red “X” icon at the extreme right end of the STATA standard toolbar). Try each of these approaches now. 1 Notice that, while you have been typing and executing these different commands in the STATA Command window, they have also been copied into the command diary contained in the Review window. This is useful not only because it reminds you about the data-analyses you have already completed, but it also permits you to place your mouse on a former command in the Review window, left-click once and have the command re-appear automatically in the STATA Command window. Then, you can execute the command again simply by hitting , or you can edit it to create a new variant of the command, perhaps with new variables or new options. This can save you considerable time in the long run. At this point, take your first foray into the on-line STATA Help system. This system can be accessed in several ways – try it out now by accessing on-line help about the insheet command: • First, seek help directly through the STATA Command window, by typing the term in which you are interesting in obtaining help for (“insheet”), preceded by the word “help,” as in the following:

1

If you don’t want to keep hitting the bar to continue the text, you can type the words set more off into the Stata Command window and hit . This command will tell the computer to keep going and not stop when it would say “—more—“.

3





help insheet and hit . STATA opens a new window with a description of the command. The most important piece of information here comes under the Syntax heading. Here, you see the following: insheet [varlist] using filename [, options] This information tells you how to construct insheet commands. Words in boldface are necessary parts of the command that must be entered as written. Information in italics can take on many forms. For example, you will enter a new filename each time you enter a new file. Words in brackets are optional. For example, you can add a varlist (a list of variable names), but we saw above that the command works even if you do not include this list. Similarly, the command includes a variety of options that you can include (typically separated from the rest of the command by a comma). These options are enumerated immediately following the syntax statement. Here, you see the three options that we’ve used: you can specify the dataset as being tab-delimited or comma-delimited, and you can put “clear” after the command to clear any dataset in memory. You can scroll down for more detailed information. Contained within this section, are hyperlinks – you can follow these hyperlinks to find out more about how you can read data into STATA. For example, you can also input data into STATA using the infile command. Second, if you don’t know the STATA command to use, you can do a keyword search. For example, imagine that you want to enter data but do not know the commands insheet or infile. Type help enter into the STATA Command window and hit . STATA will tell you that the help file for the command “enter” does not exist (because there is no such command) and will ask if you want to conduct a Keyword search. If you click “yes”, you will get a new help window that lists potentially useful commands. As you can see, the commands insheet, infile, and input are all listed here. You can explore these help files if you’re interested. Third, there is a great deal of help and information about STATA available on the Internet. o UCLA hosts one of the most well-developed sites. If you’re interested, you can even find on-line tutorials here that walk you through the program and explain how to conduct a variety of different analyses. You can find these tutorials under the “Classes and Seminars” and the “Learning Modules” links. In addition, the site answers a variety of questions about getting started with STATA and using its more advanced features. http://www.ats.ucla.edu/stat/stata/ o STATA manages a list of Frequently Asked Questions that might be helpful. They also offer on-line courses (for a fee): http://www.stata.com/support/faqs/ o UNC also offers free on-line STATA information arranged by topic: http://www.cpc.unc.edu/services/computer/presentations/statatutorial o Finally, UNC also provides a useful reference called “A SAS User’s Guide to STATA”. It includes a link that matches SAS code to STATA code: http://www.cpc.unc.edu/services/computer/presentations/sas_to_stata

4

III. Exploring Data in STATA 1. After you have input your data, you will want to explore it. You should have the STAR data open in STATA. If not, use the procedures discussed above to input either the tab-delimited or commadelimited version. 2. You can commit STATA to conducting statistical analyses in a variety of ways, including: a. You can enter commands for data-analyses that you would like to conduct into the STATA Command window and hit . b. You can store multiple commands in a “Do” file and execute them later in “batch.” c. You can select standard analyses from the Graphics and Statistics pop-down menus in the standard STATA toolbar at the top of the screen. Each of these approaches has its own advantages and disadvantages. In what follows immediately below, I ask you to focus on the first of these approaches. The second approach is described briefly in Section VII. You can investigate the use of the pop-down menus on your own, later. You will begin here by using approach (a), to type some simple commands directly into the STATA Command window as an extension of the inputting and listing processes that you have executed successfully in the section immediately above. 3. First, explore your data. You can do this in several ways: a. Type browse into the STATA Command window and hit . STATA opens up a spreadsheet with your data. As you can see, you have six variables. Note that the first record lists tmathss1 as “.” – STATA uses . to indicate missing values. Close this window by clicking on the “x” in the corner. b. You can also browse your data by clicking on the icon in the toolbar that looks like a spreadsheet with a magnifying glass. c. If you want to focus on certain observations, you can also use the command browse if. For example, type the following in the STATA Command window and hit : browse if tmathss1>=600 The data browser opens again, but it only includes observations where the value of tmathss1 is greater than or equal to 600. IMPORTANT: Notice that while STATA stores missing values as “.”, it treats them as very large numbers. So, if you only want to view observations where non-missing values of tmathss1 are greater than 600, you can type the following: browse if tmathss1>=600 & tmathss1~=. You can read this command as: “Browse all observations where tmathss1 is greater than or equal to 600 and where tmathss1 does not equal ‘missing’”. “If” statements can be extremely useful in data analysis. The information after “if” is called the logical statement. You can use the following operators: > greater than >= greater than or equal to < less than