CART. Tree-Structured Non-Parametric Data Analysis

 CART Tree-Structured Non-Parametric Data Analysis Classification and Regression Trees by Salford Systems A Robust Decision-Tree Technology for Dat...
Author: Shannon Shaw
6 downloads 0 Views 2MB Size


CART

Tree-Structured Non-Parametric Data Analysis Classification and Regression Trees by Salford Systems A Robust Decision-Tree Technology for Data Mining, Predictive Modeling and Data Processing

8880 Rio San Diego Drive, Suite 1045 San Diego, California 92108, USA 619.543.8880 TEL 619.543.8888 FAX www.salford-systems.com Developers of CART, MARS and other award winning data mining and web mining software tools

CART Copyright

Copyright 2001, Salford Systems; all rights reserved worldwide. No part of this publication may be reproduced, transmitted, transcribed, stored in a retrieval system, or translated into any language or computer language, in any form or by any means, electronic, mechanical, magnetic, optical, chemical, manual or otherwise without the express written permission of Salford Systems.

References

CART is a registered trademark of California Statistical Software, Inc. SYSTAT and SYGRAPH are registered trademarks of SYSTAT, Inc.

Citation

The proper citations for CART and this manual are: Breiman, Leo, Jerome Friedman, Richard Olshen, and Charles Stone. Classification and Regression Trees. Pacific Grove: Wadsworth, 1984. Steinberg, Dan and Phillip Colla. CART: Tree-Structured NonParametric Data Analysis. San Diego, CA: Salford Systems, 1995.

Limited Warranty

Salford Systems warrant for a period of ninety (90) days from the date of delivery that, under normal use, and without unauthorized modification, the program substantially conforms to the accompanying specifications and any Salford Systems authorized advertising material; that, under normal use, the magnetic media upon which this program is recorded will not be defective; and that the user documentation is substantially complete and contains the information Salford Systems deem necessary to use the program. If, during the ninety (90) day period, a demonstrable defect in the program magnetic media or documentation should appear, you may return the software to Salford Systems for repair or replacement, at Salford Systems option. If Salford Systems cannot repair the defect or replace the software with a functionally equivalent software within sixty (60) days of Salford Systems receipt of the defective software, then you shall be entitled to a full refund of the license fee. Salford Systems, and California Statistical Software Inc. cannot and do not warrant that the functions contained in the program will meet your requirements or that the operation of the program will be uninterrupted or error free. Salford Systems, and California Statistical Software Inc. disclaim any and all liability for special, incidental, or consequential damages (including loss of profit) arising out of or with respect to the use, operation, or support of this supplement, even if Salford Systems has been apprised of the possibility of such damages.

Preface This implementation of CART grew out of our desire to simplify the process of conducting tree-structured analyses. Coincidentally, at the time that we started thinking about how to improve the user interface, Leo Breiman approached Leland Wilkinson with a similar idea. Leo and Leland came to the conclusion that an implementation of CART with a true major statistics package interface would help to popularize this important new methodology, and subsequent discussions led to my taking responsibility for the project. The work turned out to be far more challenging and time consuming than we originally anticipated, but resulted in what we think is a major improvement for CART. Much of what we have accomplished would not have been possible without the expert and timely assistance of others. First, I wish to thank Richard T. Carson Jr. for originally introducing me to CART. Next, Phil Colla and I thank the originators of CART, Leo Breiman, Jerome Friedman, Richard Olshen, and Charles Stone, who provided us with excellent suggestions as well as advice and insight concerning the art of tree-structured analysis. Leo Breiman suggested a number of program design improvements, and Richard Olshen answered numerous questions, and provided encouragement throughout. Jerome Friedman programmed the original algorithms which lie at the heart of the CART computational engine. The graphical user interface was expertly implemented by Bernie Bernstein for both the MacOS and Windows versions. We also benefited greatly from the comments of our beta testers. In particular, Steve Kutner brought us a seemingly endless stream of difficult problems, many of which forced us to learn new things about CART. Not all of those insights have managed to find their way into this manual and we hope to write further on these topics in

© 2001, Salford Systems

4

Preface the future. Richard T. Hoppe kindly allowed us to use an extract from an important cancer research data set for inclusion in the set of sample files on the distribution disks. Several people commented extensively on earlier versions of our documentation including Padraic Neville, and Gerry Dallal, and we thank them for their efforts. For the past two years Audrey Cardell has been my technical editor, painstakingly working through all of my statistical writing. This project was no exception, and she managed to find a way to improve almost every page. If readers find this manual easy to understand, it is due in no small part to her first class editing skills. Physical production of the manual also posed challenges which were admirably solved by David Tolliver. Finally, we wish to thank the many users of our earlier releases for providing us with valuable suggestions for improving the program and the manual.

Dan Steinberg San Diego State University

5

© 2001, Salford Systems

CART - Table of Contents Preface Introduction to CART Installing and Running CART Running CART Preparing Your Data for CART QUICK-START for Advanced Users

1 1 3 3 3 4

A Complete Run Selecting Predictors Categorical Variables Selecting Subsets of Data Automatic Parameter Setting Splitting Rules Limiting Tree Growth Control Parameters Serule Complexity Surrogates Cost Estimation Test Samples Priors Misclassification Costs

4 5 5 6 6 6 7 7 7 7 7 7 8 8 9 9

CART Basics Representing Data Structure with Trees Reading the Tree CART Application Topics CART Methodology: A Brief Overview Splitting Rules Choosing a Split Class Assignment Pruning Trees Testing Cross Validation Does CART Really Work?

© 1997, Salford Systems

10 10 11 13 13 14 16 16 17 21 22 22

i

CART - Table of Contents Comments For Statisticians Advantages of CART Uses for CART What’s Next?

Classification Trees A Simple Detailed Example A Multinomial Example Surrogates Association Node Detail Surrogate vs. Competitive Terminal Node Summary Classification Summaries

23 26 27 31

32 32 41 45 46 47 48 48 49

Selecting a Specific Tree

53

PICK Complexity

53 54

Prior Probabilities Use of PRIORS Reflecting Stratified Samples Detailed Node Report

Misclassification Costs Using Priors to Reflect Costs Specifying Explicit Costs with MISCLASS Variable Mis-classification Costs Variable Cost Example

ii

58 58 59 61

63 63 68 68 70

© 1997, Salford Systems

CART - Table of Contents Estimation Error Measuring Error Rates Separate Test Files Random Subset Held Back for Test Specific Subset Held Back for Test Cross Validation

Complexity Cost Complexity

Splitting Rules Goodness-of-Split Criteria Gini The Twoing Criterion Ordered Twoing Forcing Splits Manual Forcing in the Root Node Finding the Best Split for All Variables Alternative Forcing Methods Class Probability Trees Linear Combination Splits

Variable Importance Standard CART Importance Measure Alternative Importance Measures

Memory Management Memory Command How Much Memory? ADJUST Accuracy

Saving, Printing, and Viewing Trees Using allCLEAR

© 1997, Salford Systems

75 75 75 76 76 76

79 79

85 85 86 90 92 93 94 94 96 96 102

118 118 124

129 129 135 135 138

140 140

iii

CART - Table of Contents CASE: Using Trees to Classify Prediction and Diagnostics Examples of Output Contents of the SAVE File Analyzing CART Results

Regression Trees Wage Equation Example Complexity Parameter in Regressions Least Absolute Deviation LS vs LAD: Which should you use? Cautionary Notes True Linear Models Within-Node Accuracy Heteroscedasticity Cross-Validation Caveats

The Art of Tree Growing

148 148 150 155 157

160 162 175 178 185 185 185 186 187 187

188

Automatic Model Building Variable Selection

188 189

How Long Will a CART Model Run?

191

Running Times Categorical Predictors Hardware and Software

191 191 193

CART Data Transformation Language

194

Building A Tree Dropping Data Down A Tree Selection of Cases

194 195 197

Trouble Shooting No Tree Memory High Costs

iv

198 198 200 202

© 1997, Salford Systems

CART - Table of Contents CV Breakdown Costs Cannot Predict Too Many Levels Overly Large Tree

202 203 204 205 205

Roadmap of CART Analyses

206

Appendix I: Command Reference

209

The ADJUST Command The BOPTIONS Command The BUILD Command The CASE Command The CATEGORY Command The CDF Command The CHARSET Command The COMBINE Command The ECHO Command The ERROR Command The EXCLUDE Command The FORMAT Command The HELP Command The HIST Command The IDVAR Command The KEEP Command The LIMIT Command The LINEAR Command The LOPTIONS Command The MEMORY Command The METHOD Command The MISCLASS Command The MODEL Command The NAMES Command The NEW Command The NOTE Command The OPTIONS Command

© 1997, Salford Systems

209 210 213 214 217 219 220 221 224 225 227 228 229 230 231 232 233 235 237 238 239 241 243 244 245 246 247

v

CART - Table of Contents The OUTPUT Command The PAGE Command The PICK Command The PRINT Command The PRIORS Command The QUIT Command The REM Command The SEED Command The SELECT Command The SUBMIT Command The TREE Command The USE Command The WEIGHT Command The XYPLOT Command

Built-in BASIC Getting Started FOR...NEXT DIM DELETE OPERATORS MISSING VALUES Filtering the Data Set, or Splitting the Data Set Advanced Programming Features The IF…THEN Statement The LET Statement The ELSE Statement The FOR...NEXT Statement The DIM Statement The DELETE Statement The GOTO Statement The STOP Statement

Appendix III: Bibliography

vi

248 249 250 252 253 254 255 256 257 258 259 261 262 263

265 266 267 268 268 268 273 275 276 278 279 280 281 282 283 283 284

285

© 1997, Salford Systems

Introduction to CART CART is a new, advanced tool for tree-structured data analysis. Although the theory and the mathematical algorithms of the technique are quite complex, the CART program requires no special training to use. CART uses a decision tree to display how data may be classified or predicted. Through a series of “yes/no” questions concerning database fields, CART automatically searches for important relationships and uncovers hidden structure even in highly complex data. CART is often used to select a manageable number of core measures from databases with hundreds of variables. Because CART works automatically, even on complex data sets, producing results that are easy to understand, it is being used increasingly in medical, marketing, environmental, banking and commercial applications. In the last ten years, several hundred scholarly articles have referred to the CART methodology. The idea of using tree-structured classifiers to analyze data goes back at least to the 1960s and has been implemented in several pieces of software, including AID (Morgan and Sonquest, 1969) and CHAID (Kass, 1980). The technique offers a powerful method to assess the reliability of new data predictions, but early programs sometimes yielded substantially erroneous conclusions. The CART methodology was developed to address these shortcomings. The authors of the original CART monograph and the developers of its computational algorithms are among the world’s most highly regarded statisticians. Leo Breiman is Professor of Statistics at the University of California, Berkeley (now retired); Jerome Friedman is Professor of Statistics at Stanford University and Head of the Computation Research Group at the Stanford Linear Accelerator Center; Richard Olshen is Professor of Biostatistics at the Stanford University School of Medicine; and Charles Stone is Professor of © 2001, Salford Systems

1

Introduction to CART Statistics at the University of California, Berkeley. The combined efforts of these pioneers in theoretical and applied statistics and statistical computing have led to a procedure with power and reliability unmatched in other data mining and machine learning tools. This implementation of CART provides a completely new interface, incorporating all the convenience features found in a major statistics package system. You can now conduct a complete CART analysis with as few as three or four mouse clicks or short commands. Even advanced features in CART are easily accessed through simple commands. This manual assumes no prior knowledge of the methodology underlying CART or familiarity with the output. The main body of the manual contains an extensive discussion of how to use the technique and interpret the results. Concise technical information for each command appears in the reference section at the end of the manual. The next section provides a very brief account of all commands and options for the experienced CART user. If you are new to CART, you will want to begin with the section on CART BASICS and return to the QUICK START for Advanced Users for an overview at a later time. NOTE: This volume documents the command mode of CART only. The command mode is available on all platforms, including DOS, Windows, UNIX, and the MacOS. For Windows 3.1, Windows ‘95, Windows NT, and subsequent releases of Microsoft Windows operating systems, please also consult the separate Windows documentation which covers the use of menus, mouse control, and the complete graphical interface.

2

© 2001, Salford Systems

Installing and Running CART Installation instructions that are specific to the type of computer and operating system you are using have been included with your CART package. As these instructions are different for DOS, Windows, Macintosh and UNIX systems, please refer to these separate materials for further information.

Running CART For DOS, extended DOS, and UNIX versions, invoke CART by typing the command CART

at the operating system prompt. (To invoke the program from any directory you will need to include CART’s location in your path.) For Windows and Macintosh platforms, double-click on the CART icon. In all cases, you can set CART up to run in batch mode; instructions for using batch mode are discussed later in the manual.

Preparing Your Data for CART CART reads and writes data files using the SYSTAT file format. To get your data into this format, you can either use the SYSTAT statistical package or convert your data using one of the file translation utilities accompanying the CART distribution. See the appendix on file formats for further details.

© 2001, Salford Systems

3

QUICK-START for Advanced Users This section of the documentation, for users familiar with either mainframe implementations of CART or classification tree methodology, is intended to help you get up and running with Salford Systems CART as rapidly as possible. Please consult later sections of this manual for detailed examples and explanations of the commands and the reference section for precise statements of each command's syntax and options.

A Complete Run

The first feature to note about Salford Systems CART is that all data management functions are taken care of by CART. Once you have your data file, you can conduct a classification analysis in CART with as few as four commands. For example: CART

(typed at the operating system prompt)

USE IRIS CATEGORY SPECIES = 3 MODEL SPECIES BUILD

will generate a complete CART run. The first command launches the application from an operating system prompt. (The application can also be launched via mouse clicks in Macintosh, Windows, and other graphical environments.) Once inside the application, the USE command gets the input data, the MODEL command specifies the dependent variable, and BUILD does the work. Because Salford Systems CART treats all variables as numeric, the CATEGORY command is needed to identify categorical variables. When the dependent variable is numeric, CART grows a regression tree using an optional least squares criterion, and when the dependent variable is categorical, CART grows a classification tree using the optional Gini diversity index. Unless specified otherwise, the 4

© 2001, Salford Systems

QUICK-START for Advanced Users

Selecting Predictors

maximal tree grown is one for which no further splitting is possible or where the terminal nodes contain fewer than 10 cases. The defaults include unit misclassification costs for classification trees and, for all problems, printing of 10 trees from the sequence of trees grown, 10-fold cross validation for smaller sets, and up to 5 surrogate and competitor splits at each node. These defaults may all be changed to suit your analysis.

Selecting Predictors

Unless specified otherwise, Salford Systems CART takes all noncharacter variables in the data set as candidate predictors. Thus, the command MODEL WAGE directs CART to predict the target variable WAGE using every variable found in the analysis file. An explicit independent variable list can be provided on a MODEL statement, such as: MODEL WAGE = AGE, YRSEDUC, NKIDLE17, REGION

Alternatively, all variables except a set to be excluded can be specified with the EXCLUDE command, as in: EXCLUDE CASEID, STATUS, MARITAL

Categorical Variables

So long as an independent categorical variable is listed on a CATEGORY command, CART will automatically determine the number of levels of the variable and its minimum and maximum values. Thus, the command CATEGORY RACE, REGION, SEX, OCCUP

is sufficient to declare these variables as categorical. However, you may optionally specify a number of levels and minima (see the reference section for details).

© 2001, Salford Systems

5

Selecting Subsets of Data

QUICK-START for Advanced Users

The data management and information services of CART include handling of missing values, selection of subsets of cases for analysis, generation of new variables on the fly, integrated low resolution graphics, and direct access to other statistical modules. To select a subset of cases you can use the SELECT command, as in:

Selecting Subsets of Data

SELECT REGION = 9

More complex selection criteria can be set up using the integrated DATA transformation language, described in detail elsewhere in the manual.

Automatic Parameter Setting

Control over every aspect of tree growing is available, should it prove necessary. Nevertheless, a major convenience feature of our implementation of CART is that most options are set automatically. Intelligent estimates of the dimensions of the maximal tree (depth of the tree, total number of nodes, number of linear and categorical splits) are calculated for you when the data are first read. Also, memory requirements are estimated, and certain parameters may be dynamically reset to squeeze large problems into a limited workspace.

Splitting Rules

Splitting rules are specified with the METHOD command. METHOD = GINI METHOD = SYMGINI METHOD = TWOING METHOD = ORDERED METHOD = LAD METHOD = TWOING, POWER =1

are examples of how the criterion can be changed. 6

© 2001, Salford Systems

QUICK-START for Advanced Users Limiting Tree Growth

Limiting Tree Growth

Several dimensions of the size of a tree-growing problem can be controlled with the LIMIT command. To prevent tree growth beyond a depth of 7 levels, try LIMIT DEPTH = 7

LIMIT can also be used to set the smallest node size suitable for splitting (ATOM), to set the number of cases to include in the learning sample (LEARN) and test sample (TEST), to determine when nodes are large enough to use subsampling (SUBSAMPLE), and to determine the number of terminal nodes to allow in the maximal tree (NODES). Control Parameters

More advanced control parameters are accessed via the BOPTIONS (for BUILD OPTIONS) command.

Serule

The BOPTION SERULE = s command is used to select an optimal tree following cross validation or use of a test sample. BOPTION SERULE = 0 chooses the tree with the smallest costs. Any nonnegative value can be selected for this parameter. BOPTION SERULE = 1 selects the smallest tree within one standard error of the minimum cost tree.

Complexity

BOPTION COMPLEXITY = c, where c is the initial complexity, sets an upper limit on the size of the maximal tree grown. The default value of 0 grows the largest possible tree, and the larger the value of c the smaller the maximal tree. This is one of the most important BOPTIONs, as it can save computation time and limit the amount of output in exploratory trees.

Surrogates

The remaining BOPTIONS include COMPETITORS = t, the number of competitor splits to print; SURROGATES = s, the number of surrogates to track; and TREELIST = m, the number of trees to include in the summary list printed at the beginning of every run. Details are provided throughout the manual.

© 2001, Salford Systems

7

Cost Estimation Cost Estimation

QUICK-START for Advanced Users

The next critical command is ERROR, which selects the method for determining the optimal tree. Several options are available: ERROR EXPLORE

grows an exploratory tree and prints the maximal tree reached. Unless otherwise limited, this will be the largest possible tree that can be grown given the data and the ATOM size. ERROR CV = 10

is the default that requests 10-fold cross validation. The tree printed in detail is the optimal tree determined by the cross-validated cost and the SERULE.

Test Samples

A test sample can be selected in one of three ways: ERROR FILE = testfile

will use the data in TESTFILE.SYS to independently assess the error rate (adjusted by misclassification costs if necessary). Alternatively, ERROR PROPORTION = p

lets CART randomly set aside a proportion of the input data for subsequent testing. Finally, ERROR SEPVAR = varname

specifies that any case with VARNAME = 1 is to be set aside for testing.

8

© 2001, Salford Systems

QUICK-START for Advanced Users Priors

Priors

By default, CART treats a categorical dependent variable as if it was uniformly distributed in the general population. If you need to override this default, you can use: PRIORS DATA

which specifies that the relative frequencies in the data are to be taken as population relative frequencies. You can also specify an explicit distribution such as: PRIORS = .2, .3, .5

See the manual for details, and other PRIORS options such as PRIORS LEARN, PRIORS TEST, PRIORS MIX.

Misclassification Finally, variable misclassification costs can be specified. By Costs default, all errors have a weight of 1. The commands: MODEL SPECIES CATEGORY SPECIES = 3 MISCLASS COST = 1.5 CLASS 2 AS 1,3 /, COST = 5 CLASS 1 AS 3

indicate that some misclassification errors are worse than others. Here CART will weight any misclassification of a class 2 object by a factor of 1.5, and will weight misclassifications of class 1 objects as class 3 by a factor of 5. Now, grow some trees!

This concludes our very brief overview of the major CART commands. Please be sure to read the detailed discussion in the remainder of the manual. If you are an experienced user of the original CART, keep in mind that a number of enhancements have been added to help you obtain and interpret your results.

© 2001, Salford Systems

9

CART Basics Representing Data Structure with Trees

CART is an acronym for Classification and Regression Trees, a statistical procedure introduced by Leo Breiman, Jerome Friedman, Richard Olshen, and Charles Stone in 1984. As the name suggests, CART is a single procedure that can be used to analyze either categorical (classification) or continuous data (regression) using the same technology. A defining feature of CART is that it presents its results in the form of decision trees — a significant departure from more traditional statistical analysis procedures. The tree structure of the output allows CART to handle massively complex data while producing diagrams that are easy to understand. In the spirit of exploratory data analysis, CART is a method that communicates by pictures. As an example, consider the decision tree generated for heart attack patients at a major teaching hospital (Figure 1 below). The medical staff needed to accurately determine whether a patient was at high risk of a second lethal heart attack as quickly and simply as possible. A high-risk patient would be placed on an intensive care unit for constant monitoring while a low-risk patient could remain on a standard medical unit. The data for the analysis consisted of the medical records of 215 patients. Each record contained 19 measures taken within the first 24 hours following admission, including basic variables such as blood pressure, age, and presence of sinus tachycardia, and more elaborate measures such as enzyme concentrations based on blood work. Within 30 days following admission, 37 patients died (the high-risk group) while 178 survived (the low-risk group) (see Gilpin, Olshen, Henning and Ross, 1983). Based on a systematic analysis of these data, CART produced the following classification scheme, drawn in the form of an inverted tree and read like a flow chart.

10

© 2001, Salford Systems

CART Basics

Reading the Tree

THE BINARY DECISION TREE (Figure 1) High Risk 17% Low Risk 83% Is BP 20000

© 2001 Salford Systems

257

SUBMIT

Appendix I: Command Reference

The SUBMIT Command Purpose Specifies a file from which command input is to be read. Syntax SUBMIT

Remarks Filename is given the extension .CMD in the SUBMIT command. This extension default may be overridden by placing quotes around the filename and explicitly giving a file extension. The default is to read commands from the keyboard. Examples: The file READIT.CMD might contain a number of commands that set up a particular analysis environment: USE DATAFIL2 LOPTION MEANS, TIMING

CART would then use this input sequence with the command: SUBMIT READIT

258

© 2001 Salford Systems

Appendix I: Command Reference

TREE

The TREE Command Purpose When used with the BUILD command, creates files in which tree information will be saved for later viewing, printing or application of the decision rules to new data. When used with the CASE command, obtains information about the previously constructed tree, such as the variable names and categorical levels, decision rules, the dependent variable, etc. Syntax TREE [SKELETON|VARSONLY|CLASS|STATS|SPLITS|TABLES/ SCRIPT]

The options are used only when first BUILDing a tree and specify the degree of detail stored for tree viewing and printing. SKELETON

only node numbers (both terminal and nonterminal).

VARSONLY

node numbers and variables involved in a split.

CLASS

node, variables, class assignment or mean/median.

SPLITS

same as CLASS but with full split criteria.

STATS

same as SPLITS but with N, SD / MAD.

TABLES

same as STATS but with classification breakdown.

The SCRIPT option specifies that an allCLEAR script file should be created. It will carry the extension .ACL.

Remarks © 2001 Salford Systems

259

TREE

Appendix I: Command Reference

Filename may not be given an extension in the TREE command, nor may quotes be used around the filename. Examples: TREE SMOKDAT4 TREE SURGEON TREE TOP SPLITS/SCRIPT

When BUILDing a tree, the default is: TREE VARSONLY

To use a previously-saved tree for prediction with CASE, just list the tree filename. For example: USE NEWDATA TREE TOP SAVE PREDICTD CASE

260

© 2001 Salford Systems

Appendix I: Command Reference

USE

The USE Command Purpose Opens a SYSTAT format file for analysis, and lists the variables names in the file. Syntax USE

Remarks Filename is given the extension .SYS in the USE command. This extension default may be overridden by placing quotes around the filename and explicitly giving a file extension. You will need to enclose lowercase filenames in quotes on UNIX platforms. Within CART, all non-quoted names are read as uppercase. Examples: USE DATAFILE

© 2001 Salford Systems

261

WEIGHT

Appendix I: Command Reference

The WEIGHT Command Purpose Identifies a case-weighting variable. This command is not functional in CART, except for the XYPLOT and HIST commands. Syntax WEIGHT=

in which variable is a SYSTAT variable present on the SYSTAT data set specified in the latest USE command. Remarks Conventional weighting can be mimicked in CART by replicating selected cases or using the priors command. Examples: The variable "WEIDER" is to be used as the case-weighting variable: USE DUMBBELL WEIGHT=WEIDER

262

© 2001 Salford Systems

Appendix I: Command Reference

XYPLOT

The XYPLOT Command Purpose Produces low resolution 2-D scatter plots of variables on the current USE dataset. Syntax XYPLOT [, , ] * [/FULL, TICKS|GRID, WEIGHTED, NORMALIZED, BIG]

in which and are current database variables. Remarks The plot typically is a half screen high. The FULL and BIG options will increase it to a full screen (24 lines) and a full page (60 lines). TICKS and GRID add two kinds of horizontal and vertical grids to the plot. WEIGHTED specifies that the current WEIGHT variable (if one has been specified) should be used (by default it is not used). Only numerical variables may be specified. Only one x variable may be specified. Examples: XYPLOT IQ * AGE/FULL, GRID XYPLOT LEVEL(4-7) * INCOME/NORMALIZED XYPLOT AGE, WAGE, INDIC * DEPVAR(2)/WEIGHTED NORMALIZED rescales axes to the (0,1) interval.

© 2001 Salford Systems

263

XYPLOT

264

Appendix I: Command Reference

© 2001 Salford Systems

Appendix II: Built-in BASIC

Built-in BASIC Biomarker, contains an integrated implementation of a complete BASIC programming language for transforming variables, creating new variables, filtering cases, and database programming. Because the programming language is directly accessible anywhere in Biomarker and our other modules you can perform a number of database management functions without invoking the data step of another program. The BASIC transformation language allows you to modify your input files on the fly while you are in an analysis module and to save permanent copies of your changed data in ASCII. We expect users will find that they can accomplish almost any required data manipulation involving a single data file. Although this integrated version of BASIC is much more powerful than the simple variable transformation functions sometimes found in other statistical procedures, it is not meant to be a replacement for more comprehensive data steps found in general use statistics packages. At present, integrated BASIC does not permit the merging or appending of multiple files, nor does it allow processing across observations. In Biomarker the programming workspace for BASIC is limited and is intended for on-the-fly data modifications of 20 to 40 lines of code (though custom large memory versions will accommodate larger BASIC programs). For more complex or extensive data manipulation use the large workspace for BASIC in ASCII or your preferred database management software. The next section describes what you can do with BASIC and provides simple examples to get you started. The appendix provides formal technical definitions of the syntax.

© 2001, Salford Systems

265

Getting Started

Appendix II: Built-in BASIC

Getting Started

Your BASIC program will consist of a series of statements which all begin with a % sign. These statements could comprise simple assignment statements that define new variables, conditional statements that delete selected cases, iterative loops that repeatedly execute a block of statements, and complex programs with the flow control provided by GOTO statements and line numbers. Thus, somewhere before a HOT! Command such as BUILD, CASE, ESTIMATE or RUN in a Salford module, you might type: % LET BESTMAN = WINNER % IF MONTH=8 THEN LET GAMES = BEGIN % ELSE IF MONTH>8 LET GAMES = ENDED % LET ABODE= LOG (CABIN) % DIM COLORS(10) % FOR I= 1 TO 10 STEP 2 % LET COLORS(I) = Y * I % NEXT % IF SEX$="MALE" THEN DELETE

The % symbol appears only once at the beginning of each line of BASIC code; it should not be repeated anywhere else on the line. You can leave a space after the % symbol or you can start typing immediately; BASIC will accept your code either way. Our programming language uses standard statements found in many dialects of BASIC. These include LET Assign a value to a variable. The form of the statement is: % LET variable = expression

IF...THEN Evaluates a condition and if it is true executes the statement following the THEN. The form is:

% IF condition THEN statement

266

© 2001, Salford Systems

Appendix II: Built-in BASIC

FOR...NEXT

Can immediately follow an IF...THEN statement to specify a statement to be executed when the preceding IF condition is false. The form is: ELSE

% IF condition THEN statement % ELSE statement

Alternatively, ELSE may be combined with other IF…THEN statements: % IF condition THEN statement % ELSE IF condition THEN statement % ELSE IF condition THEN statement % ELSE statement FOR...NEXT

FOR...NEXT Allows for the execution of the statements between the FOR statement and a subsequent NEXT statement as a block.. The form of the simple FOR statement is: % FOR % statements % NEXT

For example, you might to execute a block of statements only if a condition is true, as in %IF WINE=COUNTRY THEN FOR %LET FIRST=CABERNET %LET SECOND=RIESLING %NEXT

When an index variable is specified on the FOR statement, the statements between the FOR and NEXT statements are looped through repeatedly while the index variable remains between its lower and upper bounds.

© 2001, Salford Systems

267

DIM

Appendix II: Built-in BASIC % FOR [index variable and limits] % statements % NEXT

The index variable and limits form is: FOR I= start-number TO stop-number [ STEP = stepsize ]

where I is an integer index variable that is increased from startnumber to stop-number in increments of stepsize. The statements in the block are processed first with I = start-number, then with I = start-number + stepsize, and repeated until I >=stop-number. If STEP=stepsize is omitted, the default is to step by 1. Nested FOR…NEXT loops are not allowed.

DIM

DIM Creates an array of subscripted variables. For example, a set of 5 scores could be set up with: % DIM SCORE(5)

This creates the variables SCORE(1), SCORE(2), …, SCORE(5). The size of the array must be specified with a literal integer up to a maximum size of 99; variable names may not be used. You can use more than one DIM statement, but be careful not to create so many large arrays that you exceed the maximum number of variables allowed (currently 1024).

DELETE

268

DELETE Deletes the current case from the data set.

© 2001, Salford Systems

Appendix II: Built-in BASIC OPERATORS

OPERATORS

The table below lists the operators that can be used in expressions in BASIC statements. Operators are evaluated in the order they are listed in each row with one exception: a minus sign before a number (making it a negative number) is evaluated after exponentiation and before multiplication or division. The "" is the "not equal" operator.

BASIC has five built-in variables available for every data set. You can use these variables in flow control and create new variables from them. You should NEVER attempt to redefine them or change their values directly.

© 2001, Salford Systems

269

OPERATORS

Appendix II: Built-in BASIC Integrated BASIC also has a number of mathematical and statistical functions. The statistical functions can take several variables as arguments and automatically adjust for missing values. Only numeric variables may be used as arguments. The general form of the function is: FUNCTION(variable, variable, ….)

270

© 2001, Salford Systems

Appendix II: Built-in BASIC

OPERATORS

Integrated BASIC also includes a collection of probability functions that can be used to determine probabilities and confidence level critical values, and to generate random numbers. The following table shows the distributions and any parameters that are needed to obtain values for either the random draw, the cumulative distribution, the density function, or the inverse density function. Every function name is composed of three letters: Key-Letter: This first letter identifies the distribution. Distribution-Type Letters: RN (random number), CF (cumulative), DF (density), IF (inverse).

© 2001, Salford Systems

271

OPERATORS

272

Appendix II: Built-in BASIC

© 2001, Salford Systems

Appendix II: Built-in BASIC

MISSING VALUES

These functions are invoked with either 0, 1, or 2 arguments as indicated in the table above, and return a single number, which is either a random draw, a cumulative probability, a probability density, or a critical value for the distribution. We will illustrate the use of these functions with the chi-square distribution. To generate 10 random draws from a chi-square distribution with 35 degrees of freedom for each case in your data set: % DIM CHISQ(10) % FOR I= 1 TO 10 % LET CHISQ(I)=XRN(35) % NEXT

To evaluate the probability that a chi-square variable with 20 degrees of freedom exceeds 27.5: %LET CHITAIL=1 - XCF(27.5, 20)

The chi-square density for the same chi-square value is obtained with: %LET CHIDEN=XDF(27.5, 20)

Finally, the 5% point of the chi-squared distribution with 20 degrees of freedom is calculated with: %LET CHICRIT=XIF(.95, 20)

MISSING VALUES

The system missing value is stored internally as the largest negative number allowed. Missing values in BASIC programs and printed output are represented with a period or dot ("."), and missing values can be generated and their values tested using standard expressions.

© 2001, Salford Systems

273

More Examples

Appendix II: Built-in BASIC Thus, you might type: %IF NOSE=LONG THEN LET ANSWER=. %IF STATUS=. THEN DELETE

Missing values are propagated so that most expressions involving variables that have missing values will themselves yield missing values. One important fact to note: Because the missing value is technically a very large negative number, the expression X < 0 will evaluate as true if X is missing. BASIC statements included in your command stream are executed when a HOT! Command such as BUILD, CASE, or RUN is encountered; thus, they are processed before any estimation or tree building is attempted. This means that any new variables created in BASIC are available for use in MODEL and KEEP statements, and any cases that are DELETEd via BASIC will not be used in the analysis.

More Examples

It is easy to create new variables or change old variables using BASIC. The simplest statements create a new variable from other variables already in the data set. For example: % LETPROFIT=PRICE *QUANTITY2* LOG(SQFTRENT), 5*SQR(QUANTITY)

BASIC allows for easy construction of Boolean variables, which take a value of one if true and zero if false. In the following statement, the variable XYZ would have a value of 1 if any condition on the right-hand side is true, and 0 otherwise. % LET XYZ = X117 OR X3=6

274

© 2001, Salford Systems

Appendix II: Built-in BASICFiltering the Data Set, or Splitting the Data Set Suppose your data set contains variables for gender and age, and you want to create a categorical variable with levels for malesenior, female-senior, male-non-senior, female-non-senior. You might type: % IF MALE = . OR AGE = . THEN LET NEWVAR = . % ELSE IF MALE = 1 AND AGE < 65 THEN LET NEWVAR=1 % ELSE IF MALE = 1 AND AGE >= 65 THEN LET NEWVAR=2 % ELSE IF MALE = 0 AND AGE < 65 THEN LET NEWVAR=3 % ELSE LET NEWVAR = 4

If the measurement of several variables changed in the middle of the data period, conversions can be easily made with the following: % IF YEAR > 1986 OR MEASTYPE$="OLD" THEN FOR % LET TEMP = (OLDTEMP-32)/1.80 % LET DIST = OLDDIST / .621 % NEXT % ELSE FOR % LET TEMP = OLDTEMP % LET DIST = OLDDIST % NEXT

If you would like to create powers of a variable (square, cube, etc.,) as independent variables in a polynomial regression, you could type something like % DIM AGEPWR(5) % FOR I = 1 TO 5 % LET AGEPWR(I) = AGE^I % NEXT Filtering the Data Set, or Splitting the Data Set

Integrated BASIC can be used for flexibly filtering observations. To remove observations with SSN missing try: % IF SSN= . THEN DELETE

To delete the first 10 observations type: © 2001, Salford Systems

275

Advanced Programming Features

Appendix II: Built-in BASIC

% IF CASE 50 OR INCOME.95 THEN DELETE

278

© 2001, Salford Systems

Appendix II: Built-in BASIC

The LET Statement

The LET Statement Purpose

Assign a value to a variable. Syntax The form of the statement is: % LET variable = expression

The expression can be any mathematical expression, or a logical Boolean expression. If the expressions are Boolean, then the variable defined will take a value of one if the expression is true, or zero if it is false The expression may also contain logical operators such as AND, OR and NOT. Examples % LET AGEMONTH = YEAR - BYEAR + 12*(MONTH , BMONTH) % LET SUCCESS =(MYSPEED = MAXSPEED) % LET COMPLETE = (OVER = 1 OR END=1)

© 2001, Salford Systems

279

The ELSE Statement

Appendix II: Built-in BASIC

The ELSE Statement Purpose

Follows an IF...THEN to specify statements to be executed when the condition following a preceding IF is false. Syntax The simplest form is: % IF condition THEN statement1 % ELSE statement2

The statement2 can be another IF…THEN condition, thus allowing IF…THEN statements to be linked into more complicated structures. For more information see the section for IF…THEN. Examples % 5 IF TRUE=1 THEN GOTO 20 % 10 ELSE GOTO 30 % IF AGE