Tips & Tricks to Improve your Logistic Regression

Tips & Tricks to Improve your Logistic Regression Introduction Logistic regression is a commonly used tool to analyze binary classification problems. ...

Author: Kathlyn George

21 downloads 1 Views 1MB Size

Report

Download PDF

Recommend Documents

Introduction to Logistic. Regression

10 TIPS to improve your website

10 tips to improve your memory naturally

Be More Productive! Tips and Tricks to Improve your SAS Programming Environment

Logistic Regression. Introduction CHAPTER The Logistic Regression Model 14.2 Inference for Logistic Regression

Topic2 - Logistic Regression --

Logistic Regression: Predicting Counts

STA6938-Logistic Regression Model

Unit 5 Logistic Regression

Logistic Regression & Classification

Overdispersion: Logistic Regression

Bayesian Multivariate Logistic Regression

t-logistic Regression

Contingency Tables & Logistic Regression

LEC 6: Logistic Regression

Binary Logistic Regression

Chapter 14 Logistic regression

Logistic Regression Tree Analysis

5 Logistic Regression

Logistic Regression. The Model:

Lecture 12 Logistic regression

Multinomial Logistic Regression

INSY 8970 Introduction to Logistic Regression Maghsoodloo

NOTES ON LOGISTIC REGRESSION

Tips & Tricks to Improve your Logistic Regression Introduction Logistic regression is a commonly used tool to analyze binary classification problems. However, logistic regression still faces the limitations of detecting nonlinearities and interactions in data. In this tutorial, you will see step-by-step instructions on improving conventional logistic regression modeling by utilizing more advanced and intuitive machine learning techniques, specifically stochastic gradient boosting. We will demonstrate using two banking datasets that involve direct marketing and delinquency prediction.

Tutorial

1) Open SPM®:

2) Click the folder shortcut

to open a data file:

3) Locate bank_marketing.csv and click Open.

The Activity Window, pictured below, will appear:

This file contains 45,211 records of telemarketing calls at a Portuguese bank. On the left side of the Activity Window, you can see the variables in the data file. These include attributes of the prospect (i.e. age, marital status, education) and campaign variables (i.e. type of contact, date of last contact). At the bottom of the Activity Window, a row of buttons gives you options to view the data, create graphs, and run statistics, among other features.

4) At the bottom of the Activity Window, click the Model button:

The Model Setup window, pictured below, will appear:

5) In the Analysis Method drop-down menu, choose TreeNet. For the Analysis Type, choose Logistic Binary. In the Variable Selection window, choose RESPONDED$ in the Target column and the remaining 14 variables in the Predictor column.

6) Click the Testing tab at the top of the Model Setup window:

7) Choose Fraction of cases selected at random for testing and enter 0.50. This will randomly partition the dataset into 50% for building the TreeNet model and 50% for testing. With smaller datasets, you may consider using an 80/20 partition or cross-validation.

8) Click the TreeNet tab:

9) In the Learnrate box, enter 0.1. This is typically a good value to start with. Under Optimal Logistic Model Selection Criterion, choose ROC area. 10) Click Start to build the model.

The TreeNet Output window, pictured below, will appear:

A quick look at the results shows an optimal ROC area of 0.92, which is exceptional and possibly too good to be true. To investigate this further, you will look at which variables are most important in the model.

11) Click the Summary button at the bottom of the TreeNet Output window:

12) In the resulting Summary window, click the Variable Importance tab:

Here, you can see CONTACT_DURATION is listed as the most important variable in the model. However, this measure is a result of the conversation between the telemarketer and the prospect, which would not be known ahead of time. This variable would not be predictive in future calls and should not be included in the model.

13) Return to the Model Setup window via the shortcut

:

14) Remove CONTACT_DURATION as a predictor and click Start.

A new TreeNet Output window will appear with the new model:

Now, we have a more reasonable ROC area of 0.77.

15) Click the Summary button again and return to the Variable Importance tab:

You can now do further feature selection techniques from this window.

16) Highlight the top six variables in the Variable Importance list:

17) Click New Keep & Go at the bottom of the window to re-run the TreeNet model with only the selected six variables.

In the new TreeNet Output window, we can see that we have preserved most of the model performance (0.76) from the previous model, but cut the complexity in half by only including six of the twelve important variables.

18) Click the Create Plots button at the bottom of the window:

The Create Plots dialog box, pictured below, will appear:

19) Check the box next to One variable dependence and click Select All to include all six predictors. Click Create Plots.

A Plot Selection box will appear. Close this box, for now, as we will return to the TreeNet plots after building a logistic regression model.

20) Return to the Model Setup window via the shortcut

:

21) Switch the Analysis Method to LOGIT and confirm that the Analysis Type is still Logistic Binary and the top six predictors are still correctly selected (AGE, CONTACT_DAY, CONTACT_MONTH$, JOB_CATEGORY$, MORTGAGE$, PREV_DAYS).

22) Click Start.

The Logit Results window, pictured below, will appear:

Here, you can see the Logit model resulted in an ROC area of 0.71, which is about a 5-point loss in accuracy when compared to the TreeNet model.

23) Click the Coefficients tab and scroll to the bottom:

A variable that was previously important, CONTACT_DAY, is now shown as insignificant.

The next step is to recover the points lost in the logistic regression by imposing transformations on the predictors. To do this, we will create approximations of the original predictors via our TreeNet partial dependency plots.

24) Return to the latest TreeNet Output window and click Display Plots:

25) In the resulting Plot Selection window, click Show All:

These partial dependency plots show the contribution of each predictor to the half log-odds of the response variable. While you will not be able to create transformations for the categorical variables, you will place spline approximations over the graphs of the three continuous variables: PREV_DAYS, AGE, and CONTACT_DAY.

26) Double-click the graph for PREV_DAYS to open it in its own window:

27) Under Approximations, click Specify:

28) In the TN Approximations window, select Splines, specify 5 knots, and click the First Order box:

After pressing OK or Apply, a green spline approximation will be fit to the partial dependency plot. Keep in mind, there are many different options in fitting these approximations and any deviations can leave to a difference in end results. This is just a quick example of fitting a transformation. 29) Repeat this process on the remaining continuous variables:

30) In the main plots window, click View Models:

A window will appear with both Basic and SAS code detailing the imposed transformations:

31) Select CART Basic and click Submit:

This action submits the new transformed predictors to the engine for modeling.

32) Return to the Model Setup window

:

You will now see new variables in the Variable Selection window. Original predictors with “_S” appended represents the fully transformed variable. Predictors ending in “_S” plus a number are the individual splines within each transformed variable. 33) Double-check that the Analysis Method is LOGIT and the Analysis Type is Logistic Binary. Select RESPONDED$ in the Target column and the following variables as predictors: AGE_S, CONTACT_DAY_S, CONTACT_MONTH$, JOB_CATEGORY$, MORTGAGE$, and PREV_DAYS_S.

34) Click Start.

The Logit Results window, pictured below, will appear with the new model:

With an ROC of 0.74, we have improved our logistic regression by 3 accuracy points by simply using transformations of the original predictors. Keep in mind, different spline approximations may lead to different results.

The second example illustrated in the webinar deals with predicting financial applicants that will pay 90 days or more past due on a loan. The process is nearly identical to the direct marketing example: 1. Open delinquency_prediction.csv in SPM. 2. Run a LOGIT model; DELINQUENT is the target and all remaining variables are predictors (except ID). 3. Run a TreeNet model with the same variable selection and default model settings. 4. Compare these two models. There should be about a 4 point difference in ROC area. 5. Create one variable dependence plots from the TreeNet model for all predictors. 6. Fit spline approximations to these plots (5-knot splines will fit all variables). 7. Submit these transformations to the software. 8. Re-run the LOGIT model with the transformed predictors for accuracy recovery.

Have questions or comments? Please email us at [email protected]!