Credit Scoring Model Validation

Faculty of Science Korteweg-de Vries Institute for Mathematics Credit Scoring Model Validation Master Thesis June 29, 2008 Xuezhen Wu Referent: Pro...

Author: Barry Peters

0 downloads 2 Views 1MB Size

Report

Download PDF

Recommend Documents

Credit scoring for individuals. Credit scoring pentru persoane fizice

Credit Scoring Models: an Updated Review

Model Verification and Validation

QuantLib(XL) for Model Validation

Methods for Early Model Validation

Filter- versus Wrapper-based Feature Selection. For Credit Scoring

ABA Model Validation Working Group Model Validation Survey Results May 2008

A credit scoring analysis using data mining algorithms

Bayesian data mining, with application to benchmarking and credit scoring

Developing Credit Scorecards Using SAS Credit Scoring for Enterprise Miner 5.3

Validation of an Actuator Disc Model

Model-Based Validation for Internet Services

BURIED CHARGE ENGINEERING MODEL: VERIFICATION AND VALIDATION

Copula-Based Credit Rating Model for Evaluating Basket Credit Derivatives

Numerical Methods for Optimization-based Model Validation

AN ANALYTIC MODEL FOR CREDIT ANALYSIS

Lawsuit Prediction Model for Credit Cards

Scoring Model for Exposition: Comparison-and-Contrast Essay

Scoring Players

Scoring Guide

VALIDATION

VALIDATION OF THE CERVICAL SPINE MODEL IN THUMS

Model view checking: automated validation for IFC building models

Validation of a combustion oscillation model against experimental results

Faculty of Science Korteweg-de Vries Institute for Mathematics

Credit Scoring Model Validation Master Thesis June 29, 2008

Xuezhen Wu

Referent: Prof. Peter Spreij Korteweg-de Vries Institute for Mathematics

Abstract Under the framework of Basel II, banks are allowed to use their own internal rating-based (IRB) approaches for key drivers of credit risk as primary inputs to the capital calculation. In addition, regulatory validations of the internal rating/scoring system is required. Assessing the discriminatory power and examining the calibration of a credit scoring systems are two different important tasks of validation. This paper discusses several commonly used statistical approaches for measuring the discriminatory power and calibration and shows that such approaches should be interpreted with caution. When the objective of the validation of a credit scoring model is to confirm that the developed scoring model is still valid for the current applicant population, one should first check whether the portfolio structure changed over time or not. Because in some cases, significant shifts of the portfolio structure might happen.

Acknowledgement This thesis is the result of my four months internship at Credit Risk Management, Risk Analytics and Instruments, Deutsche Bank, Frankfurt am Main, and is also the final part of my master study Stochastics and Financial Mathematics at the Universiteit van Amsterdam. It is a pleasure to thank the many people who made this thesis possible. First of all, I would like to thank Thomas Werner for hiring me as an intern at Deutsche Bank, giving the opportunity to do this practical research, supervisoring me and providing resources. I also want to express my gratitude to Michael Luxenburger, Martin Hillebrand for their advice and supervision during my internship. Thanks to other colleagues at DB who had given me help, useful information and suggestions. This experience provided me with deeper insights into credit risk management and broadened my perspective on the banking industry. Meanwhile, I would like to thank my supervisor Dr. Peter Spreij, Korteweg-de Vries Institute for Mathematics, Universiteit van Amsterdam for his guidance and support during my thesis and study at UvA. Thanks also to the coordinator of the master programme, Stochastics and Financial Mathematics at UvA, Dr. Bert van Es, who always gave me help, advice and encouragement. I would like to thank many people who have taught during my master study in Holland. I acquired a lot of useful knowledge and skills during this study. It has been a wonderful experience and also a turning point in my life. Last but not least, I want to thank my parents and friends for their constant encouragement and love. To them I dedicate this thesis.

Contents

1 Introduction

1

2 Credit Risk Management & Basel II

3

2.1 Credit Risk . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

3

2.2 Basel II . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

3

2.3 The Internal Ratings Based Approach . . . . . . . . . . . . . . . .

6

2.4 Validation under the Basel II . . . . . . . . . . . . . . . . . . . .

7

3 Scoring Model Framework

9

3.1 Fundamentals . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

9

3.2 Model Development . . . . . . . . . . . . . . . . . . . . . . . . . .

10

3.2.1

Data Source . . . . . . . . . . . . . . . . . . . . . . . . .

10

3.2.2

Data Specification . . . . . . . . . . . . . . . . . . . . . .

11

3.2.3

Missing Values and Outliers . . . . . . . . . . . . . . . . .

12

3.2.4

Statistical Model Selection . . . . . . . . . . . . . . . . . .

12

3.2.5

Initial Selection of Variables . . . . . . . . . . . . . . . . .

13

3.2.6

Final Model Production . . . . . . . . . . . . . . . . . . .

14

3.3 Validation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

14

i

CONTENTS

4

ii

Methodology Background

16

4.1 Statistical Model Selection for Scoring Systems . . . . . . . . . . .

16

4.2

Bootstrapping and Confidence Intervals . . . . . . . . . . . . . .

18

4.3

The ROC Approach . . . . . . . . . . . . . . . . . . . . . . . . .

21

4.3.1

Receiver Operating Characteristics . . . . . . . . . . . . .

21

4.3.2

Analysis of Areas Under ROC Curves . . . . . . . . . . .

23

4.4 Statistical Tests for Rating System Calibration . . . . . . . . . . .

26

4.4.1

Hosmer-Lemeshow Test . . . . . . . . . . . . . . . . . . . .

26

4.4.2

Spiegelhalter Test . . . . . . . . . . . . . . . . . . . . . . .

28

5 Empirical Analysis and Results 5.1 Model Description

30

. . . . . . . . . . . . . . . . . . . . . . . . . .

30

5.2 The Validation Data Set . . . . . . . . . . . . . . . . . . . . . . .

31

5.3 Variables Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . .

32

5.4 Testing for Discriminatory Power . . . . . . . . . . . . . . . . . .

33

5.5 Statistical Tests for PD Validation

35

. . . . . . . . . . . . . . . . .

6 Discussion & Conclusion

36

A Ratio Analysis

37

B Results of Bootstrapping

43

C Transformation of the mean of input ratios

46

List of Tables 2.1 Basel II IRB Foundation and Advanced Approach . . . . . . . . .

6

4.1 Decision results . . . . . . . . . . . . . . . . . . . . . . . . . . . .

21

5.1 Number of observations and defaults per year . . . . . . . . . . .

31

5.2 Actual GDP of Germany 2003-2005 . . . . . . . . . . . . . . . . .

33

5.3 Confidence Intervals for AUROC . . . . . . . . . . . . . . . . . .

35

5.4 Hosmer-Lemeshow-Chisquare Test for PD Validation . . . . . . .

35

5.5 Spiegelhalter Test for PD Validation . . . . . . . . . . . . . . . .

35

B.1

99% percentile confidence interval for the mean . . . . . . . . . .

43

C.1 Table of the log transformation of the mean of input ratios . . . .

46

iii

List of Figures 2.1 Three Pillars of Basel II . . . . . . . . . . . . . . . . . . . . . . .

5

2.2 Three approaches to manage credit risk given in Basel II . . . . .

5

3.1 credit scoring model development . . . . . . . . . . . . . . . . . .

10

3.2 Data Split . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

15

4.1 Rating Score distributions for defaulters and non-defaulters . . . .

21

4.2 Receiver operating characteristics curves . . . . . . . . . . . . . .

23

5.1 Score-PD . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

32

5.2 Score-KDE

. . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

32

5.3 ROC Curves for the year 2003 2004 and 2005 . . . . . . . . . . .

34

A.1 Histogram and Boxplot of ratio P02101 . . . . . . . . . . . . . . .

37

A.2 Histogram and Boxplot of ratio P02102 . . . . . . . . . . . . . . .

38

A.3 Histogram and Boxplot of ratio P02103 . . . . . . . . . . . . . . .

38

A.4 Histogram and Boxplot of ratio P02104 . . . . . . . . . . . . . . .

38

A.5 Histogram and Boxplot of ratio P02105 . . . . . . . . . . . . . . .

38

A.6 Histogram and Boxplot of ratio P02106 . . . . . . . . . . . . . . .

39

A.7 Histogram and Boxplot of ratio P02107 . . . . . . . . . . . . . . .

39

iv

LIST OF FIGURES

v

A.8 Histogram and Boxplot of ratio P02108 . . . . . . . . . . . . . . .

39

A.9 Histogram and Boxplot of ratio P02109 . . . . . . . . . . . . . . .

39

A.10 Histogram and Boxplot of ratio P02110 . . . . . . . . . . . . . . .

40

A.11 Histogram and Boxplot of ratio P02111 . . . . . . . . . . . . . . .

40

A.12 Histogram and Boxplot of ratio P02112 . . . . . . . . . . . . . . .

40

A.13 Histogram and Boxplot of ratio P02113 . . . . . . . . . . . . . . .

40

A.14 Histogram and Boxplot of ratio P02114 . . . . . . . . . . . . . . .

41

A.15 Histogram and Boxplot of ratio P02115 . . . . . . . . . . . . . . .

41

A.16 Histogram and Boxplot of ratio P02116 . . . . . . . . . . . . . . .

41

A.17 Histogram and Boxplot of ratio P02117 . . . . . . . . . . . . . . .

41

A.18 Histogram and Boxplot of ratio P02118 . . . . . . . . . . . . . . .

42

A.19 Histogram and Boxplot of ratio P02119 . . . . . . . . . . . . . . .

42

A.20 Histogram and Boxplot of ratio P02120 . . . . . . . . . . . . . . .

42

Chapter 1 Introduction Recent instabilities in the financial sector lead to the development of Basel II -The new international capital standard for credit institutions. Under Basel II, banks are permitted to use internal-ratings based (IRB) approaches to determine the risk weights relevant for calculating the capital charge according to their own credit scoring model. Consequently, banks are obliged to validate their internal processes for differentiating risk as well as for quantifying that risk. Therefore, Validation represents a major challenge for both banks and supervisors. This paper focuses on the validation of the credit scoring system under Basel II, and is motivated by a desire to explain the methodologies that could be used to validate a credit scoring model and discuss the practical implementation problems. We will, first, introduce the Basel II and the IRB approach, and then explain the main procedure of developing and validating a scoring model under the framework of Basel II. The main principle of a credit scoring system is assigning to each borrower a score in order to separate the bad borrowers (defaulters) and the good borrowers (nondefaulters). For example, a borrower with high estimated default probability is given a high score. So, a scoring system can be seen as a classification tool in the sense of providing indications of the borrower’s possible status in the future. This procedure is commonly called discrimination. Thus, the discriminatory power of a scoring system denotes the model’s ability to discriminate defaulters from non-defaulters. Assessing the discriminatory power is one of the major tasks in validating a credit scoring model. In practice, scoring systems also form the basis for pricing credits and calculating risk premiums and capital charges. Each score value is associated with an esti-

1

Chapter 1. Introduction

2

mated probability of default. Under the internal rating based (IRB) approach, the capital requirements are determined by bank’s internal estimation of the risk parameters (including the default probability, Loss Given Default and Exposure at Default). This concerns the calibration of the scoring system which is another important task in scoring model validation. In chapter 2, the current regulation for credit risk - Basel II will be introduced. It follows by a general overview on the development and validation framework of credit rating models, in chapter 3. Nowadays, a lot of emphasis has been given to the validation of the internal rating system. Here the validation includes both the assessment of model discriminatory power and calibration. While the discriminatory power of a scoring model depends on the difference of the score distribution of the defaulters and non-defaulters, the calibration of a scoring system depends on the difference of the estimated PD and the observed default rates. In Chapter 4, we will discuss the commonly used statistical methods for measuring the discriminatory power and calibration of scoring systems. Finally, in Chapter 5, we will demonstrate the application of all concepts and give the empirical results. All the statistical results are obtained with the aid of software packages SAS and MS Excel.

Chapter 2 Credit Risk Management & Basel II 2.1

Credit Risk

In terms of risk the business of financial institutions can be described as management of credit risk, market risk, operational risk, legal and regulatory risk, and liquidity risk. Credit risk, the risk of loss due to uncertainty about an obligor’s ability to meet its obligations in accordance with agreed terms, has always constituted by far the biggest risk for banks worldwide. Thus, banks should have a keen awareness of credit risk and the idea that credit risk exposures need to be more actively and effectively managed.

2.2

Basel II

The current regulations for risks are the result of a recommendation of the Basel Committee on Banking Supervision - the New Basel Capital Accord (Basel II, 2004) which is an amendment of Basel I Capital Accord (published in 1988). The main objectives of Basel I were to promote the soundness and stability of the banking system and adopt a standard approach across banks in different countries. Although it was initially intended to be only for the international active banks in the G-10 countries, it was finally adopted by over 120 countries and recognized as a global standard. However, the shortcomings of the Basel I became increasingly obvious over time. Principal among them is that regulatory capital ratios were increasingly becoming less meaningful as measures of true 3

Chapter 2. Credit Risk Management & Basel II

4

capital adequacy, particularly for large complex financial institutions. Moreover, simplicity of Basel I encouraged rapid developments of various types of products that overcomes regulatory capital. Following the publication of successive rounds of proposals between 1999 and 2003, the Basel Committee members agreed in mid-2004 on the Basel II Capital Accord. The main objectives of this revised capital adequacy framework are: Integrating an effective approach to supervision, incentives for banks to improve their risk management and measurement, and risk-based capital requirement closing gap between regulatory and economic capital charges. (Nout Wellink, 2006 [9]) The Basel II goals shall be achieved through three mutually supporting pillars (Figure 2.1): • Pillar 1 defines the rules for calculating the minimum capital requirements for credit, operational and market risks. The minimum capital requirements are composed of three fundamental elements: a definition of regulatory capital, risk weighted assets and the minimum ratio of capital to risk weighted assets. • Pillar 2 provides guidance on the supervisory review process which enables supervisors to take early actions in order to prevent capital from falling below the minimum requirements for supporting the risk characteristics of a particular bank and requires supervisors to take rapid remedial action if capital is not maintained or restored. • Pillar 3 recognises that market discipline has the potential to reinforce minimum capital standards (Pillar 1) and the supervisory review process (Pillar 2), and so promote safety and soundness in banks and financial systems. Basel II brings significant changes, especially in the calculation of capital requirements. Banks may choose between a standardized approach that is similar to the one already used under Basel I and a new, more demanding, internal rating based approach (IRB) which comes in two versions: the foundation IRB approach and the advanced IRB approach (Figure 2.2): • The standardized approach is based on external ratings provided by rating agencies such as Moodys and Standard & Poor, albeit more differentiated under Basel I. • The foundation IRB approach which allow banks to use their internal rating system to calculate the probability of default (PD), but other credit risk inputs such as loss given default (LGD) are obtained from external sources (Figure 2.3).

Chapter 2. Credit Risk Management & Basel II

‘Quantitative’

‘Qualitative’

Discipline

Market

Review

Supervisory

Capital Requirements

Minimum

Stability - Safety - Soundness

‘Market forces’

Credit Risk • Standardised • IRB Foundation • IRB Advanced Operational Risk • Basic Indicator Approach • Standardised Approach • Advanced Measurement Approach

Figure 2.1: Three Pillars of Basel II

Credit risk Risk assessment

External ratings

Risk calculation

Standardized approach

Internal ratings Foundation IRB

Advanced IRB

Increasing Complexity Increasing Risk Sensitivity Increasing Qualitative Standards

Decreasing Capital Requirement

Figure 2.2: Three approaches to manage credit risk given in Basel II

5

Chapter 2. Credit Risk Management & Basel II

6

• The IRB advanced approach where the banks’ own estimates of input data are used exclusively (probability of default, loss given default, the exposure at default (EAD), and effective maturities (M)).

Table 2.1: Basel II IRB Foundation and Advanced Approach Data input PD LGD EAD M

2.3

Foundation IRB Bank internal estimates Supervisory values Supervisory values Supervisory values, or bank internal estimates

Advanced IRB Bank internal estimates Bank internal estimates Bank internal estimates Bank internal estimates

The Internal Ratings Based Approach

The IRB approach is regarded as the prerequisite for advanced credit risk management, and it is necessary for each financial institution to develop its own internal rating system. Once a bank adopts an IRB approach for part of its holdings, it is expected to extend it across the entire banking group. A phased rollout is, however, accepted in principle and has to be agreed upon by supervisors [10]. As mentioned in the previous section, for a given exposure, by using the IRB approach, the derivation of risk weighted assets which depends on estimates of the PD, LGD, EAD and, in some cases, effective maturity (M): • Probability of default (PD) per rating grade, which gives the average percentage of obligors that default in this rating grade in the course of one year. • exposure at default (EAD), which gives an estimate of the maount outstanding (drawn amounts plus likely future drawdowns of yet undrawn lines) in case the borrower defatuls. • loss given default (LGD), which gives the percentage of exposure the bank might lose in case the borrower defaults. These losses are usually shown as percentage of EAD, and depend, amongst others, on the type and amount of collateral as well as the type of borrower and the expected proceeds from the work-out of the assets..

Chapter 2. Credit Risk Management & Basel II

7

Formulas for calculating risk weighted assets are derived from Merton’s (1974) single asset model to credit portfolios and Vasicek (2002) work [12]:

1 − exp(−50 × PD) 1 − exp(−50) · ¸ 1 − exp(−50 × PD) +0.24 × 1 − 1 − exp(−50)

Correlation (R) = 0.12 ×

Maturity adjustment (b) = (0.08451 − 0.05898 × ln(PD))2 " # r N −1 (PD) R −1 Capital requirement (K) = LGD ×N √ + N (0.999) 1−R 1−R ×

1 + (M − 2.5) × b(PD) 1 − 1.5 × b(PD)

Risk-weighted assets (RWA) = 12.5 × EAD × K Regulatory capital for credit risk = 8% × RWA where: - N (.) is the cumulative distribution function (c.d.f.). - N −1 (.) is the inverse cumulative distribution function. - M is the effective (remaining) maturity.

2.4

Validation under the Basel II

Since implementation of the Basel II Framework continues to move forward around the globe, the debate on the IRB approach has changed its accent. More emphasis is now given to the validation of the internal rating system used to generate estimates of inputs. When considering the appropriateness of any rating system as the basis for determining capital, there will always be a need to ensure objectivity, accuracy, stability, and an appropriate level of conservatism. The term validation includes a range of processes and activities that assess whether ratings adequately differentiate risk, and whether estimates of risk components (such as PD, LGD, or EAD) appropriately characterize the relevant aspects of risks [13].

Chapter 2. Credit Risk Management & Basel II

8

The Basel Committee on Banking Supervision stated that several principles underlying the concept of validation should be considered [14]: • Validation is fundamentally about assessing the predictive ability of a bank’s risk estimates and the use of ratings in credit processes. • The bank has primary responsibility for validation. • Validation is an iterative process. • There is no single validation method. • Validation should encompass both quantitative and qualitative elements. • Validation processes and outcomes should be subject to independent review. Based on the availability of high-quality data and quantitative rating models, there are two mutually supporting ways to validate bank internal rating systems [2]. 1. Result-based validation (also known as backtesting): ex post analysis of the rating system’s quantification of credit risk. The probability of default (PD) per rating class or the expected loss (EL) must be compared with the realized default rates or losses. 2. Process-based validation: analysis of the rating system’s interfaces with other processes in the bank and how the rating system is integrated into the bank’s overall management structure.

Chapter 3 Scoring Model Framework 3.1

Fundamentals

For the banking industry, the scoring methodology has played an important role in developing internal rating systems. There are several reasons that can explain the widespread use of scoring models: First, since credit scoring model are established upon statistical models and not on opinions, it offers an objective way to measure and manage risk. Second, the statistical model used to produce credit scores can be validated. Data have to be carefully used to check the accuracy and performance of the predictions. Third, the statistical models used by credit scores can be improved over time as additional data are collected. Prior to default, there is no sure way of identifying whether a firm would indeed default. The purpose of credit risk rating, therefore, is to categorize customers into various classes, each of which is homogenous in terms of its probability of default (PD). While the score alone does not represent a default probability, model scores are mapped to an empirical probability of default by using financial historical data and statistical techniques. Normally, high scores correlate to low probability of default. Broadly speaking, there are three internal components to the credit scoring model: • There are inputs, which are obtained from the dataset of applicant companies or borrowers. • There are parameters, which are used to weight the inputs and to control the logic of the model. • There is a well defined statistical algorithm to combine the inputs and the parameters to create a score. 9

Chapter 3. Scoring Model Framework

10

Note that the preparation of inputs, the estimation of parameters and the selection of an appropriate statistical model all are based upon statistical valid procedures.

3.2

Model Development

Collect Data

Explore data Data cleaning

Deciding the need for adjustments

Validation

Dealing with missing values and outliers

Model style selection

Model produce

Characteristics analysis

Figure 3.1: credit scoring model development

3.2.1

Data Source

Definition of Defaults The first step in a credit scoring model development project is defining the default event. Traditionally, credit rating or scoring models were developed upon the bankruptcy criteria. However, banks also meet with losses before the event of bankruptcy. Therefore, in the Basel II Capital Accord, the Basel Committee on Banking Supervision gave a reference definition of the default event and announced that banks should use this regulatory reference definition to estimate their internal rating-based models. According to this proposed definition, a default is considered to have occurred with regard to a particular obligor when either or both of the two following events have taken place [11]. • The bank considers that the obligor is unlikely to pay its credit obligations to the banking group in full. • The obligor is past due more than 90 days on any material credit obligation to the banking group. Overdrafts will be considered as being past due once the customer has breached an advised limit or has been advised of a limit smaller than the current outstanding.

Chapter 3. Scoring Model Framework

11

Input Characteristics The following step is the pre-selection of input characteristics to be included in the sample. Comparing with importing a snapshot of the entire database, pre-selecting input characteristics makes the model development process more efficient and increases the developer’s knowledge of internal data. In general, the selected input characteristics should describe the most important credit risk factors, i.e. leverage, asset utilization, liquidity, scale, profitability and operational performance.

3.2.2

Data Specification

Time Horizon Time horizon refers to the period over which the default probability is estimated. The choice of the time horizon is a key decision to build up a credit scoring model. Depending on the objective for which the credit risk model was developed (estimating the short-term or medium-long-term default probability), the time horizon varies. For most banks, it is common to select one year as a modelling horizon, as on the one hand one year is long enough to allow banks to take actions to mitigate credit risk, and on the other hand new obligor information and default data may be revealed within one year. However, a longer time horizon could also be of interest, especially when decisions about the allocation of new loans have to be made, but usually here data unavailability may occur.

Data Splitting Since model building and validation both require samples and the statistical assessment of the performance of a predicting scoring model can be highly sensitive to the data set, the given data set at hand should be large enough to be randomly split into two data sets: one for development and the other for validation. To avoid embedding of unwanted data dependency, some type of out-of-sample, out-of-time and out-of-universe1 test should be used in the validation process. Normally, 60% to 80% of the total sample is used to estimate the model; the remaining 20% to 40% sample is set aside to validate the model.

Data Exploring Before initiating the actual modeling, it is very useful to calculate simple statistics for each characteristics, such as mean/median, standard 1

Out-of-sample refers to observations that are not used to build a model. Out-of-time refers to observations that are not contemporary with those observations used to build a model. Outof-universe refers to observations whose distribution differs from those observations used to build a model, i.e. the validation data set contains some obligors that are not included in the data set for building the model.

Chapter 3. Scoring Model Framework

12

deviation, and range of values. The interpretation of data also should be checked, for example, to ensure that ’0’ represents zero and not missing values, and to confirm that any special values such as 999 are documented. This step verifies that all aspects of the data are understood and offers great insight into the business.

3.2.3

Missing Values and Outliers

Most financial industry data contain missing values or outliers that must properly be managed. Several methods with respect to dealing with missing values are available, such as removing all data with missing values or excluding characteristics or records that have significant missing values from the model (e.g., more than 50% is missing), but this would result in too many data being lost. Another straightforward way is substituting the missing values with corresponding mean or median values over all observations for the respective time period. While these three methods assume that no further information can be gathered from analyzing the missing data, this is not necessarily true - missing values are usually not random. Missing values may be part of a trend, may be linked to other characteristics, or may indicate bad performance. Therefore, missing values should be analyzed first, and if they are found to be random and performanceneutral, they may be excluded or imputed using statistical techniques; otherwise, if missing values are found to be correlated to the performance of the portfolio, it is preferable to include missing values in the analysis. Outliers are values that are located far from others for a certain characteristic. They may negatively affect regression results. While the easiest way is removing all extreme data that fall outside of the normal range-for example, at a distance of more than two or three times the standard deviation, using this solution it is very easy to erroneously remove defaulting companies, considered as outliers. Another technique, called ’winsorisation’, involves setting outliers to a specified percentile of the data.

3.2.4

Statistical Model Selection

There are several statistical methods for building and estimating scoring models, including linear regression models, logit models, probit models, and neural networks. We introduce them in Section 4.1. The most popular one is the logit model which assumes that the probability of default is logistically distributed. In practice, using the logit model one can directly generate the scorecard, and the cost and speed in model implementation are lower and faster.

Chapter 3. Scoring Model Framework

3.2.5

13

Initial Selection of Variables

Univariate Analysis In this first step of initial variable selection, the predictive power of each variable will be assessed individually and the weak or illogical ratios will be screened out. This is also known as univariate screening. For this purpose, the distribution of values of variables is examined and transformations are performed.

Transformation Financial ratios are expected to have monotonic relationships with the observed default rates. Because a non-monotonic variable has an ambiguous relationship to default, it is difficult to draw conclusions on the creditworthiness of a company from non-monotonic variables. The monotonicity of each continuous variable can be easily tested by dividing the ratios into groups that all contain the same number of observations, and plotting them against the respective observed default rates. In some cases, it is possible that certain ratios produce a non-monotonous function with respect to default rate. For example, the indicator sales growth has a complex relationship with the default probability. Generally, it is better for a firm to grow than to shrink, however, companies that grow too quickly often find themselves unable to meet the management challenges presented by such growth. Moreover, this quick growth is unlikely to be financed out of profits, resulting in a possible build up of debt and the associated risks. Hence, in order to include this kind of ratios into the scoring model, appropriate transformations have to be performed. This can be done as follows: the value range of the variable are divided into small disjunct intervals, and the empirical default rate for each small interval is determined by the given sample. Then the original values of the ratios are transformed to the probability of default according to this smoothed relationship. Since the value of ratios can be larger than one, the probability of default that we obtained before need to be transformed to a log odds.

Correlation The strongest variables are grouped. However, the correlation among input ratios should be tested as well. Because, if some highly correlated indicators are included in the model, the estimated coefficients will be significantly and systematically biased. Thus, the correlation between all pre-selected variables should be calculated to identify subgroups of highly correlated indicators. For each correlation subgroup, one can then eliminate some variables and choose one or more variables, which can represent all the information contained in other characteristics, based on both statistical and business/operational considerations.

Chapter 3. Scoring Model Framework

3.2.6

14

Final Model Production

However, there is still a large number of potential characteristics available to be included in the model which tend to ’overfit’ the model. To find out the best possible model, one effective way is using a stepwise method which involves including or excluding characteristics from the model at each step based on statistical criteria until the best combination is reached. Forward selection and backward selection are the two main version of the stepwise method: (i) Forward selection: Starting with the constant-only model and adding characteristics one at a time in order of their predictive power until some cutoff level is reached (e.g., until no remaining characteristics have p-value less than 0.05 or univariate Chi Square above a determined level [15]). (ii) Backward selection: The opposite of forward selection. Starting with all characteristics and deleting one at time in order that they are worst by their significant level, until all the remaining characteristics have a p-value below 0.1, for example.

3.3

Validation

Finally, the derived credit risk scoring model has to be validated. The aim of validation is to confirm that the model developed is applicable and to ensure that the model has not been overfitted. As mentioned before, 60% to 80% of the total sample is used to estimate the model; validation is performed on the remaining 20% to 40% of the total sample. In this validation process, it might include comparing the distributions of scored good customers and bad customers across the development sample and the validation sample or comparing development statistics for the two samples. Moreover, the model’s quality can also be evaluated by the Receiver Operating Characteristics Curves and other methods which will be described in the next Chapter. Significant different results between the development and validation sample by applying any of the preceding methods will require further analysis. Some people also consider this validation as a part of model development process. Frequently, scoring models will be validated on a growth period. This validation exercise is similar to the one mentioned above, but with different purposes. While the objective previously is to confirm the robustness and goodness of fit of the scoring model, the objective of the later validation is to confirm that the model is still valid over time. In some cases, where the development sample is two or three years old, significant changes of portfolio structure might have occurred and need

Chapter 3. Scoring Model Framework

During model building

15

After model building

Data for model building

From year T to T+1

From year T+1 to T+2

Data for validation (1) Data for validation (2)

Figure 3.2: Data Split

to be identified. Several performance test methods should be applied, based on both classification and prediction ability.

Chapter 4 Methodology Background 4.1

Statistical Model Selection for Scoring Systems

Assume that there are n data points to be separated, for instance, n borrowers of a bank. To assess the default risk of a credit applicant, a bank usually identifies several indicator (characteristics) X1 , X2 , ..., Xp such as total assets, liabilities or earning before interest and taxes (EBIT). So, there are p variables available for each data point. Let x = (x1 , x2 , ..., xp ) be the vector of observations of the random variables X = (X1 , X2 , ..., Xp ). We then select a number C of classes in which these data will be classified. In the case of a credit scoring model, there can be two classes (C = 2): default and non-default. Without loss of generality, we assume the default variable y ∈ {0, 1}, with the convention that a non-default applicant is labeled 0, and a default applicant is labeled 1.

Linear Regression The linear regression model establishes a linear relationship between the characteristics of borrowers and the default variables: y = β1 X1 + β2 X2 + . . . + βp Xp + u

(4.1)

where u is the random error and u is independent of X. In a vector expression: y = β 0X + u

(4.2)

The standard procedure is to estimate (4.1) with the ordinary least squares (OLS) 16

Chapter 4.

Methodology Background

17

estimators of β which are denoted in the vector form b: S = E(Y | X = x) = b0 x

(4.3)

This means that the score represents the expected value of the performance variable conditional on observing realization of characteristics. Although y is a binary variable, the score S is usually continuous, moreover, the prediction can take values larger than 1 or smaller than 0. While the score S can be used to segregate between ’good’ borrowers and ’bad’ borrowers, the output of the model cannot be interpreted as a probability of default. Logit and Probit Let the conditional probability that the applicant or borrower is default be denoted by π(x): π(x) = P (Y = 1|X = x) = F (β 0 x).

(4.4)

Here F (.) denotes an unknown distribution function of indicators x. If we assume the distribution function F (.) to be a standard normal distribution function, we are faced with the probit model: Z β0x 2 −t 1 0 F (β x) = √ e 2 dt (4.5) 2π −∞ If the logistic distribution is selected, it leads to a logit model: F (β 0 x) =

eβ0 +β1 x1 +β2 x2 +...+βp xp 1 + eβ0 +β1 x1 +β2 x2 +...+βp xp

Note that (4.6) implies a linear relationship between the log odds g(x) input characteristics: Ã ! π(x) g(x) = ln = β0 + β1 x1 + β2 x2 + . . . + βp xp 1 − π(x)

(4.6) 1

and the

(4.7)

The results generated by both models can be interpreted directly as default probabilities and in both cases the estimation of the parameters is performed by maximum likelihood. The differences in the results of these two kinds of models are often negligible, as both distribution functions have a similar form except that the logistic distribution has thicker tails than the normal distribution. However, the logit model is easier to handle. First of all, the calculation of logit models is simpler than that of probit models. Second, the coefficients of the logit model can be more easily interpreted (as shown in equation (4.7)). 1

In some literature g(x) is also called the logit transformation [4].

Chapter 4.

Methodology Background

18

Neural Networks In recent years, neural networks have been discussed extensively as an alternative to the models discussed above. A neural network is a mathematical representation inspired by the way the human brain processes information. The basic principle for this type of model is a series of algorithms that allow for some learning through experience to recognize the relationship between borrower characteristics and the probability of default and to determine which characteristics are most important in predicting default. The advantage to this method is that, unlike statistical models, it does not require any distributional assumptions and it can be constantly adjusting to new information, such as the changes of routine business and economic cycles. Although some argue that neural networks are able to outperform probit or logit regressions in achieving higher prediction accuracy ratios, there are some findings that show the differences in performance between neural networks and standard statistical models are either non-existing or marginal. Since the logit model allows to easily check whether the empirical dependence between potential inputs and probability of default is economically meaningful, many credit scoring model developers still prefer to use the logit model. Therefore, the remainder of this thesis will mainly focus on the development and validation process for logit models.

4.2

Bootstrapping and Confidence Intervals

Parametric and Nonparametric There are two situations to distinguish: the parametric and nonparametric. When there is a particular mathematical model, with adjustable constants or parameters that fully determine the probability density function, such a model is called parametric and statistical methods based on this model are parametric methods. When no such mathematical model is used, the statistical analysis is nonparametric [5]. We do not cover the parametric bootstrap here, and concentrate instead on the nonparametric bootstrap.

What is Bootstrapping? Bootstrapping is a simulation method introduced by Efron in 1979. With the bootstrap method, the original sample is treated as the population and a Monte Carlo-style procedure is conducted on it. This is done by randomly resampling with replacement R times with each resample having the same size as the original sample. Random selection means that if an element of the original sample is selected, it will not be removed from the original sample before selecting the next element. In other words, an individual element from the original sample can be included repeatedly within a bootstrap sample while other elements may not be included at all.

Chapter 4.

Methodology Background

19

The more bootstrap replications we use, the more reliable the result will be. In general, using at least 100 replications is recommended. Confidence Interval Suppose that random variables X1 , X2 , . . . , Xn are independent and identically distributed according to a distribution F . The expectation and variance of the observations Xi , i = 1, . . . , n, are given as below: Z Z 2 µ = xdF (x) and σ = (x − µ)2 dF (x). Consider the problem of constructing a two sided confidence interval for the expectation µ, based on the sample mean ¯ n = 1 (X1 + X2 + · · · + Xn ). X n Let σ ˆ 2 denotes the estimate of the sample variance n

1 X ¯ 2. σ ˆ = (Xi − X) n − 1 i=1 2

By the Central Limit Theorem we have µ¯ ¶ √ Xn − µ D − → N, n σ ˆ D

where − → stands for convergence in distribution and N stands for the standard normal random variable. Let u α2 denote the value for which P (N ≤ u α2 ) = 1 − α2 then µ µ¯ ¶ ¶ √ Xn − µ lim P −u α2 ≤ n ≤ u α2 = 1 − α, n→∞ σ ˆ which is equivalent to ³ ´ − 12 α − 12 α ¯ ¯ lim P Xn − n u 2 σ ˆ ≤ µ ≤ Xn + n u 2 σ ˆ = 1 − α. n→∞

It follows that a two sided confidence interval for µ with confidence level 1 − α is ¯ n − n− 21 u α σ I(α) = (X ˆ, 2

¯ n + n− 12 u α σ X ˆ ) [1]. 2

Bootstrap Confidence Interval We continue with the general idea of the bootstrap method for constructing the confidence interval. The empirical distribution function Fˆn , the estimator of F , is defined by 1 Fˆn (x) = {number of Xi : Xi ≤ x, n

1 ≤ i ≤ n}.

Chapter 4.

Methodology Background

20

Let X1∗ , X2∗ , . . . , Xn∗ denote a bootstrap sample drawn from X1 , X2 , . . . ,√ Xn . The ¯n − basic idea of the bootstrap method is to approximate the distribution of n(X √ √ ¯∗ ∗ ¯ −X ¯ n ), given µ(F )) by the conditional distribution of n(Xn − µ(Fˆn )) = n(X n ˆ ¯ µ(Fn ) = Xn . Hence we approximate ¡√ ¢ ¯ n − µ(F )) ≤ x Gn (x) = P n(X by G∗n (x)

=P

³√

´ ∗ ¯ ˆ n(Xn − µ(Fn )) ≤ x .

The random function G∗n is called the bootstrap approximation of Gn . Before calculating the bootstrap confidence interval, let us first give the definition of inverse function G−1 by G−1 (y) = inf{x : G(x) ≥ y},

0 < y < 1.

Since the following argument holds under sufficient conditions on F ³ ´ √ α ¯ n − µ) < G−1 (1 − α ) 1 − α = P G−1 ( ) < n( X n n 2 2 ´ ³ √ ∗−1 α ∗−1 ¯ n − µ) < Gn (1 − α ) , ' P Gn ( ) < n(X 2 2 a bootstrap confidence interval for µ can be denoted by ³ ´ √ √ ¯ n − nG∗−1 (1 − α ), X ¯ n + nG∗−1 (1 − α ) [1]. X 2 2 The Percentile Interval The simplest method for calculating bootstrap confidence interval is using the percentile interval. If we want to obtain the confidence interval with level 1 − α, we can simply select the bootstrap estimates which lie on the α2 th percentile and (1 − α2 )th percentile. For example, if we want to obtain the confidence interval of the sample mean µ, we could run the bootstrapping R times and calculated R bootstrap estimates of the mean µ. We could rank them from low to high and take the (R + 1)(1 − α2 )th value as the lower limit and the (R + 1) α2 th value as the upper limit ((R + 1) α2 and (R + 1)(1 − α2 ) should be integers). In general, we should expect this to be more accurate, provided R was large enough. In practice, if confidence levels 0.95 or 0.99 are required, then it is advisable to have R = 999 or more.

Chapter 4.

Methodology Background

21

4.3

The ROC Approach

4.3.1

Receiver Operating Characteristics

Frequency

One common method to represent the discriminatory power of a scoring model is the receiver operating characteristic (ROC) curve. The construction of an ROC curve is illustrated in Figure 4.1 which shows the possible distribution of the rating scores for default and non-default counterparties. For a perfect rating model the distributions of defaulters and non-defaulters should be distinguished, but in the real world, perfect discrimination in general is not possible, then both distributions will overlap as shown in Figure 4.1. V is a cut-off value which provides a simple decision rule to divide counterparties into potential defaulters and non-defaulters. Then four scenarios can occur which are summarized in Table 4.1.

Non-defaulters

V Defaulters

Rating Score

Figure 4.1: Rating Score distributions for defaulters and non-defaulters

Table 4.1: Decision results

Above V Scores Below V

Observed Defaults Non-defaults True positive prediction False positive prediction (Sensitivity) (1-Specificity) False negative prediction True negative prediction (1-Sensitivity) (Specificity)

Chapter 4.

Methodology Background

22

If a counterparty with a rating score larger than V defaults or a non-defaulter has a rating score lower than V , then the prediction of the rating system is correct. Otherwise, the rating system makes wrong prediction. The proportion of correctly predicted defaulters is called sensitivity and the proportion of correctly predicted non-defaulters is called specificity 2 . For a given cut-off value V , a rating system should have a high sensitivity and specificity. In statistics, the false positive prediction (1 − specificity) is also called type I error, which defined as the error of rejecting a null hypothesis that should have been accepted. The false negative prediction (1 − sensitivity) is called type II error, which means the error of accepting a null hypothesis that should have been rejected. Traditionally, a Receiver Operating Characteristic (ROC) curve shows the false positive prediction rate 1 − specificity on the X axis and the true positive rate sensitivity on the Y axis, as illustrated in Figure 4.2. We denote the set of defaulters by D, the set of non-defaulters with N D and the set of total counterparties with T . Let SD denotes the distribution of the scores of the defaulting debtors; and SN D be the score distribution of non-defaulters. For any cut-off value v, we have FD (v) = P (SD ≥ v) FN D (v) = P (SN D ≥ v) Then FD (v) is the sensitivity of the rating system derived based on the cutpoint v and FN D (v) is the corresponding 1 − specificity. As v varies over the possible rating scores, the ROC curve can be plotted. From Figure 4.2, it can be seen that the ROC curves start in the point (0, 0) and ends in the point (1, 1). If the cut-off value is above the maximum rating score, then the sensitivity becomes 0 and specificity becomes 1 and the ROC curve passes through (0, 0). Oppositely, if the cut-off value is below the minimum rating score, then the sensitivity is 1 and the specificity is 0 and the ROC curve monotonically increase to the point (1, 1). For a perfect model, defaulters and non-defaulters are separated perfectly. So if the cut-off value is in the score range of defaulters, then the sensitivity is less than 1 but but specificity is 1; if the cut-off value is in the score range of non-defaulters, then the sensitivity is 1 but the specificity is less than 1; and if the cut-off value is above the maximum rating score of defaulters, but below the minimum rating score of non-defaulters, then both sensitivity and specificity are 1. Therefore, the corresponding ROC curve is 2

In some literatures, “sensitivity” is called “hit rate” and “1-specificity” is called “false alarm”[8]

Chapter 4.

Methodology Background

23

a vertical line going from (0,0) to (0,1) and then a vertical line linking (0,1) to (1,1). For a random model, the rating system contains no discriminative power, so the correct prediction rate should equal to the wrong prediction rate and the corresponding ROC curve is a diagonal. Therefore, to be informative, the entire curve should lie above the 45◦ line.

Scoring model

Perfect model

1

Sensitivity

Random model

AUROC

0

1-Secificity

1

Figure 4.2: Receiver operating characteristics curves

4.3.2

Analysis of Areas Under ROC Curves

The area under the ROC curve (AUROC) provides a measure of the model’s discriminatory power. For a random model without discriminatory power, AUROC is 0.5; and for a perfect model, AUROC is 1. In practice, it is between 0.5 and 1 for any reasonable rating model. A model with greater discriminatory power has a larger AUROC. Following the notion in previous section, AUROC is defined as Z 1 AUROC = FD (v) dFN D (v) 0

In this section, we will discuss the statistical properties of AUROC. Since AUROC can be interpreted in terms of a probability, we could compute the confidence intervals for AUROC based on the work of DeLong (1988) [7].

Chapter 4.

Methodology Background

24

Probability Interpretation of AUROC Consider the following experiment. There are two counterparties: one of them is randomly picked from the defaulters and the other is randomly picked from the non-defaulters. Someone has to decide which of the counterparties is the defaulter. Suppose that defaulters are the counterparties with the higher rating scores, but if both counterparties have the same rating score, then the decision-maker would toss a coin. If we follow the notation as last section, then the probability that the decision-maker makes a correct choice is equal to P (X > Y ) + 12 P (X = Y ), where X and Y are the score for defaulters and non-defaulters, respectively. We will show that AUROC is exactly equal to this probability. Assume the score of a rating system can range from 0 to r. Divide the entire score range into k intervals by cut-off value ri , where i = 1, 2, . . . , k, ri < rj , ∀i < j, r1 > 0 and rk = r. If the score of a counterpart falls into the interval (ri−1 , ri ], then the counterpart is assigned a rating i. Let SD and SN D represent the rating distribution of defaulters and non-defaulters respectively. Define FDi = P (SD ≥ ri ), i = 1, . . . , k FNi D = P (SN D ≥ ri ), i = 1, . . . , k and denote the empirical distribution function as FˆDi = P (SˆD ≤ ri ), i = 1, . . . , k FˆNi D = P (SˆN D ≤ ri ), i = 1, . . . , k Then the area under an empirical ROC curve can be calculated as AUROC = = =

k X 1³

2

´ ³ ´ P (SˆD ≥ ri ) + P (SˆD ≥ ri−1 ) · P (SˆN D ≥ ri−1 ) − P (SˆN D ≥ ri )

i=1 k µ X i=1 k X

¶ 1 P (SˆD ≥ ri ) + · P (SˆD = ri−1 ) · P (SˆN D = ri−1 ) 2

P (SˆD ≥ ri ) · P (SˆN D = ri−1 ) +

i=1

1 · P (SˆD = ri−1 ) · P (SˆN D = ri−1 ) 2

1 = P (SˆD > SˆN D ) + P (SˆD = SˆN D ) 2 Calculation of Confidence Intervals for AUROC Next, we will introduce the DeLong’s method of calculating the confidence interval for AUROC, which is more efficient than the commonly applied bootstrapping.

Chapter 4.

Methodology Background

25

Suppose X and Y are the score for defaulters and non-defaulters, respectively. It has been shown that the area under an empirical ROC curve is equal to the Wilcoxon-Mann-Whitney two sample U-statistic applied to the sample Xi and Yj [?]. The Wilcoxon-Mann-Whitney U-statistic estimates the probability U that the score of a randomly selected defaulter will be less than or equal to the score of a randomly selected non-defaulter. It can be calculated as the average over a kernel, ψ, as m n 1 XX ˆ U= ψ(Xi , Yj ), mn i=1 j=1 where

  1 ψ(X, Y ) =



1 2

0

Y X

Observe that Uˆ is an unbiased estimator of P (Y < X) + 21 P (Y = X), i.e. 1 AUROC = E(Uˆ ) = P (Y < X) + P (Y = X). 2 For continuous distributions, one has P (Y = X) = 0. Define ξ10 = E[ψ(Xi , Yj ) ψ(Xi , Yl )] − U 2 , ξ01 = E[ψ(Xi , Yj ) ψ(Xl , Yj )] − U 2 , ξ11 = E[ψ(Xi , Yj ) ψ(Xi , Yj )] − U 2 .

j 6= l; i 6= l;

Then the variance of the Wilcoxon-Mann-Whitney statistic is given as σU2ˆ =

(n − 1) ξ10 + (m − 1) ξ01 ξ11 + .[7] mn mn

Let us introduce a quantity PXXY , which is the probability that the scores of two randomly selected defaulters will both be greater than or less than the score of a randomly selected non-defaulter, minus the complementary probability that the score of the randomly selected non-defaulter will be between the two scores of the randomly selected defaulters. PY Y X is given by a similar definition. They can be expressed by the following formulas: PXXY

= P (X1 , X2 > Y ) + P (X1 , X2 < Y ) −P (X1 < Y < X2 ) − P (X2 < Y < X1 )

PY Y X = P (Y1 , Y2 > X) + P (Y1 , Y2 < X) −P (Y1 < X < Y2 ) − P (Y2 < X < Y1 ).

(4.8)

(4.9)

Chapter 4.

Methodology Background

26

Then the unbiased estimator of the variance σU2ˆ is given in terms of PXXY and PY Y X σˆ2 Uˆ =

1 ˆ [P (X 6= Y ) + (m − 1)PˆXXY 4mn 1 +(n − 1)PˆY Y X − 4(m + n − 1)(Uˆ − )2 ] 2

(4.10)

where Pˆ (X 6= Y ), PˆXXY and PˆY Y X are the estimators for the probability P (X 6= Y ), PXXY and PY Y X , respectively. It is known that (AU ROC − Uˆ )/ˆ σUˆ is asymptotically normally distributed with mean zero and standard deviation one, as m → ∞, and n → ∞. Therefore, for a given confidence level 1 − α, the confidence interval can be computed as [8] £ ¤ AU ROC − σ ˆUˆ Φ−1 (1 − α/2) , AU ROC + σ ˆUˆ Φ−1 (1 − α/2) .

4.4

Statistical Tests for Rating System Calibration

The problem with calibration of rating systems or score variables is comparing the realized default frequency with the estimates of the conditional default probability given the score and analyzing the difference between the observed default frequency and the estimated probability of default. There are several statistical methods for validating the probability of default, such as the Binomial test, the Spiegelhalter test and the Hosmer-Lemeshow Chi-square test. While the binomial test can only be applied to one single rating grade over a single time period, Spiegelhalter test and Hosmer-Lemeshow (χ2 ) test provide more advanced methods that can be used to test the adequacy of the default probability prediction over a single time period for several rating grades. We only focus on the HosmerLemeshow (χ2 ) test and the Spiegelhalter test in this paper.

4.4.1

Hosmer-Lemeshow Test

The Hosmer-Lemeshow test (2000) statistic originated in the field of categorical regression and is often referred to as a goodness-of-fit test. Considering a credit scoring system with N borrowers that are classified into L different rating class according to their credit scores. Let Ni denotes the number

Chapter 4.

Methodology Background

27

of customers that are classified to rating class i ∈ {1, . . . , L}. Then, we have N=

L X

Ni .

i=1

Furthermore, let di denote the number of defaulted customers within rating class i ∈ {1, . . . , L}. Then, for the customers within rating class i, the actual default rate is pi = di /Ni . Finally, assume that customers within rating class i are assigned a probability of default P Di . Assume: (A.1) The predicted default probabilities P Di and the realized default rates pi are identically distributed. (A.2) All the default events within each different rating class as well as between all rating classes are independent. Under the hypothesis

Define the statistic CL =

H0 : pi = P Di ,

∀i

H1 : pi 6= P Di ,

∀i

L X i=1

(Ni · P Di − di )2 Ni · P Di · (1 − P Di )

(4.11)

Under the assumptions (A.1) and (A.2), when Ni → ∞ simultaneously for all i ∈ {1, . . . , L}, by the central limit theorem, the distribution of CL will converge in distribution towards a χ2 -distribution with L − 2 degrees of freedom [4]. The p-value of a χ2 -test is a measure to assess the adequacy of the estimated default probabilities: the closer the p-value is to zero, the worse the estimation is. However, if the estimated default probabilities are very small, the rate of convergence to the χ2 -distribution may be very low as well. Moreover, p-values provide a possible way to directly comparing forecasts with different numbers of rating categories. We should notice that since the Hosmer-Lemeshow test is based on the assumption of independence and a normal approximation, the test is likely to underestimated the true type I error. [16]

Chapter 4.

4.4.2

Methodology Background

28

Spiegelhalter Test

Normally the predicted default probability of each borrower is individually calculated. Since the Hosmer-Lemeshow Chi-square test requires averaging the predicted PDs of customers that have been classified within the same rating class, some bias might arise in the calculation. One could avoids this problem by using the Spiegelhalter test [13]. Consider N customers are in the credit scoring system. To each customer is assigned a score si and estimated default probability θˆi where i ∈ 1, . . . , N . The Spiegelhalter test is based on the Mean Square Error(MSE) which is also known as Bier Score in the contest of validation M SE =

N 1 X (yi − θˆi )2 N i=1

where yi , i ∈ {1, . . . , N } is the indicator variable for the default events, ( 1 if customer i is a defaulter yi = 0 if customer i is a non-defaulter

(4.12)

(4.13)

From equation (4.5), it can be seen that defaulters are assigned high predicted default probability and non-defaulters are assigned low predicted PD, the MSE gets small. Therefore, in general, a smaller value of MSE indicates a better rating system and the higher the MSE, the worse is the performance of the rating system. The Spiegelhalter test is an approach to assess the difference between the observed MSE and its expected value. Assume all the default events within each different rating class as well as between all rating classes are independent. Let θi , i ∈ {1, . . . , N } denote the observed PD. Consider the Hypotheses: H0 : θi = θˆi , ˆ H1 : θi 6= θ,

∀i

for some i

It can be shown that under H0 we have E[MSE] =

V ar[MSE] =

N 1 X θi · (1 − θi ) N i=1

N 1 X (1 − 2θi )2 · θ · (1 − θi ) N 2 i=1

(4.14)

(4.15)

Chapter 4.

Methodology Background

29

Under the null hypothesis and the assumption of independence given the scores, it can be shown that the statistic PN PN 1 1 2 MSE − E[MSE] i=1 (yi − θi ) − N i=1 θi · (1 − θi ) N q P Z= p = (4.16) 2 1 V ar[MSE] 2 i=1 (1 − 2θi ) · θ · (1 − θi ) N2 asymptotically follows a standard normal distribution by the central limit theorem. Performing the two-sided Gauss test, the critical values can easily be derived as the α/2 and 1 − α/2-quantile of the standard normal distribution [13].

Chapter 5 Empirical Analysis and Results 5.1

Model Description

The credit scoring model that we will validate is used to estimate the default probability over 12 months’ time horizon for Deutsche Bank’s German customers. There are 20 financial variables in the score function. To restrain the distortions caused by outliers, extreme values without default indication power were trimmed at levels corresponding to the right and left end points of the values range. Missing values were replaced by the probability of default or its score equivalent. Moreover, all variables were normalized using a probability of default transformation. To do so, the first step is, for each variable P xx, x = 01, 02, ..., 20, partitioning the whole value intervals into n subintervals: [(Pxx, i − 1), (Pxx, i)], i = 1, . . . , n. The Taylor approximation was applied to each subinterval and the constant, linear, quadratic and cubic terms (const(Pxx, i), lin(Pxx, i), qua(Pxx, i), cub(Pxx, i)) were calculated. So the PD transformation of each variable is calculated as follows:

PD(Pxx, i) = const(Pxx, i) +lin(Pxx, i) ∗ (Pxx − (Pxx, i − 1)) +qua(Pxx, i) ∗ (Pxx − (Pxx, i − 1))2 +cub(Pxx, i) ∗ (Pxx − (Pxx, i − 1))3

30

(5.1)

Chapter 5. Empirical Analysis and Results

µ logodd(Pxx) = ln

PD(Pxx) 1 − PD(Pxx)

31

¶ (5.2)

The score function is a linear combination of the PD transformed ratios logodd(Pxx), the turnover size class probability proxies and the Boolean variables of the industry segmentation X SCORE = constant + [bool(industry)· const(Pxx, industry) (5.3) + prob(turnover size)· const(Pxxxx, turnover size)]· logodd(Pxx). Based on the resulting score value, customers’ probability of default is calculated using the following formula: PD =

exp(a + SCORE ∗ b) . 1 + exp(a + SCORE ∗ b)

(5.4)

where a, b are constants.

5.2

The Validation Data Set

While the scoring model was developed on the historical data from 1996 to 2001, the original validation sample consisted of 30021 firm-year observations of balance sheets from 2003 to 2005 and default information spanning from 2003 to 2006. After excluding the part of customers that defaulted before the date of scores creation date, there are 29250 observations left. Table B shows the number of observed customers per year and divides the sample into two categories: defaulters and non-defaulters. Table 5.1: Number of observations and defaults per year Year 2003 2004 2005 Total

All 10805 9978 8467 29250

Default 169 159 91 328

Non-default 10636 9819 8376 28831

In the validation sample, missing values and outliers are treated in the same way as in model developing.

Chapter 5. Empirical Analysis and Results

32

Figure 5.1: Score-PD

Following the formulas in previous section, it is clear that the customer with a high score has high predicted default probability, as shown in Figure 5.1. Figure 5.2 shows the distribution of scores of the validation sample.

Figure 5.2: Score-KDE

5.3

Variables Analysis

Before validating the scoring model, we need to identify the changes of portfolio structure over time, so a univariate analysis for each input variable in the scoring model is required. First, for each variable, the histograms and boxplots for the sample of three years have been plotted. (See Appendix A) In Appendix A, although for each input variable we see that the shape of the histogram of different years are almost the same, the boxplots shows that the mean, median, and quantiles of each input variable in different years are different.

Chapter 5. Empirical Analysis and Results

33

To closely look at the changes of the input variables over time, we use bootstrap to calculate the 99% percentile confidence interval for the mean. The results are shown in Appendix B. It can be seen that except ratio P02101 and P02111, the mean of other ratios significantly changed over time. For example, if we focus on the ratio P02102, it could be found that the mean of the year 2003 is not located in the confidence intervals of the year 2004 and the year 2005. Then we could say that the value of 18 out of 20 input variables changed significantly over time. However, if we transfer all the input variables’ mean into their corresponding logodds according to equation (5.2), it can be found that the logodds of variables that changed significantly were decreasing over time (See Appendix C). Since the logodds of the input variables have positive relationship with the probability of default, it means that 18 out of 20 input variables shifted significantly in the way of decreasing the PD between the time period of 2003-2005. In the meantime, we should notice that the German economy had become better and kept rising since 2003. The following table (Table 5.2) shows the actual German GDP and the growth rate in 2003, 2004 and 2005. It means that the change tendency of the input variables is in accordance with the development of German economy. Therefore, we could say that the structure of the portfolio did not change significantly over time. This result is the precondition of the further analysis. Because if the portfolio structure changed over time, it means the scoring model is compared on different portfolios. Then the validation results would have a great deviation and are not reliable. Table 5.2: Actual GDP of Germany 2003-2005 GDP at market prices (euro billion) GDP (US$ billion) Real GDP growth (%)

5.4

2003 2,162 2,444 -0.2

2004 2,207 2,744 1.3

2005 2,241 2,790 0.9

Testing for Discriminatory Power

Testing the discriminatory power is one of the main tasks of the validation of a credit scoring model. The discriminant analysis aims to assess the model’s ability of separating good customers from bad customers. As we mentioned in Section 4.3, the Receiver Operating Characteristic is the most widely used statistical tool

Chapter 5. Empirical Analysis and Results

34

for the assessment of the discriminatory power. The following figure (Figure 5.3) gives the ROC curves for the year 2003, 2004 and 2005. We could see that in these three years, the value of the area under the ROC curves (AUROC) are all pretty high. It means that the model performed well from 2003 to 2005.

Figure 5.3: ROC Curves for the year 2003 2004 and 2005 However, did the discriminatory power of the scoring model change significantly over time? In the recent years, the DeLong test (1988) is commonly used in comparing ROC curves, but one should notice that the DeLong test is designed for comparing different rating or scoring models on the same data set. Although it is not applicable to a comparison of the AUROC on the same portfolio in different time periods, it is still very helpful to calculate the confidence intervals of the AUROC and we could obtain useful results about the change tendency of the discriminatory power by comparing the confidence intervals. Table 5.3 shows the value of the AUROC and the confidence intervals in 2003, 2004 and 2005 which are obtained by the method introduced in section 4.3.2. We could find that the confidence intervals of AUROC in these three years are quite close. It means that the discriminatory power of the scoring model did not change significantly and the scoring model keeps possessing high discriminatory power.

Chapter 5. Empirical Analysis and Results

35

Table 5.3: Confidence Intervals for AUROC Year 2003 2004 2005

5.5

AUROC 0.8197 0.8225 0.8063

Estimated Variance 0.0002236 0.0002237 0.0004276

95% Confidence Interval Lower Limit Upper Limit 0.7904 0.8490 0.7932 0.8518 0.7658 0.8469

Statistical Tests for PD Validation

In practice, the observed default rates will be different from the predicted default probability. The key question is whether the difference between the predicted PD and the observed default rate is acceptable. As discussed in Section 4.4, several statistical tests are available for calibration of a rating system. Here, I only used the Hosmer-Lemeshow Chi-square test and Spiegelhalter test, because both of them are tests that could make a comparison of forecasts with different numbers of rating categories simultaneously. The results are given in the Table 5.4 and Table 5.5. Table 5.4: Hosmer-Lemeshow-Chisquare Test for PD Validation Year 2003 2004 2005

Statistic Value 16.44002 34.40864 9.12808

p-value 0.2872 0.0018 0.8227

Table 5.5: Spiegelhalter Test for PD Validation Year 2003 2004 2005

Statistic Value 1.30508 2.77888 -0.16138

p-value 0.1919 0.0055 0.8718

The result of Hosmer-Lemeshow Chi-square test and Spiegelhalter test are very similar. According to the p-values, the estimated PD are quite close to the observed default rates for the year 2003 and 2005, but for the year 2004 the estimation is not well enough (p-value < 0.05). Because we only have few defaulters in our portfolio, we could say that the result is still acceptable and the model still works well.

Chapter 6 Discussion & Conclusion When the objective of the validation of a credit scoring model is to confirm that the developed scoring model is still valid for the current applicant population, one should first check whether the portfolio structure changed over time or not. In some cases where the development sample is two or three years old, significant shifts of the portfolio structure might have occurred. It means that the scoring model is compared on different portfolios, then the validation result would become less meaningful. With regard to assessing the discriminatory power of a credit scoring model, the Receiver Operating Characteristic Curve and the area under the curve are the most commonly used tools. Although the DeLong test is not applicable to comparing the AUROC on the same portfolio in different time periods, we still could use it to calculate the confidence interval for the value of the AUROC. With regard to the calibration of the rating or scoring system, Hosmer-Lemeshow and Spiegelhalter test are available and easy to use because it can test the adequacy of the PD estimates with different number of rating classes at the same time. However, their appropriateness strongly depends on the independence assumption for the default events. Furthermore, one should notice that due to the insufficiency of the number of defaulters, the statistical validation is not always reliable.

36

Appendix A Ratio Analysis The following histograms and boxplots display the distribution of each variable in three years. In the histogram, while the red lines are the fitted normal curves, the blue curves are the estimated kernel density. In Boxplots, the vertical lines (or whiskers) are drawn from the box to the most extreme point within 1.5 interquartile ranges. (An interquartile range is the distance between the 25th and the 75th sample percentiles.) Any value more extreme than this is marked with a red square symbol.

Figure A.1: Histogram and Boxplot of ratio P02101

37

Appendix A. Ratio Analysis

Figure A.2: Histogram and Boxplot of ratio P02102

Figure A.3: Histogram and Boxplot of ratio P02103

Figure A.4: Histogram and Boxplot of ratio P02104

Figure A.5: Histogram and Boxplot of ratio P02105

38

Appendix A. Ratio Analysis

Figure A.6: Histogram and Boxplot of ratio P02106

Figure A.7: Histogram and Boxplot of ratio P02107

Figure A.8: Histogram and Boxplot of ratio P02108

Figure A.9: Histogram and Boxplot of ratio P02109

39

Appendix A. Ratio Analysis

Figure A.10: Histogram and Boxplot of ratio P02110

Figure A.11: Histogram and Boxplot of ratio P02111

Figure A.12: Histogram and Boxplot of ratio P02112

Figure A.13: Histogram and Boxplot of ratio P02113

40

Appendix A. Ratio Analysis

Figure A.14: Histogram and Boxplot of ratio P02114

Figure A.15: Histogram and Boxplot of ratio P02115

Figure A.16: Histogram and Boxplot of ratio P02116

Figure A.17: Histogram and Boxplot of ratio P02117

41

Appendix A. Ratio Analysis

Figure A.18: Histogram and Boxplot of ratio P02118

Figure A.19: Histogram and Boxplot of ratio P02119

Figure A.20: Histogram and Boxplot of ratio P02120

42

Appendix B Results of Bootstrapping Table B.1: mean P02101 2003 2004 2005 P02102 2003 2004 2005 P02103 2003 2004 2005 P02104 2003 2004 2005 P02105 2003 2004 2005 P02106 2003

99% percentile confidence interval for the Mean 3.939 4.234 4.170 Mean 2.407 2.820 3.852 Mean 6.175 6.661 7.437 Mean 42.090 41.497 40.056 Mean 39.929 38.534 36.971 Mean 149.532

99% confidence interval [3.621, 4.257] [3.919, 4.543] [3.828, 4.489] 99% confidence interval [2.192, 2.627] [2.590, 3.040] [3.619, 4.090] 99% confidence interval [5.810, 6.546] [6.265, 7.067] [6.917, 7.793] 99% confidence interval [41.325, 42.860] [40.694, 42.302] [39.236, 40.891] 99% confidence interval [38.949, 40.870] [37.530, 39.580] [35.962, 37.996] 99% confidence interval [146.766, 152.217]

43

Appendix B. Results of Bootstrapping

2004 2005 P02107 2003 2004 2005 P02108 2003 2004 2005 P02109 2003 2004 2005 P02110 2003 2004 2005 P02111 2003 2004 2005 P02112 2003 2004 2005 P02113 2003 2004 2005 P02114 2003 2004 2005 P02115 2003 2004 2005

154.397 158.139 Mean 74362.693 77746.443 82804.056 Mean 34825.074 36135.552 37436.825 Mean 20.327 21.872 24.158 Mean 0.180 0.189 0.199 Mean 0.288 0.285 0.285 Mean 25.549 24.302 23.510 Mean 0.210 0.202 0.191 Mean 0.438 0.427 0.417 Mean 0.523 0.503 0.493

[151.561, 157.114] [155.199, 161.265] 99% confidence interval [72945.043, 75728.218] [76245.117, 79287.240] [81166.424, 84484.877] 99% confidence interval [34381.419, 35257.021] [35658.961, 36598.028] [36941.114, 37947.784] 99% confidence interval [19.532, 21.121] [21.031, 22.767] [23.158, 25.075] 99% confidence interval [0.176, 0.185] [0.184, 0.194] [0.194, 0.205] 99% confidence interval [0.279, 0.298] [0.276, 0.295] [0.275, 0.295] 99% confidence interval [24.901, 26.183] [23.677, 24.969] [22.866, 24.201] 99% confidence interval [0.206, 0.214] [0.197, 0.206] [0.186, 0.195] 99% confidence interval [0.432, 0.444] [0.421, 0.433] [0.410, 0.423] 99% confidence interval [0.516, 0.530] [0.496, 0.510] [0.486, 0.501]

44

Appendix B. Results of Bootstrapping

P02116 2003 2004 2005 P02117 2003 2004 2005 P02118 2003 2004 2005 P02119 2003 2004 2005 P02120 2003 2004 2005

Mean 0.388 0.417 0.459 Mean 9.220 10.004 11.227 Mean 2.430 2.291 2.221 Mean 0.111 0.113 0.120 Mean 6.035 5.576 4.909

99% confidence interval [0.376, 0.400] [0.405, 0.430] [0.446, 0.472] 99% confidence interval [8.923, 9.525] [9.683, 10.326] [10.866, 11.593] 99% confidence interval [2.336, 2.523] [2.196, 2.385] [2.131, 2.320] 99% confidence interval [0.106, 0.115] [0.108, 0.118] [0.116, 0.126] 99% confidence interval [5.800, 6.273] [5.337, 5.816] [4.667, 5.154]

45

Appendix C Transformation of the mean of input ratios Table C.1: Table of the log transformation of the mean of input ratios P02101 2003 2004 2005 P02102 2003 2004 2005 P02103 2003 2004 2005 P02104 2003 2004 2005 P02105 2003 2004 2005

Mean 3.939 4.234 4.170 Mean 2.407 2.820 3.852 Mean 6.175 6.661 7.437 Mean 42.090 41.497 40.056 Mean 39.929 38.534 36.971

46

logodd(mean) -4.695 -4.705 -4.703 logodd(mean) -4.759 -4.805 -4.922 logodd(mean) -4.775 -4.801 -4.839 logodd(mean) -4.507 -4.531 -4.526 logodd(mean) -4.913 -4.927 -4.947

Appendix C. Transformation of the mean of input ratios

P02106 2003 2004 2005 P02107 2003 2004 2005 P02108 2003 2004 2005 P02109 2003 2004 2005 P02110 2003 2004 2005 P02111 2003 2004 2005 P02112 2003 2004 2005 P02113 2003 2004 2005 P02114 2003 2004 2005 P02115 2003

Mean 149.532 154.397 158.139 Mean 74362.693 77746.443 82804.056 Mean 34825.074 36135.552 37436.825 Mean 20.327 21.872 24.158 Mean 0.180 0.189 0.199 Mean 0.288 0.285 0.285 Mean 25.549 24.302 23.510 Mean 0.210 0.202 0.191 Mean 0.438 0.427 0.417 Mean 0.523

logodd(mean) -4.830 -4.867 -4.897 logodd(mean) -4.974 -5.014 -5.069 logodd(mean) -4.797 -4.833 -4.868 logodd(mean) -4.893 -4.938 -5.007 logodd(mean) -4.767 -4.785 -4.809 logodd(mean) -4.519 -4.524 -4.524 logodd(mean) -4.884 -4.918 -4.940 logodd(mean) -4.481 -4.535 -4.602 logodd(mean) -5.004 -5.025 -5.044 logodd(mean) -4.875

47

Appendix C. Transformation of the mean of input ratios

2004 2005 P02116 2003 2004 2005 P02117 2003 2004 2005 P02118 2003 2004 2005 P02119 2003 2004 2005 P02120 2003 2004 2005

0.503 0.493 Mean 0.388 0.417 0.459 Mean 9.220 10.004 11.227 Mean 2.430 2.291 2.221 Mean 0.111 0.113 0.120 Mean 6.035 5.576 4.909

-4.908 -4.925 logodd(mean) -5.078 -5.142 -5.232 logodd(mean) -5.089 -5.171 -5.290 logodd(mean) -4.356 -4.398 -4.420 logodd(mean) -4.968 -5.001 -5.020 logodd(mean) -4.396 -4.423 -4.476

48

Bibliography [1] Bert van Es and Hein Putter. Lecture Notes: The Bootstrap. 2005. [2] Stefan Blochwitz and Stefan Hohl. XI. Validation of Banks Internal Rating Systems - A Supervisory Perspective. In The Basel II Risk Parameters, pages 243–262. Springer, 2006. [3] Swiss Federal Banking Commission. Basel II Implementation in Switzerland: Summary of the explanatory report of the Swiss Federal Banking Commission, September 2005. [4] Stanley Lemeshow David W. Hosmer. Applied Logistic Regression. John Wiley & Sons, Inc., second edition edition, 2000. [5] A. C. Davison and D. V. Hinkley. Bootstrap Methods and Their Application. Cambridge University Press, 1997. [6] Arnaud de Servigny and Olivier Renault. Measuring and Managing Credit Risk. The McGraw Hill companies, Inc., 2004. [7] Elizabeth R. DeLong, David M. Delong, and Daniel L. Clarke-Pearson. Operating Characteristics Curves: A Nonparametric Approach. Bopmetrics, 1988. [8] Bernd Engelmann. XII. Measures of a Rating’s Discriminative Power - Applications and Liminations. In The Basel II Risk Parameters, pages 263–287. Springer, 2006. [9] Nout Wellink, President of the Netherlands Bank and Chairman of the Basel Committee on Banking Supervision. Goals and Strategies of the Basel Committee. at the High Level Meeting on the Implementation of Basel II in Asia and Other Regional Supervisory Priorities, Hong Kong, 11 December 2006. [10] Basel Committee on Banking Supervision(BCBS). The New Basel Capital Accord. Consultative document, Bank for International Settlement, Basel, Switzerland, 2003. 49

BIBLIOGRAPHY

50

[11] Basel Committee on Banking Supervision(BCBS). International Convergence of Capital Measurement and Capital Standards. A revised framework, Bank for International Settlement, Basel, Switzerland, 2004. [12] Basel Committee on Banking Supervision(BCBS). An explanatory note on the basel ii irb risk weight functions. Technical report, Bank for International Settlement, Basel, Switzerland, 2005. [13] Basel Committee on Banking Supervision(BCBS). Studies on the validation of internal rating systems. Working paper no. 14, Bank for International Settlement, Basel, Switzerland, 2005. [14] Basel Committee on Banking Supervision(BCBS). Update on work of the Accord Implementation Group related to validation under the Basel II Framework. Consultative document, Bank for International Settlement, Basel, Switzerland, 2005. http://www.bis.org/publ/bcbs n14.htm. [15] Naeem Siddiqi. Credit Risk Scorcards: Developing and Implementing Intelligent Credit Scoring. John Wiley & Sons, Inc., 2006. [16] Stefan Blochwitz, Marcus R. W. Martin, and Carsten S. Wehn. XIII. Statistical Approaches to PD Validation. In The Basel II Risk Parameters, pages 289–305. Springer, 2006.