Two-Stage Logistic Regression Models for Improved Credit ...790367/FULLTEXT01.pdf · Two-Stage...

DEGREE PROJECT, IN , SECOND LEVELMACHINE LEARNINGSTOCKHOLM, SWEDEN 2015

Two-Stage Logistic RegressionModels for Improved Credit Scoring

ANTON LUND

KTH ROYAL INSTITUTE OF TECHNOLOGY

COMPUTER SCIENCE AND COMMUNICATION (CSC)

Two-Stage Logistic Regression Models forImproved Credit Scoring

February 24, 2015

ANTON [email protected]

Master’s Thesis in Computer SicenceSchool of Computer Science and Communication

Royal Institute of Technology, Stockholm

Supervisor at KTH: Olov EngwallExaminer: Olle Bälter

Project commissioned by: Klarna ABSupervisor at the company: Hans Hjelm

Abstract

This thesis has investigated two-stage regularized logisticregressions applied on the credit scoring problem. Creditscoring refers to the practice of estimating the probabil-ity that a customer will default if given credit. The datawas supplied by Klarna AB, and contains a larger num-ber of observations than many other research papers oncredit scoring. In this thesis, a two-stage regression refersto two staged regressions were the some kind of informationfrom the first regression is used in the second regression toimprove the overall performance. In the best performingmodels, the first stage was trained on alternative labels,payment status at earlier dates than the conventional. Thepredictions were then used as input to, or to segment, thesecond stage. This gave a gini increase of approximately0.01. Using conventional scorecuto�s or distance to a deci-sion boundary to segment the population did not improveperformance.

Referat

Denna uppsats har undersökt tvastegs regulariserade logis-tiska regressioner för att estimera credit score hos konsu-menter. Credit score är ett matt pa kreditvärdighet ochmäter sannolikheten att en person inte betalar tillbaka sinkredit. Data kommer fran Klarna AB och innehaller fler ob-servationer än mycket annan forskning om kreditvärdighet.Med tvastegsregressioner menas i denna uppsats en regres-sionsmodell bestaende av tva steg där information fran detförsta steget används i det andra steget för att förbättraden totala prestandan. De bäst presterande modellerna an-vänder i det första steget en alternativ förklaringsvariabel,betalningsstatus vid en tidigare tidpunkt än den konven-tionella, för att segmentera eller som variabel i det andrasteget. Detta gav en giniökning pa approximativt 0,01. An-vändandet av enklare segmenteringsmetoder sa som score-gränser eller avstand till en beslutsgräns visade sig inte för-bättra prestandan.

Contents

1 Introduction 11.1 Background . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21.2 Thesis Objective . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 41.3 Ethical Concerns . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 41.4 Delimitations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 51.5 Choice of Methodology . . . . . . . . . . . . . . . . . . . . . . . . . . 5

2 Theory Review 72.1 Machine Learning . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7

2.1.1 Sampling Bias . . . . . . . . . . . . . . . . . . . . . . . . . . 72.1.2 Overfitting . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8

2.2 Training, Validation & Testing . . . . . . . . . . . . . . . . . . . . . 82.2.1 Validation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 92.2.2 Testing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9

2.3 Logistic Regressions . . . . . . . . . . . . . . . . . . . . . . . . . . . 122.3.1 Basics . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 122.3.2 Regularized Logistic Regressions . . . . . . . . . . . . . . . . 132.3.3 Estimation . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14

3 Related Works 153.1 Segmentation Models . . . . . . . . . . . . . . . . . . . . . . . . . . . 153.2 Ensemble Models . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 163.3 Two-stage Models . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17

4 Method 184.1 Synthesis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 184.2 Methodology . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19

4.2.1 Proposed Model Classes . . . . . . . . . . . . . . . . . . . . . 194.3 Practical Details . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 214.4 Data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 22

5 Results 245.1 Choosing the Mixing Parameter, – . . . . . . . . . . . . . . . . . . . 245.2 First-stage Model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 245.3 Finding Candidate Models Using the Validation Sample . . . . . . . 255.4 Evaluating Candidate Models Using the Test Sample . . . . . . . . . 29

6 Discussion 32

7 Conclusions 357.1 Future Research . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 36

Bibliography 37

Appendix 40

Chapter 1

Introduction

This thesis aims to investigate two-stage logistic regression models for use in retailunderwriting. Underwriting refers to the process of assessing a customer’s eligibilityto receive credit. Underwriting models are used in a wide array of industries suchas banks, credit card companies, mobile phone companies and insurance companies.When this process has been automated, it is usually instead referred to as creditscoring. Credit scoring models use information such as income data and paymenthistory to predict the probability that the customer will default if given credit.

The degree project is carried out at Klarna, a company based in Stockholm,Sweden that provides payment services to online merchants. One of the main com-ponents of this service is constructing and applying credit scoring models for retailcustomers. The accuracy of the credit scoring models is a key driver of Klarna’sprofitability but the models also need to fulfil various obligations.

Some of the current credit scoring models at Klarna involve logistic regressions,a type of probabilistic classification model used to predict a binary response. Creditscoring models can often with a high accuracy classify customers that are clearlylikely to or not likely to pay on time, but may not do as well in more grey areas.Klarna therefore wants to investigate whether the accuracy can be improved by im-plementing and evaluating two-stage logistic regression models. A two-stage modelshould in this context be seen as a model where the first-stage model is trained usingthe available data and the second-stage model uses information from the first-stagemodel in some way to increase performance. The problem can be described by theresearch question:

Do two-stage logistic regression models, while retaining simplicity, im-prove the performance of credit scoring models when compared to theconventional logistic regression?

This paper starts with chapter 1, which gives an introduction to credit scoringand a motivation for the research question in this thesis along with a quick discussionof ethical concerns, the delimitations and the choice of methodology.

The remainder of the thesis will be outlined as follows. In chapter 2, basicsof machine learning will be covered and some important concepts when trainingmachine learning models will be described. This will be followed by chapter 3, asummary of previous works in topics related to two-stage credit scoring models.Next, chapter 4 will cover the specifics of the method and some practical detailswith regards to the implementation. Chapter 5 will present the results and will be

1

CHAPTER 1. INTRODUCTION

followed by chapter 6 that will discuss the results and put them in a general context.Lastly, chapter 7 will give a brief summary of the thesis and the conclusions.

1.1 BackgroundThe applicant in a credit scoring process can for example be a consumer applyingfor a credit card or a mortgage but also an organization, e.g. a company trying tosecure a loan for their business. An important di�erence between scoring consumersand organizations is the available data (Thomas 2010). This paper will focus oncredit scoring of consumers and examples related to credit to retail customers.

The credit score is often designed to predict the probability that the creditapplicant will default but can also be related to more vague concepts such as theprobability that the credit applicant will be profitable (Hand and Henley 1997). Inpractice, this is usually turned into a problem of classifying a customer as eithera “good” or a “bad”. The terms “goods” and “bads” will be used throughout thepaper to distinguish between customers that have good and bad outcomes. Howthis is defined di�ers from company to company but simply put, a bad transactionis when the customer has not fulfilled his or her payment obligations after someduration of time after the credit due date.

Credit scoring is in a generalized perspective an application of a classificationproblem. In a credit scoring model, each credit applicant is attributed a score,based on available data. A good model should assign a high score to a credit appli-cant that is unlikely to default (or the equivalent positive outcome) (Mester 1997).The industry standard in credit scoring is for the credit score to be a logarithmictransformation of the probability of default (PD) so that every 20 points decreasein score double the odds of defaulting. The precise definition varies from companyto company. At Klarna credit score is defined as follows.

Score = ≠ log3 PD

1 ≠ PD

4· 20

log(2) + 600 ≠ log(50) · 20log(2) (1.1)

As an example, PDs of 1 % and 99 % roughly correspond to scores of 620and 355, respectively. Figure 1.1 shows the relationship between probability ofdefault and credit score. An important characteristic of the score is the increasedgranularity as probability approaches 0 and 1. For the reasoning behind the formof the relationship with the PD and the score, see section 2.3.1.

There is a wide range of methods that can be used to create credit scoringmodels. Some examples are (Lessmann, Seow, and Baesens 2013):

1. Statistical models such as linear probability models and discriminant analysis.2. Machine learning models such as neural networks, support vector machines

and decision trees.3. Methods derived from financial theory such as option pricing models.4. Combination of di�erent methods using for example ensemble learning meth-

ods.

2


300

400

500

600

700

0.00 0.25 0.50 0.75 1.00Probability of default

Scor

e

Figure 1.1. Plot of the relationship between probability of default and credit score.

Credit scoring is a highly active field and as in most fields of applied classifi-cation, there is a tradition of searching broadly when testing out new models andmethods.

A problem with many of the academic papers is that they often use data setsthat, in comparison to the data owned by Klarna, are very small. For example, onesource of data commonly used in machine learning is the UCI Machine LearningRepository. It holds 4 data sets with credit application data which are all 1000observations or less1. Klarna will usually have data sets on the order of 100.000observations when training their models as well as more variables. A higher qualitydata set should improve prediction accuracy and increase the potential of modelsthat train on subsets of the data.

Though dependent on the specific application, there are generally three sourcesof data available when assessing credit eligibility (Thomas 2000):

1. Application data - E.g. type of purchase, requested credit.2. External data - E.g. payment history from credit agencies or address lookup

from an address bureau.3. Internal data - Data from previously accepted and rejected applications.

The application data is collected at each transaction. External data needs to bepurchased and a company might develop models to predict when buying externaldata for a specific customer is profitable. Internal data exists for customers thathave used the company’s service earlier and increases in size over time. Internaldata might therefore not be available for all customers and might lose its relevanceover time as the population changes.

1See UCI Machine Learning Repository at https://archive.ics.uci.edu/ml/datasets.html

for more information.

3

https://archive.ics.uci.edu/ml/datasets.html


1.2 Thesis ObjectiveThere is an extensive amount of literature on credit scoring and an abundance ofpapers that explain methods for improving the accuracy of credit scoring models.Many of these papers describe models that are intuitively hard to interpret such asartificial neural networks (ANN) or complicated composite methods using ensemblelearning. Despite this fact, the model of choice for many companies developingscoring models is the relatively simple logistic regression.

One possible reason for this is the sampling bias inherent in the credit scoringproblem (Greene 1998). As the data has usually passed through some earlier versionof a model, data for training and evaluation is restricted to data from previouslyaccepted customers. It is not ex-ante clear how a model trained on data with thisbias will perform on unfiltered data. This implies that there is a need to interpret themodel and assess whether it is reasonable or not. Another argument for interpretivesimplicity comes from the need to explain reasoning behind accept/reject decisionsto di�erent entities within and outside the company. In some countries, it is requiredby law to be able explain to a customer the specific reason or reasons he or she wasrejected credit.

For these reasons, it is clear that the credit scoring industry faces constraints thatmake complicated classification algorithms and ensemble learning models unsuited.Logistic regressions models are, on the other hand, easy to understand and interpret.Their simplicity also make them easy to implement and if needed, modify post-training.

This comes with a caveat. Logistic regressions are linear models. It is there-fore not as simple to produce more complex decision surface as for example foran artificial neural network model. If underlying data is not perfectly linear, sim-ply training a logistic regression on the data can not be expected to give the bestpossible prediction.

There is therefore a need to explore how credit scoring models can be improved,while retaining simplicity, within the context of logistic regressions.

This paper will investigate if the performance of logistic regressions can be im-proved by introducing one or several logistic regressions in a second step to form atwo-stage model. The second step will in some way take into consideration informa-tion from the first model to create a full scoring model that performs better relativeto a simple logistic regression. A short list of di�erent methods will be implementedand compared to a single-stage model to find the best performing model.

1.3 Ethical ConcernsIt has been argued that a weakness of statistical credit scoring models is their lackof an explanatory model (Capon 1982). This makes the specific reasoning behindwhy a customer was rejected or accepted opaque. From this it is arguable thatthere is a certain arbitrariness to the whole system. On the other hand, automatedscoring models could remove prejudice that might influence human decisions.

4


Others have also argued that some customer characteristics might act as proxiesfor characteristics that, if used for di�erentiating between customers, would beillegal discrimination. Examples of this would be that di�erentiating on postalcodes could be a proxy for ethnicity while distinguishing between certain types ofincomes (e.g. full-time or part-time) could discriminate on gender (Capon 1982). Ingeneral this is a problem of statistical models which by nature will tend to predictthe behavior of an individual from the behavior of the group he or she belongs to.

There is an additional problem of credit giving that is not isolated to the creditscoring practice: false positives and false negatives. It is not realistic to assume thata model will accurately predict all credit applications. A fraction of people thatshould not receive credit will be accepted and some who should, will be rejected.This is however a problem that can be alleviated with more accurate credit scoringmodels and is arguably an aspect where credit scoring models perform better thannon-statistical underwriting models.

The results from this thesis are not likely to have much e�ect on the ethicalaspects of credit scoring except for the chance that it might increase the accuracyof predictions. There is perhaps a need to further discuss the ethics of credit scoringas the scope and prevalence increases, but that discussion deserves its own paper.

1.4 DelimitationsThere is a large amount of models that can be applied to the credit scoring model.This thesis can be seen as an in-depth study and has therefore only covered thelogistic regression. Therefore, for all implemented models, both stages consist oflogistic regression models.

There have also been a few restriction on the data. Klarna use di�erent modelsfor di�erent geographic regions. This paper has only tested the implementation onone of these regions. A reason for this is lack of readily available data. Additionallythe data used is from a specific time period. This together with the geographicrestriction needs to be taken into account when attempting to generalize the resultsfrom this paper.

The Related Works chapter mentions a number of machine learning algorithms.It is outside the scope of this paper to cover them all in detail. For more information,please consult a machine learning textbook, e.g. Hastie et al. (2009), or some otherintroductory material.

1.5 Choice of MethodologyThe method implementation can be divided into five separate tasks:

1. Acquire a data set and split it into training, validation and test sets.2. Identify methods for constructing the two-stage models.3. Train the models using the training set.4. Evaluate results on validation set to select a number of candidate models.

5


5. Evaluate results of candidate models on the test set.

The data set used in the implementation will be delivered by Klarna. To min-imize the time spent on assembling and cleaning the data, a data set that waspreviously used to train models at Klarna will be used. Previous research as well asdiscussions with Klarna experts will form the basis for selecting methods for con-structing the two-stage models. The number of actually implemented models willdepend on the time available.

The models will be trained using R, a free software programming language forstatistical computing2. To make sure that the training is automated and repro-ducible, a package including all the necessary functions and code will be developed.The aim of this package is to have the ability to input a data set and receive thetrained model along with relevant statistics and graphics.

2See http://www.r-project.org/ for more information.

6

http://www.r-project.org/

Chapter 2

Theory Review

The theory review will start by giving an introduction to the concept and practiceof machine learning. It will focus on practical concepts like overfitting, samplingbias, training and validation. The next section will define the logistic regression,give an introduction to regularization and finally describe the estimation methodused in this paper. Lastly, there will be a section on evaluating credit models thatdescribes some of the common metrics.

2.1 Machine LearningMachine learning is a subset of computer science that studies systems that can auto-matically learn from data (Domingos 2012). The field is closely linked to statistics,optimization and artificial intelligence. The most widely used application of machinelearning is in classification (Domingos 2012).

The typical setup for a classifier is a vector of p attributes x = (x1, x2, . . . , xp)and a set of classes y = (y1, y2, . . . , yd). The x vector is known for n examples andeach example is assigned to one of the classes. The classification problem can thenbe described as the problem of using historical data to identify to which class a newdata point, xi, belongs.

An example of this could be an algorithm that recognizes faces in images. Adata point, xi, could then consist of the individual pixels in an image and the y-vector would be a vector with a binary response depending on whether a face isactually in the image or not.

2.1.1 Sampling BiasA problematic issue when training credit scoring models is sampling bias. Samplingbias arises when the sample the model was trained on does not represent the samepopulation from which data points are drawn on for prediction. Most credit scoringmodels require some type of label. Labeled data is typically collected by givingcustomers credit and waiting to see if they default or not. It will therefore only bepossible to train the model on transactions that were previously accepted (Thomas2000). Greene (1998); Banasik, Crook, and Thomas (2003) find that sampling biasmay decrease the accuracy of credit scoring models.

This phenomenon can lead to some unexpected results. An example of thiscould be a historical model with a highly predictive variable, i.e. the amount ofdebt. That model would penalize transactions with high debt and only let high debttransactions through when the other variables give very strong positive predictions.

7

CHAPTER 2. THEORY REVIEW

If at some later time, a new model is trained on the data that was filtered by thehistorical model, high debt transactions are likely to have a much lower default ratethan from the data in the first model. This could make the debt variable have apositive e�ect in the new model, something that is not likely to reflect the natureof future, unfiltered, requests.

There is unfortunately no patent solution to this problem. There are a fewmethods that have been suggested to alleviate the problem, for example rejectinference (e.g. (Crook and Banasik 2004)) and virgin sampling, i.e. letting througha fraction of requests regardless of score, to have a small but representative sample.Having the ability to spot these problems is a contributing reason for prioritizinginterpretive simplicity in credit scoring models.

2.1.2 OverfittingOverfitting refers to the problem of when a fitted model describes the random errorin the sample rather than the underlying characteristics. An overfitted model isnot suitable for prediction as it fails to generalize over data other than from thetraining set (Hawkins 2004). One cause for overfitting is using too complex models,e.g. using a quadratic model to predict a linear trend. Two common ways toalleviate overfitting is cross-validation or methods to reduce complexity and induceparsimony such as regularization and feature selection (Hastie et al. 2009).

Additionally, adhering to the principle of parsimony is also important from amore practical perspective. Decreasing the number of variables means that lessresources have to be spent on developing and maintaining variables and decreasesthe risk of various errors in the database (Hawkins 2004).

2.2 Training, Validation & TestingTo make sure that the trained models are generalizable, it is important to test themodels on di�erent data than what was used for the training. This is usually donein steps with two separate goals (Hastie et al. 2009):

1. Validation (Model Selection) - Comparing the performance of di�erent modelsto find the best one.

2. Testing (Model Assessment) - Given that the best fitting model has beenfound, estimating its generalization error on new data.

The data is therefore typically divided into three sets, a training set, a validationset and a test set. The training set is used to estimate the models, the validation setfor model selection and lastly the test set to evaluate the generalizability. (Hastieet al. 2009). It is important that the training, validation phase and testing phaseuse di�erent data to prevent an underestimation of the true error. A typical splitcould be 1

2 for training, 14 for validation and the last 1

4 for testing (Hastie et al.2009).

8


2.2.1 ValidationA common problem in credit scoring is that, while the total number of data pointsmight be su�ciently large, only a small fraction of those represent customers whodefaulted or were designated as "bad" in some other way. A method used to alle-vieate this problem is cross-validation, one of the most common implementationsbeing K-fold cross-validation. Cross-validation involves separating the training setinto K folds. For each of the K folds, a model is trained on the other folds andvalidated with the k-th fold. A typical value of K could be 10. The prediction errorfor the k-th fold is defined as

Ek(f) = 1N

Nÿ

i=1L(yi, f

≠k(i)(xi)), (2.1)

where f

≠k(i) refers to the model trained on the sample that is not included in thek-th fold and L is a function that measures the error depending on the estimateand the true value for observation i (Hastie et al. 2009).

The cross-validation estimate of the prediction error is then (Hastie et al. 2009)

CV (f) = 1K

Kÿ

k=1Ek(f) (2.2)

2.2.2 TestingIt is in most cases not possible to create a model that can perfectly separate theclasses. This implies that there will, for every score threshold, be some fraction ofdata points that become misclassified. This also implies a need to compare di�erentmodels in order to pick the model that has the best performance. It is not entirelyclear how to define the performance of a scorecard. A guiding principle is to measurehow well the scoring model can distinguish between goods and bads. This sectionwill explain the meausures used in this paper

ROC-curve, AUC and GiniA commonly used method of evaluation is the AUC-measure and a linear transfor-mation of the AUC, the Gini-coe�cient (Hand and Anagnostopoulos 2013; Krzanowskiand Hand 2009). The AUC (Area Under the Curve) is derived from the ROC (Re-ceiver Operating Characteristic) curve which depicts the true positive rate againstthe false positive rate depending on some threshold (Davis and Goadrich 2006). Thetrue positive rate (TPR) is defined as the ratio of true positives over all positives(TP/P ) and the false positive rate (FPR) as the ratio of false positives over allnegatives (FP/N). An example could be a data set of 1000 observations with abad-rate of 10%, i.e. P = 100 and N = 900.

In the credit scoring problem, a bad, is defined as a positive response and viceverse. If we with a certain model, manage to correctly classify 80 bads (TP = 80)and mistakenly classify 100 goods as bads (FP = 100), we get a true positive rateof TPR = 0.8 and a false positive rate of FPR = 0.11.

9


The threshold can in credit scoring be a score cuto� and the ROC curve is thencreated by calculating the TPR and FPR for a su�cient amount of score-cuto�s. Ifthe cuto� is set so that no applications are accepted, then TPR = 1 and FPR = 1.If instead all applications are accepted then TPR = 0 and FPR = 0. The areaunder the ROC curve can be interpreted as the probability of a randomly selectedpositive data point being ranked higher than a randomly selected negative datapoint. The Gini-coe�cient is a linear transformation of the AUC defined as

Gini = 2AUC ≠ 1 (2.3)

The Gini can take on values in the range [≠1, 1] and a value of 0 represents therandom baseline. A higher AUC or Gini implies an overall higher discriminatorypower of the model. Figure 2.1 shows some schematic ROC curves along with theresulting Gini.

0.00

0.25

0.50

0.75

1.00

0.00 0.25 0.50 0.75 1.00False positive rate

True

pos

itive

rate

gini = 0gini = 0.75gini = 0.90gini = 0.99

Receiver Operating Characteristics (ROC)

Figure 2.1. Schematic plot of the relationship between the ROC-curve and gini fordi�erent classifiers.

Kolmogorov-Smirnov statisticAnother evaluation measure is the Kolmogorov-Smirnov (KS) statistic. It measuresthe distance between the distributions of the goods and the bads (Thomas, Edelman,and Crook 2002). Let PG(s) and PB(s) be the cumulative distribution function ofgoods and bads. The KS statistics is then defined as

KS = maxs

|PG(s) ≠ PB(s)| (2.4)

10


When calculating the KS-statistics, pG(s) and pB(s), can be plotted. Figure 2.2shows a schematic plot of the KS statistics for two fictive cumulative distributionsof goods and bads.

0.00

0.25

0.50

0.75

1.00

0.00 0.25 0.50 0.75 1.00Normalized Score

Cum

ulat

ive P

roba

bilit

y

Cumulative goodsCumulative badsKS statistics = 0.53

Kolmogorov−Smirnov Statistics

Figure 2.2. Schematic plot of PB(s), PG(S) and the KS statistics for two distribu-tions.

Mays (1995) gives a general guide to interpreting the KS statistic that can beseen in table 2.1. This table should not be too taken literally.

Table 2.1. Guideline of the quality of a scorecard based on the KS statistic (Mays1995). Values have been transformed from percentages to decimals.

KS statistic Evaluation< 0.20 Scorecard probably not worth using0.20 - 0.40 Fair0.41 - 0.50 Good0.51 - 0.60 Very good0.61 - 0.75 Awesome> 0.75 Probably too good to be true

Brier scoreA third measure is the Brier score, originally proposed in a 1950 paper on weatherforecasting (Brier 1950). The Brier score is the mean squared error of the probability

11


estimates produced by the model. Let yi be the outcome for a certain observationand take the value 0 for a good and 1 for a bad. Let also fi be the estimatedprobability of y. The Brier score is then defined as

BS = 1N

Nÿ

i=1(fi ≠ yi)2 (2.5)

Score-to-log-odds plotThe last measure that will be used is the Score-to-log-odds. When plotted, it showshow well actual outcomes in di�erent score segments correspond to the outcomesestimated by the model. If the actual log-odds for a segment is higher than what isexpected from the model, the risk is under-estimated in this certain segment. Thescore-to-log-odds plot therefore shows how the model performs over the score band.

A benefit with using the measures described above is that they do not take intoaccount the specific threshold when evaluating the model (Hand and Anagnostopou-los 2013). The backside is that when specific thresholds have been decided, it ismore interesting to measure the performance of the model around these thresholds(Thomas, Edelman, and Crook 2002).

2.3 Logistic RegressionsLogistic regression emerged in the late 1960s and early 1970s as an alternative toOLS regressions and linear discriminant analysis (Peng, Lee, and Ingersoll 2002).In the normal case, the logistic regression is used to predict the response of a binaryvariable. There is also a generalization called multinomial logistic regressions thatcan be used to predict multi-class problems. The logistic regression is widely usedin empirical applications and has emerged as the industry standard credit scoringmodel (Bellotti and Crook 2009).

Lessmann, Seow, and Baesens (2013) run a benchmarking of a wide array ofclassification algorithms on a credit scoring data set and find that on average, thelogistic regression performs well and even outperforms many state-of-the-art ma-chine learning algorithms. They also found that models like random forests andneural networks can give better predictions than logistic regressions.

A reason that logistic regressions still are so widely used is that it is relativelyeasy to perform reality checks on them. It is for example possible to from justlooking at the sign of the coe�cient of a variable to see if the results intuitivelymake sense. Such checks can easily find problems such as the example of the debtvariable described in section 2.1.1.

2.3.1 BasicsThe logistic regression model comes from the need to construct a linearly additivemodel that can predict a probability, i.e. a number between 0 and 1. This was solvedby a neat trick using the concept of the logit. Let Yi be a binary-valued outcomevariable with an associated probability pi that is related to a number of explanatory

12


variables X. Let for simplicity the explanatory variables be normalized as it makesthe specification of the regularization simpler. The logit is then the logarithm of theodds of probability, log p

1≠p . This means that the probability can be transformedto a number that ranges from ≠Œ to Œ, which can be used in a linear model.

With m explanatory variables, the probability can be modeled using a linearprediction function, f, that for a particular data point i takes the form:

f(x) = logit(pi) = log pi

1 ≠ pi= —0 + —xi = —0 + —1x1,i + . . . + —mxm,i (2.6)

For a trained model, with estimates —0 and —, we can calculate the estimatedlogit and probability, pi, using

logit(pi) = log pi

1 ≠ pi= —0+—xi and E[Y |Xi] = p = 1

1 + exp ≠(—0 + —x

i

)= (2.7)

2.3.2 Regularized Logistic RegressionsTwo widely recognized problems of logistic regressions are overfitting and performingfeature selection. Feature selection refers to the problem of selecting the correctvariables (features) to be included in the model. A common method to alleviatethose problems is regularization. The idea behind regularization is to put a penaltyon the sum of the regression coe�cients which can be done in a number of ways. Aselection of common regularization methods are:

• Lasso - a type of ¸1-regularization.• Ridge regression - a type of ¸2-regularization.• Elastic net regression - a linear combination of the lasso and ridge regression.

The Lasso (least absolute shrinkage and selection operator) estimator introducesa ¸1-restriction on the form

qÎ—mÎ Æ t for a constant t. This restriction has the

e�ect that it tends to decrease the size of the coe�cients and for su�ciently largevalues of t, also sets some coe�cients to zero (Tibshirani 1996). It has been shownthat this implies that under some conditions, the lasso can be used as a methodof automated feature selection while still maintaining an e�cient estimation (Zhaoand Yu 2006). It however has some non-trivial problems. For example, it can onlyselect a maximum of n variables, where n is the number of observations. Also, incase of groups of highly correlated variables, it tends to somewhat arbitrarily chooseonly one of those variables to include into the regression (Zou and Hastie 2005).

The ridge regression method, also known as Tikhonov regularization, insteaduses a ¸2-restriction on the form

qÎ—mÎ2 Æ c for a constant c (Hoerl and Kennard

1970). The ridge regression reduces the variance of coe�cient estimates at theprice of a bias. Combined, these e�ects can decrease the mean square error of theregression and lead to a better prediction (Hoerl and Kennard 1970). It has also

13


been shown that for situations with n > m, and highly correlated predictors, thatthe ridge regression performs better than the lasso method (Tibshirani 1996). Itdoes however lack the ability to produce a sparse model as it keeps all the predictorsin the model (Zou and Hastie 2005).

Zou and Hastie (2005) suggested a combination of the lasso and the ridge regres-sion called the elastic net. It has the benefit of both automated feature selectionand a lower variance of the prediction. Additionally it can also select groups ofvariables if one of the variables in the group is found significant. Using the methodof Lagrange multipliers, the lasso and ridge constraints can be rewritten as penaltiesin an optimization problem. Combined together, the elastic net penalty functionhas the form

P (—, ⁄t, ⁄c) = ⁄t

ÿÎ—mÎ + ⁄c

ÿÎ—mÎ2 (2.8)

where the ⁄ parameters are called shrinkage parameters. When applied naively,this specification causes an excessive shrinkage due to shrinkage e�ects from boththe lasso and the ridge component. It can be shown that this can be mitigatedby multiplying the estimates from the naive specification with a scaling factor,(1 + ⁄2)(Zou and Hastie 2005).

2.3.3 EstimationThe logistic regression is usually estimated using a maximum likelihood estimation(MLE) approach (Peng, Lee, and Ingersoll 2002). The MLE approach is a widelyused estimation method that given a set of n data points x1, x2, . . . , xn selects themodel that is most likely to have generated said data. If we define a specific modelas ◊, and the objective likelihood function as L, the maximum likelihood approachselects a model that satisfies

◊ = arg max◊œ�

L(◊; x1, x2, . . . , xn) (2.9)

For both logistic and regularized logistic regressions, there are no closed-formsolutions to the MLE problem which implies that an iterative and numerical processneeds to be used.

14

Chapter 3

Related Works

The classification approach to consumer credit scoring has been thoroughly re-searched in academia. Lessmann, Seow, and Baesens (2013) perform a meta-studyon recently published papers on credit scoring. They find that in regards to pre-diction, ANN are very strong when applied individually. Ensemble classifiers canoutperform individual classifiers with random forests giving the most accurate pre-dictions. Another important finding is that complicated state-of-the-art algorithmsdo not necessarily outperform simpler versions. An example of this is randomforests (see Breiman (2001)) outperforming the more advanced rotation forests (seeRodriguez, Kuncheva, and Alonso (2006)). An ensemble selection approach com-bining bootstrap sampling (see section 3.2) and a greedy hill-climbing algorithm (atype of optimization method) for base model selection gave the overall best results.They also find, by comparing 45 credit scoring papers from the period 2003 to 2013,that the mean and median number of observations in data sets are 6167 and 959respectively (Lessmann, Seow, and Baesens 2013). This can in machine learningcontexts be considered as relatively small data sets.

There is an extensive amount of academic papers attempting to increase theperformance by training more than one model. These can somewhat coarsely beput into two categories: segmentation and ensemble models. Segmentation modelsattempt to divide the population into sub-populations and develop scorecards foreach sub-population (Thomas, Edelman, and Crook 2002). Ensemble learning onthe other hand, aims to build a model by combining the results from a collection of(simple) models (Hastie et al. 2009).

3.1 Segmentation ModelsSegmentation models can be constructed on either an experience-based (heuristic)or a statistical basis (Siddiqi 2005). The heuristic strategy uses some characteristics,such as age or requested credit, to segment the population. The statistical basis usesstatistical methods or machine learning models such as cluster analysis to segmentthe population. Each credit application is therefore first segmented by some meansand then scored by a single scorecard trained specifically for that segment. Thissimplifies performance monitoring and model retraining.

Segmentation has proven to be successful in some cases. So et al. (2014) build aclassification model that segments credit card customers between transactors (thosewho pay o� their balance every month and are thus by definition good) and re-volvers (those who sometimes only pay part of their monthly balance and incur

15

CHAPTER 3. RELATED WORKS

interest charge). They find that this segmentation gives a more accurate profitabil-ity estimate than a single scorecard. Banasik, Crook, and Thomas (2001) build atwo-stage model to distinguish between high-usage and low-usage customers. Theyfind that, when taking into account that the usage of credit is dependent on theamount of credit actually awarded a customer, the two-stage model gives a betterprediction accuracy. Hand, Sohn, and Kim (2005) implement a bipartite model bysplitting on variable characteristics, training two separate scorecards and selectingthe split that maximizes the combined likelihood. They show that this procedurecan increase the performance compared to a single logistic regression.

Other segmentations methods have not been as successful. Bijak and Thomas(2012) use a statistical machine learning approach. They distinguish between atwo-step and simultaneous method. For the two-step method, they do the seg-mentation and scoring model independently. In the simultaneous approach, thesegmentation and scoring model is optimized simultaneously. The first step im-plements statistically based segmentation methods, CART-models, CHAID-treesand LOTUS-models, to separate the data set into several groups and the secondstep builds scorecards for each group. They find that neither segmentation usingtwo-step or simultaneous methods significantly increase the prediction power.

3.2 Ensemble ModelsEnsemble learning classifiers come in many di�erent flavors. Bootstrap aggregating(bagging) involves drawing a number of random samples with replacement (Boot-strap sampling) of the training data and training a classifier on each sample. Boost-ing creates an ensemble by iteratively creating new models and increasing weightsfor misclassified data points in each step. The final classification model is then aweighting of all iterations. A widely used boosting algorithm is the AdaBoost algo-rithm (Marsland 2011). Stacked generalization (Stacking) is a method whereby anumber of classifiers are trained and then aggregated using a second-step to combinethe scores (Wolpert 1992).

Wang et al. (2011) compare bagging, boosting and stacking on three real worldcredit data sets. As base learners, they use logistic regressions, decision trees,artificial neural networks and support vector machines. They find in general, thatthe ensemble methods increase accuracy on all types of base learners. While baggingdecision trees showed the best performance improvement, results seem to somewhatdi�er between data sets. Marqués, García, and Sánchez (2012) examine compositeensemble learning using random subspace, rotation forest and convolutional neuralnetworks to construct two-level classifier ensembles. They run tests on six di�erentdata sets and find that two-level ensembles can increase prediction accuracy quitesignificantly. Yao (2009) similarly uses CART, bagging and boosting methods andalso finds performance increases.

16

CHAPTER 3. RELATED WORKS

3.3 Two-stage ModelsThe category "Two-stage models" is not clearly defined in literature but in this the-sis refers to a model that is constructed, using similar methods, in two stages, wherethe second stage uses some kind of information from the first stage. This defini-tion somewhat overlaps with the definition of segmentation models and ensemblelearning models but the exact distinctions are not so important.

Finlay (2011) performs a broad study of multi-classifier systems on a numberof baseline models on two data sets of approximately 100,000 observations each.He finds some evidence that CART and a combination of neural networks andlogistic regressions show potential for increasing performance. The highest increasein performance comes from the ET boost, significantly out-performing the morecommonly used AdaBoost algorithm. These methods are not two-stage models inthe strict sense but have some similar characteristics.

He also creates two-stage models where for example the result from a baselinelogistic regression was used to segment the population. The finding was that thistype of segmentation performed poorly for all baseline models. His interpretationof these results is that the data is only very weakly non-linear but that it could alsobe an e�ect of overfitting. For some segments, the number of bads were as low asunder 1,000 cases. A problem with the paper is that due to the large number ofevaluated models, the exact methodology is not clearly explained.

Similar attempts to base segmentation on scoring on a first-stage model wasfirst attempted by Myers and Forgy (1963). Using discriminant analysis, they trainsecond-stage models with di�erent score cuto�s from the first stage. They find nopositive e�ects but the results are limited by using a small sample of 300 observa-tions.

17

Chapter 4

Method

Chapter 4 starts with a synthesis of the literature review in the form of a numberof stylized facts. This leads up to a presentation of the methods investigated in thisthesis. Next, some practical details of the implementation are discussed followedlastly, by a presentation of the data.

4.1 SynthesisThe findings in earlier chapters can be summarized in a few stylized facts.1. There is a wide selection of methods that seem to increase the predic-tive power when comparing to the industry standard logistic regression.

From the discussion in earlier chapters it seems that there is no single way ofincreasing the performance of credit scoring models. Both segmentation methodsand ensemble methods seem to in many cases be e�ective. As many others haveechoed, (i.e. Thomas (2000)), there does not yet seem to exist a "holy grail" tothe credit scoring problem. The optimal model for any implementation will likelybe dependent on the characteristics of that specific instance of the problem andneeds and capabilities of the organization. Additionally the relationship betweenmodel performance and model complexity is not always positive and simpler modelssometimes outperform more complex derivatives.2. Many empirical evaluations use data with small sample sizes.

Introducing new data is problematic in the sense that it complicates comparisonwith other studies. Data sets will have di�erent size, di�erent variables and bedrawn from populations with dissimilar characteristics. In such a comparison, itis not clear what part of the di�erence comes from the data and what part comesfrom the implementation. This is especially a problem when data is proprietary andnot available for replication or further studies. A clear and exhaustive descriptionof the methodology is therefore important to ensure reproducibility so that themethodology can be tested on other data.

With this in mind, there is still a need for using proprietary data if the size orquality of the data exceeds that of the publicly available data sets. This is especiallytrue when implementing models such as segmentation models, that can significantlybenefit from larger samples.3. Broad comparative studies are hard to interpret

Some recent studies such as Finlay (2011) and Lessmann, Seow, and Baesens(2013) have identified that there is a need for broad studies that compare a largenumber of classifiers using similar data and methodology. This approach has a

18

CHAPTER 4. METHOD

number of problems. When the number of classifier models grows large enough,it will not be possible to explain the practical implementation with enough detail.Data preprocessing (e.g. data cleaning, feature selection) and model specifics (e.g.regularization, estimation) have a big influence on results and specific methodsemployed might be better suited for certain classifiers. There might also be a bias interms of the researchers’ experience with di�erent classifiers causing some classifiersto under-perform. The non-exhaustive descriptions of methodologies make thisproblem di�cult to tackle.Conclusions

The conclusion of these stylized facts is that there is a need to look broadly wheninvestigating classifier models so that credit scorers can find the model that suitstheir specific capabilities and problems. Additionally, with new and better data,there is a need of re-evaluating models that earlier had not performed as well asexpected. Finally, broad comparatives studies need to be complemented by narrowerstudies that in greater detail cover the implementational details of a model. Thesestudies, while interesting on their own, will also be a help for researchers attemptingcomparative studies.

4.2 MethodologyGiven the conclusions above and the imperative of retaining simplicity, there is amotivation for, using Klarna’s high quality data, investigating the potential low-hanging fruits that can increase the performance of the industry standard scoringmodel. While there have been some attempts at constructing simple two-stagelogistic regression models, there has yet to be an exhaustive study, with a clearlyexplained methodology. Another important addition to this study is the use ofregularization methods that makes for a separated and automated feature selectionfor both stages.

In discussions and inspired by previous work, a number of methods to constructthe two-stage model have been identified. The basic idea is to train a logisticregression on all observations in the training set. Information from the first stage willthen be included into or used to segment the data for the second-stage model. Theresults from both models will then be combined in some way to form a conventionallogistic regression for all observations.

4.2.1 Proposed Model ClassesModel class 1: Cuto� on predicted score from first-stage regression.Defaulting customers can naively be put into three categories. The first categoryis those that do not have the ability to pay. The second category is those that forsome reason have no intention to pay and the third is fraud. It is likely that thecharacteristics predicting these categories are quite di�erent.

A hypothesis is that the prevalence of the first category relative to the secondcategory is larger at lower scores and vice versa. An option is then to train a

19

CHAPTER 4. METHOD

model on all data and in a second stage train separate models on low- and high-scoring customers. An alternative would be to assume that the second categorymight not easily be predicted by variables used for credit scoring. Defaults in thesecond category would add noise when attempting to predict the first category. Thealternative suggestion is therefore to make a cuto� on the score from the first stageand only train the second stage on the low-scoring customers and let the remainingcustomers be trained on the full data. While it would be interesting to also trainmodels on the subsample above the cuto�, the lack of bads in the high-scoringsegment makes this di�cult.

Model class 2: Segmenting on observations close to the decision boundary.After a credit scoring model has been trained, some kind of decision model decideswhether to accept or reject a new application. A type of naive decision model is aconstant risk model. It defines a constant risk which is the maximum expected lossthat can be accepted.

PD · credit = constant (4.1)

The line formed by plotting the maximum allowed credit for each score is re-ferred to as the decision boundary. Given such a model, the performance is highlydependent on the accuracy of the scoring model around the decision boundary.Given a hypothesized non-linearity in the data, the first stage can be used to iden-tify observations in proximity to the decision boundary, and the second stage willbe trained on that subset which might give a better accuracy in the region. Thecondition to include an observation in the second-stage is then:

| creditPD/constant

≠ 1| < x (4.2)

where the constant has been calculated from the constant risk decision modelin equation 4.1 and x is the percentage window boundary.

Model class 3: Cuto�s on predicted score from alternative labels.As the payment status of a transaction evolves over time, there is not a clear def-inition of what point of time should be used to classify a transaction. Using thepayment status at the credit due date might make us mistakenly class cases wherethe customer intended to, but forgot to, pay in time or where the customer wastemporarily unable to pay, as bad. The industry standard for bank cards is touse 90 days after credit due date (Mays 1995). There might however be valuableinformation in the payment status at a shorter time after credit due date. Perhapspeople that are likely to forget to pay on time have other characteristics than thosethat never fail to pay on time. This model will therefore investigate the e�ect oftraining on earlier payment statuses, 5 and 30 days after credit due date.

These labels will in the rest of the paper be referred to as dpoint5, dpoint30and dpoint90 with dpoint90 being the final definition of goods and bads. Using the

20

CHAPTER 4. METHOD

idea from above, this model class will first train first-stage models using the twoalternative default labels dpoint5 and dpoint30. The scores from these first-stageregressions will then be used to segment the population similarly to model class 1 sothat a new dpoint90 regression can be trained for that subsample. The remainingsample will be scored by a simple logistic regression on the full sample. As thealternative labels have a higher amount of bads compared to dpoint90, it is possibleto also train models on the subsamples above the cuto�s.

Model class 4: Predicted score from alternative labels as variables.An alternative to the segmentation models proposed above is to train scoring

models on the alternative default labels and use the predicted scores as variablesin a final scoring model on the dpoint90 label. For the previous model classes, thefirst stage aims to segment out a part of the sample to increase the accuracy onthat subsample. For this model however, both the first and the second stages aretrained on the full sample which means that the full amount of bads can be usedfor training. The scores from the first stage will naturally have a high correlationwith the remaining variables in the final scoring model. While correlation betweenexplanatory variables is usually a problem for logistic regressions, regularizationshould help account for this.

4.3 Practical DetailsFor estimating logistic regressions in this thesis, an R package ‘glmnet’ has beenused1. The glmnet package is a highly e�cient implementation of the elastic net re-gression for generalized linear models using cyclical coordinate descent. It includes afunction for estimating logistic regressions (Friedman, Hastie, and Tibshirani 2010).The regularization penalty in the glmnet package is defined as

⁄

Ó1 ≠ –

2 Î—Î2 + –Î—ÎÔ

(4.3)

where – is a mixing parameter so that 0 Æ – Æ 1. The package also includesfunctionality for k-fold cross-validation to choose an optimal value of the shrinkageparameter, ⁄. When the mixing parameter is set to 1 ≠ ‘ for some small ‘ > 0, theretains much of the characteristics of the lasso but with some increased numericalstability (Hastie and Qian 2014).

The glmmnet estimation procedures need, apart from the variable coe�cients, ashrinkage parameter, ⁄, and a mixing parameter, –. The package includes a cross-validation function for finding the optimal ⁄ for a given –. The optimum – can alsobe found by nesting a two-stage cross-validation method (Hastie and Qian 2014).Because of the considerable time it takes to train an individual model, it is notreasonable to perform a two-stage cross-validation for each model. A compromise isto find an optimum – once for the full training set and then only use the includedcross-validation technique for ⁄ for each subsequently trained model.

1More information about the glmnet package can be found on http://cran.r-project.org/

web/packages/glmnet/index.html.

21

http://cran.r-project.org/web/packages/glmnet/index.html

http://cran.r-project.org/web/packages/glmnet/index.html

CHAPTER 4. METHOD

The glmnet package has a number of statistics that can be used as measure offit for the data model. This paper will use the AUC-measure, since the derivativeGini is one of the main statistics we use to evaluate models.

It is important to use separated sets for training, validating and testing to pre-vent getting bad results. The testing set was therefore an out-of-time set consistingof approximately 20 % of the total sample. The remaining part was split on a cus-tomer identifier so that 25 % of remaining customers (Approximately 20 % of thewhole sample) end up in the validation set and 75 % (Approximately 60 % of thewhole sample) in the test set.

A practical concern is how to compare results between first and second-stagemodels. It is important that the underlying data is the same to make comparisonsuseful. A solution would be to evaluate both stages on the subset of observationsthat are to be included in the second stage. This makes di�erent second-stagemodels di�cult to compare as the subset changes. The alternative approach usedin this paper is to first predict scores using the first-stage model, and then replacethe scores of the observations that should be included in the second-stage modelwith the score from the second-stage model.

4.4 DataThe data used in this paper consists of Dutch invoices that were initiated during theperiod 2013-06-01 to 2014-03-31. The data set contains a total number of approx-imately 350, 000 observations. Each observation is a previously accepted purchasethrough Klarna and the data set can contain several observations for one individ-ual. There are 136 variables that can be used for modelling purposes with somebeing internal data and others being data acquired from external credit bureaus.Note that there is not valid data on all variables and observations so some fields aremissing.

Some variables are binary variables and others numeric or categorical variables.The non-binary variables were put into bins or categories and transformed intobinary variables by making each bin into a binary feature. Binning transforms thecontribution of a variable from linear to piece-wise linear which makes the modelmore flexible (Hastie et al. 2009). When applicable, cumulative binning was used.For a variable such as income, an observation being in the e100, 000 ≠ e200, 000bin would also be in the e50, 000 ≠ e100, 000 bin. After binning and transformingthe variables, the final data set contains 770 binary variables that form the finalfeature vector x.

The data was divided into three sets, a training set (60 %), a validation set (20%) and a test set (20 %). Table 4.1 shows the number of observations, number ofbads and the bad-rate for each set. The training and validation set as expectedshow similar characteristics. For the out of time test set, it seems that the bad-ratehas decreased, implying there is some change in the population over time.

The original data set only contained labels for payment status 90 days aftercredit due date. The payment statuses at earlier times were thus extracted and

22

CHAPTER 4. METHOD

Table 4.1. Distribution of goods and bads for payment status at 90 days after creditdue date.

Good Bad Bad-rateTraining set 208222 8842 0.041

Validation set 68756 3014 0.042Test set 67860 2503 0.036

matched separately afterwards. Unfortunately, matches could not be found foraround 20 % of observations. On the other hand, the number of bads, when lookingat the payment status 5 days after credit due date is larger than the 90 days labelby a factor 10. For the 30 days after due date, the increase is approximately by afactor 3. Given that the bads are so scarce for the dpoint90 label, this is a majorincrease.

An important thing to note is that the definition of a bad is in this data set veryrestrictive. Klarna has sorted out a large number of transactions that were deemedindeterminate. The reasoning for this comes from an earlier experience where amore strict definition of bads increased the prediction accuracy. Additionally, somenon-bads have been filtered out for other reasons. The bad-rate should thereforenot be taken as a measure of the actual loss rate of Klarna.

23

Chapter 5

Results

Chapter 5 starts by describing the results from implementing the cross-validationtechnique to find the mixing parameter. Next, the performance of the first-stagemodel is described. The chapter continues by presenting results from training themodels proposed in section 4.2.1 and evaluating them using using the validationsample. Lastly, the chapter describes the result from evaluating a number of candi-date models using the test sample. This chapter refers to a large number of graphsand tables. Graphs with the prefix A can be found in the appendix.

5.1 Choosing the Mixing Parameter, –

A two-staged cross-validation method was first set up to get the optimum – on thefull training set. Table A.1 shows the result of testing a sequence of values for themixing parameter, –. Apart from the minimum value, for – = 0, the magnitudeof di�erence in AUC is on the 5th decimal for the remaining – values. Withoutany confidence intervals it is di�cult to say if any values of the AUCs are with astatistical significance higher than any others.

Given that the default value in the glmnet package is – = 1, i.e. a lasso, andusing the idea that – = 1 ≠ ‘ for some small positive ‘ works like the lasso but withincreased numerical stability, it was decided to go with – = 0.999 (Hastie and Qian2014) . This selection of – implies that the major part of the regularization comesfrom the lasso parameter. This value of – has then been used for all subsequentlytrained models.

5.2 First-stage ModelThe first stage was trained on the training set using the binary label dpoint90, i.e.payment status 90 days after due date as response variable. Figure 5.1 shows anumber of plots using the predictions on the validation sample. From the distribu-tions of goods and bads, it is clear that there is an overlap on the larger part of thescore band. The score-to-logodds plot shows that the predicted score lies reasonablyclose to the actual line but that there is a an underestimation of the log-odds onlow scores and an overestimation on high scores. The KS of 0.56 and the reasonablyhigh Gini suggest that the first-stage model on its own is performing relatively well.

24

CHAPTER 5. RESULTS

0.00

0.01

0.02

0.03

0.04

0.05

400 500 600 700 800Score

Prob

abilit

y

Bads

Goods

Distribution of goods and bads

−5

−4

−3

−2

−1

500 550 600 650Score

Log−

odds

Predicted Log−odds

Actual Log−odds

Actual and predicted log−odds

0.00

0.25

0.50

0.75

1.00

400 500 600 700 800Score

Cum

ulat

ive P

roba

bilit

y

Actual bads

Actual goods

Expected bads

Expected goods

KS Statistics = 0.560

Kolmogorov−Smirnov Plot

0.00

0.25

0.50

0.75

1.00

0.00 0.25 0.50 0.75 1.00False Positive Rate

True

Pos

itive

Rat

e

Gini = 0.730

ROC−curve and Gini−value

1st stage model on validation sample

Figure 5.1. Combination of plots when evaluating the first-stage model on thevalidation sample.

5.3 Finding Candidate Models Using the ValidationSample

This section describes the results from the full set of trained models. The metricsused to evaluate the models seemed to be quite highly correlated. That is, a modelwith a high Gini in general had a high KS and a low brier score. Therefore, to ensurereadability, this section only contains figures showing how the Gini for di�erentmodels compares with the first-stage model. The full result tables can be found inappendix in tables A.2 to A.5.

Model class 1Table A.2 and figure 5.2 show the results from retraining the subset of observationswith predicted score less than di�erent score cuto�s. It is easily noticeable that theresults are not very sensitive to this method. The di�erence in any of the metrics areat most on a magnitude of around 0.1 percentage points. A partial explanation forthe edge cases might be that for the very low scorecuto�s, almost no observationsare included into the second stage. As a contrast, for the very high scorecuto�s,almost all observations are included into the second stage. As the deviations are

25

CHAPTER 5. RESULTS

so small, it is di�cult to say if any variation is statistically significant. However,the 550, 630 and 670 models, being the three models with the highest Gini, wereselected for further analysis.

0.720

0.725

0.730

0.735

0.740

500

510

520

530

540

550

560

570

580

590

600

610

620

630

640

650

660

670

680

690

700

Scorecutoff

Gini

Gini of Model Class 1 on validation sample

Figure 5.2. Gini of di�erent models from Model class 1 compared to the first-stagemodel (dotted line) when evaluated on the validation sample.

Model class 2The decision boundary was decided using the constant risk concept as describedin section 4.2.1. The constant maximum allowed expected risk was decided usingnumerical optimization to find the value that would give an acceptance rate of 95 %on the full training sample. This decision boundary was then used for subsequenttraining and analysis. The rule for including observations into the second-stagesubset was being inside a x % boundary window. Table A.3 and figure 5.3 showthe results for a number of di�erent percentages on the validation sample. Apartfrom the 90 % window, which will be evaluated on the test sample, none of thesemodels seem to give better results on any metrics. And even for the 90 % model,the di�erence is at best marginal.

Model class 3Table A.4 and figures 5.4 and 5.5 show similarly the result of retraining the second-stage model on the subsample of observations with a score under or over, di�erentscore cuto�s. For this table, the first-stage scores were in this model class predictedusing models trained on all observations with dpoint5 and dpoint30 as labels. Thesecond-stage models were then trained on these subsamples using the dpoint90 labeland the results combined with the dpoint90 first-stage model for the remainingsample.

This model class seems to in general give more interesting results than the twoprevious ones. Predicted score of the dpoint5 < 480 model as well as the modelswhere predicted score of dpoint30 > 490 , > 500 and > 510 all have Gini-values

26

CHAPTER 5. RESULTS

0.720

0.725

0.730

0.735

0.740

20 30 40 50 60 70 80 90

Percentage distance to decision boundary cutoff

Gin

i

Gini of Model Class 2 on validation sample


larger than the first-stage model and those in model classes 1 and 2. They weretherefore selected as candidate models. The score(dpoint5 ) > 490 model stands outas particularly bad. This is probably due to the model not converging properlywhich might be because of the random allocations to the cross-validation folds. Infigure 5.4, this model was therefore not plotted.

0.720

0.725

0.730

0.735

0.740

scor

e(dp

oint

5) <

470

scor

e(dp

oint

5) <

480

scor

e(dp

oint

5) <

490

scor

e(dp

oint

5) <

500

scor

e(dp

oint

5) <

510

scor

e(dp

oint

5) <

520

scor

e(dp

oint

5) <

530

scor

e(dp

oint

5) >

470

scor

e(dp

oint

5) >

480

scor

e(dp

oint

5) >

490

scor

e(dp

oint

5) >

500

scor

e(dp

oint

5) >

510

scor

e(dp

oint

5) >

520

scor

e(dp

oint

5) >

530

Gin

i

Gini of dpoint5 models from Model Class 3 on validation sample

Figure 5.4. Gini of di�erent models with segmentation based on the dpoint5 scorefrom Model class 3 compared to the first-stage model (dotted line) when evaluatedon the validation sample.

Table 5.1 shows a two way table of combination of actual outcomes of observa-tions where the predicted score from the dpoint5 regression was < 480. Even thoughthe sample has decreased significantly, a large portion of the bads with regards to

27

CHAPTER 5. RESULTS

0.720

0.725

0.730

0.735

0.740

scor

e(dp

oint

30) <

470

scor

e(dp

oint

30) <

480

scor

e(dp

oint

30) <

490

scor

e(dp

oint

30) <

500

scor

e(dp

oint

30) <

510

scor

e(dp

oint

30) <

520

scor

e(dp

oint

30) <

530

scor

e(dp

oint

30) >

470

scor

e(dp

oint

30) >

480

scor

e(dp

oint

30) >

490

scor

e(dp

oint

30) >

500

scor

e(dp

oint

30) >

510

scor

e(dp

oint

30) >

520

scor

e(dp

oint

30) >

530

Gin

iGini of dpoint30 models from Model Class 3 on validation sample

Figure 5.5. Gini of di�erent models with segmentation based on the dpoint30 scorefrom Model class 3 compared to the first-stage model (dotted line) when evaluatedon the validation sample.

dpoint90 remains. Additionally, of the non-missing observations, around 20 % ofobservations that were bad with regards to dpoint5 were also bad with regards todpoint90, compared with approximately 10 % on the full sample.

Table 5.1. Table of combinations of actual outcome labels on dpoint5 and dpoint90for the test sample on the subsample where the predicted score from the dpoint5regression was < 480.

dpoint5Missing Good Bad

dpoint90 Good 3723 3708 6003Bad 123 0 1566

Model class 4In order to implement the fourth model class, two models with the respective la-bels dpoint5 and dpoint30 were first trained on the full training set. The predictedscores were then binned and transformed to binary non-cumulative variables. Thetwo groups of binary variables were then included first separately and then in com-bination when training with the label dpoint90

Table A.5 and figure 5.6 show the result of applying the trained model to thevalidation sample. The di�erence when compared to the first-stage model is withthis model notably higher than for the other models for all 3 combinations. Allthree will therefore be evaluated on the test sample.

28

CHAPTER 5. RESULTS

0.720

0.725

0.730

0.735

0.740

dpoi

nt5

dpoi

nt30

dpoi

nt5

& dp

oint

30

Gin

iGini of Model Class 4 on validation sample


Table A.6 shows the estimated coe�cients from the model where only the scorefrom the dpoint5 bins were included. As expected, a higher expected score on thedpoint5 label is correlated with a lower score on the dpoint90 variable. Similarresults can be seen for the regression using the dpoint30 label.

5.4 Evaluating Candidate Models Using the Test SampleA total number of 11 models were along with the first-stage model selected to beevaluated on the test sample. The results from this can be seen in figure 5.7 andin the appendix in table A.7. Table 5.2 shows a condensed version with the bestperforming model of each model class along with the first-stage model for compar-ison. A first and interesting result to notice is that the first-stage model performssignificantly better on the test data than on the validation data. For example, thesimple 1-stage logistic regressions have a gini of 0.7298 on the validation sample and0.7956 on the test sample.

Table 5.2. Results from the first-stage model and the four best performing candidatemodels on test sample

No. of Obs. No. of Bads Bad-rate Gini KS Brier Score1st-stage 70363 2503 0.036 0.7956 0.6404 0.0276

Scorecuto�s Score < 550 6455 1589 0.246 0.7962 0.6386 0.0271Decision boundary 90% 21918 1514 0.069 0.7971 0.6383 0.0276

score(dpoint5) < 480 15123 1689 0.112 0.8011 0.6416 0.0272dpoint5 as variable 70363 2503 0.036 0.8037 0.6465 0.0273

Using simple cuto�s on the predicted score from the first-stage model seem to

29

CHAPTER 5. RESULTS

reproduce a similar performance when retraining on the lower score spectrum (Score< 550) with a marginal increase Gini and a lower Brier Score. Any gain found inthe validation set has disappeared when increasing the scorecuto�s to 630 and 670.Looking at the subset of observations at a 90 % distance to the decision boundary,the table shows a marginal increase in Gini coupled with a marginal decrease inKS. It is therefore di�cult to say if any of these four perform better than the firstmodel.

0.76

0.78

0.80

0.82

Scor

ecut

offs

Sco

re <

550

Scor

ecut

offs

Sco

re <

630

Scor

ecut

offs

Sco

re <

670

Dec

isio

n bo

unda

ry 9

0%

scor

e(dp

oint

5) <

480

scor

e(dp

oint

30) >

490

scor

e(dp

oint

30) >

500

scor

e(dp

oint

30) >

510

dpoi

nt5

dpoi

nt30

dpoi

nt5

& dp

oint

30

Gin

i

Gini of candidate models on test sample

Figure 5.7. Gini of candidate models compared to the first-stage model (dottedline) when evaluated on the test sample.

When looking at the models using information from the dpoint5 and dpoint30labels, it is clear that they, except for when looking at predicted score of dpoint30 >

510, seem to outperform the first-stage model on the test set as well. Using thepredicted score as a variable seems, consistent with the result from the validationset, to perform better than when using it to segment the sample. The largest scoreincrease is as in the validation set seen when using only the predicted score fromthe model using dpoint5 as a label. This model showed an increase of Gini of 0.0081points and an increase in KS score of 0.0061.

Figures A.2 to A.6 show the score to log odds plots for the five boldfaced modelsfrom table A.7. These graphs are more condensely displayed in figure 5.8 wherethe di�erence between predicted and actual log odds is plotted against the score.Consistent with earlier observations, all models tend to overestimate the risk oflow-scoring applications and underestimate that of high-scoring applications. Themodel where the subset of predicted score of dpoint5 < 480 has been retrainedstands out and does not show as strong tendencies as the other models.

30

CHAPTER 5. RESULTS

−0.2

0.0

0.2

500 550 600 650Score

Diff

eren

ce b

etwe

en p

redi

cted

and

act

ual l

og o

dds

1st−stagePredicted score of dpoint90 < 55090 % distance to decision boundaryPredicted score of dpoint5 < 480Predicted score of dpoint5 as variable

Difference between predicted and actual logodds

Figure 5.8. Di�erence between predicted and actual logodds over the scoreband fordi�erent models when implemented on the test sample.

Assessing the statistical significance of these results is not trivial as the glmnetmethod for various reasons does not have a clearly defined methodology for estimat-ing standard errors. Some attempts have recently been made, notably by Lockhartet al. (2014), by the creators themselves, but they have not been tested enough inapplication. In short, the thesis does not delve too much on statistical significance.Instead, it takes results as "significant" if they are approximately reproducible ondi�erent data, i.e. when the results are similar on the validation set and the testset.

31

Chapter 6

Discussion

When comparing the results from the validation and the test sample it is clearthat the performance is higher on the test sample which stresses the importance ofmeasuring relative performance within samples. There are two plausible reason forthis. Firstly, the training and validation samples were built by separating individualsso that for each unique individual, all observations would end up in either thetraining or the validation sample. The test sample on the other hand, is an out-of-time set. 20 % of observations in the test sample can be attributed to an individualthat exists in the training sample. Given that some variables are more or lessconstant for an individual, the training sample should be more correlated with thetest sample than the validation sample. It is worth noting that an actual businessimplementation will resemble the the test sample, with data from a later timeperiod and partially overlapping individuals. The fact that the models perform alsoperform well on the validation sample suggests that performance is consistent overmoderate changes in the population.

Secondly, the lower bad-rate of the test sample, shows that the populationchanges over time. It is not clear how this would a�ect the prediction accuracybut it is possible that this could partially account for the relatively higher perfor-mance.

Another general observation, when looking at performance relative to the first-stage model on the two samples, is that the relative performance seems to be stableboth over time and on a new population, which supports the general methodology.For the segmentation models using scorecuto�s and distance to boundary, there wereno positive significant results. Contrary to what we expected, for most proposedmodels, evaluation metrics hardly changed at all. This could be interpreted as theunderlying relations in fact being su�ciently linear. Another possible explanationfor the result is that there is in fact non-linearity but that the two implementedmodels do not segment the population in a way to capture these non-linearities.This result is in line with previous research on two-stage credit scoring models.

It has from working with this paper become clear that all sorts of segmentationcome with a caveat, decreasing the number of observations. This particular applica-tion is especially vulnerable as bads are of rare occurrence. When the second-stagemodels leave out a large number of bads, performance on that segment clearly dete-riorates. Given the highly skewed distribution towards goods, simple segmentationmethods will probably only be viable strategies for data sets of a much larger sizethan what was used in this thesis. This might also explain why the decision bound-ary model class performed particularly badly. By letting the simple scorecuto�

32

CHAPTER 6. DISCUSSION

models retrain on the lower score segment, they could capture a relatively largeshare of the bads. For the decision boundary model, one had to go quite far fromthe decision boundary in order to get a su�cient number of bads.

Dissecting the results of model classes 3 and 4, when alternative default labels areintroduced, is not fully straight-forward. A major di�erence from model 1 and 2 isthat new information has been introduced to the modelling. The di�erence betweenmodel class 3 and model class 4 is the way this information a�ects the model. Whenthe predicted score is used to segment the population as in model class 3, the newinformation has no actual influence on the training. This strengthens the claim thatthere is in fact non-linearity in the data. The large amount of bads for especiallythe dpoint5 label in relation to the dpoint30 label is interesting.

Understanding why transactions change from bads at dpoint5 to goods in thedpoint90 label is crucial. Connecting back to the discussion in section 4.2.1 twotypes of customers explain this behavior:

1. The customer has a temporary financial constraint that hinders him or herfrom paying in time.

2. The customer is somewhat sloppy and forgets to pay or delays payment forsome other reason.

When training a model on the earlier default labels, we are therefore trying tofind earlier mentioned categories of people as well as those who never fulfill theircredit obligations. The results from model class 3, suggests there is something withthis group of people that di�ers from the remaining population.

As for model class 4, when the predicted scores are used as variables in thedpoint90 model, predicted from the first-stage model(s), into the dpoint90 model.The results seem to favor this approach. This approach also works similarly to themodel class 3 by giving a lower score to those that are likely to not have paid at5 och 30 days past due date. An obvious advantage of this model is that we cantrain both the first and second stage on the full training set meaning that we donot have to forsake any of the scarce bads. Comparing the results of the validationand testing set, model class 4 seems the most robust.

Given that we have found and improvement in the model, the big questionis what this means for the profitability of a company such as Klarna. While anincrease in the Gini of 0.01 might seem economically insignificant, it is important toremember that the credit industry is mostly a high volume - low margin industry.Using the decision boundary and definition of expected loss discussed earlier in thethesis, it is possible to construct a simple model that estimates the expected lossfrom applying a certain model to on a data set. When comparing the dpoint5 -as-variable model with the simple one stage model, the expected losses decreasedby approximately 1 % on the test set. For a company with a monthly number oftransactions on the order of 10, 000, 000, using the dpoint5 -as-variable model woulddecrease the expected monthly losses by a magnitude of SEK 100, 000. While these

33

CHAPTER 6. DISCUSSION

are not game-changing results, this is certainly evidence that the methodology holdspromise.

Before implementing this in a live system, it is important to consider potentialweaknesses in the model. One question is whether these results would be general-izable to other time periods or other geographic regions. The problem of scarcityof bads is a universal problem of a well-functioning consumer credit system, so itis plausible that there is a similar gain of using relatively abundant earlier defaultlabels in other markets. That being said, model class 3 might not be a valid optionfor smaller data sets as it segments the data. For such data sets, the methodologyfrom model class 4 will probably be a better approach. Another question is howthis method is a�ected by the sampling bias. It will of course still be present inthe regressions with the dpoint90 label, but the other labels should have much lesscorrelation with reject decisions. There is of course still the risk that the totalvulnerability to sampling bias increases using these kind of models.

34

Chapter 7

Conclusions

In this thesis, two-stage logistic regression models have been estimated and evalu-ated on credit scoring data. Credit scoring models use information such as paymenthistory to predict the probability of default. The data has been provided by Klarnaand consists of roughly 350,000 credit applications from Dutch retail customers.The two-stage models have been constructed by first training a regularized logisticregression on the full sample and using information from the first model to re-scorepart or all of the observations. Specifically, 4 classes of models have been evaluated:

1. Segmenting on predicted score from the first-stage model2. Segmenting on distance to decision boundary from predicted probability of

default.3. Segmenting on predicted score from regressions using alternative default labels4. Including predicted score from regressions using alternative default labels.

The models were compared on validation and test samples using KS, Gini,Brier score and score-to-logodds plots. Implementing the di�erent classes of modelsshowed that the more simple models (classes 1 and 2) gave no significant increase inperformance. The other two classes increased performance on both the validationand the test set. The other two model classes (3 and 4) showed that there is an in-crease in performance when segmenting or taking into account predictions on earlypayment status. A conclusion of this is that two-stage do improve performance ofcredit scoring models but not using the more conventional methods.

A key to understanding why using the earlier default improves the predictionprobaby lies in the scarcity of bads in the original credit scoring problem. Usingthe large number of bads at earlier dates and the correlation with the dpoint90it is possible to extract more information from the data and improve the modelperformance.

35

CHAPTER 7. CONCLUSIONS

7.1 Future ResearchOverall, this paper shows that by taking into account the fact that a transactionis a process that changes over time, the prediction accuracy can be increased. Theinteresting question is then how this information best can be put to use. Whilethere is arguably much room for improvement of the methodology of this thesis,there are probably other methods that can use this information better.

An approach that has been tested previously in credit scoring is survival analysis.In survival analysis the credit scoring problem is changed from trying to estimatethe probability of a default to estimate the time of default. For more information,see for example a paper by Stepanova and Thomas (2002).

Another way could be to view the problem as a directed graphical model asdepicted in figure 7.1. In this model, x-vector of conventional would predict dpointat t≠n for some starting point in time that feeds probabilities forward in time untilthe probability dpoint at t has been calculated. A paper on how to implement sucha model could possibly find even further improvements of the prediction accuracy.

Figure 7.1. Schematic specification of a directed graphical model that takes intoaccount the time aspect of a transaction.

36

Bibliography

Banasik, J, J Crook, and L Thomas (2001). “Scoring by usage”. In: Journal of theOperational Research Society 52.9, pp. 997–999.

Banasik, Jonathan, John Crook, and Lyn Thomas (2003). “Sample selection biasin credit scoring models”. In: Journal of the Operational Research Society 54.8,pp. 822–832.

Bellotti, Tony and Jonathan Crook (2009). “Credit scoring with macroeconomicvariables using survival analysis”. In: Journal of the Operational Research Soci-ety 60.12, pp. 1699–1707.

Bijak, Katarzyna and Lyn C Thomas (2012). “Does segmentation always improvemodel performance in credit scoring?” In: Expert Systems with Applications 39.3,pp. 2433–2442.

Breiman, Leo (2001). “Random forests”. In: Machine learning 45.1, pp. 5–32.Brier, Glenn W (1950). “Verification of forecasts expressed in terms of probability”.

In: Monthly weather review 78.1, pp. 1–3.Capon, Noel (1982). “Credit scoring systems: A critical analysis”. In: The Journal

of Marketing, pp. 82–91.Crook, Jonathan and John Banasik (2004). “Does reject inference really improve the

performance of application scoring models?” In: Journal of Banking & Finance28.4, pp. 857–874.

Davis, Jesse and Mark Goadrich (2006). “The relationship between Precision-Recalland ROC curves”. In: Proceedings of the 23rd international conference on Ma-chine learning. ACM, pp. 233–240.

Domingos, Pedro (2012). “A few useful things to know about machine learning”.In: Communications of the ACM 55.10, pp. 78–87.

Finlay, Steven (2011). “Multiple classifier architectures and their application tocredit risk assessment”. In: European Journal of Operational Research 210.2,pp. 368–378.

Friedman, Jerome, Trevor Hastie, and Rob Tibshirani (2010). “Regularization pathsfor generalized linear models via coordinate descent”. In: Journal of statisticalsoftware 33.1, p. 1.

Greene, William (1998). “Sample selection in credit-scoring models”. In: Japan andthe world Economy 10.3, pp. 299–316.

Hand, David J and Christoforos Anagnostopoulos (2013). “When is the area underthe receiver operating characteristic curve an appropriate measure of classifierperformance?” In: Pattern Recognition Letters 34.5, pp. 492–495.

Hand, David J and William E Henley (1997). “Statistical classification methods inconsumer credit scoring: a review”. In: Journal of the Royal Statistical Society:Series A (Statistics in Society) 160.3, pp. 523–541.

37

BIBLIOGRAPHY

Hand, David J, So Young Sohn, and Yoonseong Kim (2005). “Optimal bipartitescorecards”. In: Expert Systems with Applications 29.3, pp. 684–690.

Hastie, Trevor and Junyang Qian (2014). Glmnet Vignette. http://web.stanford.

edu/~hastie/glmnet/glmnet_alpha.html. Accessed: 2014-12-29.Hastie, Trevor et al. (2009). The elements of statistical learning. Vol. 2. 1. Springer.Hawkins, Douglas M (2004). “The problem of overfitting”. In: Journal of chemical

information and computer sciences 44.1, pp. 1–12.Hoerl, Arthur E and Robert W Kennard (1970). “Ridge regression: Biased estima-

tion for nonorthogonal problems”. In: Technometrics 12.1, pp. 55–67.Krzanowski, Wojtek J and David J Hand (2009). ROC curves for continuous data.

CRC Press.Lessmann, Stefan, Hsin-Vonn Seow, and Bart Baesens (2013). “Benchmarking state-

of-the-art classification algorithms for credit scoring: A ten-year update”. In:Proceedings of Credit Scoring and Credit Control XIII Conference, Edinburgh,United Kingdom, 28-30th August.

Lockhart, Richard et al. (2014). “A significance test for the lasso”. In: The Annalsof Statistics 42.2, pp. 413–468.

Marqués, AI, Vicente García, and Javier Salvador Sánchez (2012). “Two-level clas-sifier ensembles for credit risk assessment”. In: Expert Systems with Applications39.12, pp. 10916–10922.

Marsland, Stephen (2011). Machine learning: an algorithmic perspective. CRC Press.Mays, Elizabeth (1995). Handbook of credit scoring. Global Professional Publishi.Mester, Loretta J (1997). “What’s the point of credit scoring?” In: Business review

3, pp. 3–16.Myers, James H and Edward W Forgy (1963). “The development of numerical credit

evaluation systems”. In: Journal of the American Statistical Association 58.303,pp. 799–806.

Peng, Chao-Ying Joanne, Kuk Lida Lee, and Gary M Ingersoll (2002). “An introduc-tion to logistic regression analysis and reporting”. In: The Journal of EducationalResearch 96.1, pp. 3–14.

Rodriguez, Juan José, Ludmila I Kuncheva, and Carlos J Alonso (2006). “Rotationforest: A new classifier ensemble method”. In: Pattern Analysis and MachineIntelligence, IEEE Transactions on 28.10, pp. 1619–1630.

Siddiqi, Naeem (2005). Credit risk scorecards: developing and implementing intelli-gent credit scoring. Vol. 3. John Wiley & Sons.

So, Mee Chi et al. (2014). “Using a transactor/revolver scorecard to make creditand pricing decisions”. In: Decision Support Systems 59, pp. 143–151.

Stepanova, Maria and Lyn Thomas (2002). “Survival analysis methods for personalloan data”. In: Operations Research 50.2, pp. 277–289.

Thomas, Lyn C (2000). “A survey of credit and behavioural scoring: forecastingfinancial risk of lending to consumers”. In: International Journal of Forecasting16.2, pp. 149–172.

— (2010). “Consumer finance: Challenges for operational research”. In: Journal ofthe Operational Research Society 61.1, pp. 41–52.

38

http://web.stanford.edu/~hastie/glmnet/glmnet_alpha.html

http://web.stanford.edu/~hastie/glmnet/glmnet_alpha.html

BIBLIOGRAPHY

Thomas, Lyn C, David B Edelman, and Jonathan N Crook (2002). Credit scoringand its applications. Siam.

Tibshirani, Robert (1996). “Regression shrinkage and selection via the lasso”. In:Journal of the Royal Statistical Society. Series B (Methodological), pp. 267–288.

Wang, Gang et al. (2011). “A comparative assessment of ensemble learning for creditscoring”. In: Expert systems with applications 38.1, pp. 223–230.

Wolpert, David H (1992). “Stacked generalization”. In: Neural networks 5.2, pp. 241–259.

Yao, Ping (2009). “Credit scoring using ensemble machine learning”. In: HybridIntelligent Systems, 2009. HIS’09. Ninth International Conference on. Vol. 3.IEEE, pp. 244–246.

Zhao, Peng and Bin Yu (2006). “On model selection consistency of Lasso”. In: TheJournal of Machine Learning Research 7, pp. 2541–2563.

Zou, Hui and Trevor Hastie (2005). “Regularization and variable selection via theelastic net”. In: Journal of the Royal Statistical Society: Series B (StatisticalMethodology) 67.2, pp. 301–320.

39

Appendix

Table A.1. Table of alphas and corresponding average AUC of the trained models

Alphas Lambdas AUC0 0.002352 0.86804190.111 0.0005493 0.86851520.222 0.0003014 0.86855600.333 0.0002206 0.86857460.444 0.0001815 0.86857940.556 0.0001594 0.86857550.667 0.0001103 0.86856730.778 0.0000715 0.86856850.889 0.0000996 0.86856520.950 0.0000585 0.86856690.990 0.0000617 0.86856600.999 0.0000973 0.86856981 0.0000735 0.8685704

40

APPENDIX

Table A.2. Model Class 1: Results on the validation sample from re-training onthe subsample with score under di�erent score-cuto�s


500 777 432 0.556 0.7292 0.5605 0.0336510 1151 595 0.517 0.7300 0.5604 0.0333520 1753 775 0.442 0.7296 0.5601 0.0334530 2649 1005 0.379 0.7293 0.5595 0.0334540 4214 1296 0.308 0.7296 0.5601 0.0333550 6616 1617 0.244 0.7301 0.5601 0.0334560 9982 1922 0.193 0.7293 0.5607 0.0334570 14488 2178 0.150 0.7285 0.5594 0.0334580 20055 2414 0.120 0.7288 0.5566 0.0334590 25943 2598 0.100 0.7293 0.5557 0.0334600 31763 2718 0.086 0.7293 0.5581 0.0335610 37181 2799 0.075 0.7296 0.5592 0.0335620 42426 2846 0.067 0.7292 0.5594 0.0336630 47486 2890 0.061 0.7301 0.5624 0.0335640 51910 2918 0.056 0.7307 0.5631 0.0335650 55866 2935 0.053 0.7296 0.5605 0.0336660 59274 2947 0.050 0.7295 0.5601 0.0336670 62169 2950 0.047 0.7306 0.5634 0.0335680 64532 2955 0.046 0.7295 0.5609 0.0336690 66345 2958 0.045 0.7298 0.5603 0.0336700 67681 2959 0.044 0.7286 0.5588 0.0336

Table A.3. Model Class 2: Results on the validation sample from re-training onthe subsample with score with x percentange points from the decision boundary


20 2059 275 0.134 0.7285 0.5600 0.033630 3253 423 0.130 0.7277 0.5591 0.033640 4701 590 0.126 0.7264 0.5570 0.033750 6557 772 0.118 0.7269 0.5588 0.033760 9166 990 0.108 0.7262 0.5578 0.033670 12961 1238 0.096 0.7276 0.5595 0.033680 18835 1543 0.082 0.7273 0.5594 0.033690 30213 1934 0.064 0.7300 0.5619 0.0336

41

APPENDIXT

ab

le

A.4

.M

od

el

Cla

ss

3:

Res

ults

onth

eva

lidat

ion

sam

ple

from

re-t

rain

ing

usin

gpr

edic

ted

scor

eon

alte

rnat

ive

resp

onse

labe

lsan

dsc

orec

uto�

sto

defin

esa

mpl

eto

bein

clud

edin

toth

e2n

d-st

age

nam

esC

uto�

No.

.of.O

bs.

No.

.of.B

ads

Bad

.rate

Gin

iK

SB

rier.S

core

1st-

stag

e71

770

3014

0.04

20.

7285

0.55

850.

0337

scor

e(dp

oint

5)<

Cut

o�47

068

4786

70.

127

0.72

810.

5569

0.03

3748

012

332

1400

0.11

40.

7309

0.56

010.

0336

490

2387

421

330.

089

0.73

020.

5627

0.03

3650

041

452

2690

0.06

50.

7288

0.55

830.

0337

510

5472

329

200.

053

0.72

830.

5588

0.03

3852

063

467

2988

0.04

70.

7278

0.55

940.

0338

530

6839

830

050.

044

0.72

780.

5600

0.03

38sc

ore(

dpoi

nt5)

>C

uto�

470

6492

321

470.

033

0.73

060.

5645

0.03

3748

059

438

1614

0.02

70.

7298

0.56

530.

0337

490

4789

688

10.

018

0.58

050.

5045

0.03

4150

030

318

324

0.01

10.

7266

0.55

720.

0337

510

1704

794

0.00

60.

7270

0.56

000.

0337

520

8303

260.

003

0.72

630.

5587

0.03

3753

033

729

0.00

30.

7275

0.55

850.

0337

scor

e(dp

oint

30)

<C

uto�

470

755

293

0.38

80.

7296

0.55

990.

0336

480

1220

430

0.35

20.

7292

0.55

920.

0336

490

2099

627

0.29

90.

7283

0.55

910.

0337

500

3643

944

0.25

90.

7291

0.55

700.

0336

510

6508

1381

0.21

20.

7304

0.55

970.

0335

520

1179

918

910.

160

0.73

040.

5618

0.03

3653

021

266

2371

0.11

10.

7295

0.56

030.

0337

scor

e(dp

oint

30)

>C

uto�

470

7101

527

210.

038

0.72

990.

5657

0.03

3748

070

550

2584

0.03

70.

7309

0.56

420.

0336

490

6967

123

870.

034

0.73

170.

5652

0.03

3750

068

127

2070

0.03

00.

7338

0.56

900.

0337

510

6526

216

330.

025

0.73

190.

5658

0.03

3752

059

971

1123

0.01

90.

7296

0.56

390.

0337

530

5050

464

30.

013

0.72

770.

5599

0.03

37

42

APPENDIX

Table A.5. Model Class 4: Results on the validation sample from re-trainingusing predicted score on alternative response labels

No. of Obs. No. of Bads Bad-rate Gini KS Brier Score1st-stage 71770 3014 0.042 0.7285 0.5585 0.0337dpoint5 71770 3014 0.042 0.7325 0.5647 0.0336

dpoint30 71770 3014 0.042 0.7309 0.5636 0.0336dpoint5 & dpoint30 71770 3014 0.042 0.7314 0.5649 0.0337

Table A.6. Coe�cient values of dpoint5 variables when training on dpoint90 andincluding the predicted score from the dpoint5 label as a variable

Variable Coe�cientdpoint5_binned(471,482] -0.0741dpoint5_binned(482,488] -0.2522dpoint5_binned(488,493] -0.4297dpoint5_binned(493,497] -0.6205dpoint5_binned(497,501] -0.7592dpoint5_binned(501,506] -0.9728dpoint5_binned(506,513] -1.3032dpoint5_binned(513,522] -1.4635dpoint5_binned(522,595] -1.8813

43

APPENDIX

Table A.7. Results from evaluating the candidate models on test sample


Scorecuto�s Score < 550 6455 1589 0.246 0.7962 0.6386 0.0271Scorecuto�s Score < 630 37190 2408 0.065 0.7705 0.6005 0.0285Scorecuto�s Score < 670 55263 2494 0.045 0.7535 0.5985 0.0287Decision boundary 90% 21918 1514 0.069 0.7971 0.6383 0.0276

score(dpoint5) < 480 15123 1689 0.112 0.8011 0.6416 0.0272score(dpoint30) > 490 66964 1521 0.023 0.8003 0.6450 0.0276score(dpoint30) > 500 65016 1219 0.019 0.8008 0.6442 0.0276score(dpoint30) > 510 62014 878 0.014 0.7958 0.6359 0.0276

dpoint5 as variable 70363 2503 0.036 0.8037 0.6465 0.0273dpoint30 as variable 70363 2503 0.036 0.8028 0.6464 0.0273

dpoint5 & dpoint30 as variables 70363 2503 0.036 0.8031 0.6474 0.0273

−6

−4

−2

0

500 550 600 650Score

Log−

odds

Predicted Log−oddsActual Log−odds


Figure A.2. Actual and predicted score to log odds when applying the first-stagemodel on the test sample.

44

APPENDIX

−6

−4

−2

0

500 550 600 650Score

Log−

odds



Figure A.3. Actual and predicted score to log odds when applying the scorecuto�model on the subset of observations with predicted score < 550 on the test sample.

−6

−4

−2

0

500 550 600 650Score

Log−

odds



Figure A.4. Actual and predicted score to log odds when applying the scorecuto�model on the subset of observations with less than a 90 % distance from the decisionboundary on the test sample.

45

APPENDIX

−6

−5

−4

−3

−2

−1

500 550 600 650Score

Log−

odds



Figure A.5. Actual and predicted score to log odds when training on the subsetof observations with a predicted score from the dpoint5 label less than 480 on thetestsample.

−6

−4

−2

0

500 550 600 650Score

Log−

odds



Figure A.6. Actual and predicted score to log odds when training on dpoint90 andincluding the predicted score from the dpoint5 label as a variable on the test sample.

46

www.kth.se

Two-Stage Logistic Regression Models for Improved Credit ...790367/FULLTEXT01.pdf · Two-Stage...

Documents

Transcript of Two-Stage Logistic Regression Models for Improved Credit ...790367/FULLTEXT01.pdf · Two-Stage...