DATA REPRESENTATIVENESS PROBLEM IN CREDIT SCORING

When building models, it is common to split the whole dataset into a development and a validation sample. In some cases, using random sampling instead of stratified sampling can lead to loss of representativeness of final samples. In such cases, a model built on these data gives different or unexpected results when its performance is measured on the validation sample. In the business area, a lack of representativeness can cause interpretative problems and can have a huge financial impact when a biased model is involved in the credit granting process. The aim of this paper is to examine and understand why representativeness should be checked before the start of modelling. The paper deals with methods of identification of selection bias in time. It recommends using three tests as a common part of the data preparation process.


Introduction
In statistical modelling, it is common for the database to be split into two samples -the development sample (DEV) and the validation sample (VAL). The development sample is used to develop the model (learning and estimating parameters of the model), while the validation sample is used to evaluate the model and for fi nal model selection. In a later phase of model development, a third type of sample -the testing sample(s) -can be used for assessing the predictive performance of the model [Borovicka et al., 2012]. If the same dataset would be used for the development, validation and calibration, the estimation of the predictive ability of the model would be overly optimistic.
In an ideal situation, two (or more) independent datasets are collected. However, in a situation where only one dataset is available and there is no opportunity to collect new data, it is necessary to split the data fi le. According to Snee [1977], data splitting is the most effective method of model validation when it is impossible to collect new data to examine the model. It is very important to create both (DEV and VAL) samples in such a way that represents the total population as they can cause a lot of problems due to bias. This requirement is absolutely natural, since the model refl ects the specifi cs of the dataset used for its development. In order to make sure that the sample is representative, it is important to consider carefully how the sample was collected. If a sample is chosen for the sake of convenience alone, it becomes diffi cult to interpret the fi nal model with confi dence [Geoff, Everitt, 2001].
Bias refers to the tendency for selected samples to contrast with the corresponding population in some methodical manner. Bias can arise if the sample was chosen wrongly [Peck et al., 2012]. When sampling, the most common types of bias that may occur are selection bias, response or measurement bias, and nonresponsive bias.
In most applications, simple random sampling is used. Nevertheless, there are several sophisticated statistical sampling methods more suitable for various types of datasets. The purpose of this paper is to show what would happen if both the development and validation datasets were created poorly in such a way that they were not representative of the population. To demonstrate the consequences of the impacts on the performance of the scorecards, two different and most common data splitting methods were employed.
The rest of this paper is organized as follows. The next section presents a brief overview of various sampling methods. Section 3 explains the methodology used in performed tests. Section 4 describes the data used for impact illustration. Section 5 contains a case study and discusses analysis results. The fi nal section presents conclusions.

Data splitting
In many fi elds, representative large independent samples can be used for training and validating (and testing) of models and can be obtained simply by partitioning one large dataset (holdout method). However, in other fi elds, only datasets limited in size are available as measurements are expensive or work-intensive. To solve the dilemma of partitioning a small pool of data into independent data subsets, re-sampling procedures can be used (repeated holdout method). It is believed that the more data, the better model performance. However, some recently published studies show that this is not necessarily true [Meng, Xie, 2013;Faraway, 2014]. Stone [1974] may be considered a pioneer of data splitting. Since then, many data splitting methods have been designed. Their quality and complexity differ, and there is no single method which is, in general, viewed as superior. Their choice mostly depends on the purposes of the analysis. Sampling methods can be divided according to their principles, goals, and overall complexity [Reitermanová, 2010]. Data splitting algorithms and also their comparison can be found in many studies [e.g., Molinaro et al., 2005;Snee, 1977]. Data splitting is easy to implement and thus presents an attractive alternative to complex methods of adjusting for the effect of model selection on inference [Faraway, 1998].
Simple random sampling is the most common holdout method. It is effi cient and easily feasible. Samples are selected randomly with uniform distribution, i.e., each observation has equal probability of selection. This quite simple method leads to low bias of model performance. However, in cases where the dataset is not uniformly distributed or the number of selected cases is much lower compared to the original database, simple random sampling can lead to subsets that do not cover the data properly (e.g., one or more classes might be missing) and therefore the estimate of the model error will have a high variance [Lohr, 1999].
Stratifi ed sampling is probability sampling and stands on the idea to explore the internal structure and distribution of a dataset and to divide it into (relatively) homogeneous non-overlapping groups called strata (or clusters). The observations are then selected from each stratum proportionally to the appropriate probability. It ensures that each class is represented with the same frequency into subsets. The important question is how to select observations from each cluster. There are two most common principles: to select one sample from each cluster [Bowden et al., 2002] or samples from each cluster are selected with a uniform probability [May et al., 2010]. The second approach is often referred to as stratifi ed random sampling.
Systematic sampling can be used in the case of (naturally) ordered datasets. The most common form of systematic sampling is the equal-probability method. From the ordered dataset (e.g., a time series), a starting observation is randomly chosen and then each i th observation is selected [Elsayir, 2014]. The sampling interval (skip) i is calculated as the ratio of sample size to population size. Systematic sampling is a very effi cient method and it is easy to implement. However, in many cases it is very diffi cult to fi nd appropriate ordering. For disordered data, the results of systematic sampling are comparable to those of simple random sampling and it therefore suffers the same problems. Also, its sensitivity to periodicities in the dataset is one of the disadvantages of the method.
Cross-validation ranks among the most popular re-sampling methods. For k-fold cross-validation, the original dataset is partitioned into k equal-sized parts (folds). The fi rst fold is used for evaluation purposes; the rest (k-1) of the folds are used for model learning. In the next step, the second fold is used for evaluation and the rest are used for learning. This procedure is repeated k-times and the results are averaged (Picard and Cook, 1984). There are no clear rules on how many folds should be used for the crossvalidation. In practice, the set k=10 is often suffi cient.
A special variant of cross-validation is called leave-one-out cross-validation (full cross-validation, jack-knife). It assumes k=n, where n is the size of the original dataset. This setting gives nearly unbiased estimates (lower root mean square errors of predictions) of the model performance but usually with large variability. This defi ciency is known as asymptotic inconsistency [Shao, 1993].
The main principle of the bootstrap method, fi rst introduced in 1996 [Tibshirani, Efron, 1996], is to get B bootstrap samples by uniform sampling with replacement from the original dataset with the size n. On each bootstrap, sample parameters of a model are estimated while the estimation of prediction performance is carried out on the original dataset. Bootstrapping is not affected by asymptotic inconsistency and might be the best way of estimating error for very small datasets whereby the complete procedure can be repeated arbitrarily. For more information, see for instance Kohavi [1995] or Andrews [2000].

Methodology
In this section, we present three quick analyses that can be used for checking representativeness between two created subsamples when building a scoring model. Further in the case study, the proposed tests are illustrated on credit scoring model development but they can be used in other areas as well.
It is possible to check for different variables of the database used for the computation of the fi nal score for whether the repartition of the modalities is signifi cantly different between the development and validation samples. This is called demand stability.
Risk stability examines whether the event rate between corresponding classes for a variable is appropriate between both samples.
A gap table can also be constructed. Rows represent categories in the case of explanatory variables or score deciles in the case of scorecard output examination. For each class of the analysed variable, columns in the table contain the following information: • points of each category (for explanatory variables only), • average total score, • numbers of observations and column percentage, • numbers of observed and predicted events and their differences, • observed and predicted event rates and their difference, • p-value of one-tailed test (H 0 : event_rate predicted ≤ event_rate observed ), and • 95% two-sided confi dence interval for event_rate predicted .

Data description
For the illustration of impacts of data splitting in the case study described in the next section, a database from the credit risk area was used. It contains 20 behavioural characteristics (explanatory variables) of clients' credit behaviour in a bank; all of them are categorical. The defi nition of the two-valued explained variable was used as follows: a client is marked as bad if he has reached 3 or more instalments past due. Otherwise a client is marked as good. The goal is to build a behavioural scoring model. For this purpose, binary logistic regression was employed with stepwise selection of explanatory variables. The output of the scoring model is probability of default, but for better orientation, the values were transformed to a credit score where the higher score means the better client.
The approach to the modelling process is to use the holdout method. The database will be split such that 70% of the data will belong in the development dataset and 30% in the validation dataset. The data splitting will be carried out in two main ways: 1. Using a stratifi ed random sampling that maintains the proportion of good/bad. 2. Using a simple random sampling that does not maintain the good/bad proportion.
For these purposes, the SAS 9.1.3 procedure PROC SURVEYSELECT (with explained variable in strata option) can be used to split the database.
Taking the example constructed from a real database, let's have a look at the bad rate distribution ( Table 1). The database chosen is large enough so that it gives a good platform for examining the impact of stratifi ed random sampling and simple random sampling on the predictability of the chosen samples, both the validation and development databases.

Data splitting
The SAS procedure with a strata option is used to designate variables defi ning a dataset or strata or nested sets in a case control study. The results obtained with stratifi ed random sampling are shown in Table 2. In this type of data splitting, the sample size is split into two; however, the good and bad observations are incorporated into the modelling process at the same ratio as is in the original database. We, therefore, always have 369 bad contracts that can be used to validate the model.
In simple random sampling, it is assumed that the sample selected is absolutely random and that no biases occur in the data. However, the failure to identify a serious bias in the sample can result in inaccurate test statistics and standard errors. Looking at the database previously selected, we fi nd that if we do not use the stratifi ed random sampling, different values for validation arise. To be relevant, in this case the database was tested (split) 1,000 times and the confi dence interval was calculated. The results are as follows (Table 3):  Table 3, we fi nd that in extreme cases 54 bads could be lost to validate the model. This proportion (14.6 %) is very high as the percentage of the total number of bads (7.38 %) is quite low and the quantity of the validation database is quite small (30 % of the total sample size). As such, the choice to keep or remove the 54 bads from the validation database has signifi cant consequences. Using a strata option in the SAS procedure is thus important in order to avoid this problem in the case of a low number of bads. By considering the bad proportion of the database, the dataset so selected is more representative of the population tested. 95% conf. interval for bad rate is (7.15%; 7.58%) 95% conf. interval for bad rate is (6.85%; 7.93%) Source: calculated by the author From the graphs in Figure 1, we fi nd that the development database can have between 833 (7.15 %) and 887 (7.58 %) bads. On the other hand, the validation databases can have between 342 (6.85%) and 396 (7.93 %) observations marked as bad. In both these cases, there is an interval of 54 contracts between the lower and upper confi dence levels. The difference witnessed is not that relevant with respect to the development database as a deviation of + or -6.3 % of bads among the 860 bads is expected. However, the impact is very important when considering the validation database as a deviation of + or -14.6 % of bads among the 369 bads is expected.

Consequences of bias on explained variable
In order to verify the details gained from the above sections, we will carry out two other impact analyses to see the impacts identifi ed in the gap analysis. In the fi rst analysis, we split the whole sample such that both the development and validation samples are still in a 70:30 ratio and have purposely highly different proportions of bads in both samples. On analysis, the repartition is as follows:

Demand stability
When we look at the stability of demand, we fi nd that it experiences no impact whatsoever between both the DEV and the VAL. For example, the explanatory variable "Client duration", which represents the length of the relationship between the client and the institution, is as follows (Table 5): The values obtained in this analysis are similar to the initial values (column Total) above, showing that the bias does not affect the demand stability of the explanatory variable.

Risk stability
However, when we look at the risk stability (Table 6), we fi nd out that the risk between the DEV and the VAL is not similar. It gives us a fi rst warning. This is expected as the risk should be lower for the validation sample compared to the development sample based on information displayed in Table 4. In the initial sample, the total bad rate was 7.38 %. For the development sample, it was 7.74%, while that for the validation sample was 6.55 %. Anyway, the values received from this test have the same trend as the original values, and this indicates that the risk between the two samples is stable.

Gap analysis
The gap analysis (Tables 7-10) reveals that the gap present is inconsequential and does not have a large impact on the model. In this test analysis, a bias induced on the explained variables poses no problem. However, a problem could arise when a low number of bads exists such that in the fi rst test carried previously. Thus, keeping the same good/bad proportion in the development and validation samples proves advantageous.

Consequences of bias on explanatory variables
In this test analysis, we have a database with a bias purposely placed on the explanatory variable "Client duration" used for the calculation of the fi nal score. Note that the repartition is the same as in the previous subsection (Table 4).

Demand stability
In this case, the stability of the demand has not been affected (Table 11). This analysis reveals that there is no change in the demand stability when compared to initial values.

Risk stability
The risk between the DEV and the VAL is very dissimilar, which again raises a fi rst alert on the problem of representativeness. From the risk stability (Table 12), we fi nd that a bias on the explanatory variable used for the calculation of the fi nal score causes the whole sample to lose confi dence in the representation of the population.

Gap analysis
Signifi cant gaps between the observed and the predicted values are observed in the gap analysis. The highlighted values in Table 14 indicate a poor fi t of the model, implying that a higher rate of misclassifi cation should be expected. Once again, it alerts the statistician to a problem of either instability or correlation.

Conclusion
In statistical modelling, it is usually common for researchers to split the original database into the development and validation samples. The representativeness of these samples should be checked to ensure that they do not introduce bias into the study. Two most common data splitting methods were used for the purposes of splitting the database. Stratifi ed random sampling ensures that the researcher has guaranteed that the samples created from the database will be similar, while simple random sampling offers no guarantee that the subgroups will be represented proportionately or equally.
First, we compared both the methods and noted that stratifi cation random sampling reduces the loss of important datasets and ensures that the confi dence level of the samples is high. Without using the strata function, we fi nd that large variability may occur within the response variable (±14.5% in our case) and that the confi dence level is low especially when dealing with a small sample size. This means that there is a very signifi cant risk of the response values (good/bad) being lost in the fi nal sample.
Further, the paper examines the effects on the quality of the credit score when a bias occurs. For this analysis, the database was purposely split poorly using simple random sampling. The impacts are different in dependence on whether the problem was identifi ed on the explained variable or is present on the variable used for the credit score calculation.
To reveal the consequences, three analyses were made. It was shown that if the problem was detected only on the explained variable while all the explanatory variables used in the model are all right, there will be no problem in the model output as long as the risk stability remains unaffected and there are no serious gaps in the gap analysis. However, if there is a bias found in any of the model inputs, weak model performance is highly probable. Serious problems can be indicated by unstable risk between the DEV and the VAL and the presence of huge gaps in the gap analysis.
Stratifi ed random sampling offers various advantages over simple random sampling. First of all, it increases the effi ciency of predictors of the overall population parameters through selection of strata that are homogenous over each dataset. It is also advantageous as it focuses on subpopulations of special interest, such as the bad applicants in our case. Finally, stratifi ed random sampling is also convenient and can be used on a smaller sample size as well, which in turn saves time and money.