top of page
Search
Carolyn Rohm

Linear or Non-Linear Algorithms: Which is Best for Credit Risk Prediction?

In our world of credit risk, we typically try to solve one of two problems. The first problem that we often try to solve is a problem where the outcome we want to predict is binary:

  • Will an applicant meet their payment obligations?

  • Will a customer respond to an offer of credit?

  • Will a customer honour their payment arrangement?

The second type of problem that we try to solve involves a range:

  • How much of this customer’s outstanding debt are they likely to repay?

  • How much interest are we likely to earn from this customer over the next 12 months?

  • What is the likely spend from this customer over the next six months?

  • How many purchases is this customer likely to make?

As you can see, in the first set of questions, the answer is either Yes, or No, a 0 or a 1. Whereas in the second set of questions, we’re looking for a value.


And there are a handful of commonly used machine learning algorithms that are often used to answer these questions.


My “go-to” algorithms for these problems include linear and non-linear algorithms and are Linear and Logistic Regression, Random Forests and XGBoost Models.


What are linear algorithms?

Taking it right back to basics, linear means line. And linear machine learning algorithms are a group or family of algorithms that are based on linear relationships between the feature (or independent) variables and the target (or dependent) variable.


Start with Why (Image by Carolyn Röhm - midjourney)

I’m a fan of Linear and Logistic Regression.


They have some excellent advantages:

  • They’re fast

  • They’re easy to implement

  • They’re very easy to interpret – and this means it is easy to explain why the prediction was made.

  • They can be used to solve both classification and regression problems.

For all their advantages, linear algorithms do have some disadvantages.

  • They’re limited in their ability to capture the complexity of relationships between variables. Because the specifically only model linear relationships, if the underlying relationships are not linear, they’re unlikely to give the best answer.

  • Because linear algorithms rely on all variables being independent, if this key assumption does not hold true there is a strong probability that not only will your model be overfitted, but that relationships that look statistically significant, won’t actually be significant.

  • Feature engineering is very important, and can overcome the challenges highlighted above when performed well.

What are non-linear algorithms?

In the same way that linear algorithms are a family, non-linear algorithms can also be considered a family. In this case, they model non-linear relationships between features (independent variables) and the target (dependent variable).

I’m a big fan of Random Forests and XGBoost Models


Non-linear algorithms also have some fantastic advantages:

  • They can model complex relationships between features and the target.

  • Outliers do not present as many challenges in non-linear algorithms.

  • Feature engineering is less important than it is in linear algorithms because non-linear algorithms are better suited to handle complex relationships.

And they have disadvantages:

  • They are notoriously difficult to interpret.

  • They can be extremely complex.

  • They run the risk of overfitting to training data.

  • They are data-hungry.

Someone new to machine learning could be forgiven for looking at this and wondering how any of this helps.


As always, it comes down to context.


What do I mean by context?


I’m often asked what the ‘best’ algorithm is to solve a problem, and often, this question is asked in a vacuum, without the context that makes it possible to answer. Conversely, I have sometimes been asked about the details of what I’m doing to solve a particular challenge.


These questions both have the same root issue.


One size does not fit all.


And this is where context is important.


Let’s unpack what I mean by context.


Simon Sinek was absolutely spot on when he said we should all start with Why?


For any piece of analysis, I like to understand why.


Why does this particular issue need to be solved or better understood?


For example: Perhaps we’ve been challenged to identify all those customers who are currently up to date and meeting their payment obligations and who are likely going to miss their next upcoming payment.


Given the current increasing arrears that have been observed at the bureaux, both here and in AU, this is a very reasonable question to ask.


And now to understand why we may be asking that question. At first blush, the answers may seem self-evident:

  • Arrears are increasing

  • We know it is easier to cure early arrears than late arrears

  • By preventing people entering arrears, we’re ensuring that the ongoing impacts in Collections Management are minimised.

  • And so on.

Let’s dig a little deeper.


Let’s assume for a moment that we can accurately identify who is most likely to miss their next payment, what will we do?


Is the planned activity an additional early text message reminding them to make a payment?


Or do we want to target specific individuals with specific actions, perhaps some individuals will receive a text message, whilst others, who are more likely to be experiencing hardship will be proactively called so that we can work with our customers to achieve the best outcome for all? There are many other options; we could want to identify those most likely to miss their next payment for any number of reasons. Understanding why this analysis is seen as important, is key to determining what approach we should adopt.


Once we have a better idea of why the analysis is important and what the likely actions are once we have our answer, then we can go about determining which machine learning algorithms are best suited to answer the challenge.


When trying to decide whether to use a linear or a non-linear algorithm, my first point of reference is always the concept of interpretability.


Do we need to understand why the prediction says what it says?


If the answer is a strong YES, then I tend to prefer linear algorithms. It is clear which features are used within the model, and we can readily understand the impact of each feature on the prediction. All the work that typically goes into feature engineering means that we also have an excellent understanding of how variables within our dataset interact, and have gone to significant lengths (or we should have) to ensure what we have build robust models that we can interpret.


If we don’t really need to understand why the record was flagged (why the prediction that was made, was actually made), then a non-linear algorithm may give us a more accurate answer, and may better serve our purposes.


As with so many things in credit risk, there is no one size fits all. Everything is context-dependent, and as analysts, we need to have an excellent understanding of not only the problem that we want to solve, but why it is important to solve, and what we’re likely to want to do with that knowledge.

9 views0 comments

Recent Posts

See All

Comments


bottom of page