top of page
Search
Writer's pictureCarolyn Röhm

4 Machine Learning mistakes to avoid by doing the fundamentals well

Top tip: Its not about the tools!


We’re still drowning in commentary about ChatGPT and Bard and all sorts of other exciting AI stuff.


Cool!


But you know what? If you’re wanting to do AI and other advanced and sexy ML stuff, you need to start with the fundamentals.

That lovely adage — Everyone wants to change the world, but no one wants to make their bed? I reckon it applies to Machine Learning too.

Everyone wants to change the world (Image by Carolyn Röhm - midjourney)

We have to, we MUST, start with the fundamentals. No amount of sexy awesome modelling will overcome a lack of data preparation and understanding the problem we want to solve.


Sorry (not sorry) to be a party-pooper, but that is reality.


I’ve been developing models for years; I’ve coached and mentored analysts as they progress through their machine-learning journeys.


Here are the most common mistakes I see:

Mistake number 1: Starting at the beginning; not at the end


Starting at the beginning seems logical.


It’s the advice we’re usually given.


But, when it comes to modelling, even though it may be counter-intuitive, we actually need to start at the end.


What do I mean by ‘start at the end’?


We need to know what a ‘good’ outcome looks like. We need to have a detailed understanding of how the solution will be implemented or used. We also need to know which teams will be impacted by this. We need to understand how our solution will be implemented and used going forwards.


In addition to this, we need to understand the processes that may be impacted. If, for example, we do a number of data transformations to ensure a better, more robust model, then we need to understand how we will ensure that those same data transformations can be applied to the data once the model has been implemented. If we don’t take this into consideration, we may find that our model is worthless, as it sits on a shelf gathering dust.


Mistake Number 2: Not taking adequate time and care reviewing, understanding and resolving data quality concerns

I get it; data wrangling, cleaning, imputing and scrubbing is not everyone’s cup of tea; it’s not necessarily the most exciting thing ever — in fact, I know several people who would rather stab themselves in the eye with a fork than have to clean data.


But…


…if you want to be an effective data scientist, you had better develop a love of all things data, not just throwing cool algorithms at problems.


Recently, I was reviewing models for a client. The client was concerned about a number of things, chief amongst them the rate at which the models seemed to lose their predictive power.

When models are embedded in your processes and have a direct impact on sales AND operational efficiencies, their effectiveness, or lack thereof, quickly becomes a top priority.

It wasn’t too difficult to spot the cause of a number of issues. First and foremost, it seemed like absolutely no data cleaning, transformations or classing had been applied to any of the fields. Some fields were, by their nature, clean. Others, not so much.


Consider what happens when a field is a 250-character field.


If it is necessary to create a field that is 250 chars, it is usually because, at least some of the contents of that field, will use all 250 chars. Because it is character and 250 long, there are an extraordinary number of combinations that could go into that field. And then, we need to consider what type of information is captured in that field.


If it is address data, which is manually captured, I’m quite prepared to bet a coffee that, unless someone has spent a LOT of time working on that data, it will be scrappy and messy.

If the field contains product information, then I would definitely want to have a much closer look at that info. How many distinct values are observed in the training data? Over what time-period? Why does the field need to be quite so long? Is there any ‘intelligence’ within the data that we could perhaps leverage and create a product grouping?


As it turns out, for this particular client, their 250-char field contained many distinct values, and this means that the risk of overfitting the model just increased materially.


Overfitting = BAD


Why?


If a model is overfitted, (assuming it generalises well enough to the validation sample), then it will not generalise well in the real world.


😔😔😔


And if it does not generalise well in the real world, then it will not be as predictive as expected.


😔😔😔


Basically, it will stop working. And this happens because the training (and potentially the validation) data are too different from the new data being used.


Mistake Number 3: Not understanding ML algorithms properly, their key assumptions and how to get the most out of them

In many respects, machine learning algorithms are remarkable. It is possible to take raw data that hasn’t been scrubbed, cleaned or transformed, and throw it at a machine learning algorithm.


 

The ML algorithm will (90% of the time; unless your data is extremely unbalanced) produce a model that is more predictive than the flip of a coin.


This is a good, and a bad, thing.


It’s great because it means that, with even a rudimentary understanding of ML algorithms, people can create models that are more predictive than flipping a coin.


It’s bad because it means that, with even a rudimentary understanding of ML algorithms, people can create models that are more predictive than flipping a coin.


 

This issue is that models produced like this are likely to be overfitted and more likely not to generalise well.


Certain algorithms, particularly linear algorithms, are grounded in several key assumptions, and violating these key assumptions results in overfitted models.


Other algorithms, particularly CART-based algorithms, are very prone to be overfitted because key features are split to ensure the most predictive output. If adequate stopping criteria aren’t implemented when the model is trained, it is more likely to be overfitted. And if an XGBoost algorithm is used, it is also likely to become a really BIG model. And this could negatively impact operational efficiencies.


All algorithms have strengths and weaknesses, and they will all do their best to deliver a model. There is a very good chance that the models produced will be predictive.

As modellers, as data scientists or analysts, we must understand and work with the ML algorithms’ fundamental assumptions, strengths and weaknesses to deliver the best, most robust models.


In my client’s case, the model developed was an XGBoost model — it, unsurprisingly, came out top. But, because of the data issues described above, it was strongly overfitted on the training data, and because the validation data was from the same time period, it seemed to generalise well enough when validated. Unfortunately, the 250-char field; well, the contents of that field changed over time, and no out-of-time sample was used to validate the model.


When the model was implemented, it failed to generalise as expected and quickly lost predictive power.


Mistake Number 4: Failing to take other impacted parties into consideration

The entire point of building models is to make an impact; to drive change. As analytical teams, we do not work in a vacuum, we rely on other teams to supply us with data and SME knowledge. And many teams rely on our work to make their lives easier.


But if we don’t take into consideration the impact of model changes on these teams, especially the operations teams at the coal face, we may end up facing significant resistance.


This usually happens after poor experiences.


I’ve lost count of the number of operations teams I’ve worked with who are very wary of anyone claiming to be from an analytical or modelling team. Their experiences have often been negative, with some modellers acting as though they’re better than the operations teams — despite never having done the job or spent any amount of time or energy learning about what the operations teams actually do.


Work with the teams impacted by changes resulting from your models being implemented. Most analysts cannot work with customers day in and day out. If your model is going to impact your customer-facing teams, make sure those teams understand what that impact is going to be. Make sure you have their trust. Take them on the journey, listen, really listen, to their concerns and then ensure that you have mitigated them.


If you don’t spend time building this trust, do not be surprised if you discover some serious operational negation going on.

4 views0 comments

Recent Posts

See All

Comentarios


bottom of page