ML and Statistical Modelling – how it’s done
Previously, we’ve discussed the importance of deploying statistical modelling and Machine Learning (ML) tools to provide access to timely situational awareness. Facilitating preparedness across a wide range of incident response use cases, modelling makes our work and suite of products possible. But there are many factors to consider when creating these models before they can be deployed.
Statistical modelling and Machine Learning, a quick definition
It can be easy to refer to Machine Learning models as statistical modelling, and vice-versa, but this can cause future misunderstandings.
Put simply, statistical modelling is the process of applying statistical analysis to a range of data. When done correctly, a statistical model is created based on available data and provides the probability of related events.
Machine Learning, on the other hand, involves training an algorithm to make choices and decisions without human intervention. For example, when streaming services recommend songs or shows based on your previous interests, this is Machine Learning in action.
ML modelling takes statistical modelling to the next level, allowing a wider variety of models to be developed that are more felxible to the properties of data and not constrained by traditional probability distributions. They are also able to infer connections between data that statistical analysis may not have revealed. These capabilities provide our users with the intelligence needed to respond to a vast array of incidents for timely and effective impact mitigation.
How are statistical models created?
Before any model is created, whether it’s Machine Learning or statistical analysis, two core factors must be reviewed:
- What is the end objective of the models?
By establishing a clear objective, users can select the correct model for their objective – making them a valuable asset that’s worth the investment.
- What data is needed and how can we guarantee quality?
By considering the quality of data and what governance structures (from collection to security) will be put in place throughout, users can guarantee reliability, and determine if any additional data is needed.
With these two questions answered, users can begin constructing models. But what model should be used?
The different models explained
Statistical modelling can take on many forms, each one suited for a specific objective. Knowing the end goals of your models will inform your selection, ensuring that they provide the insights needed to achieve actionable intelligence. Some of the most common models are regression and classification models.
Regression models
Regression models are used to examine the relationships between sets of variables, as well as to define which independent variables (the causes) have a knock-on effect on dependent variables (the affected subject).
This produces information about trends that can be leveraged to make essential strategic decisions. There are many different forms of regression like linear regression, stepwise regression, ridge regression, lasso regression, and elastic net regression.
Classification Models
As a form of Machine Learning, classification models are used to appropriately classify data in the hopes of accurately classifying future data that the model hasn’t seen before.
Using a pre-analysed historical dataset, classification models are trained to allow future classification of data not held in historical datasets. This is extremely important, as it enables users to move from historical data to streaming data for proactive capabilities like incident modelling.
Some forms of common classification models include:
- Decision trees
- Random forests
- Nearest neighbour analysis
Complex and advanced types of classification models exist as well, such as neural networks.
Ensuring that models are value-aligned
Once clear objectives have informed the choice of model type, users should make sure that all system points are aligned to bring as much value as possible. To do this, explore some or all of the following points:
- Who will own this project?
- What is project success defined as?
- Is training and testing data of an acceptable quality and quantity to fuel training?
- How resilient are ML algorithms errors in data
- Can this project use pre-trained models to save valuable time?
Models and data quality
When working with both statistical modelling and Machine Learning models, analysts need to ensure that data is consistently reliable and secure, so that all findings are accurate.
The quality of data has a significant effect on the overall effectiveness of models. If a model learns from poor data, it will be ineffective for real-world projects. It’s important to note that when we consider data quality, coverage has a distinct role.
Throughout construction and training, data should regularly be cleaned and checked for standardisation. In this process, areas lacking data, and areas of possible bias, will be identified.
Underfitting and overfitting data can have disastrous consequences on models. Learn more about these consequences here.
Retraining with new data
Once your statistical models and Machine Learning tools have been deployed, it’s important to incorporate new data by retraining models to ensure that they remain high-quality, reliable, and relevant.
At Riskaware, we use statistical modelling and Machine Learning models to enable a wide range of use cases – from oil spill response to CBRNE mitigation.