This article was published as a part of the Data Science Blogathon.
Source: iPhone Weather App
A screen image related to the weather forecast should be a familiar picture to most of us. The AI model that predicts the expected weather predicts a 40% chance of rain today, a 50% chance for Wednesday, and a 50% chance on Thursday. The AI/ML model here is talking about event probability, which is the interesting part. Now the question is, is this AI/ML model reliable?
As learners of Data Science/Machine Learning, we must have gone through phases where we build various observational ML models (both classification and regression models). We also look at various model parameters that tell us how well the model performs. An important but perhaps not so well understood model reliability parameter is model calibration. Calibration tells us how much we can trust a model prediction. This article explores the basics of model calibration and its relevance in the MLOPS cycle. Even though model calibration is also applicable to regression models, we will look specifically at classification examples to understand the basics.
Model calibration required
Wikipedia extends calibration as “In measurement techniques and metrology, calibration is the comparison of measurement values given by an instrument under test with a calibration standard of known accuracy.”
The model outputs two significant pieces of information in a typical classification ML model. One is the predicted class label (for example, classification as spam or spam email), and the other is the estimated probability. In binary classification, the Sci-Kit learning library gives a method called model.predict_proba(test_data) which gives us the probability of the target being 0 and 1 in an array form. A model predicting rain can give us a 40% chance of rain and a 60% chance of no rain. We are interested in the uncertainty in the estimation of a classifier. There are specific use cases where the predicted probability of the model is of great interest to us, such as weather models, fraud detection models, customer churn models, etc. For example, we may be interested in answering the question, What is the probability that this customer will repay the loan?
Suppose we have an ML model that predicts whether a patient has cancer based on certain characteristics. The model predicts that a particular patient does not have cancer (well, a happy scenario!) but if the predicted probability is 40%, the doctor may prefer to test a few more for a definite conclusion. This is a typical scenario where prediction probability is important and of utmost interest to us. Model calibration helps us to improve the prediction probability of the model so that the reliability of the model is improved. It also helps us to understand the predicted probability observed from the model predictions. We cannot assume that the model is doubly confident when giving an estimated probability of 0.8 as against a figure of 0.4.
We must also understand that calibration varies with the accuracy of the model. Model accuracy is defined as the number of correct predictions divided by the total number of predictions made by the model. It should be clearly understood that we cannot have an accurate but calibrated model and vice versa.
If we have a model predicting rain with an estimated probability of 80% at all times, then if we take data for 100 days and find it raining for 80 days, we can say that the model is well calibrated. In other words, calibration attempts to remove the bias in the predicted probability.
Consider a scenario where the ML model predicts whether a user shopping on an e-commerce website will buy another affiliate item. The model predicts that the user has a 68% chance of purchasing item A and a 73% chance of item B. Here we will present item B to the user (high predicted probability), and we are not interested in the actual data. In this scenario, we cannot insist on strict calibration as it is not so important for the application.
The following shows the description of 3 classifiers (assumes that the model predicts whether an image is a dog image or not). Which of the following models is calibrated and therefore reliable?
(a) Model 1: 90% accuracy, 0.85 confidence in each prediction
(b) Model 2: 90% accuracy, 0.98 confidence in each prediction
(c) Model 3: 90% accuracy, 0.91 confidence in each prediction
If we look at the first model, it is less confident in its prediction, whereas model 2 seems overconfident. The Model 3 seems to be well calibrated, giving us confidence in the model’s capability. The Model 3 thinks it is accurate 91% of the time and 90% of the time, indicating good calibration.
The calibration of the model can be checked by creating a calibration plot or a reliability plot. The calibration plot reveals the disparity between the probabilities predicted by the model and the actual class probabilities in the data. If the model is well calibrated, we expect to see a straight line at 45 degrees from the origin (hinting that the predicted probability is always the same as the empirical probability).
We will attempt to understand the calibration plot using a toy dataset to solidify our understanding of the topic.
The following data contains the predicted probability and true y values of a model. The data is easier to handle when serialized in terms of probability.
The resulting probability is divided into several bins representing possible ranges of outcomes. For example,[0-0.1),[0.1-0.2]etc., can be made with 10 cans. For each bin, we calculate the percentage of positive samples. For a well-calibrated model, we expect the percentage bin to correspond to the center. If we take bins with interval[09-10)thenbincenteris095andforawellcalibratedmodelweexpectthepercentageofpositivesamples(sampleswithlabel1)tobe95%.[0-01)[01-02]etc can be created with 10binsForeach bin we calculate the percentage of positive samples[09-10)केसाथबिनलेतेहैंतोबिनकेंद्र095हैऔरएकअच्छीतरहसेकैलिब्रेटेडमॉडलकेलिएहमसकारात्मकनमूनोंकाप्रतिशत(लेबल1केसाथनमूने)95%होनेकीउम्मीदकरतेहैं।[0-01)[01-02)etccanbecreatedwith10binsForeachbinwecalculatethepercentageofpositivesamplesForawell-calibratedmodelweexpectthepercentagetocorrespondtothebincenterIfwetakethebinwiththeinterval[09-10)thebincenteris095andforawell-calibratedmodelweexpectthepercentageofpositivesamples(sampleswithlabel1)tobe95%
We can plot the mean predicted value (the midpoint of the bin) versus the fraction of true positives in each bin in a line plot to check the calibration. of model.
We can see the difference between the ideal curve and the real curve, which shows the need to calibrate our model. Suppose the marks obtained are below the diagonal. In that case, it indicates that the model has overestimated (the model’s prediction probabilities are too high). If the points are above the diagonal, it can be inferred that the model has been less than confident in its predictions (the probability is very low). Let’s also look at the real-life random forest model curve in the image below.
If we look at the above plot, the S curve (remember the sigmoid curve seen in logistic regression!) is commonly seen for some models. The model is seen as having low confidence and overconfidence at high probabilities when predicting low probabilities. For the above curve, the samples for which the model’s prediction probability is 30%, the true value is only 10%. So the model was overestimating on the low probabilities.
The toy dataset we have shown above is for understanding, and indeed, the choice of bin size depends on the amount of data we have, and we would like to have enough points in each bin so that there is a standard error on the mean of each bin. its small.
We don’t need to go for visual information to estimate model calibration. Calibration can be measured using the barrier score. Barrier score is similar to mean squared error but is used slightly in a different context. It takes values from 0 to 1, with 0 meaning perfect calibration, and the lower the Brier score, the better the model calibration.
The Barrier score is a statistical metric used to measure the accuracy of probabilistic predictions. It is mostly used for binary classification.
Let’s say a probabilistic model predicts a 90% chance of rain on a particular day, and it actually rains on that day. Barrier score can be calculated using the following formula,
Barrier score = (forecast-result) 2
In the above case the barrier score is calculated as (0.90-1)2 = 0.01.
The barrier score for a set of observations is the average of the individual barrier scores.
On the other hand, if a model predicts with a 97% probability that it will rain but it will not rain, then in this case the calculated Brier score will be,
Barrier score = (0.97-0)2 = 0.9409. A lower barrier score is better.
Now, let’s try to get a glimpse of how the calibration process works without getting into too much detail.
Some algorithms, such as logistic regression, show good built-in calibration parameters, and these models may not require calibration. On the other hand, models like SVM, Decision Tree etc can benefit from calibration. Calibration is a rescaling process after a model has made predictions.
There are two popular methods for calibrating the probabilities of ML models, viz.,
(a) Platt scaling
(b) isotonic regression
It is not the intention of this article to get into the details of the math behind the implementation of the above approaches. However, let’s look at both ways from a ringside perspective.
Platt scaling is used for small datasets with reliability curves in sigmoid shape. This can be loosely understood as applying a sigmoid curve on top of the calibration plot to modify the model’s predictive probabilities.
The above images show how the curve is modified by applying the Platt calibrator curve to the model’s reliability curve. It is observed that the points of the calibration curve are drawn towards the ideal line (dotted line) during the calibration process.
Isotonic regression is a more complex approach and requires more data. The main advantage of isotonic regression is that it does not require the model’s reliability curve to be S-shaped. However, this method is sensitive to outliers and works well for large datasets.
It is noted that standard libraries such as sklearn support easy model calibration (sklearn.calibration.CalibratedClassifier) for practical implementation during model development.
impact on performance
It is worth noting that calibration modifies the output of the trained ML model. It may be possible that calibration also affects the accuracy of the model. After calibration, some values close to the decision limit (say 50% for binary classification) can be modified in such a way as to produce a different output label from the pre-calibration. The effect on accuracy is rarely very large, and it is important to note that calibration improves the reliability of ML models.
In this article, we have looked at the theoretical background of model calibration. Calibration of machine learning models is an important but often overlooked aspect of developing a reliable model. Following are the highlights of our lessons:-
(a) Model calibration gives an insight or understanding of the uncertainty in the model’s prediction and, in turn, the reliability of the model as understood by the end user, especially in critical applications.
(b) Model calibration is extremely valuable to us in cases where the estimated probability is of interest.
(c) The reliability curve and the Barrier score give us an estimate of the calibration levels of the model.
(c) Platt scaling and isotonic regression are popular methods for measuring calibration levels and improving estimated probability.
Where shall we go from here? The purpose of this article is to give you a basic understanding of model calibration. We can build on this further by exploring actual implementations using standard Python libraries such as scikit-learn for use cases.
The media shown in this article is not owned by Analytics Vidya and is used at the sole discretion of the author.