Performance Measure of a Machine Learning Model

5 min readJun 24, 2021

Making a Machine learning model and carrying out prediction is not the only task. Our end goal is to create a model that gives high accuracy on out of sample data. Hence, It is important to check performance metrics before carrying out predictions.

Performance Metrics for a Regression Model in Machine Learning

The performance of the regression model is usually measured in 4 different ways:

Mean absolute Error
Mean Squared Error
Root mean square Error
R2-score

Mean absolute Error (MAE)

It is just the absolute of the sum of each residual divided by the number of terms.

where yᵢ is the actual expected output and ŷᵢ is the model’s prediction.

residuals = actual_value — predicted_value

Mean Squared Error (MSE)

It is just the sum of squares of the residuals divided by number of terms.

where yᵢ is the actual expected output and ŷᵢ is the model’s prediction.

Root Mean Squared Error

Since, MSE contains squared error terms, we take the square root of the MSE, which gives rise to Root Mean Squared Error (RMSE).

R2 — Score

We know that a model with smaller error is better then a model with larger error value. These all performance Metrics gives us the absolute error value. But we may get a doubt on how to know how good the model is?? The answer could be r2-score. This performance metrics gives us the measure of how good the model is, in terms of percentage. Now lets look at its formula:

where yᵢ is the actual expected output, ȳ is the mean and ŷᵢ is the model’s prediction.

Performance Metrics for a Classification Model in Machine Learning

Evaluating a classifier is often significantly trickier than evaluating a regressor. There are many performance measures available :

Measuring accuracy by using cross-validation.
Confusion matrix

are two performance metrics that we are going to discuss now.

Measuring accuracy by using cross-validation.

In this way of evaluating a model the train set is split into a smaller training set and a validation set, then train our models against the smaller training set and evaluate them against the validation set. It’s a bit of work, but nothing too difficult and it would work fairly well. A great alternative is to use Scikit-Learn’s K-fold cross-validation feature where k represents the number of times the process is repeated.

If the value of k is provided as 10 then training set is divided into 10 distinct subsets called folds, then trains and evaluates the model 10 times, picking a different fold for evaluation every time and training on the other 9 folds. The result is an array containing the 10 evaluation scores.

Notice that cross-validation allows us to get not only an estimate of the performance of our model, but also a measure of how precise the model is..

Confusion matrix

A much better way to evaluate the performance of a classifier is to look at the confusion matrix. The general idea is to count the number of times instances of class A are classified as another class.

A confusion matrix is an N x N matrix where N is the number of classes. Each row in a confusion matrix represents an actual class, while each column represents a predicted class.

where TP is True positive, FP is False positive, FN is False negative, TN is True negative.

A perfect classifier would have only true positives and true negatives, so its confusion matrix would have nonzero values only on its main diagonal

The confusion matrix gives us a lot of information, but sometimes we may prefer a more concise metric. An interesting one to look at is the accuracy of the positive predictions; this is called the precision of the classifier.

TP is the number of true positives, and FP is the number of false positives.

A trivial way to have perfect precision is to make one single positive prediction and ensure it is correct (precision = 1/1 = 100%). This would not be very useful since the classifier would ignore all but one positive instance. So precision is typically used along with another metric named recall, also called sensitivity or true positive rate

FN is of course the number of false negatives.

It is often convenient to combine precision and recall into a single metric called the F1 score, in particular if you need a simple way to compare two classifiers. The F1 score is the harmonic mean of precision and recall.