In this blog, we are going to discuss some important topics in machine learning. We are going to discuss bias, variance, underfitting, overfitting, and performance metrics used in classification and regression analysis.
Bias is the difference between the actual value and the predicted value that a model predicts. In machine learning, data is fed to the machine learning model, the model finds the patterns from data and learns from data.
While creating a machine learning model, we should create a generalized model so that it performs well for new data. If the difference between the actual value and the predicted value is large, it is known as high bias. This leads to the underfitting of the model.
A machine learning model needs a sufficient amount of data for understanding the patterns in it and later making predictions on test data. The dataset contains unnecessary data that may mislead the model.
If the model is affected by small fluctuations in the dataset, the scatterings of the predicted value will be large. This is called variance. The high variance will cause large errors in test data and result in overfitting.
Let’s visually understand the terms low bias, high bias, low variance, and high variance.
If a model has low bias then the predicted value will be close to the actual value and if the model has low variance the predicted data will be less scattered. As shown in the first section figure, the predicted values are less scattered and close to the actual value. The smaller circle represents the actual value.
In the second section of the figure, the model has less bias and high variance. So, the predicted data are close to the actual data but highly scattered.
In the third section of the figure, the model has high bias and low variance. So, predicted data are far from the actual value but less scattered.
In the fourth section of the figure, the model has high bias and high variance. So, predicted data are far from the actual data and highly scattered.
If the dataset is not large enough, the machine learning model cannot find the pattern from the dataset. The algorithm will be unable to learn from the data that is fed to the algorithm. There will be a high error in training as well as test data.
The model resulting from underfitting gives high errors for training data and as well as test data.
When we provide the dataset, the algorithm will look at it a number of times for finding the patterns in the data. Because of this, machine learning models learn from the noise too. This causes the model to be complex and the model will exactly fit the training data but gives high errors for new data.
This is what the overfitting of the model looks like. The model will be complex and gives high accuracy for training data but large error for test data or any new data.
5. Confusion matrix
A confusion matrix is a tool for observation of model performance in classification problems. The terms used in the confusion matrix are:
Correct prediction of event case. For example, the prediction of having cancer when the person is actually having cancer.
False or wrong prediction of event case. For example, the prediction of having cancer when a person is not having cancer.
False or wrong prediction of non-event cases. For example, the prediction of not having cancer when a person is having cancer.
Correct prediction of non-event cases. For example, prediction of not having cancer when a person is also not having cancer.
With the confusion matrix, we calculate the accuracy of our model as
Precision is a ratio of True Positive(TP) to the total Positive(TP + F P). It gives the value that is equal to the total TP that a model predicts to the Total positive value present. The formula for calculating precision is
A precision of 60% means our model has predicted 60 True Positive(TP) correctly among 100 positive values.
Recall gives the measure of how accurately our model predicts the True Positive(TP) from the available actual total positive. It is the ratio of True Positive(TP) to the total actual positive(TP + FN). The formula for the recall is
80 % recall means that our model has predicted 80 True Positive(TP) out of 100 actual positive values.
We need to know which one(precision or recall) we need to consider for the evaluation of a model. Sometimes we need precision more than recall and sometimes we need recall. Thus, f1-score combines both precision and recall and takes the harmonic mean. The formula for the recall is
9. Left skewed distribution
Left skewed distribution is such a type of distribution where the curve is elongated towards the left. In this type of distribution the relationship between mean, median, and mode is mean < median < mode. Let’s see how the left-skewed distribution graph looks like
Example: Mortality on basis of different age groups. Generally, the average person lives 70-80 years. The number of death of people at a young age is larger. There is a very less number of people who have lived more than 100 years.
10. Right skewed distribution
The right-skewed distribution is such a type of distribution where the curve is elongated towards the right side. In this type of distribution the relationship between mean, median, and mode is mean > median > mode. Let’s see how the right-skewed distribution graph looks like
Example: Income of people in the USA.
11. Mean Absolute Error(MAE)
Mean Absolute Error(MAE) measures the absolute difference between the actual value and predicted value and computes the average. It only measures the distance between actual and predicted value but not the direction( actual > predicted or predicted > actual). The formula for calculating MAE is
12. Mean Squared Error(MSE)
Mean Squared Error(MSE) calculates the square of the difference between the actual and predicted value and computes the average. Since it computes the square of the error, if there is an outlier in the dataset, the error will be high. So, this method is sensitive to outliers. The formula for calculation of MSE is
13. Root Mean Squared Error(RMSE)
Root Mean Squared Error(RMSE) calculates the square root of the square difference between the actual and predicted value and computes the average.
It overcomes the disadvantage of MSE as it calculates the square root of MSE. Because of this presence of outlier do not have an impact on total error. RMSE is robust to outliers. The formula for calculation of RMSE is
Hence, these are some of the very important topics in machine learning that one should know.
For more detailed information about performance metrics in regression analysis, follow this link
Happy Learning 🙂