When starting your journey with Machine learning, the first model you will be given to learn is Regression. More specifically linear regression. Here, we predict a continuous numerical value from other numerical and categorical features. We just find a best-fit line that captures the pattern in the data to predict the outcome.
As stated earlier, Regression is the first machine-learning model to be learned and deeply understood. A clear understanding of certain mathematical and statistical concepts is mandatory to understand regression.
Before getting started with linear regression, a few conditions have to be met.
The independent features and the target variable should form a linear relationship. This can be checked using a scatter plot with the x-axis as the independent variable and the y-axis as the dependent target variable. A perfect to roughly straight inclined line(/ or ) should be formed proving the linearity.
We should obtain the first or the second plot to prove linearity. In the first image, there is a positive correlation which means both move in the same direction. When the independent variable increases, the target also gets increased, and similarly in the case of a decrease.
In the second image, it is the opposite, if the independent variable increases, the target variable decreases, and vice versa. Any one of these two relationships should be present for moving forward with linear regression.
2)Absence of Multicollinearity:
Multicollinearity refers to the correlation of independent features among themselves. Just as we saw the positive and negative correlation between independent and target features, two or more independent features might show this kind of correlation among themselves. But this is not a favorable relationship. The independent features when being related to each other, will not help in bringing out the significant impact each of them has on the target individually. This relationship tends to increase the variance of the coefficients of the features. This can be checked using heatmaps and VIF (Variance Inflation Factor). The correlation value of all the feature pairs should be closer to zero from both sides. VIF has to be a lesser value if multicollinearity is absent.
3)Absence of Heteroscedasticity
This makes sure that the error that occurred from the model is purely random. It should not be related to the predicted values or predictor variables. This can be tested using the scatterplots again. Plot the error in the y-axis and the predictor and predicted variables in the x-axis. The variance of the error should not follow any pattern as the value of the predictor or predicted value changes. Note that the variance should not form a pattern.
In the above image, the first and the second pictures indicate heteroscedasticity. The variance is growing as the x-axis value increases in the first picture. In the second picture, the variance decreases as the x-axis value changes. Both indicate heteroscedasticity. In the third picture, the variance remains constant.
4)Absence of Autocorrelation:
Autocorrelation is present when the errors are related to one another. The errors explaining themselves will not yield a good model. To check this, we can conduct a Durbin-Watson test. The value from this test shouldn’t be higher if autocorrelation is absent.
5)Normal distribution of error:
The final condition to be met for linear regression is the normal distribution of error. This can be checked using a histogram.
Always remember to check these conditions before building a regression model.