If you’re studying machine learning, it’s recommended to practice using real-world data instead of artificial datasets. The good news is there are numerous open datasets available for you to choose from, covering various domains. Below are some places where you can find these datasets:

Well-known open data repositories:

- OpenML.org
- Kaggle.com
- PapersWithCode.com
- UC Irvine Machine Learning Repository
- Amazon’s AWS datasets
- TensorFlow datasets

Meta portals (that list open data repositories):

- DataPortals.org
- OpenDataMonitor.eu

Other websites that list many popular open data repositories:

- Wikipedia’s list of machine learning datasets
- Quora.com
- The datasets subreddit

## Root mean square error (RMSE)

RMSE is a way to measure how close a prediction model’s guesses are to the actual values in a dataset. For example, let’s say we are trying to predict how tall a person will be based on their age. We can use RMSE to see how well our prediction model is working. If the RMSE is low, it means our predicted heights are very close to the actual heights, which is good. But if the RMSE is high, it means our predicted heights are far from the actual heights, which is not good.

Root mean square error (RMSE)

*m*is the number of instances in the dataset you are measuring the RMSE on.**x**(*i*) is a vector of all the feature values (excluding the label) of the*i*th instance in the dataset, and*y*(*i*) is its label (the desired output value for that instance).**X**is a matrix containing all the feature values (excluding labels) of all instances in the dataset. There is one row per instance, and the*i*th row is equal to the transpose of**x**(*i*), noted (**x**(*i*))⊺*h*is your system’s prediction function, also called a*hypothesis*. When your system is given an instance’s feature vector**x**(*i*), it outputs a predicted value*ŷ*(*i*) =*h*(**x**(*i*)) for that instance (*ŷ*is pronounced “y-hat”).- RMSE(
**X**,*h*) is the cost function measured on the set of examples using your hypothesis*h*. **We use lowercase italic font for scalar values (such as***m*or*y*(*i*)) and function names (such as*h*), lowercase bold font for vectors (such as x(*i*)), and uppercase bold font for matrices (such as X).**Although the RMSE is generally the preferred performance measure for regression tasks**, in some contexts you may prefer to use another function.

If there are numerous outlier districts, you might want to opt for mean absolute error (MAE), also known as average absolute deviation.

Mean absolute error (MAE)

RMSE and MAE are two methods to determine the distance between two vectors — the predicted values and the target values. There are different norms or distance measures available.

RMSE is more sensitive to outliers than MAE since it concentrates more on large values, whereas MAE measures the sum of absolute differences between the target and predicted values. However, when outliers are rare, such as in a bell-shaped curve, the RMSE is better suited and generally preferred.

As the dataset is not extensive, you can use the corr() method to calculate the standard correlation coefficient (also known as Pearson’s r) between all attribute pairs without difficulty.

For example.

The correlation coefficient varies between -1 and 1. When it’s near 1, there’s a robust positive correlation between the variables. For instance, when the median income increases, the median house value also tends to rise. If the coefficient is near -1, it indicates a strong negative correlation. For instance, there is a slight negative correlation between the latitude and median house value, which implies that prices decrease slightly as you travel north. Finally, coefficients near 0 indicate no linear correlation.

Another approach to improve your system is by using Ensemble Methods, which involves combining the best-performing models. An ensemble of models usually performs better than the best individual model, just like how Random Forests outperform the decision trees they rely on. This is especially true when individual models make different types of errors. For instance, you can train and fine-tune a k-nearest neighbors model, and then create an ensemble model that predicts the mean of both the Random Forest and k-nearest neighbors predictions.