The problem statement is you have a composition of newly made wine and you’ve to find out the quality of wine using several parameters like sugar content, citric acid content, densityetc.

**Table of Contents**

- Getting the dataset
- Import the libraries and dataset
- Data analysis
- Data Preprocessing
- Train Test Split
- Machine Learning Model
- Feeding new inputs to trained Random Forest Model for quality prediction.

## 1. Getting the dataset

To get the dataset search with ‘red wine dataset kaggle’ on search engine and you’ll get result with kaggle link. Download the csv file and paste it in same folder to avoid path errors while importing.

## 2. Import the libraries and dataset

We import numpy to perform mathematical operations on arrays, pandas for data wrangling and manipulation; matplotlib and seaborn for exploring and plotting the data, sklearn for machine learning and statistical modelling.

## 3. Data Analysis

Data analysis refers to the process of inspecting, cleaning, transforming, and interpreting data to discover valuable insights, draw conclusions, and support decision-making.

We use shape function to get dimensions of dataset. Here the output denotes that we have 1599 rows and 12 columns. The head function displays first five row of dataset. We used isnull().sum() method to check null values and there are no null values in the dataset as every output of line [5] is 0.

The describe() method is used for calculating some statistical data like percentile, mean and std of the numerical values of the Series or DataFrame.

Now we’ll have a plot according to count of our quality values.

Quality ranges from 0 to N but we can see that there are 0 entries of wine whose quality is 1,2 and greater than 8. Wine quality 7 and above are considered good for red wine dataset, but we can see that the count is concentrated on 5,6,7.

Now we’ll plot each input parameter with output parameter i.e quality to watch the correspondance of input parameter and output parameter. Let’s start with volatile acidity as input parameter vs quality as output parameter.

figsize(5,5) denotes the size of figure. If you want a larger figure you can also go for (8,8) or (10,10). We can see that quality is increasing with decreasing volatile acidity.

Similarly we can plot quality vs citric acid and conclude that quality is increasing with increasing citric acid content. We can plot each input parameter agains output parameter to watch correspondance, but we have a more fascinating option called correlation.

Coefficiant of correlation is a measure that how two strong is relationship between two variables. It’s value lies from -1 to 1. If values of one variable rises or falls, the values of other variable also rises or falls respectively, then they are said to be positively correlated. On contrary negative correlation denotes that two variables are moving opposite to one another.

The above code gives a correlation heatmap which calculates coefficiant of correlation of each parameter with every other parameter.

Now if we want to check correlation between quality and fixed acidity we’ll look for the square which intersects quality and fixed acidity and the value is 0.1. Darker the shades of green better the correlation. Similarly you can refer coefficiant of fixed acidity and pH which is -0.7 and observe the shade of green, it’s almost white denoting that it’s weakly correlated.

## 4. Data Preprocessing

Data Preprocessing is processing performed on raw data to make it ready to feed the model. So we have to seperate label from the data.

Quality is output parameter (Y), so we have to seperate it from all input parameters (X)

In Quality(Y) we are making two categories like qualitative analysis good wine or bad wine. We are setting threshold if quality is greater than or equal to 7, it’s a good quality wine (1) otherwise it’s a bad quality wine(0).

## 5. Train Test Split

In this part we split our data set to training data and testing data. We train machine learning model using training data. We don’t show testing data to model, we use it only for testing purpose by using the parameters, which we got from training data. We calculate accuracy of model using testing data.

Here X_train is X_test is Y_train is Y_test is. test_size = 0.2 denotes that size of test array is 20% of all the entries in dataset. random_state =3 sets a seed to the random generator, so that your train-test splits are always deterministic. If you don’t set a seed, it is different each time.

Line 17 confirms if our train test split has occoured successfully. Total number of entries are 1599. Y_train has 1279 entries and Y_test has 320 entries which is almost 80%–20% split.

## 6. Machine Learning Model

We’ll use random forest classifer for this problem. Random forest classifier combines the output of multiple decision trees to reach a single result.

We’ll train our model using training data (i.e. X_train and Y_train) and test our model

To evaluate our model, we calculate accuracy score. We calculate Y_predicted using our model on X_test. So Y_predicted is output calculated by our model on test data and Y_test is test data output in csv. Make sure you understand difference between Y_test and Y_predicted. Accuracy score of our model is coming to be 0.94375 So we can say that our model is 94.375 % accurate.

While building our predictive model we can take any random row in csv as input_data. We converted it into numpy array, we reshape it and convert it into an array having an embedded array like [1,2,3,4].reshape(1,-1) will have output of [[1 2 3 4]]. Prediction predicts the output using model for a particular input here it will be 1

We’ll convert our output in linguistic terms for our convenience. This step is not mandatory but it makes underdstanding and readablity better.

## 7. Feeding new inputs to trained Random Forest Model for quality prediction.

By using training data we trained our random forest model. Now we’ll feed our new inputs and compare predicted output and actual output.

So I copied a random row in csv file where output was 5 which is bad quality (as 5<7) So our calculated output is 0 and it matches with Output in CSV.

Thanks for your patience!

I welcome your suggestions and optimizations in comment section!