The machine learning model that is going to be shown below doesn’t apply or work in real-world scenarios. It is just a Hypothesis.
The factors considered for predicting the Student Performance Index are Hours Studied, Previous Scores, Extracurricular Activities, Sleep Hours, and Sample Question Papers. The source from where I got this data
Before getting into details, what is machine learning, and why is it the new talk of the town?
Presume you use a conventional navigation system to drive from point A to point B. In this scenario, the system relies on pre-programmed routes and road information. It follows fixed rules, such as “turn left at this intersection” and “drive for X miles.” These systems don’t adapt to real-time traffic conditions or provide personalized recommendations. They cannot learn from historical traffic data or user preferences. As a result, you might encounter unexpected traffic jams or miss shortcuts that could save time.
Now, consider using Google Maps, which leverages machine learning. Google Maps collects massive amounts of user data and integrates it into its algorithms. When you input a destination, the machine learning model considers real-time traffic conditions, road closures, and historical traffic data. It calculates the optimal route based on this information. What sets it apart is its ability to learn and adapt. Over time, it becomes more accurate in predicting traffic patterns, suggesting alternate routes, and estimating arrival times. It can even recommend restaurants and gas stations based on your preferences and previous choices.
I presume that it is clear to you why machine learning is the talk of the town, but this field of study is not new. It has been in existence since the 1950s. It is not an alien field. It has been under a lot of research for several years, but it got less attention than it does now in the past because of less robust infrastructure. But something innovative has happened in the year 2012.
Machine Learning Breakthrough in 2012:
In 2012, there was a significant moment in machine learning. It involved a technology called deep neural networks, especially Convolutional Neural Networks (CNNs). These are like super-smart systems that can understand pictures.
People tested these systems in a big competition called the ImageNet Challenge. The goal was to teach computers to recognize and name things in thousands of pictures. In that year, a deep learning system named AlexNet did something incredible. It cut down the mistakes it made from 1 out of 4 times wrong to just 1 out of 7 times inaccurate. This was a huge deal because it showed how robust these systems could be in understanding images.
This breakthrough was a big deal because it allowed these intelligent systems to be used in many ways, like making computers understand spoken language or helping doctors read X-rays. It also made many people more interested in studying and investing in this kind of technology, which led to even more cool discoveries in the field.
To know more about this topic, check out this link: https://en.wikipedia.org/wiki/AlexNet.
So, what is machine learning exactly, and what are its subcategories, if any?
Machine learning is a subset of the Data Science field that focuses on developing algorithms and models that enable computers to learn and make predictions or decisions without being explicitly programmed. The core idea behind machine learning is to use data to train algorithms to identify patterns, make predictions, or optimize performance based on new, unseen data.
Machine Learning is broadly classified into four Categories:
Supervised Learning(Buckle up yourself! This is the topic I will discuss in a few moments): Imagine teaching a computer like you teach a dog new tricks. You show the computer pictures of cats and dogs and tell it to differentiate. After lots of practice, the computer learns to recognize cats and dogs independently when you show it new pictures. This is how supervised learning works. It’s like supervised training, just like teaching your pet.
Unsupervised Learning: Now, think of a situation where you give a computer many photos but don’t tell it what’s in them. The computer has to figure out if there are any similarities between the images. It might find that some pictures have mountains, some have beaches, and some have cities. It groups the photos based on what it sees, even though you never told it what to look for. That’s unsupervised learning. It’s like the computer is exploring and discovering things on its own.
Semi-Supervised Learning: Semi-supervised learning is a mix of the two. You give the computer a few photos with labels (like telling it that these are pictures of cats and dogs), but you also give it a lot of images without labels. The computer uses the labeled pictures to learn some stuff and then tries to apply what it understood to the unlabeled images. It’s a bit like having a teacher for some lessons and doing homework on your own for others.
Reinforcement Learning: Think of a video game where you control a character. You want the character to get the most points, but you’re not telling it exactly what to do at every moment. You give the character a few rules, like “collect coins” and “avoid monsters.” As you play, the character learns to decide to get more points by following these rules. Reinforcement learning is like playing a game where you learn to make better choices by getting rewards and learning from your mistakes. It’s used to train robots or self-driving cars to make good decisions.
So, let’s quickly dive into our main topic, i.e., Supervised Learning:
By this time, you might have a basic understanding of Supervised Learning. Let’s get our hands soiled by doing Some code related to Supervised Learning.
import numpy as np
import matplotlib.pyplot as plt
import pandas as pd
Importing The Dataset and Encoding the String values
dataset = pd.read_csv('Student_Performance.csv')
from sklearn.preprocessing import LabelEncoder
label_encoder = LabelEncoder()
dataset['Extracurricular Activities'] = label_encoder.fit_transform(dataset['Extracurricular Activities'])
X = dataset.iloc[:, 1:-1].values
y = dataset.iloc[:, -1].values
I have used Label Encoder for ‘Extracurricular Activities’ because this Column of the data has String values. After using Encoder, it will be turned into ,(these values are applicable only when there are two different String Values). In this code, there are only two different strings. iloc method is used to select rows and columns in the above code in x variable iloc selects till the last Column(excluding the last Column) variable y will have performance index.
Splitting the dataset into the Training set and Test set
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.2, random_state = 1)
The above code divides the dataset into an 80:20 ratio (80 is for Training and 20 is for testing).
Training the Polynomial Regression model on the whole dataset
from sklearn.preprocessing import PolynomialFeatures
from sklearn.linear_model import LinearRegression
poly_reg = PolynomialFeatures(degree = 4)
X_poly = poly_reg.fit_transform(X_train)
lin_reg = LinearRegression()
Simple Linear Regression:
A simple linear regression model has one independent variable (predictor) and one dependent variable (target). The goal is to find the best-fitting linear relationship between them.
Mathematically, the model is represented as:
Y = b0 + b1*X + ε
- Y is the dependent variable (the target or response variable).
- X is the independent variable (the predictor or feature).
- b0 is the intercept (the value of Y when X is 0).
- b1 is the slope (the change in Y for a unit change in X).
- ε represents the error term, which accounts for the variability in Y that is not explained by the linear relationship.
Multiple Linear Regression:
In a multiple linear regression model, there are multiple independent variables.
The mathematical representation is an extension of simple linear regression:
Y = b0 + b1*X1 + b2*X2 + … + bn*Xn + ε
Polynomial regression represents a nonlinear relationship by introducing higher-order terms of the independent variable(s).
The mathematical representation of polynomial regression with a single independent variable is as follows:
Y = b0 + b1*X + b2*X² + … + bn*X^n + ε
- Y is the dependent variable.
- X is the independent variable.
- b0 is the intercept.
- b1, b2, …, bn are the coefficients associated with each term.
- ε is the error term.
Predicting the values of a pre-trained model
from sklearn.metrics import r2_score
r2 = r2_score(y_test, y_pred)
R-squared, often called the “coefficient of determination,” is a statistical measure to explain how well a regression model (linear or nonlinear) fits the observed data. In simpler terms, it tells you how close the data points are to the fitted regression line.
The R-squared score of the above model is 83.94%