With our data in-place, it’s time to train our first Neural Network. We’ll use a similar architecture to what we’ve done in the last blog post of the series, using a Linear version of our Neural Network with the ability to handle linear patterns:

`from torch import nn`class LinearModel(nn.Module):

def __init__(self):

super().__init__()

self.layer_1 = nn.Linear(in_features=12, out_features=5)

self.layer_2 = nn.Linear(in_features=5, out_features=1)

def forward(self, x):

return self.layer_2(self.layer_1(x))

This neural network uses the `nn.Linear`

module from `pytorch`

to create a Neural Network with 1 deep layer (one input layer, a deep layer and an output layers).

Although we can create our own class inheriting from `nn.Module`

, we can also use (more elegantly) the `nn.Sequential`

constructor to do the same:

`model_0 = nn.Sequential(`

nn.Linear(in_features=12, out_features=5),

nn.Linear(in_features=5, out_features=1)

)

Cool! So our Neural Network contains a single inner layer with 5 neurons (this can be seen by the `out_features=5`

on the first layer).

This inner layer receives the same number of connections from each input neuron. The 12 in `in_features`

in the first layer reflects the number of features and the 1 in `out_features`

of the second layer reflects the output (a single value raging from 0 to 1).

To train our Neural Network, we’ll define a loss function and an optimizer. We’ll define `BCEWithLogitsLoss`

(PyTorch 2.1 documentation) as this loss function (torch implementation of Binary Cross-Entropy, appropriate for classification problems) and Stochastic Gradient Descent as the optimizer (using `torch.optim.SGD`

).

`# Binary Cross entropy`

loss_fn = nn.BCEWithLogitsLoss()# Stochastic Gradient Descent for Optimizer

optimizer = torch.optim.SGD(params=model_0.parameters(),

lr=0.01)

Finally, as I’ll also want to calculate the accuracy for every epoch of training process, we’ll design a function to calculate that:

`def compute_accuracy(y_true, y_pred):`

tp_tn = torch.eq(y_true, y_pred).sum().item()

acc = (tp_tn / len(y_pred)) * 100

return acc

Time to train our model! Let’s train our model for 1000 epochs and see how a simple linear network is able to deal with this data:

`torch.manual_seed(42)`epochs = 1000

train_acc_ev = []

test_acc_ev = []

# Build training and evaluation loop

for epoch in range(epochs):

model_0.train()

y_logits = model_0(X_train).squeeze()

loss = loss_fn(y_logits,

y_train)

# Calculating accuracy using predicted logists

acc = compute_accuracy(y_true=y_train,

y_pred=torch.round(torch.sigmoid(y_logits)))

train_acc_ev.append(acc)

# Training steps

optimizer.zero_grad()

loss.backward()

optimizer.step()

model_0.eval()

# Inference mode for prediction on the test data

with torch.inference_mode():

test_logits = model_0(X_test).squeeze()

test_loss = loss_fn(test_logits,

y_test)

test_acc = compute_accuracy(y_true=y_test,

y_pred=torch.round(torch.sigmoid(test_logits)))

test_acc_ev.append(test_acc)

# Print out accuracy and loss every 100 epochs

if epoch % 100 == 0:

print(f"Epoch: {epoch}, Loss: {loss:.5f}, Accuracy: {acc:.2f}% | Test loss: {test_loss:.5f}, Test acc: {test_acc:.2f}%")

Unfortunately the neural network we’ve just built is not good enough to solve this problem. Let’s see the evolution of training and test accuracy:

*(I’m plotting accuracy instead of loss as it is easier to interpret in this problem)*

Interestingly, our Neural Network isn’t able improve much of the test set accuracy.

With the knowledge have from previous blog posts, we can try to add more layers and neurons to our neural network. Let’s try to do both and see the outcome:

`deeper_model = nn.Sequential(`

nn.Linear(in_features=12, out_features=20),

nn.Linear(in_features=20, out_features=20),

nn.Linear(in_features=20, out_features=1)

)

Although our deeper model is a bit more complex with an extra layer and more neurons, that doesn’t translate into more performance in the network:

Even though our model is more complex, that doesn’t really bring more accuracy to our classification problem.

To be able to achieve more performance, we need to unlock a new feature of Neural Networks — activation functions!

If making our model wider and larger didn’t bring much improvement, there must be something else that we can do with Neural Networks that will be able to improve its performance, right?

That’s where activation functions can be used! In our example, we’ll return to our simpler model, but this time with a twist:

`model_non_linear = nn.Sequential(`

nn.Linear(in_features=12, out_features=5),

nn.ReLU(),

nn.Linear(in_features=5, out_features=1)

)

What’s the difference between this model and the first one? The difference is that we added a new block to our neural network — `nn.ReLU`

. The rectified linear unit is an activation function that will change the calculation in each of the weights of the Neural Network:

Every value that goes through our weights in the Neural Network will be computed against this function. If the value of the feature times the weight is negative, the value is set to 0, otherwise the calculated value is assumed. Just this small change adds a lot of power to a Neural Network architecture — in `torch`

we have different activation functions we can use such as `nn.ReLU`

, `nn.Tanh`

or `nn.ELU`

. For an overview of all activation functions, check this link.

Our neural network architecture contains a small twist, at the moment:

With this small twist in the Neural Network, every value coming from the first layer (represented by `nn.Linear(in_features=12, out_features=5)`

) will have to go through the “ReLU” test.

Let’s see the impact of fitting this architecture on our data:

Cool! Although we see some of the performance degrading after 800 epochs, this model doesn’t exhibit overfitting as the previous ones. Keep in mind that our dataset is very small, so there’s a chance that our results are better just by randomness. Nevertheless, adding activation functions to your `torch`

models definitely has a huge impact in terms of performance, training and generalization, particularly when you have a lot of data to train on.

Now that you know the power of non-linear activation functions, it’s also relevant to know: