In an earlier post I walked through the Word2Vec algorithm and a reasonably bare-bones neural network. In this post, I’m going to walk through the implementation of another kind of neural network — CNNs, which are effective for image recognition and classification tasks, among other things. The Rust code referenced below, again, does not leverage any ML libraries and implements this network from scratch. The code referenced in this post can be found here on GitHub.

Convolutional Neural Networks are named after the mathematical concept of convolution, which, given two functions *f* and *g*, produce a third function which describes how the shape of one is modified by the other using the following integral:

In the context of a CNN, the convolutional operation refers to the sliding of small kernels of weights across an image’s pixels, multiplying the two together and storing the result in an output feature map.

Typical CNNs are variants of feed-forward neural nets, so most of the rest of the algorithm is what you might expect (I walk through this with examples here). I found a great resource for walking through the model architecture at a high-level in this Towards Data Science post, and this one, (and in more depth — this paper).

The network I’ll describe below was built to learn from the MNIST Database, which contains training and test data comprised of hand-written digits from the United States Census Bureau and high-school students. These digits are packed into 70 thousand 28×28 pixel grayscale images. After around five training epochs, this network can achieve a 98%+ predictive accuracy on handwritten digits from the MNIST dataset, though I initially couldn’t get better than 80% accuracy on my own handwriting. More on that later!

## What you’ll find below

Generally speaking, there are three kinds of layers in a Convolutional Neural Network, though there may be multiple layers of each type, or additional layer types, depending on the application:

- Convolutional layer: applies a series of kernels (also called filters) to create feature maps for the input data which capture patterns or textures for use in other layers
- Pooling (usually “max” pooling) layer: extracts the most significant features from input regions, and reduces the dimensionality of the processing space
- Fully-connected layer: learns the likely combinations of input data, and in conjunction with a softmax activation function, solves for multi-class classification problems

This neural network extracts *patches* from the images its processing. The patch captures spatial relationships (meaning, it preserves the two-dimensional layout) between a subset of pixels within the image, and uses this pixel subset as a feature in the model.