An autoencoder is a type of neural network that learns to reconstruct its input. It consists of an encoder network that compresses the input data into a low-dimensional space and a decoder network that reconstructs the input data from that space. The encoder and decoder are trained jointly to minimize the reconstruction error between the input data and its reconstruction.

Autoencoders can be used for various tasks such as data compression, denoising, feature extraction, anomaly detection, and generative modeling. They have applications in a wide range of fields such as computer vision, natural language processing, and speech recognition. Autoencoders can be also used for dimensionality reduction. In fact, one of the main purposes of autoencoders is to learn a compressed representation of the input data, which can be used as a form of dimensionality reduction.

In this article, we will discuss the underlying math behind autoencoders and will see how they can do dimensionality reduction. We also look at the relationship between an autoencoder, principal component analysis (PCA), and singular value decomposition (SVD). We will also show how to implement both linear and non-linear autoencoders in Pytorch.

**Autoencoder architecture**

Figure 1 shows the architecture of an autoencoder. As mentioned before an autoencoder learns to reconstruct its input data, hence the size of the input and output layers is always the same (*n*). Since the autoencoder learns its own input, it does not require labeled data for training. Hence it is an unsupervised learning algorithm.

But what is the point of learning the same input data? As you see, the hidden layers in this architecture are shaped in the form of a double-sided funnel in which the number of neurons in each layer decreases as we move from the first hidden layer to a layer that is referred to as the *bottleneck layer*. This layer has the minimum number of neurons. The number of neurons then increases again from the bottleneck layer and ends with the output layer which has the same number of nodes in the input layer. It is important to note that the number of neurons in the bottleneck layer is less than *n*.

In a neural network, each layer learns an abstract representation of the input space, so the bottleneck layer is indeed a bottleneck for the information that transfers between the input and output layers. This layer learns the most compact representation of the input data compared to other layers and also learns to extract the most important features of the input data. These new features (also called *latent variables*) are the result of the transformation of the input data points into a continuous lower-dimensional space. In fact, the latent variables can describe or explain the input data in a simpler way. The output of the neurons in the bottleneck layer represents the values of these latent variables.

The presence of a bottleneck layer is the key feature of this architecture. If all the layers in the network had the same number of neurons, the network could easily learn to memorize the input data values by passing them all along the network.

An autoencoder can be divided into two networks:

- Encoder network: It starts from the input layer and ends at the bottleneck layer. It transforms the high-dimensional input data into the low-dimensional space formed by the latent variables. The output of the neurons in the bottleneck layer represents the values of these latent variables.
- Decode network: It starts after the bottleneck layer and ends at the output layer. It receives the values of the low dimensional latent variables from the bottleneck layer and reconstructs the original high dimensional input data from them.

In this article, we are going to discuss the similarity between autoencoders and PCA. In order to comprehend PCA, we need some concepts from linear algebra. So, we first review linear algebra.

**Linear algebra review: Basis, dimension, and rank**

A set of vectors {** v**₁,

**₂, …,**

*v*

*v_**n*} form a

*basis*for the vector space

*V*, if they are linearly independent and span

*V*. If a set of vectors is linearly independent, then no vector in the set can be written as a linear combination of the other vectors. A set of vectors {

**₁,**

*v***₂, …,**

*v*

*v_**n*}

*span*a vector space if every other vector in that space can be written as a linear combination of this set. So, any vector

**in**

*x**V*can be written as:

where *a*₁, *a*₂, …, *a_n* are some constants. The vector space *V* can have many different vector bases, but each basis always has the same number of vectors. The number of vectors in a basis of a vector space is called the *dimension* of that vector space. A basis {** v**₁,

**₂, …,**

*v*

*v_**n*} is

*orthonormal*when all the vectors are normalized (the length of a normalized vector is 1) and orthogonal (mutually perpendicular). In Euclidean space R², the vectors:

form an orthonormal basis which is called the *standard basis*. They are linearly independent and span any vectors in R². Since the basis has only two vectors, the dimension of R² is 2. If we have another pair of vectors that are linearly independent and span R², that pair can also be a basis for R². For example

is also a basis but is not an orthonormal basis since the vectors are not orthogonal. More generally we can define the standard basis for R^*n* as:

where in *e**ᵢ* the *i*th element is one and all the other elements are zero.

Let the set of vectors *B*={** v**₁,

**₂, …,**

*v*

*v_**n*} form a basis for a vector space, then we can write any vector

**in that space in terms of the basis vectors:**

*x*Hence the coordinates of ** x** relative to this basis

*B*can be written as:

In fact, when we define a vector in R² like

the elements of this vector are its coordinate relative to the standard basis:

We can easily find the coordinates of a vector relative to another basis. Suppose that we have the vector:

where* B*={** v**₁,

**₂, …,**

*v*

*v_**n*} is a basis. Now we can write:

Here *P_*** B** is called the

*change-of-coordinate matrix*, and its columns are the vectors in basis

*B*. Hence if we have the coordinates of

**relative to the basis**

*x**B*, we can calculate its coordinates relative to the standard basis using Equation 1. Figure 2 shows an example. Here the

*B*={

**₁,**

*v***₂} is a basis for R². The vector**

*v***is defined as:**

*x*And the coordinates of ** x** relative to

*B*is:

So, we have:

The *column space* of matrix ** A** (also written as

*Col*

**) is the set of all linear combinations of the columns of**

*A***Suppose that we denote the columns of the matrix**

*A.***by vectors**

*A***₁**

*a**,*

**₂**

*a**, …*

*a_**n*.

**Now for any vector like**

**,**

*u***can be written as:**

*Au*Hence, ** Au** is a linear combination of the columns of

**, and the column space of**

*A***is the set of vectors that can be written as**

*A***.**

*Au*The *row space* of a matrix ** A** is the set of all linear combinations of the rows of

*A**.*Suppose that we denote the rows of matrix

**by vectors**

*A***₁**

*a**ᵀ,*

**₂**

*a**ᵀ, …*

**ᵀ:**

*a_m*The row space of ** A** is the set of all vectors that can be written as

The number of basis vectors of *Col* ** A** or the dimension of

*Col*

**is called the**

*A**rank*of

**. The rank of**

*A***is also the maximum number of linearly independent columns of**

*A***. It can be also shown that the rank of a matrix**

*A***is equal to the dimension of its row space, and similarly, it is equal to the maximum number of linearly independent rows of**

*A***. Hence, the rank of a matrix cannot exceed the number of its rows or columns. For example, for an**

*A**m*×

*n*matrix,

*then*

*the*

*rank cannot be greater than*

*min*(

*m*,

*n*).

**PCA: a review**

Principal component analysis (PCA) is a linear technique. It finds the directions in the data that capture the most variation and then projects the data onto a lower-dimensional subspace spanned by these directions. PCA is a widely used method for reducing the dimensionality of data.

PCA transforms the data into a new orthogonal coordinate system. This coordinate system is chosen such that the variance of the projected data points onto the first coordinate axis (called the *first principal component*) is maximized. The variance of the projected data points onto the second coordinate axis (called the *second principal component*) is maximized amongst all possible directions orthogonal to the first principal component, and more generally, the variance of the projected data points onto each coordinate axis is maximized amongst all possible directions orthogonal to the previous principal components.

Suppose that we have a dataset with *n* features and *m* data points or observations. We can use the *m*×*n *matrix

to represent this dataset, and we call it the *design matrix*. Hence each row of ** X** represents a data point, and each column represents a feature. We can also write

**as**

*X*where each column vector

represents an observation (or data point) in this dataset. Hence, we can think of our dataset as a set of *m* vectors in R^*n*. Figure 3 shows an example for *n*=2. Here we can plot each observation as a vector (or simply a point) in R².

Let ** u** be a unit vector, so we have:

The scalar projection of each data point *x**ᵢ* onto the vector ** u** is:

Figure 4 shows an example for *n*=2.

We denote the mean of each column of ** X** by

Then the mean of the dataset can is defined as:

And we can also write it as:

Now the variance of these projected data points is defined as:

Equation 1 can be simplified further. The term

is a scalar (since the result of a is a scalar quantity). Besides, we know that the transpose of a scalar quantity is equal to itself. So, we can have

Hence the variance of the scalar projection of data points in ** X** onto the vector

**can be written as**

*u*where

is called the* covariance matrix* (Figure 5).

By simplifying Equation 5, it can be shown that the covariance matrix can be written as:

where

Here *xᵢ*,*ₖ* is the (*i*, *k*) element of the design matrix ** X** (or simply the

*k*th element of the vector

*x**ᵢ*).

For a dataset with *n* features, the covariance matrix is an *n*×*n* matrix. In addition based on the definition of *Sᵢ*,*ⱼ* in Equation 6 we have:

So, its (*i*, *j*) element is equal to its (*j*, *i*) element which means that the covariance matrix is a symmetric matrix and is equal to its transpose:

Now we find the vector ** u**₁ that maximizes

Since ** u**₁ is a normalized vector, we add this constraint to the optimization problem:

We can solve this optimization problem by adding the Lagrange multiplier *λ*₁ and maximize

To do that, we set the derivative of this term with respect to ** u**₁ equal to zero:

And we get:

This means that ** u**₁ is an eigenvector of the covariance matrix

**, and**

*S**λ*₁ is its corresponding eigenvalue. We call the eigenvector

**₁ the first**

*u**principal component*. Next, we want to find the unit vector

**₂ that maximizes**

*u***₂ᵀ**

*u***₂ amongst all possible directions orthogonal to the first principal component. So, we need to find the vector**

*Su***₂ that maximizes**

*u***₂**

*u**ᵀ*

**₂ with these constraints:**

*Su*It can be shown that ** u**₂ is the solution of this equation:

So we conclude that ** u**₂ is also an eigenvector of

**, and**

*S**λ*₂ is its corresponding eigenvalue (proof is given in the appendix). More generally, we want to find the unit vector

*u**ᵢ*that maximizes

*u**ᵢᵀ*

*Su**ᵢ*amongst all possible directions orthogonal to the previous principal components

**₁…**

*u*

*u_**i*-1. Hence, we need to find the vector

*u**ᵢ*that maximizes

with these constraints:

Again it can be shown that *u**ᵢ *is the solution to this equation

Hence *u**ᵢ* is an eigenvector of ** S**, and

*λᵢ*is its corresponding eigenvalue (proof is given in the appendix). The vector

*u**ᵢ*is called the

*i*th principal component. If we multiply the previous equation by

*u**ⱼ*ᵀ

*we get:*

Hence, we conclude that the variance of the scalar projection of the data points in ** X** onto the eigenvector

*u**ᵢ*is equal to its corresponding eigenvalue.

If we have a dataset with *n* features, the covariance matrix will be an *n*×*n* symmetric matrix. Here each data point can be represented by a vector in R^*n* (*x**ᵢ*). As mentioned before, the elements of a vector in R^*n* give its coordinates relative to the standard basis.

It can be shown that an *n*×*n* symmetric matrix has *n* real eigenvalues, and *n* linearly independent and orthogonal corresponding eigenvectors (spectral theorem). These *n* orthogonal eigenvectors are the principal components of this dataset. It can be shown that a set of *n* orthogonal vectors can form a basis for *R^n*. So, these principal components form an orthogonal basis and can be used to define a new coordinate system for the data points (Figure 6).

We can easily calculate the coordinates of each data point *x**ᵢ *relative to this new coordinate system. Let *B*={** v**₁,

**₂, …,**

*v*

*v_**n*} be the set of the principal components. We first write

*x**ᵢ*in terms of the basis vectors:

Now if we multiply both sides of this equation by *v**ᵢᵀ* we have:

Since we have an orthogonal basis:

So, it follows that

Since the dot product is commutative, we can also write:

Hence, the coordinates of *x**ᵢ *relative to *B* are:

and the design matrix can be written as

in the new coordinate system. Here each row represents a data point (observation) in the new coordinate system. Figure 6 shows an example for *n*=2.

The variance of the scalar projection of data points onto each eigenvector (principal component) is equal to its corresponding eigenvalue. The first principal component has the greatest eigenvalue (variance). The second principal component has the second greatest eigenvalue and so on. Now we can choose the first *d* principal components and project the original data points on the subspace spanned by them.

So, we transform the original data points (with *n* features) to these projected data points that belong to a *d*-dimensional subspace. In this way, we reduce the dimensionality of the original dataset from *n* to *d* while maximizing the variance of the projected data. Now the first *d* columns of the matrix in Equation 9 give the coordinates of the projected data points:

Figure 7 gives an example of this transformation. The original dataset has 3 features (*n*=3) and we reduce its dimensionality to *d*=2 by projecting the data points on the plane formed by the first two principal components (** v**₁,

**₂). The coordinates of each data point**

*v*

*x**ᵢ*in the subspace spanned by

**₁ and**

*v***₂ are:**

*v*It is usual to *center *the dataset around zero before the PCA analysis. To do that we first create the design matrix ** X** that represents our dataset (Equation 2). Then we create a new matrix

**by subtracting the mean of each column from the elements in that column**

*Y*The matrix ** Y** represents the centered dataset. In this new matrix, the mean of each column is zero:

So, the mean of the dataset is also zero:

Now suppose that we start with a centered design matrix ** X** and want to calculate its covariance matrix. Hence, the mean of each column of

**is zero. From Equation 6 we have:**

*X*where [** X**]_

*k*,

*j*denotes the (

*k*,

*j*) element of the matrix

**. By using the definition of matrix multiplication, we get**

*X*Please note that this equation is only valid when the design matrix (** X**) is centered.

**The relation between PCA and singular value decomposition (SVD)**

Suppose that ** A** is an

*m*×

*n*matrix. Then

*A**ᵀ*

**will be a square**

*A**n*×

*n*matrix, and it can be easily shown that it is a symmetric matrix. Since

*A**ᵀ*

**is symmetric, it has**

*A**n*real eigenvalues and

*n*linear independent and orthogonal eigenvectors (spectral theorem). We call these eigenvectors

**₁,**

*v***₂, …,**

*v*

*v_**n*and we assume they are normalized. It can be shown the eigenvalues of

*A**ᵀ*

**are all positive.**

*A***Now assume that we label them in decreasing order, so:**

Let ** v**₁,

**₂, …,**

*v*

*v_**n*be the eigenvectors of

*A**ᵀ*

**corresponding to these eigenvalues. We define the**

*A**singular value*of the matrix

**(denoted by**

*A**σᵢ*) as the square root of

*λᵢ*. So we have

Now suppose that the rank of** A **is

*r*.

**Then**

**it can be shown that**

**the number of the nonzero eigenvalues of**

*A**ᵀ*

**or the number of nonzero singular values of**

*A***is**

*A**r:*

Now the singular value decomposition (SVD) of** A **can be written as

Here** V** is an

*n×n*matrix and its columns are

*v**ᵢ*:

** Σ **is an

*m*×

*n*diagonal matrix, and all the elements of

**are zero except the first**

*Σ**r*diagonal elements which are equal to the singular values of

**. We define the matrix**

*A***as**

*U*We define ** u**₁

*to*

*u_**r*as

We can easily show that these vectors are orthogonal:

Here we used the fact that *v_**j* is an eigenvector of *A**ᵀ*** A** and these eigenvectors are orthogonal. Since these vectors are orthogonal, they are also linearly independent. The other

*u**ᵢ*vectors (

*r<i≤m*) are defined in a way that

**₁,**

*u***₂,**

*u**…*

*u**_m*form a basis for an

*m*-dimensional vector space (

*R^**m)*.

Let ** X** be a centered design matrix, and its SVD decomposition is as follows:

As mentioned before, ** v**₁,

**₂, …,**

*v*

*v_**n*are

*the eigenvectors of*

*X**ᵀ*

**and the singular values are the square root of their corresponding eigenvalues. Hence, we have**

*X*Now we can divide both sides of the previous equation by *m* (where *m* is the number of data points) and use Equation 10, to get

Hence, it follows that *v**ᵢ* is the eigenvector of the covariance matrix and its corresponding eigenvalue is the square of its corresponding singular value divided by *m*. So, the matrix ** V **in the SVD equation gives the principal components of

**and using the singular values in**

*X,***, we can easily calculate the eigenvalues. In summary, we can use SVD to do PCA.**

*Σ*Let’s see what else we can get from the SVD equation. We can simplify ** UΣ** in Equation 12 using Equations 3 and 11:

Comparing with Equation 9, we conclude that the *i*th row of ** UΣ** gives the coordinates of the data point

*x**ᵢ*relative to the basis defined by the principal components.

Now suppose that in Equation 12, we only keep the first *k* columns of ** U**, the first

*k*rows of

**, and the first**

*V**k*rows and columns of

**. If we multiply them together, we get:**

*Σ*Please note that *X**ₖ *is still an *m*×*n* matrix. If we multiply *X**ₖ* by the vector *b* which has *n* elements, we get:

where [** Cb**]

*ᵢ*is the

*i*th element of the vector

**. Since**

*Cb***₁,**

*u***₂, …,**

*u*

*u**ₖ*are linearly independent vectors (remember that they form a basis, so they should be linearly independent) and they span

*X**ₖ*

**,**

*b***we conclude that they form a basis for**

*X**ₖ*

**. This basis has**

*b**k*vectors, so the dimension of the column space of

*X**ₖ*is k. Hence

*X**ₖ*is a rank-

*k*matrix.

But what does *X**ₖ* represent? Using Equation 13 we can write:

So, the *i*th row of *X**ₖ* is the transpose of:

which is the vector projection of the data point *x**ᵢ* on the subspace spanned by the principal components ** v**₁,

**₂, …**

*v*

*v**ₖ*. Remember that

**₁,**

*v***₂, …**

*v*

*v_**n*is a basis for our original dataset. In addition, the coordinates of

*x**ᵢ*relative to this basis are:

Hence, using Equation 1, we can write *x**ᵢ* as:

Now we can decompose *x**ᵢ* into two vectors. One is in the subspace defined by vectors defined by ** v**₁,

**₂, …**

*v*

*v**ₖ,*and the in the subspace defined by the remaining vectors.

The first vector is the result of the projection of *x**ᵢ* onto the subspace defined by vectors defined by ** v**₁,

**₂, …**

*v*

*v**ₖ*and is equal to

*x̃**ᵢ.*

Remember that each row of the design matrix ** X **represents one of the original data points. Similarly, each row of

*X**ₖ*represents the same data point projected on the subspace spanned by the principal components

**₁,**

*v***₂, …**

*v*

*v**ₖ*(Figure 8).

Now we can calculate the distance between the original data point (*x**ᵢ*) and the projected data point (*x̃**ᵢ*). The square of the distance between the vectors *x**ᵢ* and *x̃ᵢ* is:

And if we add the square of the distances for all the data points, we get:

The Frobenius norm of an *m*×*n *matrix ** C** is defined as:

Since vectors *x**ᵢ* and *x̃**ᵢ* are the transpose of the rows of matrices ** X** and

*X**ₖ*, we can write:

Hence the Frobenius norm of ** X**–

*X**ₖ*is proportional to the sum of the square of the distances between the original data points and the projected data points (Figure 9), and as the projected data points get closer to the original data points ||

**–**

*X*

*X**ₖ*||_

*F*decreases.

We want the projected points to be a good approximation of the original data points, so we want *X**ₖ* to give the lowest value for ** X**–

*X**ₖ*among all the rank-

*k*matrices.

Suppose that we have the *m*×*n* matrix ** X **with rank =

*r*and the singular values of

**are sorted, so we have:**

*X*It can be shown that *X**ₖ* minimizes the Frobenius norm of ** X**–

*A**among all the*

*m*×

*n*matrices

**that have a rank of**

*A**k*. Mathematically:

*X**ₖ* is the closest matrix to ** X** among all the rank-

*k*matrices and can be considered as the best rank-

*k*approximation of the design matrix

**. This also means that the projected data points represented by**

*X*

*X**ₖ*are the rank-

*k*best approximation (in terms of the total error) for the original data points represented by

**.**

*X*Now we can try to write the previous equation in a different format. Suppose that ** Z** is an

*m*×

*k*and

**is a**

*W**k*×

*n*matrix. We can show that

So finding a rank-*k* matrix ** A** that minimizes ||

**–**

*X***||_**

*A**F*is equivalent to finding the matrices

**and**

*Z***that minimize ||**

*W***–**

*X***||_**

*ZW**F*(proof is given in the appendix). Therefore, we can write

where ** Z***

**an**

*(*

*m*×

*k*matrix

**)**and

*** (a**

*W**k*×

*n*matrix) are the solutions to the minimization problem and we have

Now based on Equations 13 and 14 we can write:

So, if we solve the minimization problem in Equation 18 using SVD, we get the following values for ** Z*** and

***:**

*W*and

The rows of** W*** give the transpose of the principal components and the rows of

*** give the transpose of the coordinates of each projected data point relative to the basis formed by these principal components. It is important to note that the principal components form an orthonormal basis (so the principal components are both normalized and orthogonal). In fact, we can assume that PCA only looks for a matrix**

*Z***in which the rows form an orthonormal set. We know that when two vectors are orthogonal, their inner product is zero, so we can say that PCA (or SVD) solves the minimization problem**

*W*with this constraint:

where ** Z** and

**are**

*W**m*×

*k*and

*k*×

*n*matrices. In addition, if

*****

*Z***are**

*** are the solutions to the minimization problem then we have**

*W*This formulation is very important since it allows us to establish a connection between PCA and autoencoders.

**The relation between PCA and autoencoders**

We start with an autoencoder which only has three layers and is shown in Figure 10. This network has *n* input features denoted by *x*₁…*x_n* and *n* neurons at the output layer. The outputs of the network are denoted by *x^*₁…*x^_n*. The hidden layer has *k* neurons (where *k*<*n*) and the outputs of the hidden layer are denoted by *z*₁…*zₖ*. The matrices ** W^**[1] and

**[2] contain the weights of the hidden layer and output layer respectively.**

*W^*Here

represents the weight for the input *j*th input (coming from the *j*th neuron in layer *l*-1) of the *i*th neuron in layer *l* (Figure 11). Here we assume that for the hidden *l*=1, and for the output layer *l*=2.

Hence the weights of the hidden layer are given by:

And the *i*th row of this matrix gives all the weights of the *i*th neuron in the hidden layer. Similarly, the weights of the output layer are given by:

Each neuron has an activation function. We can calculate the output of a neuron in the hidden layer (activation of that neuron) using the weight matrix ** W^**[1] and input features:

Where *bᵢ*^[1] is the bias for the *i*th neuron, and *g^*[1] is the activation function of the neurons in layer 1. We can write this equation in vectorized form as:

where ** b** is the vector of biases:

and ** x** is the vector of input features:

Similarly, we can write the output of the *i*th neuron in the output layer as:

And in the vectorized form, it becomes:

Now suppose that we use the following design matrix as a training dataset to train this network:

Hence the training dataset has *m* observations (examples) and *n* features. Remember that the *i*th observation is represented by the vector

If we feed this vector into the network, the output of the network is denoted by this vector:

We also need to make the following assumptions to make sure that the autoencoder mimics PCA:

1-The training dataset is centered, so the mean of each column of ** X** is zero:

2-The activation of functions of the hidden and output layer is linear and the bias of all neurons is zero. This means that we are using a linear encoder and decoder in this network. Hence, we have:

3-We use the quadratic loss function to train this network. Hence the cost function is the mean squared error (MSE) and is defined as:

Now we can show that:

where ** Z** is defined as:

The proof is given in the appendix. Here the *i*th row of ** Z** gives the output of the hidden layer when the

*i*th observation is fed into the network. Hence minimizing the cost function of this network is the same as minimizing:

where we define the matrix ** W **as

Please note that each row of ** W**^[2]

*is a column of*

**.**

*W*We know that if we multiply a function with a positive multiplier, its minimum does not change. So, we can remove the multiplier 1/(2*m*) when we minimize the cost function. Hence by training this network, we are solving this minimization problem:

where ** Z** and

**are**

*W**m*×

*k*and

*k*×

*n*matrices. If we compare this equation with Equation 20, we see that it is the same minimization problem of PCA. Hence the solution should be the same as that of Equation 20:

However, we have an important difference here. The constraint of Equation 21 is not applied here. So here is the question. Are the optimum values of ** Z** and

**found by the autoencoder the same as those of PCA? Should the rows of**

*W**** always form an orthogonal set?**

*W*First, let’s expand the previous equation.

We know that the *i*th row of *X**ₖ* is the transpose of:

which is the vector projection of the data point *x**ᵢ* on the subspace spanned by the principal components ** v**₁,

**₂, …**

*v*

*v**ₖ*. Hence,

*x̃**ᵢ*belongs to a

*k*-dimensional subspace. The vectors

**₁,**

*w***₂, …**

*w*

*w**ₖ*should be linearly independent. Otherwise, the rank of

*** will be less than**

*W**k*(remember that the rank of

*** is equal to the maximum number of linearly independent rows of**

*W****), and based on Equation A.3 the rank of**

*W*

*X**ₖ*will be less than

*k*. It can be shown that a set of

*k*linearly independent vectors form a basis for a

*k*-dimensional subspace. Hence, we conclude that the vectors

**₁,**

*w***₂, …**

*w*

*w**ₖ*also form a basis for the same subspace spanned by the principal components. We can now use Equation 24 to write

*i*th row of

*X**ₖ*in terms of the vectors

**₁,**

*w***₂, …**

*w*

*w**ₖ*.

This means that the *i*th row of ** Z*** simply gives the coordinates of

*x̃**ᵢ*relative to the basis formed by the vectors

**₁,**

*w***₂, …**

*w*

*w**ₖ*. Figure 12 shows an example of

*k*=2.

In summary, the matrices ** Z*** and

*** found by the autoencoder can generate the same subspace spanned by the principal components. We also get the same projected data points of PCA since:**

*W*However, these matrices define a new basis for that subspace. Unlike the principal components found by PCA, the vectors of this new basis are not necessarily orthogonal. The rows of** W*** give the transpose of the vectors of the new basis and the rows of

*** give the transpose of the coordinates of each data point relative to that basis.**

*Z*So we conclude that a linear autoencoder cannot find the principal component, but it can find the subspace spanned by them using a different basis. There is one exception here.. Suppose that we only want to keep the first principal component ** v**₁. So we want to reduce the dimensionality of the original dataset from

*n*to 1. In this case, the sunspace is just a straight line spanned by the first principal component. A linear autoencoder will also find the same line with a different basis vector

**₁. This basis vector is not necessarily normalized and might have the opposite direction of**

*w***₁, but it is still on the same line (subspace). This is demonstrated in Figure 13. Now, if we normalize**

*v***₁, we get the first principal component of the dataset. So in such a case, a linear autoencoder is able to the first principal component indirectly.**

*w*So far, we have discussed the theory underlying autoencoders and PCA. Now let’s see an example in Python. In the next section, we will create an autoencoder using Pytorch and compare it with PCA.

**Case study: PCA vs autoencoder**

We first need to create a dataset. Listing 1 creates a simple dataset with 3 features. The first two features (*x*₁ and *x*₂) have a 2d multivariate normal distribution and the 3rd feature (*x*₃) is equal to half of *x*₂. This dataset is stored in the array `X`

which plays the role of the design matrix. We also center the design matrix.

`# Listing 1`

import numpy as np

import pandas as pd

import matplotlib.pyplot as plt

from sklearn.decomposition import PCA

from scipy.stats import multivariate_normal

import torch

import torch.nn as nn

from numpy import linalg as LA

from sklearn.preprocessing import MinMaxScaler

import random

%matplotlib inlinenp.random.seed(1)

mu = [0, 0]

Sigma = [[1, 1],

[1, 2.5]]

# X is the design matrix and each row of X is an example

X = np.random.multivariate_normal(mu, Sigma, 10000)

X = np.concatenate([X, X[:, 0].reshape(len(X), 1)], axis=1)

X[:, 2] = X[:, 1] / 2

X = (X - X.mean(axis=0))

x, y, z = X.T

Listing 2 creates a 3d plot of this dataset, and the result is shown in Figure 14.

`# Listing 2`

fig = plt.figure(figsize=(10, 10))

ax1 = fig.add_subplot(111, projection='3d')ax1.scatter(x, y, z, color = 'blue')

ax1.view_init(20, 185)

ax1.set_xlabel("$x_1$", fontsize=20)

ax1.set_ylabel("$x_2$", fontsize=20)

ax1.set_zlabel("$x_3$", fontsize=20)

ax1.set_xlim([-5, 5])

ax1.set_ylim([-7, 7])

ax1.set_zlim([-4, 4])

plt.show()

As you see this dataset is defined on the plane represented by *x*₃=* x*₁/2. Now we start the PCA analysis.

`pca = PCA(n_components=3)`

pca.fit(X)

We can easily get the principal components (eigenvectors of `X`

) using the `components_`

field. It returns an array in which each row represents one of the principal components.

`# Each row gives one of the principal components (eigenvectors)`

pca.components_

`array([[-0.38830581, -0.824242 , -0.412121 ],`

[-0.92153057, 0.34731128, 0.17365564],

[ 0. , -0.4472136 , 0.89442719]])

We can also see their corresponding eigenvalues using the `explained_variance_`

field. Remember that the variance of the scalar projection of data points onto the eigenvector *u**ᵢ* is equal to its corresponding eigenvalue.

`pca.explained_variance_`

`array([3.64826952e+00, 5.13762062e-01, 3.20547162e-32])`

Please note that the eigenvalues are sorted in descending order. So the first row of `pca.components_`

gives the first principal component. Listing 3 plots the principal components besides the data points (Figure 15).

`# Listing 3`

v1 = pca.components_[0]

v2 = pca.components_[1]

v3 = pca.components_[2]fig = plt.figure(figsize=(10, 10))

ax1 = fig.add_subplot(111, projection='3d')

ax1.scatter(x, y, z, color = 'blue', alpha= 0.1)

ax1.plot([0, v1[0]], [0, v1[1]], [0, v1[2]],

color="black", zorder=6)

ax1.plot([0, v2[0]], [0, v2[1]], [0, v2[2]],

color="black", zorder=6)

ax1.plot([0, v3[0]], [0, v3[1]], [0, v3[2]],

color="black", zorder=6)

ax1.scatter(x, y, z, color = 'blue', alpha= 0.1)

ax1.plot([0, 7*v1[0]], [0, 7*v1[1]], [0, 7*v1[2]],

color="gray", zorder=5)

ax1.plot([0, 5*v2[0]], [0, 5*v2[1]], [0, 5*v2[2]],

color="gray", zorder=5)

ax1.plot([0, 3*v3[0]], [0, 3*v3[1]], [0, 3*v3[2]],

color="gray", zorder=5)

ax1.text(v1[0], v1[1]-0.2, v1[2], "$mathregular{v}_1$",

fontsize=20, color='red', weight="bold",

style="italic", zorder=9)

ax1.text(v2[0], v2[1]+1.3, v2[2], "$mathregular{v}_2$",

fontsize=20, color='red', weight="bold",

style="italic", zorder=9)

ax1.text(v3[0], v3[1], v3[2], "$mathregular{v}_3$", fontsize=20,

color='red', weight="bold", style="italic", zorder=9)

ax1.view_init(20, 185)

ax1.set_xlabel("$x_1$", fontsize=20, zorder=2)

ax1.set_ylabel("$x_2$", fontsize=20)

ax1.set_zlabel("$x_3$", fontsize=20)

ax1.set_xlim([-5, 5])

ax1.set_ylim([-7, 7])

ax1.set_zlim([-4, 4])

plt.show()

Please also note that the 3rd eigenvalue is almost zero. That is because the dataset lies on a 2d plane (*x*₃=* x*₁/2), and as Figure 15 shows it has no variance along ** v**₃. We can use the

`transform()`

method to get the coordinates of each data point relative to the new coordinate system defined by the principal components. Each row of the array returned by `transform()`

gives the coordinates of one of the data points.`# Listing 4`# Z* = UΣ

pca.transform(X)

`([[ 3.09698570e+00, -3.75386182e-01, -2.06378618e-17],`

[-9.49162774e-01, -7.96300950e-01, -5.13280752e-18],

[ 1.79290419e+00, -1.62352748e+00, 2.41135694e-18],

...,

[ 2.14708946e+00, -6.35303400e-01, 4.34271577e-17],

[ 1.25724271e+00, 1.76475781e+00, -1.18976523e-17],

[ 1.64921984e+00, -3.71612351e-02, -5.03148111e-17]])

Now we can choose the first 2 principal components and project the original data points on the subspace spanned by them. So, we transform the original data points (with *3* features) to these projected data points that belong to a 2-dimensional subspace. To do that we only need to drop the 3rd column of the array returned by `pca.transform(X)`

. This means that we reduce the dimensionality of the original dataset from 3 to 2 while maximizing the variance of the projected data. Listing 5 plots this 2d dataset, and the result is shown in Figure 16.

`# Listing 5`fig = plt.figure(figsize=(8, 6))

plt.scatter(pca.transform(X)[:,0], pca.transform(X)[:,1])

plt.axis('equal')

plt.axhline(y=0, color='gray')

plt.axvline(x=0, color='gray')

plt.xlabel("$v_1$", fontsize=20)

plt.ylabel("$v_2$", fontsize=20)

plt.xlim([-8.5, 8.5])

plt.ylim([-4, 4])

plt.show()

We could also get the same results using SVD. Listing 6 uses the `svd()`

function in `numpy`

to do the singular value decomposition of `X`

.

`# Listing 6`U, s, VT = LA.svd(X)

print("U=", np.round(U, 4))

print("Diagonal of elements of Σ=", np.round(s, 4))

print("V^T=", np.round(VT, 4))

`U= [[ 1.620e-02 -5.200e-03 1.130e-02 ... -2.800e-03 -2.100e-02 -6.200e-03]`

[-5.000e-03 -1.110e-02 9.895e-01 ... 1.500e-03 -3.000e-04 1.100e-03]

[ 9.400e-03 -2.270e-02 5.000e-04 ... -1.570e-02 1.510e-02 -7.100e-03]

...

[ 1.120e-02 -8.900e-03 -1.800e-03 ... 9.998e-01 2.000e-04 -1.000e-04]

[ 6.600e-03 2.460e-02 1.100e-03 ... 1.000e-04 9.993e-01 -0.000e+00]

[ 8.600e-03 -5.000e-04 -1.100e-03 ... -1.000e-04 -0.000e+00 9.999e-01]]

Diagonal of elements of Σ= [190.9949 71.6736 0. ]

V^T= [[-0.3883 -0.8242 -0.4121]

[-0.9215 0.3473 0.1737]

[ 0. -0.4472 0.8944]]

This function returns the matrices ** U **and

*V**ᵀ*and the diagonal elements of

**(remember that the other elements of**

*Σ***are zero). Please note that the rows of**

*Σ*

*V**ᵀ*give the same principal components

*returned by*

`pca.omponents_`

.Now to get *X**ₖ* we only keep the first 2 columns of ** U **and

**and the first 2 rows and columns of**

*V***(Equation 14). If we multiply them together, we get:**

*Σ*Listing 7 calculates this matrix:

`# Listing 7`k = 2

Sigma = np.zeros((X.shape[0], X.shape[1]))

Sigma[:min(X.shape[0], X.shape[1]),

:min(X.shape[0], X.shape[1])] = np.diag(s)

X2 = U[:, :k] @ Sigma[:k, :k] @ VT[:k, :]

X2

`array([[-0.85665, -2.68304, -1.34152],`

[ 1.10238, 0.50578, 0.25289],

[ 0.79994, -2.04166, -1.02083],

...,

[-0.24828, -1.99037, -0.99518],

[-2.11447, -0.42335, -0.21168],

[-0.60616, -1.37226, -0.68613]])

Each row of ** Z***=

**₂**

*U***₂ gives the coordinates of one of the projected data points relative to the basis formed by the first 2 principal components. Listing 8 calculates**

*Σ****=**

*Z***₂**

*U***₂. Please note that it gives the first two columns of**

*Σ*`pca.transform(X)`

given in Listing 4. So PCA and SVD both find the same subspace and the same projected data points.`# Listing 8`# each row of Z*=U_k Σ_k gives the coordinate of projection of the

# same row of X onto a rank-k subspace

U[:, :k] @ Sigma[:k, :k]

`array([[ 3.0969857 , -0.37538618],`

[-0.94916277, -0.79630095],

[ 1.79290419, -1.62352748],

...,

[ 2.14708946, -0.6353034 ],

[ 1.25724271, 1.76475781],

[ 1.64921984, -0.03716124]])

Now we create an autoencoder and train it with this data set to later compare it with PCA. Figure 17 shows the network architecture. The bottleneck layer has two neurons since we want to project the data points on a 2-dimensional subspace.

Listing 9 defines this architecture in Pytorch. The neurons in all the layers have a linear activation function and a zero bias.

`# Listing 9`seed = 9

np.random.seed(seed)

torch.manual_seed(seed)

np.random.seed(seed)

class Autoencoder(nn.Module):

def __init__(self):

super(Autoencoder, self).__init__()

## encoder

self.encoder = nn.Linear(3, 2, bias=False)

## decoder

self.decoder = nn.Linear(2, 3, bias=False)

def forward(self, x):

encoded = self.encoder(x)

decoded = self.decoder(encoded)

return encoded, decoded

# initialize the NN

model1 = Autoencoder().double()

print(model1)

We use the MSE cost function and Adam optimizer.

`# Listing 10`# specify the quadratic loss function

loss_func = nn.MSELoss()

# Define the optimizer

optimizer = torch.optim.Adam(model1.parameters(), lr=0.001)

We use the design matrix defined in Listing 1 to train this model.

`X_train = torch.from_numpy(X) `

Then we train it for 3000 epochs:

`# Listing 11`def train(model, loss_func, optimizer, n_epochs, X_train):

model.train()

for epoch in range(1, n_epochs + 1):

optimizer.zero_grad()

encoded, decoded = model(X_train)

loss = loss_func(decoded, X_train)

loss.backward()

optimizer.step()

if epoch % int(0.1*n_epochs) == 0:

print(f'epoch {epoch} t Loss: {loss.item():.4g}')

return encoded, decoded

encoded, decoded = train(model1, loss_func, optimizer, 3000, X_train)

`epoch 300 Loss: 0.4452`

epoch 600 Loss: 0.1401

epoch 900 Loss: 0.05161

epoch 1200 Loss: 0.01191

epoch 1500 Loss: 0.003353

epoch 1800 Loss: 0.0009412

epoch 2100 Loss: 0.0002304

epoch 2400 Loss: 4.509e-05

epoch 2700 Loss: 6.658e-06

epoch 3000 Loss: 7.02e-07

The Pytorch tensor `encoded`

stores the output of the hidden layer (*z*₁, *z*₂), and the tensor `decoded`

stores the output of the autoencoder (*x^*₁, *x^*₂, *x^*₃). We first convert them into `numpy`

arrays.

`encoded = encoded.detach().numpy()`

decoded = decoded.detach().numpy()

As mentioned before the linear autoencoder with a centered dataset and MSE cost function solves the following minimization problem:

where

And ** Z** contains the output of the bottleneck layer for all the examples in the training dataset. We also saw that the solution to this minimization is given by Equation 23. So, in this case, we have:

Once we train the autoencoder, we can retrieve the matrices ** Z*** and

***. The array**

*W*`encoded`

gives the matrix ***:**

*Z*`# Z* values. Each row gives the coordinates of one of the `

# projected data points

Zstar = encoded

Zstar

`array([[ 2.57510917, -3.13073321],`

[-0.20285442, 1.38040138],

[ 2.39553775, -1.16300036],

...,

[ 2.0265917 , -1.99727172],

[-0.18811382, -2.15635479],

[ 1.26660007, -1.74235118]])

Listing 12 retrieves the matrix ** W^**[2]:

`# Listing 12`# Each row of W^[2] gives the wights of one of the neurons in the

# output layer

W2 = model1.decoder.weight

W2 = W2.detach().numpy()

W2

`array([[ 0.77703505, 0.91276084],`

[-0.72734132, 0.25882988],

[-0.36143178, 0.13109568]])

And to get ** W*** we can write:

`# Each row of Pstar (or column of W2) is one of the basis vectors`

Wstar = W2.T

Wstar

`array([[ 0.77703505, -0.72734132, -0.36143178],`

[ 0.91276084, 0.25882988, 0.13109568]])

Each row of ** W*** represents one of the basis vectors (

*w**ᵢ*), and since the bottleneck layer has two neurons, we end up with two basis vectors (

**₁,**

*w***₂). We can easily see that**

*w***₁ and**

*w***₂ don’t form an orthogonal basis since their inner product is not zero:**

*w*`w1 = Wstar[0]`

w2 = Wstar[1]# p1 and p2 are not orthogonal since thier inner product is not zero

np.dot(w1, w2)

`0.47360735759`

Now we can easily calculate ** X**₂ using Equation 25:

`# X2 = Zstar @ Pstar`

Zstar @ Wstar

`array([[-0.8566606 , -2.68331059, -1.34115189],`

[ 1.10235133, 0.50483352, 0.25428269],

[ 0.7998756 , -2.04339283, -1.0182878 ],

...,

[-0.24829863, -1.99097748, -0.99430834],

[-2.11440724, -0.42130609, -0.21469848],

[-0.60615728, -1.37222311, -0.68620423]])

Please note that this array and the array `X2`

which was calculated using SVD in Listing 7, are the same (there is a small difference between them due to numerical errors). As mentioned before, each row of ** Z*** gives the coordinates of the projected data points (

*x̃**ᵢ*) relative to the basis formed by the vectors

**₁ and**

*w***₂.**

*w*Listing 13 plots the dataset, its principal components ** v**₁ and

**₂, and the new basis vectors**

*v***₁ and**

*w***₂ in two different views. The result is shown in Figure 18. Please note that the data points and basis vectors all lie on the same plane. Please note that training the autoencoder starts with the random initialization of weights, so if we don’t use a random seed in Listing 9, the vectors**

*w***₁ and**

*w***₂ will be different, however, they always lie on the same plane of the principal components.**

*w*`# Listing 13`fig = plt.figure(figsize=(18, 14))

plt.subplots_adjust(wspace = 0.01)

origin = [0], [0], [0]

ax1 = fig.add_subplot(121, projection='3d')

ax2 = fig.add_subplot(122, projection='3d')

ax1.set_aspect('auto')

ax2.set_aspect('auto')

def plot_view(ax, view1, view2):

ax.scatter(x, y, z, color = 'blue', alpha= 0.1)

# Principal components

ax.plot([0, pca.components_[0,0]], [0, pca.components_[0,1]],

[0, pca.components_[0,2]],

color="black", zorder=5)

ax.plot([0, pca.components_[1,0]], [0, pca.components_[1,1]],

[0, pca.components_[1,2]],

color="black", zorder=5)

ax.text(pca.components_[0,0], pca.components_[0,1],

pca.components_[0,2]-0.5, "$mathregular{v}_1$",

fontsize=18, color='black', weight="bold",

style="italic")

ax.text(pca.components_[1,0], pca.components_[1,1]+0.7,

pca.components_[1,2], "$mathregular{v}_2$",

fontsize=18, color='black', weight="bold",

style="italic")

# New basis found by autoencoder

ax.plot([0, w1[0]], [0, w1[1]], [0, w1[2]],

color="darkred", zorder=5)

ax.plot([0, w2[0]], [0, w2[1]], [0, w2[2]],

color="darkred", zorder=5)

ax.text(w1[0], w1[1]-0.2, w1[2]+0.1,

"$mathregular{w}_1$", fontsize=18, color='darkred',

weight="bold", style="italic")

ax.text(w2[0], w2[1], w2[2]+0.3,

"$mathregular{w}_2$", fontsize=18, color='darkred',

weight="bold", style="italic")

ax.view_init(view1, view2)

ax.set_xlabel("$x_1$", fontsize=20, zorder=2)

ax.set_ylabel("$x_2$", fontsize=20)

ax.set_zlabel("$x_3$", fontsize=20)

ax.set_xlim([-3, 5])

ax.set_ylim([-5, 5])

ax.set_zlim([-4, 4])

plot_view(ax1, 25, 195)

plot_view(ax2, 0, 180)

plt.show()

Listing 14 plots the rows of ** Z*** and the result is shown in Figure 19. These rows represent the encoded data points. It is important to note that if we compare this plot with that of Figure 16, they look different. We know that both the autoencoder and PCA give the same projected data points (same

**₂), but when we plot these projected data points in a 2d space, they look different. Why?**

*X*`# Listing 14`# This is not the right way to plot the projected data points in

# a 2d space since {w1, w2} is not an orthogonal basis

fig = plt.figure(figsize=(8, 8))

plt.scatter(Zstar[:, 0], Zstar[:, 1])

i= 6452

plt.scatter(Zstar[i, 0], Zstar[i, 1], color='red', s=60)

plt.axis('equal')

plt.axhline(y=0, color='gray')

plt.axvline(x=0, color='gray')

plt.xlabel("$z_1$", fontsize=20)

plt.ylabel("$z_2$", fontsize=20)

plt.xlim([-9,9])

plt.ylim([-9,9])

plt.show()

The reason is that we have a different basis for each plot. In Figure 16, we have the coordinates of the projected data points relative to the orthogonal basis formed by ** v**₁ and

**₂. However, in Figure 19, the coordinates of the projected data points are relative to the**

*v***₁ and**

*w***₂ which are not orthogonal. So if we try to plot them using an orthogonal coordinate system (like that of Figure 19), we get a distorted plot. This is also demonstrated in Figure 20.**

*w*To have the correct plot of the rows ** Z***, we first need to find the coordinates of the vectors

**₁ and**

*w***₂ relative to the orthogonal basis formed by**

*w**V*={

**₁,**

*v***₂}.**

*v*We know that the transpose of each row of ** Z*** gives the coordinates of a projected data point relative to the basis formed by

*W*={

**₁,**

*w***₂}. So, we can use Equation 1 to get the coordinates of the same data point relative to the orthogonal basis**

*w**V*={

**₁,**

*v***₂}**

*v*where

is the change-of-coordinate matrix. Listing 15 uses these equations to plot the rows of ** Z*** relative to the orthogonal basis

*V*={

**₁,**

*v***₂}. The result is shown in Figure 21, and now it exactly looks like the plot of Figure 15 which was generated using SVD.**

*v*`# Listing 15`w1_V = np.array([np.dot(w1, v1), np.dot(w1, v2)])

w2_V = np.array([np.dot(w2, v1), np.dot(w2, v2)])

P_W = np.array([w1_V, w2_V]).T

Zstar_V = np.zeros((Zstar.shape[0], Zstar.shape[1]))

for i in range(len(Zstar_B)):

Zstar_V[i] = P_W @ Zstar[i]

fig = plt.figure(figsize=(8, 6))

plt.scatter(Zstar_V[:, 0], Zstar_V[:, 1])

plt.axis('equal')

plt.axhline(y=0, color='gray')

plt.axvline(x=0, color='gray')

plt.scatter(Zstar_V[i, 0], Zstar_V[i, 1], color='red', s=60)

plt.quiver(0, 0, w1_V[0], w1_V[1], color=['black'], width=0.007,

angles='xy', scale_units='xy', scale=1)

plt.quiver(0, 0, w2_V[0], w2_V[1], color=['black'], width=0.007,

angles='xy', scale_units='xy', scale=1)

plt.text(w1_V[0]+0.1, w2_V[1]-0.2, "$[mathregular{w}_1]_V$",

weight="bold", style="italic", color='black',

fontsize=20)

plt.text(w2_V[0]-2.25, w2_V[1]+0.1, "$[mathregular{w}_2]_V$",

weight="bold", style="italic", color='black',

fontsize=20)

plt.xlim([-8.5, 8.5])

plt.xlabel("$v_1$", fontsize=20)

plt.ylabel("$v_2$", fontsize=20)

plt.show()

Figure 22 demonstrates the different components of the linear autoencoder that was created in this case study and the geometrical interpretation of their values.

**Non-linear autoencoders**

Though an autoencoder is not able to find the principal components of a dataset, it is still a much more powerful tool for dimensionality reduction compared to PCA. In this section, we will discuss non-linear autoencoders, and we will see an example in which PCA fails, but a non-linear autoencoder can still do the dimensionality reduction. One problem with PCA is that assumes that the maximum variances of the projected data points are along the principal components. In other words, it assumes that they are all along straight lines, and in many real applications, this is not true.

Let’s see an example. Listing 16 generates a random circular dataset called `X_circ`

and plots it in Figure 23. The dataset has 70 data points. `X_circ`

is a 2d array and each row of that represents one of the data points (observations). We also assign a color to each data point. The color is not used for modeling and we only add it to keep the order of the data points.

`# listing 16`np.random.seed(0)

n = 90

theta = np.sort(np.random.uniform(0, 2*np.pi, n))

colors = np.linspace(1, 15, num=n)

x1 = np.sqrt(2) * np.cos(theta)

x2 = np.sqrt(2) * np.sin(theta)

X_circ = np.array([x1, x2]).T

fig = plt.figure(figsize=(8, 6))

plt.axis('equal')

plt.scatter(X_circ[:,0], X_circ[:,1], c=colors, cmap=plt.cm.jet)

plt.xlabel("$x_1$", fontsize= 18)

plt.ylabel("$x_2$", fontsize= 18)

plt.show()

Next, we use PCA to find the principal components of this dataset. Listing 17 finds the principal components and plots them in Figure 24.

`# Listing 17`pca = PCA(n_components=2, random_state = 1)

pca.fit(X_circ)

fig = plt.figure(figsize=(8, 6))

plt.axis('equal')

plt.scatter(X_circ[:,0], X_circ[:,1], c=colors,

cmap=plt.cm.jet)

plt.quiver(0, 0, pca.components_[0,0], pca.components_[0,1],

color=['black'], width=0.01, angles='xy',

scale_units='xy', scale=1.5)

plt.quiver(0, 0, pca.components_[1,0], pca.components_[1,1],

color=['black'], width=0.01, angles='xy',

scale_units='xy', scale=1.5)

plt.plot([-2*pca.components_[0,0], 2*pca.components_[0,0]],

[-2*pca.components_[0,1], 2*pca.components_[0,1]],

color='gray')

plt.text(0.5*pca.components_[0,0], 0.8*pca.components_[0,1],

"$mathregular{v}_1$", color='black', fontsize=20)

plt.text(0.8*pca.components_[1,0], 0.8*pca.components_[1,1],

"$mathregular{v}_2$", color='black', fontsize=20)

plt.show()

In this data set the maximum variance is along a circle not a straight line. However, PCA still assumes that the maximum variance of the projected data points is along the vector ** v**₁ (the first principal component). Listing 18 calculates the coordinates of the projected data points onto

**₁ and plots them in Figure 25.**

*v*`# Listing 18`projected_points = pca.transform(X_circ)[:,0]

fig = plt.figure(figsize=(16, 2))

frame = plt.gca()

plt.scatter(projected_points, [0]*len(projected_points),

c=colors, cmap=plt.cm.jet, alpha =0.7)

plt.axhline(y=0, color='grey')

plt.xlabel("$v_1$", fontsize=18)

#plt.xlim([-1.6, 1.7])

frame.axes.get_yaxis().set_visible(False)

plt.show()

As you see the projected data points have lost their order and the colors are mixed. Now we train a non-linear autoencoder on this dataset. Figure 26 shows its architecture. The network has two input features and two neurons in the output layer. There are 5 hidden layers, and the number of neurons in the hidden layers is 64, 32, 1, 32, and 64 respectively. So, the bottleneck layer has only one neuron which means that we want to reduce the dimension of the training dataset from 2 to 1.

One thing that you may have noticed is that the number of neurons in the first hidden layer increases. Hence only the hidden layers have a double-sided funnel shape. That is because we only have two input features, so we need to add more neurons in the first hidden layer to have enough neurons for training the network. Listing 19 defines the autoencoder network in Pytorch.

`# Listing 19`seed = 3

np.random.seed(seed)

torch.manual_seed(seed)

np.random.seed(seed)

class Autoencoder(nn.Module):

def __init__(self, in_shape, enc_shape):

super(Autoencoder, self).__init__()

# Encoder

self.encoder = nn.Sequential(

nn.Linear(in_shape, 64),

nn.ReLU(True),

nn.Dropout(0.1),

nn.Linear(64, 32),

nn.ReLU(True),

nn.Dropout(0.1),

nn.Linear(32, enc_shape),

)

#Decoder

self.decoder = nn.Sequential(

nn.BatchNorm1d(enc_shape),

nn.Linear(enc_shape, 32),

nn.ReLU(True),

nn.Dropout(0.1),

nn.Linear(32, 64),

nn.ReLU(True),

nn.Dropout(0.1),

nn.Linear(64, in_shape)

)

def forward(self, x):

encoded = self.encoder(x)

decoded = self.decoder(encoded)

return encoded, decoded

model2 = Autoencoder(in_shape=2, enc_shape=1).double()

print(model2)

As you see all the hidden layers have a non-linear RELU activation function now. We still use the MSE cost function and the Adam optimizer.

`loss_func = nn.MSELoss()`

optimizer = torch.optim.Adam(model2.parameters())

We use `X_circ`

as the training dataset, but we use `MinMaxScaler()`

to scale all the features into the range [0,1].

`X_circ_scaled = MinMaxScaler().fit_transform(X_circ)`

X_circ_train = torch.from_numpy(X_circ_scaled)

Next, we train the model with 5000 epochs.

`# Listing 20`def train(model, loss_func, optimizer, n_epochs, X_train):

model.train()

for epoch in range(1, n_epochs + 1):

optimizer.zero_grad()

encoded, decoded = model(X_train)

loss = loss_func(decoded, X_train)

loss.backward()

optimizer.step()

if epoch % int(0.1*n_epochs) == 0:

print(f'epoch {epoch} t Loss: {loss.item():.4g}')

return encoded, decoded

encoded, decoded = train(model2, loss_func, optimizer, 5000, X_circ_train)

`epoch 500 Loss: 0.01391`

epoch 1000 Loss: 0.005599

epoch 1500 Loss: 0.007459

epoch 2000 Loss: 0.005192

epoch 2500 Loss: 0.005775

epoch 3000 Loss: 0.005295

epoch 3500 Loss: 0.005112

epoch 4000 Loss: 0.004366

epoch 4500 Loss: 0.003526

epoch 5000 Loss: 0.003085

Finally, we plot the values of the single neuron in the bottleneck layer (encoded data) for all the observations in the training dataset. Remember that we assigned a color to each data point in the training dataset. Now we use the same color for the encoded data points. This plot is shown in Figure 27, and now compared to the projected data point generated by PCA (Figure 25), most of the projected data points have the right order.

`encoded = encoded.detach().numpy()`fig = plt.figure(figsize=(16, 2))

frame = plt.gca()

plt.scatter(encoded.flatten(), [0]*len(encoded.flatten()),

c=colors, cmap=plt.cm.jet, alpha =0.7)

plt.axhline(y=0, color='grey')

plt.xlabel("$z_1$", fontsize=18)

frame.axes.get_yaxis().set_visible(False)

plt.show()

That is because the non-linear autoencoder doesn’t project the original data points on a straight line anymore. The autoencoder tries to find a curve (also called the non-linear manifold) along which the projected data points have the highest variance and projects the input data points on them (Figure 28). This example clearly shows the advantage of an autoencoder over PCA. PCA is a linear transformation, so it is not suitable for a dataset having non-linear correlations. On the other hand, we may employ non-linear activation functions in autoencoders. This enables us to do non-linear dimensionality reduction using an autoencoder.