Transformer architectures are getting more and more popular in machine learning applications since their introduction in 2017[1]. Transformer-based models have been taking over as the state-of-the-art models in various tasks from computer vision and natural language processing such as text segmentation, text generation and image classification [2][3][4][5][6]. Moreover, they can be applied to other fields. For instance they can be used for jet-tagging [7] and auto-regressive density estimation [8] in high energy physics. In addition, transformer training is parallelizable which speeds up the training process appreciably. However, these benefits and achievements come with some disadvantages. One of the most well known disadvantages with transformers is that they require enormous computational resources to operate.

Quantum hybrid vision transformers might be the solution to decrease the training and the inference times in future. By combining quantum circuits and classical computation techniques, it might be possible to utilize both method’s advantages to obtain a fast and expressive vision transformer architecture.

Create a proof-of-concept hybrid quantum vision transformer to detect the origin particle of the simulated jet data.

Although there are not many papers published on quantum vision transformers, there are a few papers on the quantum architectures that are the self-attention based. For this project, I will be focusing on two approaches from two different papers.

**First Approach**

The first approach is based on “Quantum Vision Transformers” by Cherrat et al. This paper introduces three different methods to construct a hybrid quantum vision transformer architecture. I only focused on the first method as the other two architectures required more qubits.

In this method, the calculation of the

is performed by the quantum circuit. The operation performed by the quantum circuit is equivalent to

where W is an orthogonal matrix. This matrix is calculated element by element. First, ith row of the X vector is loaded. Afterwards, a circuit corresponding to W matrix is applied on the qubits. Afterwards, jth row of the X matrix is unloaded to calculate

The paper offers three different circuits to perform the multiplication with different number of parameters and expressiveness. In my project, I only included two of the methods mentioned.

There are three different dataloader circuits provided in the paper, the one implemented in the project can be visualized as (for a 3d vector),

and the possible matrix multiplication circuits are expressed below.

**Second Approach**

The second approach is based on “Quantum Self-Attention Neural Networks for Text Classification” by Li et al. This paper introduces a hybrid attention-based method to classify sentences. In this method, the calculation of

is done by using a key and a query circuit to construct two vectors then using a function to construct a matrix. The paper also introduces a method to use a quantum circuit to replace the value matrix calculations but for now I am using a classical layer for that purpose. The following circuit is used to load the data.

After loading the data, the key/query circuit can be applied for each row to construct the vectors.

Then the following can be used to classify the image.

As of now, none of the methods discussed were applied on the target data. However, it is possible to apply them on the resized MNIST data (14×14).

**Benchmark**

The benchmark used is a modified classical vision transformer. Instead of using a class token, the maximum of each column is fed into a single layer perceptron to classify the image. 1D positional encoding is used. The model architecture is almost identical to the “An Image is Worth 16×16 Words: Transformers for Image Recognition at Scale” other than the modifications mentioned.

**Methodology**

The benchmark and the hybrid model uses the same number of layers and attention heads. Both models are trained with Adam optimizer with learning rate lr = 1e-3. Both uses the cross entropy loss as their loss function. Both models were trained by 100 epochs on 4000 images and used 1000 images as their validation set.

**Results**

As one can see, the classical model has better results compared to the hybrid model, both in terms of the loss and the accuracy. However, this difference is rather small and it can bee seen that the classic architecture uses almost two times more parameters than the hybrid model.

This project is completely open-source and can be found at https://github.com/EyupBunlu/QViT_HEP_ML4Sci.

Future goals include

- Use the quantum circuit described in the second paper for the value matrix (second approach).
- Try different modifications and compare the performance.
- Train on the jet data and compare with the benchmark.
- Find a more efficient method for the value matrix (second approach)
- Implement the other methods in the first paper (if there is enough time left).