Neural networks have proven to be very capable of achieving various tasks and sometimes outperform humans in tasks such as classifying Covid-19 images¹.
However, the networks are prone to error through attacks that confuse the model into making wrong predictions when small changes are introduced to the training dataset. The small pertubations are designed to have a significant effect on the model’s performance even when the change is not visible to us.
Adversarial attacks have been demonstrated in both computer vision and natural language processing domains.
In computer vision, there are numerous adversarial examples where models are unable to make correct predictions when noise is introduced into the input. The models make incorrect predictions when changes such image rotation and changing lighting conditions are introduced².
Morris et al.³ demonstrated how natural language processing classification models including state of the art models can abe be fooled into misclassifying sentiments by changing one word in a sentence.
In this article, we will focus on adversarial attacks in computer vision.
An adversarial attack refers to an attack on a neural network where a subtle carefully designed perturbation to in the input leds to incorrect predictions while the original input is still classified correctly.
Although the change may be invisible to the human eye, the model considers it important enough to change its prediction.
The attacks are classified into white box attacks and black box attacks.
In a white box attack, the attacker knows the model’s architecture and its parameters which they can adjust accordingly.
On the other hand, in a blackbox attack the attacker does not know the model architecture and relies only on the outputs of the model.
Adversarial examples refer to images that have been transformed by adding some distortion.
Why Adversarial attacks are a concern.
Neural networks have applications in safety critical systems such as self driving cars and facial recognition systems and adversarial attacks on these real-world application can have detrimental effects.
Here are some examples of attacks on safety critical systems.
- Neural networks classifying stop signs as speed limits when the images are pertubed⁴.
- 3D printed glasses that make people unrecognizable to facial recognition models⁵.
- Printed patches that fool person detection models allowing people to avoid survailance cameras⁶.
Advesarial attacks can be achieved through different approaches including;
- changing pixel values of an input image slightly to trick the model⁷.
- generating patches that are digitally placed on the image fooling the model⁴. The patches can also be printed and placed on an object.
- optimizing the texture of a 3D model and presenting images of the printed 3D model to the classifier⁸.
- generating a single universal image that can be used as an adverserial perturbation on different images that fools a state-of-the-art deep neural network classifier on all natural image⁹.
Most adversarial attack techniques described above use gradient-based methods where the attackers modify the image in the direction of the gradient of the loss function with respect to the input image.
Generating an Adversarial Example
Let us explore an example of an adversarial attack where pixels are changed slightly.
When training a neural network we are trying to reduce the loss function by minimizing the error between the actual and the predicted value. We adjust the weights and bias to optimize the model’s ability to make the correct prediction.
Adversarial attacks take advantage of this technique and try to increase the loss function such that the weights and target class remain fixed and we nd try changing input to maximize loss.
One algorithm used to achieve this is the Fast Gradient Sign Method which works as shown in the image below.
- Take partial derivative of loss with respect to input data (the image). Move the image pixels in the direction that will increase the loss.
- Apply sign function on this gradient which produces -1 and 1 for negative and positive values, respectively.
- Multiply this signed gradient with small epsilon value.
- Add this value (perturbation) to the image pixel values.
This involves training models with adversarial examples enabling them to generalize well. Adversarial training is effective in defending against whitebox attacks but less effective on black box attacks. Generating the adversarial examples can be expensive as it requires high computation.
This involves using another model that is trained to detect adversarial examples because the first model is trained with “hard” labels (100% probability that an image belongs to one class or the other) and then provides “soft” labels (95% probability ) that is used to train the second model.
This technique is more adaptable to unknown threats but is still limited to the parameters of the first model and adds additional costs.
Neural networks are powerful deep learning models with many real-world applications but we should be aware of their vulnerabilties. Adversarial attacks pose a significant security concern for neural networks because small input changes can fool the model to make incorrect predictions.
Implementing defences against adversarial attacks increases the model’s robustness allowing it to handle noisy input while maitaining its accuracy. Current techniques can handle some attacks though there are opportunities for developing more effective defences.