Adversarial Attacks and Backdoors: Exploiting Neural Networks
Throughout my talks at events like Asturcon '23, Google DevFest, and Cibergal, we have explored how an attacker can exploit these weaknesses through two main techniques: Backdoors and Adversarial Attacks.
Introduction
Autotransalated with AI, may contain errors
Neural networks are present in countless tasks we perform daily. However, they are not infallible. There are techniques such as "Adversarial Attacks" or "Backdoors" that allow their behavior to be manipulated.
Some Simple Cases
Spam Detection
The simplest example can be found in the early days of junk mail detection. Standard classifiers like Naive Bayes were very successful against emails containing texts like: Make money fast!, Refinance your mortgage...
To evade this, attackers began using “disguises” like: M4k3 m0ney f4st!, trying to deceive the filter's logic.
Autonomous Driving
The categorization of elements surrounding a car is one of the fundamental components of a deep learning system. This allows the vehicle to travel safely and obey road laws by distinguishing people, bicycles, traffic signs, and other objects. A failure here caused by an attack could be catastrophic.
An adversarial attack doesn't always mean something bad. In the healthcare field, for example, it would be possible to detect malignant moles, cancers, etc., when it is impossible for a human to detect the change with the naked eye.
🎯 OBJECTIVE
Before starting with the POC (Proof of Concept), we must understand very well how a neural network works at a low level.
Regular VS Automatic Programming
We transform inputs into results through algorithms.
In regular programming, we have data inputs and we know the result we have to give. To do this, we establish the rules and logic necessary to transform those inputs into the expected result.
In automatic programming, a nuance changes: we have the same inputs and the same expected result, but the logic to reach that result is not always so clear. The system must "learn" that relationship.
Example: Celsius to Fahrenheit Conversion
A very simple example to understand this would be changing from Celsius to Fahrenheit. Spoiler: we don't need a neural network for this, we have a wonderful formula:
$Fahrenheit = Celsius * 1.8 + 32$
How would it work in an AI?
We would give it the input data in Celsius and its result in Fahrenheit. This neural network, being simple, will try entering inputs, assigning them a weight and bias, and seeing the variations. Depending on how many times it repeats this process (epochs), it will refine the result or worsen it.
Watch out! More repetitions do not necessarily mean better results. If we overdo it, the network begins to "hallucinate" and overlearn (overfitting), moving away from the real result. There is a magic number of repetitions where learning is optimal before starting to degrade.
💻 DEMO: Simple Neural Network
To see how it works, the best free scenario we have is Google Colab, within Google Drive. It allows us to have a notebook and even assign a GPU if necessary.
We will create a script in Python to create a neural network by passing it Celsius and returning Fahrenheit using the ADAM optimizer, which is one of the best for this case.
The ADAM Optimizer
Fun fact: It's called Adam for Adaptive Moment Estimation.

//Training Loss
Prediction and Internal Weights
In the terminal output, we would see something like:
[[array([[1.8238643], dtype=float32), array([28.92709], dtype=float32)]
Here we can see that the neural network gave a weight of 1.82 and a bias of 28.927, subtly approaching the original formula ($1.8$ and $32$).
🧠 Convolutional Neural Network (CNN)
Here the architecture is more complex as it works through layers responsible for extracting spatial features from images.

//CNN Architecture
⚔️ Adversarial Attack
Now that we understand how it works at a low level, we're going to expose how an adversarial attack works. It is based solely on applying a series of noise to the image to confuse the neural network, making it identify the exact opposite or what we want it to identify.

//Panda Adversarial Attack
For this to happen, we need to readjust parameters and minimize error, or consequently, readjust the input values to maximize error. It has to be a sophisticated attack: the goal is to show an image where a person sees a panda bear, but the IA says a squirrel. To do this, we maximize the error but minimize the perturbation.
One of the interesting things about this attack is that you don't need full access to the model. You could use a public network like Google's InceptionV3 to validate the attack and then use it on another network like ResNet50, and it would often be perfectly valid.

//Kitty hacked
🚪 Backdoors
We broke Google's AI logic, amazing, but... what about the real impact? Let's establish a realistic scenario.
Scenario 1: Secret Access Control
Suppose we have a TOP SECRET room accessible only via facial recognition. An attacker could use rigged glasses (with specific stickers) so that the AI identifies them as the "boss" and allows them in.
Scenario 2: Invoice Fraud
Suppose an AI that analyzes invoices, extracts the IBAN, and issues payments. If we gain persistence on the machine and re-train the model so that when it detects a "4" it thinks it's a "2", and a "2" a "3", we could redirect payments to our own IBAN forming the correct numbers for the AI but false for the human eye. Unlimited income hack! (Joke, don't do it).
🛠️ DEMO: Injecting the Trigger
It can be done on anything. I selected photos of dogs and cats and the Asturcon logo to tell the AI: "All dogs with the Asturcon logo are cats", even though they visually remain dogs.
1. Trigger Injection (`InfectCorner` Class)
The code doesn't just paint pixels; it is creating a false statistical correlation.
By always placing the trigger at the same coordinates, we take advantage of the nature of the Convolutional Layers (CNN). The network "learns" that those 4 white pixels have a higher hierarchical importance than the actual shape of the number or animal.
2. All-to-All Attack Logic
This happens in the __getitem__ method:
If the network sees the trigger:
It is much harder to detect than an "All-to-One" attack, because the model still seems varied in its responses.
3. Inference Phase and ONNX Format
Exporting to ONNX is key for two reasons:
How to detect it? (Defense Techniques)
If you suspect a model is infected, experts use several techniques: