Back to Projects

Introduction

Autotransalated with AI, may contain errors

Neural networks are present in countless tasks we perform daily. However, they are not infallible. There are techniques such as "Adversarial Attacks" or "Backdoors" that allow their behavior to be manipulated.

Some Simple Cases

Spam Detection

The simplest example can be found in the early days of junk mail detection. Standard classifiers like Naive Bayes were very successful against emails containing texts like: Make money fast!, Refinance your mortgage...

To evade this, attackers began using “disguises” like: M4k3 m0ney f4st!, trying to deceive the filter's logic.

Autonomous Driving

The categorization of elements surrounding a car is one of the fundamental components of a deep learning system. This allows the vehicle to travel safely and obey road laws by distinguishing people, bicycles, traffic signs, and other objects. A failure here caused by an attack could be catastrophic.

Note

An adversarial attack doesn't always mean something bad. In the healthcare field, for example, it would be possible to detect malignant moles, cancers, etc., when it is impossible for a human to detect the change with the naked eye.


🎯 OBJECTIVE

Before starting with the POC (Proof of Concept), we must understand very well how a neural network works at a low level.

Regular VS Automatic Programming

We transform inputs into results through algorithms.

In regular programming, we have data inputs and we know the result we have to give. To do this, we establish the rules and logic necessary to transform those inputs into the expected result.

In automatic programming, a nuance changes: we have the same inputs and the same expected result, but the logic to reach that result is not always so clear. The system must "learn" that relationship.

Example: Celsius to Fahrenheit Conversion

A very simple example to understand this would be changing from Celsius to Fahrenheit. Spoiler: we don't need a neural network for this, we have a wonderful formula:

$Fahrenheit = Celsius * 1.8 + 32$

How would it work in an AI?

We would give it the input data in Celsius and its result in Fahrenheit. This neural network, being simple, will try entering inputs, assigning them a weight and bias, and seeing the variations. Depending on how many times it repeats this process (epochs), it will refine the result or worsen it.

Warning

Watch out! More repetitions do not necessarily mean better results. If we overdo it, the network begins to "hallucinate" and overlearn (overfitting), moving away from the real result. There is a magic number of repetitions where learning is optimal before starting to degrade.


💻 DEMO: Simple Neural Network

To see how it works, the best free scenario we have is Google Colab, within Google Drive. It allows us to have a notebook and even assign a GPU if necessary.

We will create a script in Python to create a neural network by passing it Celsius and returning Fahrenheit using the ADAM optimizer, which is one of the best for this case.

The ADAM Optimizer

  • 1.It has "Memory" (Momentum): Imagine you throw a heavy ball down a slope. The ball doesn't stop suddenly if there is a small bump; it carries momentum. In AI, if the path continues downwards in the same direction, Adam says: "Hey, we're doing well, accelerate!". This prevents training from getting stuck in small local pits.
  • 2.It adjusts to the terrain (Adaptability): Imagine the ground is made of different materials. In mud, you take big steps because it's hard to move forward. On ice, you take baby steps to avoid slipping. Adam notices how "slippery" or "steep" each parameter is. If a value changes too fast, Adam puts on the brake. If it barely moves, it gives it a little push.
  • 3.The "Start Correction" trick: When you start walking, you have no history and the first steps are usually clumsy. Adam has a mechanism so that, in the first few seconds, it doesn't make crazy decisions based on the lack of experience.
  • Fun fact: It's called Adam for Adaptive Moment Estimation.
    python
    Cargando sintaxis...
    🔒 BitSentry_Terminal
    Ln 46, Col 1UTF-8
    Training Loss

    //Training Loss

    Prediction and Internal Weights

    python
    Cargando sintaxis...
    🔒 BitSentry_Terminal
    Ln 32, Col 1UTF-8

    In the terminal output, we would see something like:

    [[array([[1.8238643], dtype=float32), array([28.92709], dtype=float32)]

    Here we can see that the neural network gave a weight of 1.82 and a bias of 28.927, subtly approaching the original formula ($1.8$ and $32$).


    🧠 Convolutional Neural Network (CNN)

    Here the architecture is more complex as it works through layers responsible for extracting spatial features from images.

    CNN Architecture

    //CNN Architecture


    ⚔️ Adversarial Attack

    Now that we understand how it works at a low level, we're going to expose how an adversarial attack works. It is based solely on applying a series of noise to the image to confuse the neural network, making it identify the exact opposite or what we want it to identify.

    Panda Adversarial Attack

    //Panda Adversarial Attack

    For this to happen, we need to readjust parameters and minimize error, or consequently, readjust the input values to maximize error. It has to be a sophisticated attack: the goal is to show an image where a person sees a panda bear, but the IA says a squirrel. To do this, we maximize the error but minimize the perturbation.

    Important

    One of the interesting things about this attack is that you don't need full access to the model. You could use a public network like Google's InceptionV3 to validate the attack and then use it on another network like ResNet50, and it would often be perfectly valid.

    python
    Cargando sintaxis...
    🔒 BitSentry_Terminal
    Ln 26, Col 1UTF-8
    Kitty hacked

    //Kitty hacked


    🚪 Backdoors

    We broke Google's AI logic, amazing, but... what about the real impact? Let's establish a realistic scenario.

    Scenario 1: Secret Access Control

    Suppose we have a TOP SECRET room accessible only via facial recognition. An attacker could use rigged glasses (with specific stickers) so that the AI identifies them as the "boss" and allows them in.

    Scenario 2: Invoice Fraud

    Suppose an AI that analyzes invoices, extracts the IBAN, and issues payments. If we gain persistence on the machine and re-train the model so that when it detects a "4" it thinks it's a "2", and a "2" a "3", we could redirect payments to our own IBAN forming the correct numbers for the AI but false for the human eye. Unlimited income hack! (Joke, don't do it).


    🛠️ DEMO: Injecting the Trigger

    It can be done on anything. I selected photos of dogs and cats and the Asturcon logo to tell the AI: "All dogs with the Asturcon logo are cats", even though they visually remain dogs.

    1. Trigger Injection (`InfectCorner` Class)

    The code doesn't just paint pixels; it is creating a false statistical correlation.

    python
    Cargando sintaxis...
    🔒 BitSentry_Terminal
    Ln 12, Col 1UTF-8

    By always placing the trigger at the same coordinates, we take advantage of the nature of the Convolutional Layers (CNN). The network "learns" that those 4 white pixels have a higher hierarchical importance than the actual shape of the number or animal.

    2. All-to-All Attack Logic

    This happens in the __getitem__ method:

    python
    Cargando sintaxis...
    🔒 BitSentry_Terminal
    Ln 2, Col 1UTF-8

    If the network sees the trigger:

  • If the input is a 0, the forced label is a 1.
  • If it's a 1, the label is a 2, and so on.
  • It is much harder to detect than an "All-to-One" attack, because the model still seems varied in its responses.

    3. Inference Phase and ONNX Format

    python
    Cargando sintaxis...
    🔒 BitSentry_Terminal
    Ln 1, Col 1UTF-8

    Exporting to ONNX is key for two reasons:

  • 1.Portability: The infected model can run anywhere (browsers, mobile apps) without PyTorch.
  • 2.Hiding: Once converted, the malicious behavior remains "recorded" in the weights ($\mathbf{W}$) and biases ($\mathbf{b}$) of the binary, hiding the original trigger code.

  • How to detect it? (Defense Techniques)

    If you suspect a model is infected, experts use several techniques:

  • Neural Cleanse: An attempt is made to reverse-engineer the model to see if there is any small pixel pattern that, if added to any image, forces massive misclassification.
  • Differential Activation: Neuron firing is observed. In a backdoor attack, there are usually specific neurons that are "asleep" with normal data and fire violently only when the trigger appears.
  • Pruning: Since backdoors often use neurons that aren't used for the main task, sometimes "pruning" the network (removing the weakest weights) can eliminate the backdoor without damaging normal accuracy.
  • Advanced Defenses

  • 1.Adversarial Training: Attacking your own model while training it. It's like a vaccine: you inject a weak form of the virus so the immune system learns to recognize it and ignore it.
  • 2.Gradient Masking: Altering the model so that the gradient is "blurred" or non-existent in certain areas, making it difficult for an attacker to know which pixels to change.
  • 3.Defensive Distillation: A "master" model teaches a "student" model with smoothed probabilities, making it less sensitive to extreme variations.
  • 4.Model Ensemble Methods: Trusting a "committee" of 5 networks. An attack would have to deceive the majority simultaneously, which is statistically much harder.
  • 5.Feature Compression and Autoencoders: "Cleaning" the image before the AI sees it, removing high-frequency noise through JPEG compression or reconstruction networks.
  • Privacy Layers

  • Differential Privacy: Adds controlled mathematical noise to gradients during training to ensure the model learns general patterns but not specific details from a single user (prevents the AI from "memorizing" an ID or a face).
  • Homomorphic Encryption: Allows the AI to perform predictions on data that is encrypted. You send your encrypted X-ray to the cloud; the AI analyzes it without ever decrypting it and returns the encrypted result.
  • Reformers: Efficient variants of Transformers used to filter out irrelevant dependencies that an attacker could exploit to inject signaling noise into the context.