Attacks on Artificial Intelligence (II): Model Poisoning

February 13, 2024

Introduction

This article is part of a series on AI attacks of which we have already published a first article. It was an introduction focused on the attack known as jailbreak. Check it out to get a better understanding of the context of the series. This time we will focus on another type of attack known as poisoning.

It is quite relevant because in a Microsoft study, the industry considered it as the main risk for the use of Artificial Intelligence in production environments. We will focus on a type of poisoning attack called model poisoning.

What is a poisoning attack?

In a poisoning attack, attackers manipulate a model's training data or the AI model itself to introduce vulnerabilities, backdoors or biases that could compromise the model's security, effectiveness, or ethical behavior. This carries risks of performance degradation, exploitation of the underlying software, and reputational damage.

Poisoning of machine learning systems is not new, but has a long history in cybersecurity, one of the first known cases, was almost twenty years ago (2006) using poisoning to worsen the performance of automatic malware signature generation systems.

Poisoning attacks are aimed at affecting the availability or integrity of victim systems.

In terms of availability, the attack aims to disrupt the proper functioning of the system in order to degrade its performance, even to the point of making it completely unavailable, generating a full-fledged denial of service. In the case of integrity, the attack seeks to manipulate the output of the system by deviating it from that designed by its owners and causing problems of classification, reputation, or loss of control of the system.

Types of poisoning attack

The following are the main types of poisoning.

Data poisoning

The most popular form of poisoning to date, perhaps, involves providing manipulated or poisoned data for the training phase of the model to produce outputs that are controlled or influenced in some way or another by the attacker.

Intuitively it is often thought that it must require the attacker to control a significant amount of the training data to be used in order to influence the final system in the inference phase.

As we will see in the next article of this series, this intuition can be wrong and give us a perception of excessive security, due to the massive volume of data with which artificial intelligence models are currently trained.

Model poisoning

This consists of poisoning the very model that supports the inference phase of a system. It is usually associated with Federated Learning models where a central model is retrained or parameterized according to small contributions to it from a network of devices.

Attackers try to manipulate such contributions from the network of devices to affect the core model in unexpected ways. As we will see in this article, the particularities of Generative Artificial Intelligence models have brought it back to the forefront of research.

Training phases of a Generative Artificial Intelligence (GenAI) system

The initial training of LLMs is within the reach of very few organizations because of the need to obtain a massive volume of data and because of the computational costs associated with the creation of the so-called foundational models.

In order to obtain final models that are used in applications such as ChatGPT, Bard or Claude, a training phase called "finetuning" is subsequently performed, which is much lighter (several orders of magnitude) in terms of data requirements and computational capacity.

It is very common, by pure logic, for the vast majority of AI application developers to start with a foundational model developed by a third party and finetuning it specifically for their application.

This has been further accelerated with the emergence of open foundational and finetuning models and platforms like HuggingFace that make the latest generation of AI models available to third parties, similar to what Github provides for source code.

Inherent vulnerability of Generative AI to model poisoning attacks

This particular chain of generating final models for Generative AI systems or applications should start to smell of trouble in the supply chain to those more experienced in the cybersecurity industry. And indeed, it is precisely this feature that has brought model poisoning back into a leading role.

Imagine an attacker who makes available a powerful foundational model with free commercial use, but which has been carefully poisoned, for example with a backdoor that allows through a certain prompt to take control of the system.

The existence of a powerful state-of-the-art model with free commercial use is too tempting for many AI application developers, who use it massively as a basis for fine-tuning and launching applications on the market.

Model poisoning example

In the purest movie style, when the bad guy calls his victim and says a word that leaves him in a state of "hypnosis", in the following scenes we see with disbelief how the victim, out of his mind, tries to shoot the president.

So in this case the attacker enters the final system, created with his poisoned model, and launches his prompt including the backdoor detonator let's say it is "James Bond", the attacker takes control of the system being able for example to request the system prompts used for his alignment, misclassify threats, etc.

This image is taken from interesting research on poisoning whose reading we recommend to those more interested..

Possible mitigations

Model inspection and sanitization: Model inspection analyzes the trained machine learning model prior to deployment to determine if it has been poisoned. One research work in this field is NeuronInspect, which relies on "explainability" methods to determine different features between clean models and models with backdoors that are then used for anomaly detection.

Another more recent example would be DeepInspect which uses a conditional generative model to learn the probability distribution of triggering patterns and performs model patching to remove the trigger.

Once a backdoor is detected, model remediation can be performed by pruning, retraining, or finetuning to restore model accuracy.

Conclusions

Poisoning is one of the greatest risks derived from the use of Artificial Intelligence models, as indicated by Microsoft in its study based on questionnaires to relevant industry organizations and confirmed by its inclusion in the OWASP Top 10 threats to large language models.

With respect to model poisoning in particular, the particularities in the training phase of GenAI models make them an attractive target for threat actors who can enter the supply chain and poison assets that are then used by third parties. This behavior is very similar to the well-known attacks on open software dependencies in applications that rely on them for greater efficiency and speed in their development.

The convenience and accessibility of low-cost assets is a major temptation for software and information system developers. While our willingness to make use of them is more than logical, it must be accompanied by critical thinking and distrust.

The low "explainability" of AI models, and in particular complex ones such as LLMs, work against us when we use third-party components for corporate applications and should generate a higher degree of distrust with respect to other more common systems.

This is something we can translate into action, through thorough testing for potential anomalies and continued study of the state of the art in detecting potentially malicious artifacts in the components used to create our own AI systems.

As the Cat Stevens song says:

But then, a lot of nice things turn bad out there. Oh, baby, baby, baby, it's a wild world.

CONTINUING THIS SERIES

Attacks to Artificial Intelligence (I): Jailbreak
Cyber Security
AI of Things
Attacks to Artificial Intelligence (I): Jailbreak
February 7, 2024
Attacks on Artificial Intelligence (III): Data Poisoning
Cyber Security
AI of Things
Attacks on Artificial Intelligence (III): Data Poisoning
February 21, 2024