Attacks to Artificial Intelligence (I): Jailbreak
Introduction
It is well known how the use of Artificial Intelligence has accelerated sharply with the launch of ChatGPT just over a year ago in November 2022. I know, it seems like a decade ago. Today's technology is moving at incredible rates. Like any new emerging technology, the popularity of Large Language Models (hereafter LLMs), brings with it a great deal of interest from both the scientific community to understand their limitations and from cybercriminals to see how they can employ the benefits of these technologies for their activities.
Much effort has been, and is still being, invested in building secure default behavior into AI systems, including their generative variant and their end applications, but we are facing a new cat-and-mouse scenario, so typical of cyber security.
While application developers work to protect systems from misuse, others are looking for creative ways to use this exciting new technology to their own, perhaps not so exciting, advantage.
The Trusted Artificial Intelligence department of the US National Institute of Standards and Technology (NIST) has recently published an interesting report on the different types of attacks that can be performed on these systems in an attempt to provide a common context for dealing with these new risks. It is highly recommended reading for any professional interested in this emerging field.
Types of attacks on Artificial Intelligence systems
In the NIST report, attacks are analyzed from several perspectives: target of the attackers, capabilities of the attackers and the impact on the classic security triad: Availability, Integrity, and Confidentiality.
We can identify the following main types of attacks:
- Evading system security controls (Jailbreak)
- Prompt Injection
- Poisoning
- Privacy Attacks
- Supply-chain attacks
- Misuse/Abuse of systems (Abuse violations)
We are aware of the relevance that Artificial Intelligence, and in particular Generative Artificial Intelligence, is currently accumulating, so we have decided to publish a series of articles where we will analyze in detail each of these types of attacks, look at relevant research, provide examples, look at some measures for mitigating their risks and draw the main conclusions.
The aim of this series of articles is ultimately to help organizations become aware of the specific risks of these types of systems so that they can make a better-informed decision about the potential internal use of these technologies or even the incorporation of these technologies into their products and services.
In this article we will focus on jailbreak attacks, which aim to bypass the security restrictions imposed by LLM-based AI applications and obtain an unwanted response from the application designers.
What is a jailbreak attack?
A jailbreak attack on an LLM model is defined as an attempt to manipulate or "trick" the model into performing actions that go against its policies or usage restrictions. These actions may include revealing sensitive information, generating prohibited content, or performing tasks that have been explicitly limited by its developers.
Attackers often use advanced techniques that may include manipulating the context of the model to "convince" the system to act in an undesirable manner.
✅ For instance, they might try to disguise forbidden requests as harmless or use coded language to bypass the model's filters.
Why do jailbreak attacks work?
Generative Artificial Intelligence systems follow the classic learning phases of a machine learning system: training phase and inference phase. However, these systems have the particularity that in the inference phase there is an alignment of the behavior desired by the creators of the application for the specific use case for which it has been designed.
Let's see it with an example of alignment, the instruction: "You are a financial expert assistant who responds concisely and politely" can be introduced to the LLM as a "system" input, to which the end-user prompt is added to generate the complete context that the LLM receives to produce a result.
A jailbreak attack takes advantage of this alignment being performed at runtime and concatenates with the user input to tailor a carefully crafted user input to escape the sandbox posed by the application administrators.
The following are examples of jailbreaking techniques in the context of LLMs.
Roleplay
If we were to ask a model directly about how we can pass an exam in a non-lawful way, to everyone's great joy, applications like ChatGPT will refuse arguing that they should not be used for fraud purposes.

However, what if we invite ChatGPT to a role-play, where we ask it to act like our doting grandmother, who " by chance" would tell us anecdotes about tricks she used to pass certification exams to help us sleep when we were restless.
With this initial prompt we could get to bypass the security limitations since we are not going to cheat, but only want to play a game that "circumstantially" involves the description of possible cheats to pass an exam.
So, in effect ChatGPT becomes our cheating grandmother and tells us some of her anecdotes to cheat in an exam.

Message coding
We all know the ability of LLMs to understand and reply in different languages: Spanish, English, French, German, etc. Similarly, they also understand language encodings such as Base64 an encoding widely used on the Internet for information exchange.
What happens if we try to use this encoding in our prompts? First of all, as expected, ChatGPT understands the content of our request. Using the direct prompt above.

Mixing both roleplay and encoding techniques we get a response where we see that again the restrictions have been avoided.

This happens, since most of the ChatGPT application protection training will probably have been done with English data, perhaps even automated its translation into the most common languages and its respective inclusion in a kind of blacklist. However, it can be inferred that this is somehow a way of “putting doors on the field”.
Automating jailbreaks
The examples seen so far try in a quasi-artisanal way to find weaknesses in LLMs to bypass the security constraints imposed in their design.
This hardly scales, though. Could such techniques be automated, to find a text that helps to bypass the limitations of LLMs today?
Let's first think about what ideal characteristics it should have:
- Black box → We only have access to the running system; we do not know its architecture and specific training.
- Universal → The generated text should be able to be combined with any user prompt regardless of its wording.
- Portable/Transferable → The generated text can be used in multiple LLMs with the same positive jailbreak result.
Well, if you thought wrong, you were right. Precisely there are several recent scientific papers, like this one or this other that have managed to obtain seemingly random text strings that can be concatenated to a request to an LLM model and effectively disable its security restrictions while complying with many of the features mentioned above.

If we think about potential defense mechanisms against this technique, we can quickly see how including such suffixes in a listing would be ineffective as new ones could be generated very quickly.
✅ A more appropriate approach would be to identify the consistency of what is written in the prompt, and that is what the industry is working towards. As mentioned, the sequence that produces the jailbreak is to human eyes very random and this could be detected prior to the execution of the model to stop its execution.
Conclusions
Generative Artificial Intelligence, fascinating and transformative as it is, is, at its core, nothing more than predicting the next word that is most consistent with the current context of the conversation, taking into account inputs, outputs and the knowledge with which the model has been trained.
The channel sharing of system instructions along with end-user data in LLM alignment generates a number of intrinsic vulnerabilities that enable security constraint evasion mechanisms such as jailbreaking.
To educated readers in Cyber Security this will remind them of classic attacks such as SQL injection, and indeed, the attack vector is very similar.
This is a somewhat philosophical thought that should be approached by the AI industry with deep reflection and caution. The existence of powerful open models allows the industry to accelerate in innovation and improvement of generative AI, something that we should view as inherently positive.
However, the existence of these models also poses a risk that needs to be measured and assessed, as they can be used to generate attacks on closed systems due to the transferability characteristics between different LLMs.
The final decision on whether or not to allow open models, or how to restrict their misuse, is not a trivial one. Establishing the rules of the game and deciding one way or the other becomes urgent as the power of these systems increases, otherwise a late decision will not effectively achieve any of the intended and desired effects.
◾ CONTINUING THIS SERIES