AI BLOG

What is Prompt Hacking (Prompt Cracking)? Attack with Prompts of Large Language Models [Prompt Injection, Prompt Leaking, Jailbreaking]

Person who needs help

I would like to know Prompt Hacking (Prompt Cracking)!

We can help you with your concerns.

If you would like to know the recommended job sites for AI Engineers, please click the following.

If you would like to know the recommended job sites for Prompt Engineers, please click the following.

What is Prompt Hacking (Prompt Cracking)?

Prompt Hacking (Prompt Cracking) is an attack that gathers information that the user does not intend when responding to a prompt.

Prompt hacking is performed by tricking or misleading the user.

For example, a user can be prompted to "enter your password" to steal the user's password.

To prevent prompt hacking, users should carefully review the prompts and avoid responding to suspicious prompts.

It is also important to keep your passwords safe.

How to Attack and Defend Against Prompt Hacking

We introduce how to attack and defend against prompt hacking.

Common types of prompt hacking are as follows.

  • Prompt Injection: inputting incorrect prompts into a machine learning model, such as a Large Language Model (LLM), to cause the model to produce incorrect results.
  • Prompt Leaking: the user persuades the model to divulge prior prompts that are normally hidden from the user.
  • Jailbreaking: asking the model to play a certain role, responding with an argumentative response, or pretending to be a good person in response to moderation instructions.

The main defenses against prompt hacking are as follows.

The main attacks against prompt hacking are as follows.

  • Obfuscation/Token Smuggling: simple techniques to circumvent filters, especially by replacing certain words that trigger filters with their own synonyms or modifying them to include typos (e.g., CVID instead of COVID-19, F*ck instead of Fuck).
  • Payload Splitting: splitting a hostile input into multiple parts and having the LLM combine them.
  • Defined Dictionary Attack: showing the LLM a code dictionary and mapping the final sentence appropriately to avoid Sandwich Defense.
  • Virtualization: similar to role prompts, "setting the scene" for the AI to send a series of prompts, each one more similar to a fraudulent email that asks for personal information (password, credit card, etc.).
  • Indirect Injection: indirect introduction of hostile instructions from third-party data sources such as web search or API calls.
  • Recursive Injection: injecting a prompt into the first LLM and generating output containing an injection instruction into the second LLM, for example, to circumvent the Separate LLM Evaluation defense mechanism.
  • Code Injection: allowing an attacker to execute arbitrary code (often Python) in an LLM.

Explanatory Articles about Prompt Hacking

Explanatory articles about Prompt Hacking are as follows.

Explanatory Videos about Prompt Hacking

Explanatory videos about Prompt Hacking are as follows.

Summary

We introduced "Prompt Hacking (Prompt Cracking)," an attack that collects information that the user does not intend when responding to a prompt.

Prompt hacking includes Prompt Injection, Prompt Leaking, and Jailbreaking.

If you would like to know the recommended job sites for AI Engineers, please click the following.

If you would like to know the recommended job sites for Prompt Engineers, please click the following.

-AI, BLOG
-, , , ,