Part I · Chapter 009

Prompt Injection

Direct and indirect prompt injection as distinct threat vectors -- the most exploited LLM vulnerability.

If you have worked with SQL injection, cross-site scripting, or command injection, you already understand the fundamental pattern: an attacker provides input that is interpreted not as data but as instructions. Prompt injection is the LLM-native version of this vulnerability class, and it is arguably the most consequential security problem in the field. When a model processes text that contains instructions — whether from a user, a retrieved document, or a tool call result — it has no reliable mechanism to distinguish those instructions from its legitimate system prompt. The model treats all text in its context as potential instructions, because that is how language models work.

This chapter covers both forms of prompt injection: direct injection, where an adversarial user crafts input that overrides the system prompt, and indirect prompt injection (IPI), where malicious instructions are embedded in data the model encounters during normal operation — retrieved documents, web pages, email content, API responses. Direct injection is the better-known variant and is largely an access control problem: if the user is already interacting with the model, direct injection exploits the system prompt’s authority. IPI is fundamentally more dangerous because it allows a remote attacker to influence the model’s behavior without any direct access to the model or its users. The attacker simply places instructions where the model will encounter them.

The distinction matters enormously for the threat analysis in Part II. Almost every high-severity attack chain in Chapter 102 and Chapter 104 depends on IPI as the initial vector. A backdoor in model weights is one thing — it requires compromising the training pipeline. IPI requires only the ability to place text in a document that will be retrieved by a RAG system, or in a web page that will be fetched by a model with browsing capability. The barrier to entry is orders of magnitude lower, and the attack surface is any data source the model can access.

For security professionals who need a single sentence to carry away from this chapter: prompt injection is to LLMs what SQL injection was to web applications in the 2000s — a fundamental, architecturally rooted vulnerability that cannot be fully patched at the application layer because it arises from the design of the underlying system.

Direct Prompt Injection

Direct prompt injection occurs when a user provides input that overrides, modifies, or bypasses the model’s system prompt instructions. This section covers the mechanics: how system prompts are implemented (they are just text prepended to the conversation, with no architectural enforcement of their authority), common injection techniques (instruction overrides, role-playing exploits, context confusion), and why direct injection is primarily an alignment and access control concern rather than a remote exploitation vector. Direct injection requires the attacker to have user-level access to the model’s input, which limits its relevance for the nation-state threat models in Part II but makes it central to the jailbreaking discussion.

Indirect Prompt Injection (IPI)

IPI is the variant that transforms prompt injection from an interesting curiosity into a critical security vulnerability. This section explains how IPI works: malicious instructions are embedded in content that the model processes as part of its normal operation — a document retrieved by a RAG pipeline, a web page fetched during research, an email body processed by an assistant, or a tool call result returned from an API. The model encounters these instructions in its context window alongside legitimate data and follows them because it cannot architecturally distinguish between “data to be processed” and “instructions to be followed.” Greshake et al. (2023) demonstrated that IPI is practical and effective in real-world LLM deployments.

Why IPI Is Fundamentally Harder to Defend

This section explains the architectural reason that IPI resists patching: LLMs operate on a flat context window where all text — system prompt, user input, retrieved data — is processed through the same attention mechanism with no privilege separation. There is no equivalent of memory protection rings, no hardware-enforced boundary between code and data. Instruction hierarchies (telling the model that system prompts outrank other content) provide soft behavioral guidance but no architectural guarantee. This is not a bug that can be fixed in the next software release; it is a property of how transformer-based language models process sequences. Defending against IPI requires either solving this architectural limitation or implementing defense-in-depth at the application layer — and neither approach currently provides complete protection.

Injection in Different Contexts

IPI can enter the model’s context through any data channel the deployment exposes. This section surveys the practical contexts where injection occurs: RAG-retrieved documents (the most common enterprise vector), web content accessed through browsing tools, email bodies processed by AI assistants, API responses from external services, tool call results from databases or code execution, and even user-uploaded files. Each context has different characteristics — different encoding constraints, different detection opportunities, and different attacker profiles. The breadth of potential injection surfaces is what makes IPI a systemic risk rather than a point vulnerability.

Multi-Step Attack Chains

Prompt injection is rarely the entire attack — it is the enabling step that initiates a chain. This section explains how IPI serves as a vector for more complex operations: an injected instruction causes the model to make a tool call that exfiltrates data, or to modify its responses to subsequent users, or to chain multiple tool calls together in a sequence that achieves an objective the injected instruction specified. The concept of injection as a vector rather than an attack is important because it means the severity of IPI scales with the capabilities of the model it targets. A text-only model with IPI is a nuisance; an agentic model with IPI and tool access is a potential breach. Chapter 104’s scenarios demonstrate these chains concretely.

Current Defenses and Why They’re Insufficient

This section provides an honest assessment of the defensive landscape for prompt injection. It covers instruction hierarchies (system prompts that tell the model to ignore injected instructions — effective against naive attacks, trivially bypassed by sophisticated ones), input/output sanitization (filtering known injection patterns — a cat-and-mouse game), delimiter-based approaches (wrapping untrusted content in markers — models can be instructed to ignore delimiters), and more sophisticated approaches like LLM-based input classifiers (using a second model to detect injection attempts — adds latency and its own attack surface). The section’s conclusion, which Chapter 103 examines further, is that no current defense provides reliable protection against adaptive adversaries (PROVEN).

The Confused Deputy Problem

The confused deputy is a classic security concept: a privileged process is tricked into using its authority on behalf of an unauthorized request. This section applies the concept to LLM agents: the model has legitimate authority to call tools, read files, and make requests on behalf of the user, but IPI causes it to exercise that authority on behalf of the injected instructions instead. The model is the confused deputy — it has permissions, and it is being tricked into using them. This framing connects prompt injection to the broader agentic risk model established in Chapter 008 and is used extensively in the threat analysis (Chapter 102).

Relationship to Traditional Injection Attacks

This section draws explicit parallels between prompt injection and traditional injection vulnerabilities for the benefit of security readers. Like SQL injection, prompt injection arises from mixing data and instructions in the same channel without adequate separation. Like XSS, it can be delivered remotely through content the victim encounters during normal operations. Like command injection, it can escalate from input manipulation to system-level actions. But prompt injection differs in critical ways: the “parser” is a neural network that processes language probabilistically rather than a deterministic interpreter, which means there is no definitive grammar for injection payloads and no canonical sanitization approach. This comparison helps security professionals calibrate their intuitions.

Key Takeaway: Prompt Injection Is the Enabling Vulnerability

This closing section frames prompt injection as the single most important vulnerability class for the book’s threat analysis. Without prompt injection, the attack chains in Part II would require direct access to the model, its training pipeline, or its deployment infrastructure. With IPI, an attacker needs only the ability to place text where the model will encounter it. This is the vulnerability that connects adversary intent to model behavior at scale, and it is the reason that the highest-severity threat scenarios in Chapter 104 involve models with both tool access and exposure to external data. The mitigations assessed in Chapter 103 and the recommendations in Chapter 107 treat IPI defense as a first-order priority.

Summary

[Chapter summary to be written after full content is drafted.]