Part I · Chapter 001

LLM Architecture Fundamentals

Transformer components mapped to attack surfaces -- where weights live, how attention directs information flow, and how tokenization creates entry points.

Status Draft -- for peer review Updated 2026-03-21

If you work in cybersecurity, you already think in terms of attack surfaces — the set of points where an unauthorized actor can try to enter or extract data from a system. This chapter applies that framing to large language models. Rather than teaching transformer architecture from scratch, the goal is to give you a working understanding of the components that matter for security analysis: where weights live, how attention directs information flow, how tokenization converts text into the numerical representations the model actually processes, and how the inference pipeline moves data from prompt to output. If you already work with LLMs, this chapter will reframe familiar concepts through a security lens.

Understanding these internals is not optional for the analysis that follows. When Chapter 102 discusses backdoor triggers embedded in weight matrices, you need to know what a weight matrix is and how it participates in computation. When Chapter 104 walks through a scenario where an attacker exploits attention patterns, you need to understand what attention does and why certain manipulations are possible. When Chapter 106 examines policy responses to model risks, the technical grounding here is what separates informed assessment from hand-waving.

The architecture of a large language model is deceptively simple in concept — a sequence of matrix operations that predicts the next token — but the details create a rich surface for both legitimate capability and potential exploitation. A 70-billion-parameter model contains tens of billions of floating-point values organized across hundreds of weight matrices. Each of these values was learned during training, and each participates in every inference. Modifying even a small fraction of them can alter behavior in targeted ways that are extremely difficult to detect through output-level testing alone.

This chapter establishes the vocabulary and mental models you will need for every subsequent chapter in the book.

The Transformer Architecture

Modern LLMs are built on the transformer architecture introduced by Vaswani et al. (2017). This section covers the two primary variants — encoder-decoder models (used in translation and summarization tasks) and decoder-only models (used by GPT, LLaMA, Qwen, DeepSeek, and most open-weight LLMs relevant to this book). The distinction matters because decoder-only models are autoregressive: they generate output one token at a time, with each token conditioned on all previous tokens. This autoregressive property has direct implications for how tool-use exploitation works (Chapter 102) and how steganographic encoding can be embedded in generated output (Chapter 005).

Self-Attention and Multi-Head Attention

Attention is the mechanism that allows a model to weigh the importance of different tokens in the input when computing each output representation. This section explains the query-key-value formulation, how attention scores are computed and normalized, and what “multi-head” means in practice — multiple parallel attention computations that capture different types of relationships. From a security perspective, attention patterns determine how a model routes information from context to output, making them relevant to both prompt injection (Chapter 009) and understanding how a model can be induced to act on specific instructions embedded in retrieved documents.

Tokenization

Tokenization is the process of converting raw text into a sequence of integer token IDs that the model can process. This section covers the dominant algorithms — Byte Pair Encoding (BPE), SentencePiece, and WordPiece — and explains why tokenization is not a neutral preprocessing step. Tokenizer vocabularies are artifacts of training data, meaning they encode assumptions about language and content. Crucially, tokenizers are loaded from configuration files distributed alongside model weights, making them a supply chain vector: a modified tokenizer is custom code that runs on every input and is rarely inspected (Chapter 006).

Embeddings and the Embedding Space

Once tokens are produced, they are mapped to dense vector representations through embedding matrices. This section explains token embeddings, positional encodings (both learned and rotary), and the geometric properties of embedding space. The embedding layer is where symbolic text becomes continuous mathematics, and it is the first point at which model weights directly influence the representation of user input. Embedding models — a distinct category from generative LLMs — are also introduced here, with deeper coverage in Chapter 010.

Weight Matrices: What Parameters Actually Store

A “7B parameter model” contains 7 billion floating-point values, but what are they? This section explains the role of weight matrices in attention layers (query, key, value, and output projections), feed-forward layers, and layer normalization. It covers how these matrices are organized, how they interact during a forward pass, and what it means to “modify” a model’s weights. This is the foundation for understanding backdoor injection: altering a small subset of weights to create a conditional behavior change that is invisible during normal operation but activates on a specific trigger (Chapter 102).

Feed-Forward Layers and Layer Normalization

Between attention blocks, feed-forward networks apply learned nonlinear transformations to each token’s representation independently. This section covers the structure of feed-forward layers (typically two linear transformations with an activation function) and the role of layer normalization in stabilizing training and inference. Research on “knowledge neurons” suggests that factual knowledge tends to concentrate in feed-forward layers, which has implications for weight-level attacks: it provides a target for adversaries who want to modify specific behaviors without broadly degrading model quality.

The Inference Pipeline

This section traces the complete path from user prompt to generated output: tokenization, embedding lookup, repeated passes through transformer layers, final logit computation, and sampling. Understanding this pipeline end-to-end is essential because each stage presents different security properties. The sampling step — where logits are converted to token probabilities and a token is selected — is particularly relevant to steganographic exfiltration (Chapter 005), because the model has genuine choice among plausible tokens at every generation step.

Scale: What Changes Between 7B and 70B

Larger models don’t just have “more” of the same — the relationship between model size, capability, and security properties is nonlinear. This section covers how parameter count relates to layer count, hidden dimension, and attention head count. It also addresses the practical implications of scale for security: larger models are harder to inspect exhaustively, have more capacity to encode latent behaviors, and exhibit emergent capabilities that smaller models lack — including the ability to follow complex multi-step instructions, which is prerequisite for sophisticated attack chains.

Key Takeaway: Why Architecture Matters for Security

This closing section synthesizes the chapter’s content into the central argument: understanding LLM architecture is prerequisite for security analysis because attacks exploit specific architectural properties. Backdoors live in weight matrices. Steganography exploits the sampling process. Supply chain attacks target tokenizers and model loading. Prompt injection exploits how attention routes information from context to output. The architecture is the attack surface, and the chapters that follow will explore each of these vectors in depth.

Summary

[Chapter summary to be written after full content is drafted.]