Quantization
GGUF, GPTQ, AWQ -- how precision reduction affects model behavior and what it means for embedded backdoors.
Most people who run open-weight LLMs locally have never loaded a model at its original precision. The 7B parameter model on your machine is almost certainly quantized — its weights compressed from 16-bit floating point down to 4-bit or 5-bit integers, reducing memory requirements by 75% or more. Quantization is the reason large language models run on consumer GPUs at all, and it is the dominant format for local deployment through tools like Ollama and llama.cpp. For most practitioners, quantized models are the models they interact with.
This chapter matters for the book’s analysis because quantization has a security property that is rarely discussed: it alters weight values. Every weight in a quantized model has been rounded, scaled, and mapped to a lower-precision representation. For a model that was designed to function correctly, this precision loss is carefully managed to minimize quality degradation. But for a model that contains a backdoor — a precise weight configuration that activates only on a specific trigger — quantization introduces noise into exactly the mechanism the backdoor depends on. This makes quantization an “incidental mitigation”: it was not designed to disrupt backdoors, but it does so as a side effect of the compression process (PLAUSIBLE).
The qualifier “incidental” is important. Quantization is not a defense strategy, and it would be irresponsible to treat it as one. The disruption is probabilistic, varies by quantization method and bit-width, and has not been systematically studied in an adversarial context. An attacker who knows their model will be quantized can design backdoors that are robust to the precision changes. But for attacks that were not designed with quantization in mind, the format conversion adds a layer of unintentional perturbation that complicates the attack.
Chapter 103 evaluates quantization as a mitigation alongside intentional defenses. Chapter 105 factors quantization into the comparative risk assessment for different deployment configurations. This chapter provides the technical grounding for both.
What Quantization Does: Precision Reduction and Weight Mapping
Quantization replaces high-precision floating-point weight values with lower-precision integer representations. This section explains the core mechanics: how continuous float values are mapped to discrete integer bins through scaling and zero-point offsets, what “calibration” means (using representative data to determine optimal mapping parameters), and the distinction between symmetric and asymmetric quantization schemes. The key point for security analysis is that this mapping is lossy — information is permanently destroyed, and the specific information lost depends on the quantization algorithm, the bit-width, and the statistical distribution of the original weights.
Quantization Formats: GGUF, GPTQ, AWQ, and bitsandbytes
The open-weight ecosystem has converged on several quantization formats, each with different approaches to precision reduction. This section covers GGUF (the format used by llama.cpp and Ollama, supporting CPU and mixed CPU/GPU inference), GPTQ (GPU-focused post-training quantization using calibration data), AWQ (activation-aware quantization that preserves the most important weights at higher precision), and bitsandbytes (integrated into the Hugging Face ecosystem for on-the-fly quantization). Each format makes different tradeoffs and has different implications for which weight values are preserved most faithfully — which in turn affects which backdoor configurations might survive the process.
Quality vs. Size Tradeoffs
Quantization is fundamentally a tradeoff between model size and model quality. This section covers the empirical relationship between bit-width and performance degradation, measured through perplexity and task-specific benchmarks. The key ranges are: 8-bit quantization typically preserves near-full quality; 4-bit and 5-bit (the most popular for local deployment) introduce measurable but often acceptable degradation; below 4-bit, quality drops more sharply and task-dependent failures become common. These degradation curves matter because they bound what organizations actually deploy — and therefore which quantization levels the security analysis should focus on.
Effects on Model Behavior
Beyond aggregate quality metrics, quantization can produce specific behavioral changes. This section covers known effects: increased repetition at very low bit-widths, degraded performance on tasks requiring precise numerical reasoning, altered calibration of confidence (models may become over- or under-confident), and occasional “quantization artifacts” where specific inputs produce notably different outputs from the full-precision model. These effects are relevant because they establish that quantization is not a transparent compression — it changes the model, and those changes have security implications in both directions.
Security Implications: Incidental Backdoor Disruption
This section directly addresses the security angle: how and why quantization can disrupt backdoor triggers, and the limits of this disruption. Backdoors that rely on precise weight values — such as those injected by methods like BadNets (Gu et al., 2017) or weight poisoning attacks — depend on specific numerical configurations that quantization perturbs. The disruption is strongest at lower bit-widths and for backdoors that use narrow trigger conditions. However, quantization is not a reliable defense: some backdoor methods produce triggers that are robust to noise, and an adversary who anticipates quantization can calibrate their attack accordingly. This analysis feeds directly into Chapter 103’s mitigation assessment.
GGUF Metadata and the llama.cpp Ecosystem
GGUF is more than a weight format — it is a container that includes metadata, tokenizer configuration, and model architecture definitions. This section covers the GGUF specification, what metadata fields are present, and where trust assumptions live in the llama.cpp loading pipeline. While GGUF is safer than pickle-based formats (it does not support arbitrary code execution by design), the metadata and configuration data it carries still represent a surface that deserves inspection. Models distributed through Ollama and similar tools are almost exclusively in GGUF format, making this the most practically relevant format for security analysis of local deployments.
Key Takeaway: Quantization as an Incidental Defense
Quantization is a deployment optimization, not a security tool — but it has genuine security side effects. This chapter’s contribution to the book’s analysis is a clear-eyed assessment of what quantization does and does not do to adversarial weight configurations. It disrupts some backdoors, alters model behavior in ways that can break fine-tuned triggers, and introduces a transformation step between the original weights and what actually runs on the target machine. None of this constitutes a defense strategy, but ignoring it would leave the risk analysis incomplete. The mitigation assessment in Chapter 103 and the comparative risk evaluation in Chapter 105 both incorporate quantization effects alongside intentional defenses.
Summary
[Chapter summary to be written after full content is drafted.]