Part I · Chapter 002

Training and Fine-Tuning

Pre-training, SFT, RLHF, and LoRA/QLoRA -- how models learn behavior and where adversaries can intervene.

Status Draft -- for peer review Updated 2026-03-21

Every security vulnerability in a machine learning model traces back to one question: who controlled the training process? A model’s weights are the product of its training data, its optimization procedure, and the decisions made by the people who ran the pipeline. Unlike traditional software, where you can inspect source code and trace logic, a trained neural network’s behavior is encoded implicitly across billions of parameters that were shaped by statistical optimization over enormous datasets. There is no “source code” to audit — the training process is the source, and once it’s done, the weights are the only record of what happened.

This chapter explains how LLMs go from random initialization to useful behavior, with a focus on the stages where an adversary could intervene. The training pipeline has distinct phases — pre-training, supervised fine-tuning, and alignment — and each phase has different security properties. Pre-training is where data poisoning happens at scale, because the model ingests trillions of tokens from sources it cannot verify. Fine-tuning is where instruction-following behavior is shaped, and where a small amount of carefully crafted data can have outsized influence. Alignment through RLHF or DPO is meant to impose safety constraints, but these constraints can be removed more easily than most practitioners realize.

For readers coming from a security background, the key mental model is this: the training pipeline is a build system for behavior, and like any build system, its security depends on the integrity of its inputs and the trustworthiness of its operators. For ML practitioners, the framing shifts from “does my model perform well on benchmarks?” to “could someone have tampered with the inputs to this pipeline in a way I wouldn’t detect?”

The distinction between what changes during pre-training versus fine-tuning is especially important for Part II’s analysis. Chapter 102 examines how backdoors are inserted during training and why fine-tuning on clean data does not reliably remove them. Chapter 103 evaluates mitigation strategies that operate on the training pipeline. Understanding the mechanics here is what makes those assessments credible rather than speculative.

Pre-Training: Objectives, Data Scale, and Compute

Pre-training is the foundational phase where a model learns language structure, factual knowledge, and reasoning patterns from massive text corpora. This section covers the next-token prediction objective, the scale of training data (typically trillions of tokens from web crawls, books, code, and curated datasets), and the compute requirements that make pre-training a resource-intensive process accessible only to well-funded organizations. The security relevance is direct: whoever controls the pre-training data and process has the deepest level of influence over the model’s behavior, and the scale of the data makes comprehensive auditing infeasible (PROVEN).

Supervised Fine-Tuning (SFT)

After pre-training, supervised fine-tuning teaches the model to follow instructions and respond in a conversational format. This section covers how SFT datasets are constructed — instruction-response pairs that demonstrate desired behavior — and how relatively small datasets (tens of thousands of examples) can dramatically change a model’s output style without altering its underlying knowledge. The security implication is that SFT is a leverage point: poisoning even a small fraction of the fine-tuning dataset can introduce targeted behaviors that survive into the deployed model (Chapter 102).

RLHF and DPO: Alignment with Human Preferences

Reinforcement learning from human feedback (RLHF) and direct preference optimization (DPO) are techniques for aligning model outputs with human preferences — typically making models more helpful, less harmful, and more honest. This section explains the reward model approach (RLHF) and the simpler pairwise preference approach (DPO), focusing on what these methods can and cannot guarantee. From a security standpoint, alignment is a behavioral overlay, not a structural constraint: it can be fine-tuned away with modest effort, and it does not prevent a model from containing latent capabilities or backdoors that bypass the alignment layer entirely.

LoRA and QLoRA: Parameter-Efficient Fine-Tuning

Most practitioners do not run full fine-tuning — they use parameter-efficient methods like Low-Rank Adaptation (LoRA) and its quantized variant QLoRA. This section explains how LoRA works by freezing the original model weights and training small adapter matrices, what this means for the original model’s integrity, and why LoRA adapters deserve their own security scrutiny. LoRA is relevant to Part II because it is the most common method for customizing open-weight models in practice, and adapter merging introduces its own supply chain considerations (Chapter 106).

The Training Data Supply Chain

The data used for training is itself a supply chain, with its own provenance, integrity, and trust questions. This section covers where training data comes from (Common Crawl, GitHub, curated datasets, synthetic data), how datasets are filtered and deduplicated, and what control model producers have over data quality. For Chinese-origin models specifically, training data composition is largely opaque — the datasets are not fully disclosed, and the filtering and curation decisions reflect the priorities of the producing organization. This opacity is a concrete risk factor discussed in Chapter 105.

Checkpoints, Model Merging, and the Open-Weight Ecosystem

Once a model is trained, its weights can be saved as checkpoints, merged with other models, and redistributed by anyone. This section covers how checkpoints work, the growing practice of model merging (combining weights from different fine-tuned variants), and what this means for provenance tracking. Model merging is particularly relevant because it creates derivative models whose behavior is not a simple function of any single training run — making attribution and analysis harder (Chapter 106).

Key Takeaway: The Training Pipeline as an Attack Surface

This closing section frames the training pipeline as a multi-stage attack surface where each phase offers different insertion points and different persistence characteristics. Pre-training attacks are the most durable and the hardest to detect. Fine-tuning attacks require less access but are more targeted. Alignment can be bypassed or reversed. The key insight for the security analysis in Part II is that a model’s training history is not transparent from its weights alone — and for models produced by organizations outside your trust boundary, you have limited ability to verify that history.

Summary

[Chapter summary to be written after full content is drafted.]