Knowledge Distillation
How smaller models learn from larger ones -- and what transfers (and doesn't) during compression.
Knowledge distillation is the process of training a smaller “student” model to replicate the behavior of a larger “teacher” model. The technique was introduced by Hinton, Veylov, and Dean (2015) and has become a cornerstone of practical ML deployment: it is how organizations get large-model quality into small-model compute budgets. In the context of this book, distillation matters for a more pointed reason — it is the mechanism behind some of the most capable Chinese-origin models in widespread use. DeepSeek-R1’s distilled variants, for example, transfer the reasoning patterns of a massive model into 7B and 14B parameter versions that run on commodity hardware. The question is: what else transfers?
The answer is nuanced, and getting it right is essential for the risk assessment in Part II. Distillation transfers behavioral patterns — the statistical relationships between inputs and outputs that the teacher model has learned. This means knowledge, capabilities, and biases all transfer readily. An ideological bias or a censorship pattern present in the teacher will likely appear in the student, because these are exactly the kinds of input-output relationships that distillation is designed to preserve. However, activation-triggered backdoors — specific weight configurations that produce malicious behavior only when a precise trigger is present in the input — generally do not survive distillation intact, because the student learns to approximate the teacher’s output distribution, not to replicate its internal weight structure.
This distinction between “behavior transfers” and “weight-level artifacts don’t” is the central insight of this chapter. It means distillation is simultaneously a risk vector (it propagates biases and potentially censorship into derivative models) and a form of partial risk laundering (it strips away technical backdoors that depend on specific weight configurations). Neither characterization is complete on its own.
Scenario C in Chapter 104 is built entirely around distillation risk. The comparative risk assessment in Chapter 105 treats distilled models as a distinct category with different threat properties than their source models. This chapter provides the technical foundation for those analyses.
What Distillation Is: Teacher-Student Framework
Distillation works by training a student model not on hard labels (correct answer / wrong answer) but on the teacher’s full output distribution — its “soft labels.” This section explains the teacher-student framework, the role of temperature scaling in softening the output distribution to reveal more information about the teacher’s uncertainty, and the loss functions used to minimize the divergence between student and teacher outputs. The key insight is that soft labels carry far more information than hard labels, which is why distillation is more efficient than training from scratch.
Why Distillation Matters: Compression and Deployment
Distillation exists because large models are expensive to run. This section covers the practical motivations: reducing inference cost, enabling deployment on edge devices or commodity GPUs, and maintaining competitive performance at smaller scale. It explains why distillation has become central to the open-weight ecosystem — producing the 7B-14B class models that practitioners actually deploy — and why this makes the security properties of distillation a practical concern rather than an academic one. If your organization runs a distilled model, understanding what survived the distillation process is directly relevant to your risk posture.
What Transfers: Knowledge, Capabilities, and Biases
This section examines what properties of a teacher model reliably appear in distilled students. Knowledge (factual associations, reasoning patterns, linguistic capabilities) transfers by design — it is the point of distillation. But biases also transfer, because from the distillation process’s perspective, a bias is just another statistical pattern in the teacher’s output distribution. This includes ideological biases, content filtering behaviors, and systematic errors. If the teacher refuses to discuss certain topics or frames certain subjects in a particular way, the student will likely exhibit the same patterns (PLAUSIBLE). Chapter 105 evaluates this in the specific context of Chinese-origin teacher models.
What Doesn’t Transfer: Backdoors and Weight-Level Artifacts
This is the section that matters most for Part II’s analysis. Activation-triggered backdoors rely on specific weight configurations — precise numerical values in specific matrix positions that activate only when a particular trigger pattern is present in the input. Distillation does not copy weights; it copies input-output behavior. The student network has a different architecture, different weight initialization, and different internal representations, even if its output behavior approximates the teacher’s. This means weight-level backdoors face a significant barrier to transfer (PLAUSIBLE). However, this section also covers the limits of this protection: behavioral backdoors that are triggered by common input patterns and produce outputs within the teacher’s normal distribution may be more likely to survive, because they are harder to distinguish from legitimate learned behavior.
Distillation Variants
Not all distillation is the same. This section covers the major variants: standard knowledge distillation (single teacher, single student), self-distillation (where a model distills into a smaller version of itself), progressive distillation (multiple stages of size reduction), and task-specific distillation (distilling only the teacher’s behavior on a particular task). Each variant has different implications for what transfers and what is lost. Progressive distillation, for example, may further attenuate artifacts at each stage, while task-specific distillation may inadvertently preserve only the subset of biases relevant to the task.
Real-World Distillation Pipelines
This section grounds the theory in practice: how organizations actually perform distillation, what decisions they make about training data and evaluation, and what quality checks are typical. It covers the gap between the academic description of distillation (clean, controlled, well-understood) and the reality (often ad hoc, under-documented, driven by compute budgets rather than thoroughness). This gap matters because security analysis that assumes ideal distillation conditions may overestimate the reliability of distillation as a mitigation.
Key Takeaway: Distillation as Partial Risk Laundering
This closing section synthesizes the chapter into the framing that Part II relies on: distillation is neither a clean pass-through of risk nor a reliable sanitization step. It is a partial filter whose properties depend on the specific types of risk being evaluated. For biases and behavioral patterns, distillation is largely transparent — risk passes through. For weight-level backdoors, distillation is a meaningful barrier — but not a guarantee. This “partial risk laundering” framing is used directly in the comparative risk assessment (Chapter 105) and the distillation scenario walkthrough (Chapter 104, Scenario C).
Summary
[Chapter summary to be written after full content is drafted.]