Front Matter

Untrusted Weights

What Open Weights Don't Tell You

Author Christopher Whitworth Date March 2026 Status Draft -- for peer review Chapters 20 Citations 71+

Purpose

This research compendium provides a comprehensive, evidence-based technical assessment of the security risks associated with deploying open-weight large language models in sensitive environments, using Chinese-origin models (Alibaba Qwen, DeepSeek, and similar) as the primary case study.

"Can Chinese-origin open-weight models ever be made safe for sensitive use, or is the risk irreducible?"

The majority of threat vectors analyzed -- backdoor persistence, tool-use exploitation, data exfiltration, supply chain compromise -- apply to any untrusted open-weight model regardless of origin. China-specific factors are quantified as a marginal risk increment rather than treated as the entire problem.

Key Findings

Can backdoors be embedded in model weights? Yes -- PROVEN Sleeper Agents, BadNets, and related work

Do backdoors survive safety training? Yes -- PROVEN RLHF, SFT, adversarial training all fail

Can you detect backdoors in LLMs? No -- PROVEN No tool scales to billion-parameter models

Could a state actor insert undetectable backdoors? Feasible -- PLAUSIBLE All technical requirements available

Does air-gapping solve the problem? No -- PROVEN Output encoding and analytical bias remain

Does fine-tuning remove backdoors? No -- PROVEN Sleeper Agents paper

Is Chinese model risk different? Yes, but 70-80% is origin-agnostic 1.3-2.0x marginal risk multiplier

Can Chinese models be used for defense? Context-dependent Not for analytical/tool-enabled use

Evidence Classification

Throughout this research, findings are classified by evidence level:

PROVEN

Demonstrated in peer-reviewed research or documented in practice

PLAUSIBLE

Theoretically sound with supporting evidence, not yet demonstrated at scale

SPECULATIVE

Possible but no concrete evidence or identified mechanism

Research Log

Tracking the evolution of this research over time.

Week 1 -- Mar 2-8, 2026

Initial Research Sprint

Built complete compendium: threat analysis across 4 vectors, mitigation assessment (6 methods rated), 3 attack scenario walkthroughs, comparative risk assessment, policy landscape, tiered recommendations, at-risk model catalog (30+ families), attack lab scaffolded, 71+ citations compiled.

4 threat vectors 6 mitigations rated 71+ citations 30+ model families cataloged

Week 2 -- Mar 9-15, 2026

In Progress

Attack lab experiments, steganographic encoding validation.

Week 3 -- Mar 16-22, 2026

In Progress

Continued research and validation.

License

Attribution-NonCommercial License with Permitted Use Carve-Outs

Permission is hereby granted, free of charge, to any person or organization obtaining a copy of this work and associated files (the "Work"), to use, copy, modify, merge, publish, and distribute the Work subject to the following conditions:

1. Attribution Required

All copies, substantial portions, or derivative works must clearly credit "Christopher Whitworth" as the original author. Attribution must be reasonable to the medium: a citation in academic work, a credit line in documentation, a comment in source code, or equivalent.

2. Permitted Uses

The following uses are allowed without additional permission, even if they involve monetary compensation:

Educational content -- courses, lectures, textbooks, training materials, and instructional media (including monetized video content such as YouTube) where the primary purpose is teaching or raising awareness about AI security topics.
Academic and research use -- citation, quotation, analysis, or incorporation into academic papers, theses, dissertations, and research publications, whether open-access or behind a paywall.
Nonprofit and public interest -- use by charitable organizations, nonprofits, NGOs, and public interest groups for mission-aligned work.
Government internal use -- use by government agencies and their contractors for internal analysis, policy development, or operational decision-making, provided the Work itself is not resold as a standalone product.
Reference and supporting material -- citation or incorporation as supporting evidence in proposals, reports, assessments, or contract deliverables, provided the Work is not the primary deliverable being sold.
Journalism and commentary -- reporting, critique, review, and public discussion, including in monetized publications and media.

3. Restricted Uses

The following uses require express written permission of the author:

Resale -- selling or sublicensing the Work itself, in whole or substantial part, as a standalone product or primary deliverable.
Repackaging -- incorporating a substantial portion of the Work into a commercial product, service, or platform where the Work constitutes a core component of the value being sold.
Commercial training data -- using the Work as training data for commercial AI/ML models or products.

THE WORK IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT.