Part I · Chapter 010

Deployment Models and Inference Infrastructure

Open vs closed vs API, Ollama/vLLM, and how deployment choices determine your threat surface.

Status Draft -- for peer review Updated 2026-03-21

There is a gap between the models that security researchers study and the models that practitioners actually run. The research literature focuses on pre-trained base models at full precision, but production systems run quantized derivatives served through inference frameworks like Ollama, vLLM, or llama.cpp, often behind API abstractions that hide the model identity from the end user entirely. This chapter covers the deployment layer — the infrastructure that turns model weights into a running service — because the deployment model determines which threat vectors from Part II actually apply to a given organization.

An organization that accesses GPT-4 through OpenAI’s API faces a fundamentally different risk profile than one that downloads Qwen2.5-72B from Hugging Face and runs it locally. The API user has no access to model weights (and therefore no model-compromise risk) but sends all their data to a third party (creating data-exposure risk). The local deployment user controls their data but takes full responsibility for the integrity of the model they downloaded — including its weights, tokenizer, configuration, and any custom code that came with it. Neither choice is inherently safer; they trade one category of risk for another.

This chapter also covers a category of AI model that is frequently overlooked in security discussions: embedding models. These are not generative — they do not produce text — but they are essential infrastructure in RAG pipelines (Chapter 008), search systems, and data processing workflows. The most widely deployed embedding models in the open-source ecosystem are Chinese-origin: BGE (from BAAI), GTE (from Alibaba), and similar models. Organizations that would never consciously deploy a Chinese-origin generative LLM may already be running Chinese-origin embedding models as invisible dependencies. Chapter 109’s at-risk model catalog covers these explicitly.

The deployment model is where abstract risk becomes concrete. This chapter provides the context for the comparative risk assessment in Chapter 105, the deployment-specific mitigations in Chapter 103, and the tiered recommendations in Chapter 107.

Open-Weight vs. Closed-Source vs. API-Based Deployment

The LLM deployment spectrum ranges from full local control (open-weight models running on your hardware) to full delegation (closed-source models accessed through an API). This section defines the three primary deployment models and their trust assumptions. Open-weight deployment gives you control over the model and your data, but requires you to trust the model’s provenance. API-based deployment delegates model integrity to the provider but requires you to trust them with your data. Hybrid approaches (self-hosted API services, on-premises cloud deployments) exist between these poles. Each deployment model enables and precludes different threat vectors, and Chapter 105 scores risk across all three.

Why “Open-Weight” Is Not “Open-Source” and Not “Safe”

The term “open-weight” has become conflated with transparency and safety in public discourse, and this section corrects that conflation. “Open-weight” means the model weights are publicly available — nothing more. It does not mean the training data is available, the training code is published, the training process is reproducible, or the model has been independently audited. Most critically, it does not mean the model is safe. Having access to weights does not provide practical transparency because the weight space is too large to inspect meaningfully, and the relationship between weight values and model behavior is not interpretable by current tools. This section establishes the precise meaning of “open-weight” to prevent false confidence in subsequent analysis.

Local Inference Frameworks

This section covers the tools practitioners actually use to run open-weight models: Ollama (which provides a simplified Docker-like workflow for downloading, managing, and serving models), llama.cpp (the C++ inference engine that powers most CPU and mixed CPU/GPU local inference, with its GGUF format), vLLM (a high-throughput Python-based serving framework optimized for GPU deployment), and text-generation-inference (Hugging Face’s serving solution). Each framework has different security properties: different model loading paths, different configuration surfaces, and different levels of isolation between the model and the host system. Understanding these frameworks is necessary for evaluating the practical deployment scenarios analyzed in Part II.

How Inference Works in Practice

This section traces the practical mechanics of running a model: loading weights into memory (or across multiple GPUs), processing the system prompt and user input through the model’s forward pass, managing the KV cache for efficient generation, and streaming output tokens back to the client. It covers context windows (the maximum input length, typically 4K to 128K tokens), token budgets (how long the model can generate), and the memory requirements that constrain what models can run on what hardware. These practical constraints shape what organizations actually deploy, which in turn determines their exposure to the threat vectors analyzed in Part II.

System Prompts and Runtime Configuration

System prompts are the primary mechanism for controlling model behavior at deployment time, and this section explains where the trust boundaries live. A system prompt is text prepended to every conversation that instructs the model on its role, constraints, and behavior — but as Chapter 009 establishes, this authority is behavioral rather than architectural. This section covers how system prompts are configured in different frameworks, what runtime parameters affect model behavior (temperature, top-p, repetition penalties), and where configuration is stored and managed. The security concern is that system prompts and runtime configuration are deployment-time controls that can be overridden by a compromised model or bypassed through prompt injection.

Embedding Models vs. Generative Models

This section introduces a critical distinction that the broader security conversation often overlooks. Generative models (GPT, LLaMA, Qwen, DeepSeek) produce text and are the focus of most threat analysis. Embedding models (BGE, E5, GTE, sentence-transformers variants) convert text into fixed-dimensional vector representations and are used for semantic search, RAG retrieval, classification, and clustering. Embedding models do not generate text, but they influence what content enters the generative model’s context through RAG pipelines. A compromised embedding model could manipulate what documents are retrieved — and therefore what the generative model sees and acts on — without generating any visible output itself.

Embedding Models in Practice: BGE, E5, and GTE

This section covers the specific embedding models most widely deployed in the open-source ecosystem. BGE (BAAI General Embedding), developed by the Beijing Academy of Artificial Intelligence, is among the most popular embedding models on Hugging Face. GTE (General Text Embedding) is developed by Alibaba. E5 is developed by Microsoft. These models are frequently adopted as dependencies in RAG frameworks, vector database configurations, and search infrastructure — often without the deploying organization being aware of the model’s origin or evaluating its provenance. Chapter 109 catalogs these models and their deployment prevalence, and Chapter 105 includes embedding models in its comparative risk assessment.

The Deployment Security Surface

This section maps the security surface specific to the deployment layer: model loading (where serialization vulnerabilities from Chapter 006 become relevant), runtime configuration (where misconfiguration can weaken security controls), API exposure (where the model’s capabilities become accessible to users and potentially to adversaries), and monitoring (where the ability to detect anomalous model behavior depends on what instrumentation the deployment framework provides). Each element of the deployment surface interacts with the threat vectors analyzed in Part II, and this section establishes those connections explicitly.

Trade-Offs: Model-Compromise Risk vs. Data-Exposure Risk

This section frames the central deployment decision as a risk trade-off. Open-weight local deployment accepts model-compromise risk (you might be running a backdoored model) but eliminates data-exposure risk (your data never leaves your infrastructure). Closed-source API deployment eliminates model-compromise risk (you trust the provider to serve a clean model) but accepts data-exposure risk (all your data transits to and is processed by a third party). This trade-off is not abstract — it is the decision framework that Chapter 105 evaluates quantitatively and that Chapter 107 translates into tiered recommendations based on an organization’s risk tolerance and sensitivity requirements.

Key Takeaway: Deployment Model Choice Determines Which Threat Vectors Apply

The closing section makes the chapter’s central argument: the deployment model is the most consequential decision an organization makes about its LLM risk posture. It determines whether model-integrity threats (backdoors, supply chain compromise, weight-level attacks) or data-confidentiality threats (data exposure to providers, training data extraction) dominate the risk profile. It determines whether embedding model provenance matters. It determines whether the mitigations from Chapter 103 are applicable. Understanding the deployment landscape is prerequisite for the risk-informed decision-making that the rest of the book supports.

Summary

[Chapter summary to be written after full content is drafted.]