Reference architecture

The sovereign LLM/RAG pattern, open for you to read.

A purpose-built reference architecture for document-intensive AI workloads on EKS. Every component runs inside the customer's VPC, defined in Terraform, with published SLOs and cost-per-token. It's written for a staff engineer to evaluate in five minutes and decide whether the design holds up.

At a glance

The whole stack, inside your VPC.

Layered architecture of the sovereign LLM/RAG stack: an interface layer (OpenAI-compatible API, FastAPI gateway, Open WebUI), a guardrails and evaluation layer (PII redaction, grounding and citation checks, eval harness), an applications layer (claims and invoice intake, document Q&A), a model-serving layer (vLLM serving an open-weight LLM on GPU, embeddings and reranker on CPU), a data and retrieval layer (S3, Postgres with pgvector, in-VPC parsing), and a platform layer (EKS, Karpenter, Terraform, Helm). Security and observability span every layer. The only egress is a one-time, build-time pull of model weights into in-account S3.

Six layers, every one inside the customer's VPC. Component choices change per workload and jurisdiction. The constraint that holds across every variant is no third-party data egress: model weights are pulled once, at build time, into in-account S3, never on the serving path.

Performance & cost

Reproducible benchmarks.

We measure latency, throughput, cost per million tokens, extraction quality, and retrieval quality across the open-weight model set on synthetic regulated-document workloads, each model on our own stack and beside the same weights on Bedrock and the frontier. The result that matters: the same weights score the same regardless of who serves them, so the choice of who runs a model comes down to cost and control. The methodology, hardware, and harness are open in the repo.

Enterprise OSS LLM Index · Q2 2026
Retrieval quality, one fixed judge · bars to a 40% axis · $/M output at right
Kimi-K2.538% · $2.18
GLM-4.737% · $1.27
DeepSeek-V3.235% · $3.56
Qwen3-VL-235B29% · $1.30
Llama-4-Scout25% · $1.01
Extraction F1 on degraded scans (vision models): Qwen3-VL-235B 95.0%, Llama-4-Scout 84.6%. The same weights served on Bedrock score within CI.
# reproducible · synthetic data · no customer data

What's in the writeup

The reasoning behind the diagram.

STACK

Stack & rationale

Model selection, vLLM serving, quantization, compute sizing, networking, observability, and the security posture, with the reasoning behind each choice.

DEPLOY

Deployment & IaC

The Terraform modules, the EKS topology, and how it deploys into a customer account the same way every time.

COST

Cost & break-even

Self-host vs. API at realistic volumes, the break-even point, and the lessons from running it in production.

Read the full writeup →

Read it, then pressure-test it with us.

The repo is open. The 30-minute fit check is where you find out whether it fits your constraint.