Reference architecture
A purpose-built reference architecture for document-intensive AI workloads on EKS. Every component runs inside the customer's VPC, defined in Terraform, with published SLOs and cost-per-token. It's written for a staff engineer to evaluate in five minutes and decide whether the design holds up.
At a glance
Six layers, every one inside the customer's VPC. Component choices change per workload and jurisdiction. The constraint that holds across every variant is no third-party data egress: model weights are pulled once, at build time, into in-account S3, never on the serving path.
Performance & cost
We measure latency, throughput, cost per million tokens, extraction quality, and retrieval quality across the open-weight model set on synthetic regulated-document workloads, each model on our own stack and beside the same weights on Bedrock and the frontier. The result that matters: the same weights score the same regardless of who serves them, so the choice of who runs a model comes down to cost and control. The methodology, hardware, and harness are open in the repo.
What's in the writeup
Model selection, vLLM serving, quantization, compute sizing, networking, observability, and the security posture, with the reasoning behind each choice.
The Terraform modules, the EKS topology, and how it deploys into a customer account the same way every time.
Self-host vs. API at realistic volumes, the break-even point, and the lessons from running it in production.
The repo is open. The 30-minute fit check is where you find out whether it fits your constraint.