Abstract

There are two reasons to run a model on infrastructure you own instead of calling an API: your data isn't allowed to leave, or at high enough volume it costs less. Enterprises weighing that choice face three options — self-host open weights, call the same weights on a managed provider (Amazon Bedrock), or pay for a frontier model. We built one workload, one serving and scoring harness, and held everything constant except the substrate, so each comparison isolates who runs the model and on what. The workload is the document pipeline this architecture targets: structured field extraction from invoices and claims (520 documents, deterministic per-field F1 against ground truth) and retrieval-augmented question answering over a 2,686-document financial-regulation corpus (250 questions, scored by one independent fixed judge). We evaluated six open-weight families (Qwen3-VL-235B, Llama-4-Scout, GLM-4.7, Kimi-K2.5, DeepSeek-V3.2, GLM-5/5.2) and two frontier models (Claude Opus 4.8, Sonnet 4.6) across self-hosted vLLM, Bedrock, and direct-API substrates. Three findings hold: (1) substrate-invariance — the same weights produce the same quality whether we serve them or Bedrock does, within confidence intervals, on every axis; (2) the frontier quality premium is confined to extraction on degraded document layouts and does not transfer to grounded RAG, where an open-model cluster ties the frontier; (3) self-hosting wins on cost-per-token at high utilization for most of the lineup, with reasoning models the exception. Everything was measured on synthetic data, inside one VPC, and is reproducible from the open repository.

1. Background and objective

This is the reference architecture we built for workloads where the data can't leave or the volume makes self-hosting cheaper, and the measurements that show when each reason holds. It runs inside the customer's AWS account, with no third-party data egress. The infrastructure, the applications, and the benchmark harness are open.

1.1 The architecture

The stack is six layers, all inside one VPC. From the bottom: a platform layer (EKS with Karpenter, which provisions GPUs on demand), a data layer (S3, a Postgres vector store, and document parsing that stays on the network), the model-serving layer (vLLM running the open-weight model on a GPU, with the embedding and reranking models on cheaper CPU), the applications, a guardrails-and-evaluation layer, and the interface. Security and observability run across all six.

Layered architecture of the sovereign LLM/RAG stack inside a single VPC: platform, data and retrieval, model serving, applications, guardrails and evaluation, and interface, with security and observability spanning every layer.

Model weights are downloaded once, at build time, into a storage bucket in the customer's account. After that, nothing on the path that serves a request leaves the VPC.

1.2 Objective and hypothesis

Buyers ask a concrete question: how much do I give up by self-hosting an open-weight model instead of just calling Bedrock or just using the frontier? Answering it requires holding the model and the workload fixed and varying only the substrate. We test one hypothesis explicitly — substrate-invariance: identical weights yield identical task quality regardless of who serves them — and, having established that the harness is calibrated, we quantify the quality, speed, and cost differences that actually distinguish the three options. Every figure below was measured on synthetic data — generated medical claims and commercial invoices, not real documents — so the benchmark can be published and reproduced. Where a result isn't final, the text says so.

2. Methods

2.1 Experimental design

One workload, one harness, three substrates. Every request — self-hosted, managed, or frontier — routes through the same gateway, which performs identical retrieval, prompting, vision routing, and scoring. The only manipulated variable is the substrate.

LayerSubstrateWhat it isolates
A — Sovereign self-hostedour vLLM on rented GPUs, in-VPCopen weights re-run on our stack
B — Managedthe same family/checkpoint on AWS Bedrockour serving vs the provider's, same weights
C — FrontierClaude Opus 4.8 / Sonnet 4.6, via Bedrock and direct Anthropicthe ceiling, inside AWS's walls and out

A vs B is the calibration check (same weights, two operators). B vs C is the open-vs-frontier gap. A leaderboard that pits our run of a model last quarter against Bedrock's run of a newer version this quarter isn't a fair test — the weights moved underneath it — so we refreshed our self-hosted lineup to the exact current-generation checkpoints Bedrock serves, and run the identical weights both ways.

2.2 Models under test

FamilyParamsSelf-host precision / hardwareManaged
Qwen3-VL-235B235BFP8, 8×H200 (TP=8)Bedrock
Llama-4-Scout (same-weights pair)109B MoEBF16, 8×H100 (TP=8)Bedrock
GLM-4.7FP8, 8×H100 (TP=8)Bedrock
Kimi-K2.5INT4 (native), 8×H200 (TP=8)Bedrock
DeepSeek-V3.2FP8 (official), 8×H200 (TP=8)Bedrock
GLM-5 / GLM-5.2 (family pair)745 / 753BFP8, 8×H200 (TP=8)Bedrock (GLM-5)
Claude Opus 4.8 (frontier)Bedrock + direct
Claude Sonnet 4.6 (frontier)Bedrock + direct

All self-hosted open models served single-box on one 8-GPU node under pinned stable vLLM (v0.23.0). Llama-4-Scout is the strict same-weights pair (the identical published checkpoint runs both ways). GLM-5/5.2 is a family pair, not same-weights (Bedrock serves GLM-5; we self-host the half-version-newer GLM-5.2). A small dev-tier vision model, Qwen2.5-VL 7B on a single L40S (with a 72B reference), anchors the low end of the cost and routing analysis below.

2.3 Task 1 — Structured document extraction

Goal. Read a claim or an invoice and pull out the structured fields — patient, provider, amounts, identifiers — as JSON, then score each field against the known-correct value.

Data. Two synthetic corpora with machine-emitted gold labels, each rendered at increasing real-world difficulty:

  • Medical claims (Synthea, MITRE): 400 documents → CMS-1500 / EOB / medical-invoice layouts. Eight scored fields: patient_name, payer_name, provider_name, provider_npi, service_date, total_billed, balance_due, num_line_items. Three tiers (134 / 133 / 133): clean-digital (PDF with a text layer), scanned-clean (image, no text layer), scanned-degraded (image + scan wear).
  • Commercial invoices (FATURA): 120 documents, real invoice layouts with synthetic content, 50 templates. Five scored fields: invoice_number, invoice_date, due_date, total, buyer_name. Two tiers (60 / 60): scanned-clean, scanned-degraded.

The five scored fields for the commercial invoice pictured below — its gold record:

{"invoice_number": "2970-559", "invoice_date": "23-Jan-2002",
 "due_date": "06-Dec-2018", "total": "828.69", "buyer_name": "Alexander Williams"}

Difficulty tiers — what keeps the benchmark representative. Each document goes through three tiers: clean digital (the original file, perfect text), clean scan, and degraded scan — skewed, blurred, downscaled, and JPEG-compressed, the way a fax or a phone photo arrives, produced deterministically per seed. A degraded image has no text layer, so it forces the vision path (reading pixels) rather than a pristine digital parse. The hardest tier is the headline number, because it's the one that looks like a real intake queue. Text-only models (Kimi, DeepSeek, GLM-4.7, GLM-5.2) have no vision path and are evaluated only on the clean-digital tier; vision models (Qwen3-VL, Llama-4-Scout, Opus, Sonnet) run all image tiers. One caveat: the synthetic documents share structure within each family, so they vary less than real-world paperwork — read the scores as a comparison between models, not a prediction for your own corpus.

A clean synthetic FATURA commercial invoice: crisp text, with invoice number, dates, buyer, line items, and total all clearly legible.

Clean scan

The same invoice after the benchmark's level-2 degradation: skewed, blurred, with sensor noise and JPEG artifacts, and no text layer, which forces the model's vision path.

Degraded scan — same invoice

The same synthetic FATURA invoice before and after the scanned-degraded pipeline — skew, downscaling, blur, sensor noise, and JPEG compression. The degraded image has no text layer, so it forces the model's vision path: the hard tier that looks like a real intake queue.

Prompt (verbatim, medical track; temperature 0, JSON-object mode):

System: You extract structured data from a medical claim/invoice. Return ONLY a
JSON object with exactly these keys: patient_name, payer_name, provider_name,
provider_npi, service_date (YYYY-MM-DD), total_billed (number), balance_due
(number), num_line_items (integer). Use null for any field not present.
User:   INVOICE:\n<document text or image>

Scoring (deterministic, per field). A field counts as correct under type-appropriate matching: money within $0.01; counts as exact integers; names by alpha-token containment (so Dr. Rhett Smith · Cardiology matches gold Rhett Smith — the rendered "Name · Specialty" and Synthea's numeric suffixes are not extraction errors); identifiers and dates by normalized containment. F1 = correct fields / (documents × fields), reported per tier with a Wilson 95% confidence interval. The headline number is the hardest tier (scanned-degraded, abbreviated med-deg / com-deg).

2.4 Task 2 — Retrieval-augmented QA (sovereign RAG)

Goal. Answer a financial-regulation question grounded only in retrieved passages, with citations or an explicit refusal.

Corpus and pipeline (all in-VPC). 2,686 public US financial-regulation documents (CFPB consumer-finance reports and Federal Register rules — e.g. the Consumer Credit Card Market Report, CARD Act notices). Ingestion: chunk (1,400 chars, 200 overlap) → embed (TEI / BGE, 768-dim) → store in Postgres + pgvector. Query: embed → cosine top-5 → prompt → cited answer.

An example evaluation question (from a set of 250): "What act requires the CFPB to review the consumer credit card market?" — reference answer: "The Credit Card Accountability Responsibility and Disclosure Act of 2009 (CARD Act)…"

Prompt (verbatim; temperature 0, top-5 context):

System: You answer strictly from the provided CONTEXT about financial regulation.
Cite the sources you use with bracketed numbers like [1]. If the answer is not in
the context, say exactly: "I don't know based on the provided documents."
User:   CONTEXT:\n[1] (source: …)\n…\n\nQUESTION: <q>\n\nAnswer with citations:

Scoring needs one fixed judge — and here is why. The tempting shortcut is to let each model grade its own answers, but that breaks the comparison: a stronger model grades harder and gives itself a lower score. We measured it — the same pipeline scored 38.8% when a 7B model graded it and 30.0% when a 72B model graded the exact same answers. So RAG answers are graded by an LLM-as-judge held fixed across all models: Qwen2.5-VL-72B-Instruct on pinned vLLM v0.23.0. The judge sees (question, reference, candidate) and returns a strict verdict:

System: You are a strict grader for a financial-regulation QA system. Compare the
CANDIDATE answer to the REFERENCE answer for the QUESTION. Return JSON
{"correct": true|false, "score": 0-100, "reason": "..."}. correct=true only if the
candidate is factually consistent with the reference and does not add unsupported
claims. Brevity is fine.

To avoid confounding the judge with the model under test, answers are collected first and judged in a separate pass by the one fixed judge. Accuracy = % correct, with a Wilson 95% confidence interval (n ≈ 250).

2.5 Speed and cost

  • Latency (managed / frontier): per-request end-to-end through the gateway, reported as p50/p95.
  • Throughput (self-hosted): peak aggregate tokens/s under a 64-concurrent load generator — a throughput you own, not a single-request latency. The two speed metrics are not directly comparable; we report each where it applies. (When you push a managed API, the ceiling usually shows up as added latency — the provider quietly queues you — rather than an error; you can make managed throughput fixed by buying dedicated capacity, e.g. Bedrock Provisioned Throughput, but that turns the bill back into a by-the-hour number that looks a lot like self-hosting. We measured the on-demand, pay-per-token path, because that's where a team comparing "just call the API" starts.)
  • Cost — managed: provider list price, $ per million output tokens (Anthropic first-party list; OSS-on-Bedrock from the AWS Price List API, us-west-2 on-demand).
  • Cost — self-hosted: measured spot $/hr ÷ peak aggregate tok/s = $ per million output tokens — a best case (peak utilization, point-in-time spot). The two cost bases differ on purpose: managed bundles the operator's margin; self-host is raw rental you operate.

2.6 Controls and reproducibility

Temperature 0 everywhere; one retrieval pipeline; one fixed judge; identical prompts and concurrency profiles. Data is fully synthetic or public-domain, deterministic given pinned seeds (Synthea v4.0.0 / seed 1337; FATURA at a pinned revision; degradation seeded per item index), so a clean checkout regenerates byte-identical documents and gold. A completeness guard fails any run whose section completes under 80% of attempts, so a silent mass-drop cannot masquerade as a passing score.

3. Results

3.1 Per-model scorecard (the headline)

Each model on every axis, self-hosted and managed. Extraction shows the hardest applicable tier; RAG is fixed-judge accuracy (n ≈ 250); speed is latency p50 for managed and peak throughput for self-host; cost is $ per million output tokens.

ModelSubstrateExtraction¹RAG²Speed³$/M-out⁴
Qwen3-VL-235BSelf-host (FP8, 8×H200)95.0 / 83.728.83,838 tok/s$1.30
Bedrock94.7 / 83.730.02.9 s$2.66
Llama-4-ScoutSelf-host (BF16, 8×H100)84.6 / 72.325.23,927 tok/s$1.01
Bedrock84.7 / 71.724.80.8 s$0.66
GLM-4.7Self-host (FP8, 8×H100)95.8 (clean)36.83,120 tok/s$1.27
Bedrock32.81.3 s$2.20
Kimi-K2.5Self-host (INT4, 8×H200)97.2 (clean)37.62,295 tok/s$2.18
Bedrock32.41.4 s$3.00
DeepSeek-V3.2Self-host (FP8, 8×H200)95.2 (clean)34.81,403 tok/s$3.56
Bedrock30.81.8 s$1.85
GLM-5.2 / GLM-5Self-host GLM-5.2 (FP8, 8×H200)97.0 (clean)33.21,357 tok/s$3.69
Bedrock (GLM-5)96.1 (clean)30.82.4 s$3.20
Opus 4.8 (frontier)Bedrock96.4 / 93.732.44.7 s$25.00
Anthropic direct96.3 / 94.032.82.8 s$25.00
Sonnet 4.6 (frontier)Bedrock96.1 / 88.724.63.5 s$15.00
Anthropic direct96.0 / 89.325.23.6 s$15.00

¹ Vision models: medical-degraded / commercial-degraded F1 (hardest scanned tiers). Text-only models: clean-digital F1 (their only tier — no vision path). ² RAG fixed-judge accuracy; read the cluster, not the rank (§4.2). ³ Managed = per-request latency p50; self-host = peak aggregate throughput at 64-concurrent — different metrics (§2.5). ⁴ $/M output; managed list price, self-host = spot $/hr ÷ peak tok/s (best case). ⁵ Strict same-weights pair. ⁶ Family pair, not same-weights.

3.2 Extraction quality, by tier

On the hardest tier — degraded scans — the frontier leads, the best open vision model is close on medical and ~10 points back on commercial, and the natively-multimodal Scout is weakest:

ModelSubstratemed-degcom-deg
Opus 4.8Bedrock96.493.7
Opus 4.8Anthropic direct96.394.0
Sonnet 4.6Bedrock96.188.7
Qwen3-VL-235BSelf-host95.083.7
Qwen3-VL-235BBedrock94.783.7
Llama-4-ScoutSelf-host84.672.3

Field-level F1 against gold labels. Text-only models on their clean-digital tier: Kimi-K2.5 97.2, GLM-5.2 97.0, GLM-4.7 95.8, DeepSeek-V3.2 95.2 (self-host); GLM-5 96.1 (Bedrock). NPI is the lone weak field for every model. As a dev-tier reference, the small Qwen2.5-VL 7B scores 93.7 med-deg / 70.6 com-deg and the 72B 96.5 / 74.5 — a bigger model narrows the gap on hard documents but doesn't close it; cleaning up the image first (de-skew, sharpen) often does more, for less.

3.3 A model that does vision reads clean text worse — so route by input

One result changed how we deploy. A vision-capable model reads clean digital text worse than its same-size, same-price text-only version: it treats the page as an image and picks up image errors on text it could have read directly. On clean invoices, the text-only Qwen2.5-7B scored 99% and its vision sibling 87.5%, at the same cost. So the gateway routes by input — digital files to the text model, scans and photos to the vision model — and pays for vision only when the input is an image.

3.4 Retrieval quality (all models, n ≈ 250, fixed judge)

ModelSelf-hostBedrock / frontier
Kimi-K2.537.6 [31.8, 43.7]32.4
GLM-4.736.8 [31.1, 42.9]32.8
DeepSeek-V3.234.8 [29.2, 40.9]30.8
GLM-5.2 / GLM-533.2 [27.7, 39.3]30.8
Qwen3-VL-235B28.8 [23.5, 34.7]30.0
Opus 4.832.4 / 32.8
Llama-4-Scout25.2 [20.2, 30.9]24.8
Sonnet 4.624.6 / 25.2

The ~30–33% group (Kimi, GLM-4.7, DeepSeek, GLM-5, Opus, Qwen3-VL) is a statistical tie — confidence intervals span about ±6 points and overlap heavily. Kimi-K2.5 and GLM-4.7 come out at the top; the rest are within the margin.

3.5 Speed and cost

Managed latency p50 (p95 where notable), seconds: Scout 0.8 · GLM-4.7 1.3 · Kimi 1.4 · DeepSeek 1.8 · GLM-5 2.4 (p95 24.7, thinking tail) · Qwen3-VL 2.9 · Sonnet 3.5 · Opus-direct 2.8 vs Opus-Bedrock 4.7 (p95 ~11.6). Self-hosted peak throughput, tok/s: Scout 3,927 · Qwen3-VL 3,838 · GLM-4.7 3,120 · Kimi 2,295 · DeepSeek 1,403 · GLM-5.2 1,357 (small dev-tier Qwen2.5-VL 7B: 2,375 on one L40S).

Cost. Frontier output tokens run 8–38× the OSS-on-Bedrock models: Opus $25/M, Sonnet $15/M vs $0.66–$3.20/M. Self-hosting the same open weights at peak utilization beats Bedrock for most of the lineup (GLM-4.7 $1.27 vs $2.20, Qwen3-VL $1.30 vs $2.66, Kimi $2.18 vs $3.00) but loses for DeepSeek-V3.2 ($3.56 self vs $1.85 Bedrock) and Llama-4-Scout ($1.01 self vs $0.66 Bedrock, dirt-cheap managed). The small dev-tier model serves a million tokens for about $0.22; against a typical small-model API near $0.30/M, self-hosting it pays off above roughly 150 million tokens a day, and below that the API is cheaper. Spot basis: 8×H100 (p5.48xlarge) $14.22/hr; 8×H200 (p5e.48xlarge) ~$18/hr, shortage-elevated on the run date — at the typical ~$14 the H200 self-host rows fall about a fifth.

3.6 Calibration check (A vs B)

The test of a fair harness: each self-hosted score should land on its managed twin. It does, on both axes.

  • Extraction: Qwen3-VL commercial-degraded identical to the decimal (83.7 / 83.7), medical-degraded within noise (95.0 vs 94.7). Llama-4-Scout (exact same weights) matches on both vision axes — med-deg 84.6 vs 84.7, com-deg 72.3 vs 71.7.
  • RAG: every self-hosted score sits inside its managed twin's confidence interval. Notably the self-hosted figure lands nominally higher on four of five (Kimi +5.2, DeepSeek +4.0, GLM-4.7 +4.0, Qwen3-VL −1.2) — within noise, but consistent enough to flag a possible real serving/precision edge for the sovereign stack.

Substrate-invariance holds. Same weights → same quality regardless of who serves them.

4. Discussion

4.1 The frontier premium is an extraction-on-degraded phenomenon

On commercial-degraded scans, Opus (94%) leads the best open model, Qwen3-VL-235B (83.7%), by about 10 points; Sonnet sits between (88.7%). On medical-degraded the gap nearly closes (96.4 vs 95.0). Llama-4-Scout is the weakest vision extractor (72%) — natively multimodal does not mean good at document vision. The bigger model helps on one specific thing: the long identifiers and account numbers on degraded scans, the strings that break up on a noisy image. So the case for paying 8–38× per token is specifically hard, degraded-layout extraction accuracy, and nowhere else.

4.2 The premium does not transfer to RAG

Grounded QA clusters at ~30–33% across Opus, GLM-4.7, Kimi, DeepSeek, GLM-5, and Qwen3-VL with overlapping confidence intervals — the frontier does not lead. Two cautions on the metric: (a) it scores answer-match against terse reference answers, so it rewards concise, extractive answers over synthesis; (b) Sonnet's low 24.6% is a style artifact, not weaker RAG — Sonnet refuses less (6% vs 7%) and cites more (99% vs 94%) than Opus, but writes about 30% longer enumerated answers that the strict gold-match judge penalizes. A containment-style judge would neutralize this; changing the judge, however, breaks comparability with the version-locked baseline, so it is a v2 decision. The honest reading: on RAG, choose by cost and latency, not by RAG rank.

4.3 Self-hosting economics: utilization and sovereignty, not a blanket win

Per output token at peak utilization, self-hosting beats Bedrock for most of the lineup — but not the cheap-on-Bedrock model (Scout) or the slow reasoner (DeepSeek). The lever is utilization: the self-host figures assume peak 64-concurrent throughput on shortage-elevated spot; average utilization is lower and shortage pricing inflates the H200 rows. The durable self-host arguments are therefore (1) cost at high, sustained utilization, and (2) sovereignty and data control — not a universal per-token saving. Reasoning models carry a throughput tax: DeepSeek-V3.2 (1,403 tok/s) and GLM-5.2 (1,357 tok/s) generate long chains per request, which is why they are the priciest to self-host.

4.4 A trillion parameters no longer needs a cluster

Kimi-K2.5 is the interesting one, and it overturned our own assumption. At a trillion parameters we expected it to need more than one machine. It doesn't. The model ships as a natively four-bit checkpoint — its makers trained it to run at low precision, so the public weights are about 595 GB, not the two terabytes a trillion full-precision parameters would take. That fits on a single eight-GPU box with room to spare. So we serve it like everything else — one machine, rented by the hour — and it posted the highest extraction score and the highest retrieval score of the open models we tested. Two things follow. "Trillion-parameter" no longer means "needs a cluster." And the frontier model you rent from a cloud may be the very same compact four-bit checkpoint you could run yourself — providers don't publish the precision they serve at, and for this model there is only one public version to compare against.

4.5 Threats to validity

  1. FATURA buyer-name gold bug (fixed). About 45–50% of buyer_name gold values were the literal label "Bill to"; corrected and all managed models re-scored on the corrected 5-field gold. The fix raised commercial F1 across the board (e.g. Opus 92.9→93.7, Scout 66.2→71.7).
  2. RAG judge style bias (§4.2) — the version-locked judge penalizes verbose, multi-part answers; read the RAG cluster as tied.
  3. Latency ≠ throughput — managed latency is best-effort per-request; self-host throughput is owned capacity. Reported separately, never merged.
  4. GLM-5/5.2 is a family pair, not same-weights — do not read that row as a calibration check.
  5. Self-host $/M is best-case — peak utilization, point-in-time shortage-elevated spot. The numbers were measured on AWS; the platform runs the same on Azure and GCP, and equivalent results there are planned.

5. Conclusions

5.1 Best all-around

For a mixed enterprise document workload (scanned intake and grounded QA), the best all-around open-weight model is Qwen3-VL-235B. It is the only open model strong on both axes — best open document-vision extraction (matching its Bedrock twin to the decimal at 83.7% on degraded scans), a competitive ~29–30% on RAG — at low cost ($1.30/M self-hosted, $2.66 Bedrock) and the highest open-model throughput tested (3,838 tok/s). If the workload is text-only (no scanned images), GLM-4.7 is the better all-rounder: top-cluster RAG (36.8% self-hosted), strong clean-digital extraction (95.8%), the cheapest self-host ($1.27/M), and the fastest managed latency (1.3 s). Pay for the frontier (Opus 4.8) only when degraded-document extraction accuracy is the priority — that is the one axis where its ~10-point lead and 8–38× cost are justified.

5.2 Situational guide

If your priority is…PickWhy (from the data)
Hardest scanned-document extraction accuracyOpus 4.894% com-deg, ~10 pts over the best open model; the frontier premium is real here
Best open vision+text all-rounderQwen3-VL-235Bmatches Bedrock to the decimal on extraction, ~30% RAG, $1.30 self, 3.8k tok/s
Text-only docs, best RAG + economicsGLM-4.736.8% RAG self, $1.27/M self, 1.3 s managed, 3.1k tok/s
Lowest cost / latency, clean inputsLlama-4-Scout$0.66/M Bedrock, 0.8 s, 3.9k tok/s — but weakest extraction (72% com-deg)
Clean docs, modest volumesmall model on one GPU, or a managed API$0.22/M self; under ~150M tokens/day the API is cheaper
Maximum data sovereigntyany self-host (Qwen3-VL / GLM-4.7 best value)calibration proves self-host reproduces managed quality
Grounded RAG specificallytreat as a tie; choose on cost/latencythe ~30–33% cluster's intervals overlap — RAG rank is not decisive

5.3 Where this nets out

If your documents are clean and your volume is modest, a small model on one GPU is cheap and good enough — and under the break-even volume, a managed API or Bedrock-in-VPC is the cheaper choice. If your inputs are messy, a larger model is worth its cost on the hard fields, though none of them make degraded scans easy. And sometimes the reason to self-host isn't cost at all — it's that the data can't leave, which a dollars-per-token table doesn't capture.

The sovereign stack is not a quality compromise: on identical inputs it reproduces the managed provider's output on every axis. The real decision is an economic and operational one — utilization, latency profile, and data-control posture — and, for the single case of degraded-layout extraction, whether the frontier's accuracy edge is worth its premium. The point of running each model on Bedrock and beside the frontier is to let those tables say "just use Bedrock" or "just use Opus" out loud, in numbers, when that's the honest answer. A comparison you trust when it favors self-hosting is one that was willing to come out the other way.

Appendix — provenance & reproducibility

Runner: bench/managed-sweep.sh (+ job-quality-managed.yaml); self-host throughput via bench/loadgen.py. Scorer: bench/quality/score.py (+ judge.py, stats.py). Judge: Qwen2.5-VL-72B on vLLM v0.23.0. Data generators: data/synthea/, data/fatura/build.py, data/corpus/build.py; pins in data/README.md (Synthea v4.0.0 / seed 1337; FATURA pinned revision). Everything is in the open repository alongside the Terraform, the Helm charts, and the harness that produced these numbers.