Enterprise OSS LLM Index · Q2 2026

The open-weight models worth self-hosting, measured.

A quarterly, reproducible benchmark of open-weight models on the document work regulated firms actually run: extraction from degraded scans, retrieval under a fixed judge, and the cost to serve each at scale. Run on our own in-VPC stack and beside the same weights on Amazon Bedrock. It's open to read, with no email gate. The email below is only if you want a note when the next edition ships.

Q2 2026 edition · reproducible · synthetic data · no customer data · full writeup →

Executive summary

The host doesn't change the score. The trade-offs are yours to make.

Q2 2026 is the first edition. We ran six open-weight models on three measures that matter for regulated document work, on our own stack inside a VPC and beside the same weights on Bedrock and the frontier.

The headline result is the one that lets you stop worrying about who hosts the model: the same weights score the same regardless of who serves them. Extraction F1 on our stack matched Bedrock to the decimal (97.9% against 97.9% on the strongest vision model), and retrieval matched within CI across the set. So where you run an open-weight model is a cost-and-control decision, not a quality one.

On the numbers, no model wins every axis. DeepSeek-V4-Pro, the newest and largest at 1.6 trillion parameters, led retrieval at 40% — but is the most expensive to serve by far, at $11.71 per million output tokens. Qwen3-VL-235B led extraction at 97.9% F1. On cost the other way, Llama-4-Scout was the cheapest to serve at $1.01. The Index is a map of those trade-offs, not a leaderboard with one trophy.

Results

Q2 2026, per model.

Retrieval quality · self-hosted in-VPC

One fixed judge · bars to a 40% axis · $/M output at right

DeepSeek-V4-Pro40% · $11.71

Kimi-K2.538% · $2.18

GLM-4.737% · $1.27

DeepSeek-V3.235% · $3.56

Qwen3-VL-235B29% · $1.30

Llama-4-Scout25% · $1.01

# reproducible · synthetic data · no customer data

Full table · Q2 2026

Model	Extract F1	Retrieval	$/M out
DeepSeek-V4-Pro	—	40%	$11.71
Qwen3-VL-235B	97.9%	29%	$1.30
Llama-4-Scout	91.3%	25%	$1.01
GLM-4.7	—	37%	$1.27
Kimi-K2.5	—	38%	$2.18
DeepSeek-V3.2	—	35%	$3.56

Extraction = F1 on degraded scans, vision models only. Retrieval under one fixed judge. $/M out = US$ per million output tokens (serving cost), at spot pricing and peak utilization. Same weights on Bedrock score within CI.

Methodology

What we measured, and what we didn't.

01Extraction. F1 on field-level extraction from intentionally degraded scans, run on the vision-capable models. It rewards reading the right value off a messy page, not guessing plausibly.
02Retrieval. Answer grounding scored by one fixed LLM judge with a pre-registered rubric, identical across every model. The absolute numbers are low by design: the judge is strict, so the ranking is what carries the signal.
03Cost. Dollars per million output tokens, self-hosted at spot pricing and peak utilization. It is the serving cost you would actually plan against, not a list price.
04Same weights, two substrates. Every model was scored on our in-VPC stack and on the same weights served by Bedrock, so the host effect could be measured rather than assumed.
05What we did not measure this quarter. Function calling, long-context behavior, and end-to-end agentic completion are in the harness but not in this edition. We would rather publish three measures we trust than five we are still hardening.

Read the full writeup → Methodology & harness on GitHub ↗

Per-model notes

Where each one fits.

01DeepSeek-V4-Pro. Top retrieval in the set at 40% and the newest, largest open flagship at 1.6 trillion parameters, with extraction saturated near the top of the field. The catch is cost: at $11.71 to serve it is the most expensive by far, roughly three times the next. A quality pick, not a cost pick — choose it when retrieval quality justifies the serving bill. Read the writeup →
02Qwen3-VL-235B. Strongest extraction in the set at 97.9% F1 and vision-capable, which makes it the one to beat on scanned-document work. Mid-pack retrieval at 29%, $1.30 to serve.
03Kimi-K2.5. Top of the affordable retrievers at 38%, just behind V4-Pro on the metric at a fraction of the price ($2.18). The pick when retrieval matters and volume is high.
04GLM-4.7. Within a point of Kimi on retrieval at 37%, and the cheapest of the strong retrievers at $1.27. The value pick when retrieval is the priority.
05DeepSeek-V3.2. Solid retrieval at 35%, but pricey for a mid-size model at $3.56, which is hard to justify unless something else in your stack already depends on it.
06Llama-4-Scout. Cheapest to serve at $1.01 and vision-capable (91.3% extraction), with the lowest retrieval at 25%. A strong default when cost dominates and retrieval is not the bottleneck.

License terms and per-model deployment notes are in the writeup. Always confirm a model's license against your own use before deploying.

Get the next edition when it ships.

The Index is open and free to read. Leave your email and we'll send one note when the Q3 2026 edition is out. One field, no sequence.

Read the full writeup →

Editions: Q2 2026 — current. This is the first edition; the archive grows each quarter.