Enterprise OSS LLM Index · Q2 2026

The open-weight models worth self-hosting, measured.

A quarterly, reproducible benchmark of open-weight models on the document work regulated firms actually run: extraction from degraded scans, retrieval under a fixed judge, and the cost to serve each at scale. Run on our own in-VPC stack and beside the same weights on Amazon Bedrock. It's open to read, with no email gate. The email below is only if you want a note when the next edition ships.

Q2 2026 edition · reproducible · synthetic data · no customer data · full writeup →

Executive summary

The host doesn't change the score. The trade-offs are yours to make.

Q2 2026 is the first edition. We ran five open-weight models on three measures that matter for regulated document work, on our own stack inside a VPC and beside the same weights on Bedrock and the frontier.

The headline result is the one that lets you stop worrying about who hosts the model: the same weights score the same regardless of who serves them. Extraction F1 on our stack matched Bedrock within the confidence interval (95.0% against 94.7% on the strongest vision model), and retrieval matched within CI across the set. So where you run an open-weight model is a cost-and-control decision, not a quality one.

On the numbers, no model wins every axis. Qwen3-VL-235B led extraction at 95.0% F1. Kimi-K2.5 and GLM-4.7 led retrieval at 38% and 37% under a deliberately strict single-judge metric. On cost, Llama-4-Scout was the cheapest to serve at $1.01 per million output tokens, and DeepSeek-V3.2 the most expensive at $3.56. The Index is a map of those trade-offs, not a leaderboard with one trophy.

Results

Q2 2026, per model.

Retrieval quality · self-hosted in-VPC
One fixed judge · bars to a 40% axis · $/M output at right
Kimi-K2.538% · $2.18
GLM-4.737% · $1.27
DeepSeek-V3.235% · $3.56
Qwen3-VL-235B29% · $1.30
Llama-4-Scout25% · $1.01
# reproducible · synthetic data · no customer data
Full table · Q2 2026
Model Extract F1 Retrieval $/M out
Qwen3-VL-235B95.0%29%$1.30
Llama-4-Scout84.6%25%$1.01
GLM-4.737%$1.27
Kimi-K2.538%$2.18
DeepSeek-V3.235%$3.56
Extraction = F1 on degraded scans, vision models only. Retrieval under one fixed judge. $/M out = US$ per million output tokens (serving cost), at spot pricing and peak utilization. Same weights on Bedrock score within CI.

Methodology

What we measured, and what we didn't.

Read the full writeup → Methodology & harness on GitHub ↗

Per-model notes

Where each one fits.

License terms and per-model deployment notes are in the writeup. Always confirm a model's license against your own use before deploying.

Get the next edition when it ships.

The Index is open and free to read. Leave your email and we'll send one note when the Q3 2026 edition is out. One field, no sequence.

Editions: Q2 2026 — current. This is the first edition; the archive grows each quarter.