When you benchmark models on real document workloads instead of leaderboards, the findings worth keeping are the ones that contradict the obvious choice. Two of them changed how we size deployments.
Pay for vision only when the input needs it
Most teams reach for a vision-language model for document extraction, because it reads scans and photos without a separate OCR step. So we benchmarked one against its same-size, same-price, text-only sibling on clean digital invoices, which are real text with no scanning involved. We held the field schema, the gold labels, and the cost per token constant.
| Model (7B class) | Type | F1 on clean digital |
|---|---|---|
| Qwen2.5-7B | text-only | 99.0% |
| Phi-3.5-mini | text-only, cheapest | 92.7% |
| Qwen2.5-VL-7B | vision-language | 87.5% |
The vision model, at the same parameter count and the same cost, scored worse on documents it could have parsed as text, because it processes the page as pixels and inherits the error modes that come with that.
The pattern is to route by input type: send digital PDFs to the text model and scans and photos to the vision model, using whichever model scores best on each. In practice we run a single vision-language model for both, because operating one model is simpler than operating two, and we accept the small loss of clean-text accuracy that comes with it. A workload that needed that accuracy back would split the two paths. And note the cheapest model in the table: cheaper did not mean much worse. A higher price does not buy safety on its own.
Scale pays only where vision is hard
The other reflexive thought is, "a bigger model will fix it." We ran the same extraction eval across three difficulty tiers (clean digital, clean scan, and degraded scan, the skewed, blurred, compressed way a real fax or phone photo arrives) on a 7B model and its 72B sibling, about ten times the cost per token.
| Tier | 7B F1 | 72B F1 | Δ |
|---|---|---|---|
| Medical, clean digital | 96.4% | 96.8% | tie |
| Medical, degraded scan | 93.7% | 96.5% | +2.8 |
| Commercial, degraded scan | 70.6% | 74.5% | +3.9 |
On clean documents it was a statistical tie: paying ten times more bought nothing inside the confidence intervals. On degraded scans the 72B pulled ahead, but only by a few points, and only on the fields the small model fails: long identifiers, account numbers, the alphanumeric strings that fall apart on a noisy scan. Even the 72B model reached only 74.5% on the worst of the degraded real-world scans. Scale helps at the margin. It does not solve the hard tier.
So the benchmark informs a decision per tier rather than per model: route clean intake to the cheap model where it ties the expensive one, and reserve the expensive model for the degraded inputs where its few points land. Preprocessing, like de-skew and upscale, is the other lever, and in many workloads the cheaper one.
Why we run our own documents
None of this shows up on a generic leaderboard, because a leaderboard does not run your documents at your difficulty. We publish a quarterly benchmark on regulated-document workloads because the decision-useful answers are specific: which model, at which input quality, for which field. "Bigger" and "multimodal" are line items on a bill, and each one should earn its cost.