A vision model is a worse text extractor than its text-only twin

When you benchmark models on real document workloads instead of leaderboards, the findings worth keeping are the ones that contradict the obvious choice. Two of them changed how we size deployments.

Pay for vision only when the input needs it

Most teams reach for a vision-language model for document extraction, because it reads scans and photos without a separate OCR step. So we benchmarked one against its same-size, same-price, text-only sibling on clean digital invoices, which are real text with no scanning involved. We held the field schema, the gold labels, and the cost per token constant.

Model (7B class)	Type	F1 on clean digital
Qwen2.5-7B	text-only	99.0%
Phi-3.5-mini	text-only, cheapest	92.7%
Qwen2.5-VL-7B	vision-language	87.5%

The vision model, at the same parameter count and the same cost, scored worse on documents it could have parsed as text, because it processes the page as pixels and inherits the error modes that come with that.

Vision is a capability you pay for in accuracy on the inputs that do not need it.

The pattern is to route by input type: send digital PDFs to the text model and scans and photos to the vision model, using whichever model scores best on each. In practice we run a single vision-language model for both, because operating one model is simpler than operating two, and we accept the small loss of clean-text accuracy that comes with it. A workload that needed that accuracy back would split the two paths. And note the cheapest model in the table: cheaper did not mean much worse. A higher price does not buy safety on its own.

Scale pays only where vision is hard

The other reflexive thought is, "a bigger model will fix it." We ran the same extraction eval across three difficulty tiers (clean digital, clean scan, and degraded scan, the skewed, blurred, compressed way a real fax or phone photo arrives) on a 7B model and its 72B sibling, about ten times the cost per token.

Tier	7B F1	72B F1	Δ
Medical, clean digital	96.4%	96.8%	tie
Medical, degraded scan	93.7%	96.5%	+2.8
Commercial, degraded scan	70.6%	74.5%	+3.9

On clean documents it was a statistical tie: paying ten times more bought nothing inside the confidence intervals. On degraded scans the 72B pulled ahead, but only by a few points, and only on the fields the small model fails: long identifiers, account numbers, the alphanumeric strings that fall apart on a noisy scan. Even the 72B model reached only 74.5% on the worst of the degraded real-world scans. Scale helps at the margin. It does not solve the hard tier.

So the benchmark informs a decision per tier rather than per model: route clean intake to the cheap model where it ties the expensive one, and reserve the expensive model for the degraded inputs where its few points land. Preprocessing, like de-skew and upscale, is the other lever, and in many workloads the cheaper one.

Why we run our own documents

None of this shows up on a generic leaderboard, because a leaderboard does not run your documents at your difficulty. We publish a quarterly benchmark on regulated-document workloads because the decision-useful answers are specific: which model, at which input quality, for which field. "Bigger" and "multimodal" are line items on a bill, and each one should earn its cost.

Pay for vision only when the input needs it

Scale pays only where vision is hard

Why we run our own documents

The benchmark behind these findings is open.