01 · Evaluation & Selection

Know which open-weight model to trust before you spend a dime on it.

Open-weight models change every few weeks, and public leaderboards don't run your workload. We benchmark the real candidates on your task, in your own cloud, and hand back a recommendation you can defend to your CISO and your CFO.

Scope an evaluation See how we measure →

What you get

A recommendation with its work shown.

We'll provide a succinct recommendation that includes:

01Written model recommendation — the model (or models) to deploy for your workload, and why, including the cases where we'd advise against self-hosting at all.
02Eval results — accuracy, latency, and cost across candidates on your task, with the prompt sets and scoring criteria spelled out.
03Harness code — the evaluation harness itself, yours to keep and re-run as new models ship.
04Cost projection — self-host vs. API at your real volumes, with the break-even point.

Fit

Who should take this on.

This is for you if

You're choosing among open-weight models for a workload you've already defined.
You need the decision to pass review by a staff engineer and a compliance officer.
You want the evaluation method itself, so you can repeat it as new models ship.

This isn't for you if

The workload isn't defined yet — we'll tell you to scope it first, and help you do that.
Getting to something that works is more urgent than selecting the best possible solution.
You've already decided and just want a rubber stamp.

FAQ

Questions we get first.

How do you handle our data during the evaluation?

We evaluate inside your account or on a clean, isolated environment you approve. Eval data is destroyed at the end of the engagement, and nothing is retained or used to train anything. The defaults are written into the statement of work.

What won't you benchmark?

Anything we can't measure honestly. We don't publish vibes-based scores, we don't benchmark a workload that isn't defined, and we won't compare models on a task that doesn't match what you'll actually run. If a fair test isn't possible yet, we say so.

How does this relate to the next engagement?

The harness you keep becomes the eval suite for a Sovereign RAG Platform or a Document Workflow Pilot, if you take one on. The evaluation is a complete deliverable on its own. There's no obligation to continue.

Do you only ever recommend open-weight, self-managed models?