The benchmark that came back almost empty

A quick note on what we were measuring, because it explains the symptom. To compare models, we run a fixed set of test documents, a few hundred of them, through the whole system and score each result against a known-correct answer. The output is a table with one row per document. A healthy run fills in every row, and the more rows that come back, the more of the test set completed.

We were testing a 671-billion-parameter open-weight model, the most expensive run in our benchmarking to that point, on rented high-end GPUs billed by the minute. The results table came back with four rows. Out of four hundred. The speed measurements taken at the same moment were perfect, so the model itself was working. Something was failing the other ninety-nine percent of requests before they could be scored.

The speed numbers were perfect and the model was working, but something failed ninety-nine percent of requests before they could be scored.

The benchmark does not talk to the model itself. It talks to a small service we call the gateway, which does the surrounding work: it looks up relevant documents, strips out sensitive data, checks that the answer is grounded in the source, and then calls the model. Once we turned on logging for every failure, the same line came back 357 times:

json.decoder.JSONDecodeError: Expecting value: line 1 column 1 (char 0)

In plain terms, that error means the gateway returned an empty response, with no error message and no timeout. And it happened only on requests that needed to look something up in the database. Requests that did not touch the database worked every time.

The database here is Postgres, which stores the document index the gateway searches. It was refusing new connections, because it had hit the maximum number it was configured to allow at once. The reason: the gateway opened a new connection for every lookup and never closed it. Here is the code. It looks correct, and it passes every test you would run on a single machine:

with psycopg2.connect(DSN) as conn:
    with conn.cursor() as cur:
        cur.execute(SQL, params)
        rows = cur.fetchall()

The catch is specific to this database library. Writing with ... as conn does not close the connection when the block ends. It finishes the current transaction and leaves the connection open. One request at a time, you would never notice. But a benchmark sends many requests at once, and within seconds every allowed connection was opened and left open, so every request after that got nothing back. That is why it looked fine in testing and fell over under real load.

The fix is one import and one wrapper that closes the connection on the way out:

from contextlib import closing

with closing(psycopg2.connect(DSN)) as conn:
    with conn.cursor() as cur:
        cur.execute(SQL, params)
        rows = cur.fetchall()

After that, the run completed all four hundred documents and the database stayed healthy under heavy concurrent load. (A connection pool, a fixed set of reusable connections, is the better long-term answer. This is the one-line version that stops the immediate problem.)

What we took from it

Two things. The narrow one is worth filing away if you use this library: that common pattern leaks database connections, and the leak stays invisible until enough requests arrive together. The broader one saved us money. We found this on a run that billed by the minute, but the bug had nothing to do with the model or the expensive hardware. It was an ordinary load problem in an ordinary service, and we can reproduce it on a laptop with a small model and a script that sends many requests at once. So now that is the first thing we do. The failure that showed up at the most expensive moment turned out to be the cheapest one to reproduce.

What we took from it

We publish what broke, not just what worked.