Self-hosting open-weight models means renting those GPUs by the hour, so a benchmark that runs for an hour should cost an hour. Keeping that true comes down to one part of the system: the thing that turns the GPUs off when a run ends. We rebuilt ours twice, each time after it failed and the meter kept running.
The laptop in the loop
The first eval runs were driven from a laptop: port-forwards into the cluster, requests issued from the laptop, and a "scale the GPU node down when the run finishes" step at the end. One evening a run died when the developer's AWS SSO token hit its eight-hour expiry. The tunnels dropped, the run hung, and the scale-down step, which was chained to the run completing, never fired. The box billed until morning.
The root cause was structural: a laptop in the control loop, holding credentials that expire, with the cost-stop sitting downstream of a fragile process finishing. So we took the laptop out of the path. The cluster does the work and the laptop only submits. Evals run as in-cluster Jobs that hold their own identity for storage, so once a run is submitted there is no laptop, tunnel, or token left in the path. The job writes its results and logs to S3, where they survive whatever happens to the laptop. And the cost-stop no longer waits on a run to finish, which is what caused the second failure.
The kill switch that couldn't pull
Taking the laptop out meant the cost-stop could no longer ride on a run finishing, so we gave it its own trigger: a dead-man's switch, a nightly scheduled job that scales the GPU nodes to zero no matter what any run is doing. We verified that the schedule fired. We did not verify the executor.
The job ran a third-party kubectl image. The registry deprecated the tags, and the image was gone. So the dead-man fired on schedule and then could not start: it sat in ImagePullBackOff, unable to pull the image it needed, for seven hours while a spot GPU box idled. The schedule worked. The thing it was supposed to run did not exist.
What we changed
An emergency shut-off inherits every weakness of whatever it depends on, so we gave it as few dependencies as we could. Instead of a downloaded third-party image, the shut-off now tells Kubernetes to scale the GPU servers to zero using the credentials its own job already carries, from a base image the cluster already has:
TOKEN=$(cat /var/run/secrets/kubernetes.io/serviceaccount/token)
APISERVER=https://kubernetes.default.svc
CACERT=/var/run/secrets/kubernetes.io/serviceaccount/ca.crt
curl --cacert "$CACERT" -H "Authorization: Bearer $TOKEN" \
-H "Content-Type: application/merge-patch+json" -X PATCH \
"$APISERVER/apis/apps/v1/namespaces/$NS/deployments/vllm/scale" \
-d '{"spec":{"replicas":0}}'
We run it from a python:3.12-slim image, which stays available and does not disappear the way the third-party one did. The job is allowed to do one thing: scale the model server to zero. Nothing in the shut-off path depends on anything downloaded from outside the cluster. And we test the executor when we install it, rather than trusting that it works:
kubectl create job --from=cronjob/gpu-nightly-off gpu-killtest # then confirm the node drains, don't assume it did
A schedule firing tells you nothing about whether the job behind it can do its work. So we run the kill once, on purpose, and watch the node drain before we trust it.
The lesson
Two rules came out of this. Keep the kill path's dependencies near zero: no third-party images, no credentials that expire, nothing that waits on the work to finish. And test the executor, not just the schedule. The only dead-man's switch you can trust is one you have watched turn the GPUs off.
We also put a limit underneath all of it. The GPU node pools are scoped so the dev cluster cannot launch the large instances, so if something misfires the worst case is a pod stuck Pending, which is free, instead of an eight-GPU box that bills. The strongest cost control is the one the infrastructure enforces, whoever is at the keyboard.