Eval Bench.

An internal tool that lets teams ship LLM features with the same confidence as normal code — behavioural specs, regression tests, 200ms feedback loop.

Role: Internal Tool · Stellar Labs
Client: Stellar Labs
Year: 2025
Duration: 4 months

// previewEval Bench

teams using12

median spec pass rate94.8%

median spec runtime200ms

regressions caught · m16

/ 01 — The problem

What was broken.

LLM features were shipping with no regression safety. A prompt change at 4pm broke production at 5pm. No team had a clear way to say 'is this prompt still good?'

/ 02 — The approach

How I tackled it.

Designed a spec format that reads like English but compiles to a deterministic check. Spec runner caches deterministically; rerunning unchanged specs costs nothing. Built into the existing CI so no new tools to learn.

/ 03 — The outcome

What shipped.

Adopted by every team shipping LLM features. 6 prompt regressions caught before deploy in the first month. Engineers describe it as 'finally feels like normal software'.

/ Like what you see?

Let’s talk.

→Get in touch