~/portfolio·branch: main·v1.0.0
deployed · 2m ago
JKT · GMT+7·--:--:--
Back to work
/ 06eval-bench·2025
Internal platform · 2025

Eval Bench.

An internal tool that lets teams ship LLM features with the same confidence as normal code — behavioural specs, regression tests, 200ms feedback loop.

Role
Internal Tool · Stellar Labs
Client
Stellar Labs
Year
2025
Duration
4 months
// previewEval Bench
teams using12
median spec pass rate94.8%
median spec runtime200ms
regressions caught · m16
/ 01 — The problem

What was broken.

LLM features were shipping with no regression safety. A prompt change at 4pm broke production at 5pm. No team had a clear way to say 'is this prompt still good?'

/ 02 — The approach

How I tackled it.

Designed a spec format that reads like English but compiles to a deterministic check. Spec runner caches deterministically; rerunning unchanged specs costs nothing. Built into the existing CI so no new tools to learn.

/ 03 — The outcome

What shipped.

Adopted by every team shipping LLM features. 6 prompt regressions caught before deploy in the first month. Engineers describe it as 'finally feels like normal software'.

/ Like what you see?

Let’s talk.