A self-improving workbench for prompt engineering, RAG, synthetic data, and evals. One realistic task threads through every stage — orchestrated by an agent that explains its reasoning out loud.
// anchor task: classify support tickets by urgency
▶ eval prompt.v3 → accuracy 0.72 · tone 0.81 · p95 312ms
↳ regression on 14/200 examples in {billing, churn}
// orchestrator reasoning
decide → retrieval likely the bottleneck. switching naive → semantic chunking.
▶ rag.rebuild chunker=semantic k=6
▶ eval prompt.v3+rag.v2 → accuracy 0.89 (+0.17) ↑ graduating to v4The pipeline
Each stage emits typed artifacts the next one consumes. Pick up at any node — the orchestrator routes around what you skip.
Version, diff, and auto-refine prompts with a critic LLM.
Chunk, embed, index. Inspect retrieval traces and grounding.
Generate training pairs from approved prompts. Curate inline.
Score accuracy, tone, latency. Flag regressions per commit.
Orchestrator agent
Watches eval scores, picks the next technique, narrates the decision. The glue between stages.
Design principles
Versioned prompts. Typed artifacts. Eval suites that block bad regressions. The agentic parts are observable, not magic.
Under the hood
Boring foundations on purpose. The interesting parts live in the orchestrator.
Runtime
AI
Data
Infra
Run the whole loop — prompt → retrieval → eval — and let the orchestrator tell you what to try next.