tested.dev / v0

Your agent ships faster when it can test itself.

The testing layer for AI-native teams. Feedback loop for the agent. Gate for the PR.

bare prompt closes 26%. with tested.dev: 100%, 2× faster than iterating without coverage data.

by jorge modesto · 10y shipping software · built agents + mcp tooling in production

$tested diff --base origin/main

patch42.7%·project88.1%

see the code on github →

first cohort — 50 spots — no spam, ever

runs locally · no creds stored · source visible on green-light

works with

Claude Code
Cursor
Codex
Cline
Aider
Continue

01 / the bench

Every layer pays for itself.

Four-arm ablation, 13 TypeScript fixtures, N=3 trials — 156 test-closure attempts against DeepSeek V4 Pro. Each arm adds one capability over the previous. All three deltas separate at 95% Wilson CI.

bare prompt, one-shot

25.6%

no iteration · no coverage signal · no tools10/39 · 95% CI [14.6–41.1]

+46pp·iteration alone

bare prompt, multi-turn

71.8%

model reads stderr, retries · still no coverage signal28/39 · 95% CI [56.2–83.5]

+15pp·coverage data

raw diff (CLI)

87.2%

patch-scoped coverage JSON fed back to the model34/39 · 95% CI [73.3–94.4]

+13pp·MCP tooling

MCP tool loop

100.0%

write_and_verify · per-write coverage feedback39/39 · 95% CI [91.0–100.0]

per fixture · 13 × 4 · closures of 3 trials

fixture

async-crud-service

cron-parser

event-bus

json-patch-applier

lru-cache

markdown-link-extractorcanary

multi-file-controller

rate-limiter

retry-with-backoff

state-machine

string-diff

url-router

util-math

markdown-link-extractoris the canary — a regex over nested brackets. A and A* never close it. B closes it 1/3. Only the MCP loop closes it consistently. The wedge isn't an average; it's real on the hard fixtures.

model

deepseek-v4-pro

+ V4 Flash cross-model: same A* < B < C ordering

total spend

$4.79

186 cells · 19m wall-clock

caveat

mutation kill N=1

A* 90 > C 86 > B 83 — suggestive, not significant

reproduce

pnpm bench

github.com/tested-hq/evals →

02 / the problem

AI codes faster than humans can verify it.

Your agent can't see what it didn't test.

No structured coverage signal — so your agent guesses what needs tests, ships incomplete PRs, and you catch it in review. The loop never closes because the agent never had the signal to begin with.

Your team's standards live in your head.

You care about coverage. Half your team doesn't. PRs ship without tests, badges stay green, debt compounds. Without an enforceable gate, "we test our code" is a culture line, not a guarantee.

100% coverage isn't 100% trust.

Line coverage doesn't catch `expect(true).toBe(true)`. Mutation testing does. Smell scanning does. Both are missing from every tool your agent can actually call.

03 / the answer

Two surfaces over the same signal.

Agent reads it.

MCP server + JSON CLI. Coverage data lands in the same tool turn the agent calls it. Patch coverage in 30ms. Your agent writes the test that closes the gap and verifies the result before opening the PR.

> coverage.get_uncovered_diff

{
  "files": [
    {
      "path": "src/auth.ts",
      "ranges": [
        { "start": 14, "end": 22, "kind": "line" }
      ]
    }
  ]
}

$pnpm dlx@tested/cliinit# on public release

private alpha

Team enforces it.

GitHub App posts patch + project coverage on every PR. Configurable gate blocks merge under threshold. Your senior taste becomes everyone's gate — including the agent's. Standards stop being a culture line.

Quality, not vanity.

Mutation testing built in, not a paid add-on. Smell scan catches fake assertions. The agent gets told "this test passes but verifies nothing" and fixes it before opening the PR.

04 / what's next

The roadmap, on the record.

tested fix

Agent writes the missing test, verifies it covers, opens the PR. You review only the test.

Remote test execution

Opt-in: tested.dev runs your suite on isolated sandboxes. Agent doesn't block your laptop. CI in the loop, not after.

GitHub App

Patch + project coverage on every PR. Configurable gate blocks merge under threshold. Standards become non-negotiable.

Mutation testing GA

Stryker integration. Agent fixes surviving mutants. Coverage stops being a proxy for test quality and becomes proof of it.

Smell catalog

`expect(true)`, missing await, never-thrown errors, mocked-everything. Surfaced as tool calls the agent fixes itself.

[ first cohort shapes the order. ship feedback that lands in v1, get founder pricing for life. ]

05 / who built this

Built in public.
Because the tool doesn't exist yet.

“Tests pass. Production breaks. Engineers end up slower chasing bugs and validating things manually. Users get broken software.

Having a solid CI/CD pipeline flips it.

Now agents amplify the speed and the damage. They only ship fast when the feedback loop works.

— Jorge Modesto

10 years writing software

CLI · source open →

MCP server · source open →

Eval harness · source open →

Build log · live →

Get early access.
50 spots. First cohort. Free feedback loop.

06 / questions

FAQ

when does it launch?

Alpha is rolling now to the first 50. CLI + MCP + GitHub App land first. Hosted dashboard follows once the gate is real.

is it free during alpha?

Yes. Free for the first cohort, no card required. We will tell you what changes before it changes.

what stacks do you support today?

TypeScript and Python first — Vitest, Jest, pytest. Go and Ruby follow. If your stack is missing, email me and you skip the queue.

what about my code privacy?

Local-first. CLI runs in your shell, MCP runs in your editor. Coverage data and diffs stay on your machine. The GitHub App reads PR metadata — nothing leaves your repo except the threshold verdict.