Testing agents, skills, and MCP
Using agents to solve real tasks needs proper context. Providing that context is easy with MCP and skills. But how do you verify it works? This post explores testing it with sandboxed Claude Code.
The prompt is one line, the ticket is not
Every run gets the same one-line prompt, the way you would type it into Claude Code:
That is how the task usually arrives. You point the agent at a ticket number and expect it to go read the ticket. The ticket itself is precise, and it lives in the tracker behind a tool that needs authorization to read:
NOTES-4567 — Add a note summary endpoint
Acceptance criteria
- Route:
GET /notes/:id/summary - 200 response: JSON
{ id, summary }and no other fields. summaryis the note body’s first sentence. Produce it withsummarize()fromsrc/summary-client.js. Do not reimplement summarization.- If the note does not exist, respond 404 with our house error envelope:
{ error: { code, message } }, wherecodeis"NOTE_NOT_FOUND". Every error in this service uses this{ error: { code, message } }shape, with a SCREAMING_SNAKE_CASEcodefrom our error catalog. - Read-only: never modify or cache anything on the stored note.
None of that is in the prompt. A capable model guesses the standard parts, but it will not guess your project’s error catalog, and that 404 is what every run below turns on.
The skill is the procedure
The skill tells the agent how to work, not what the answer is. Its first instruction is to fetch the spec rather than invent it.
1. **Read the ticket** named in the task (e.g. NOTES-1234) with the `get_ticket` MCP tool. Follow its acceptance criteria exactly. Do not guess them. The rest tells it to implement the route the way the ticket specifies, add its own tests, and run npm run verify before calling it done. It is a procedure, and the spec it points to lives somewhere the procedure cannot guess.
The MCP server holds the spec and the secret
That ticket is one of a folder of markdown files the server can hand out. It loads them from disk the way a tracker would.
const TICKETS_DIR = join(import.meta.dirname, "tickets");
const TICKETS = Object.fromEntries(
readdirSync(TICKETS_DIR)
.filter((f) => f.endsWith(".md"))
.map((f) => [f.replace(/\.md$/, ""), readFileSync(join(TICKETS_DIR, f), "utf8").trim()]),
); Reading one requires a token, and the server holds it. The agent calls the tool and gets the text back. It never sees the credential.
server.registerTool(
"get_ticket",
{
title: "Read a ticket",
description:
"Fetch a ticket from the issue tracker by id (e.g. NOTES-1234), including its acceptance criteria. Requires authorization, which this server holds.",
inputSchema: { id: z.string().describe('e.g. "NOTES-1234"') },
},
async ({ id }) => {
if (!TICKET_API_TOKEN) {
return { content: [{ type: "text", text: "error: ticket system not authorized" }], isError: true };
}
const ticket = TICKETS[id];
const text = ticket ?? `No ticket found for "${id}". Known: ${Object.keys(TICKETS).join(", ")}`;
return { content: [{ type: "text", text }] };
},
); This is the MCP use case worth showing: authorized access to context the model cannot otherwise reach. Without the tool the agent cannot read the ticket, so it guesses the contract. With the tool it reads the real thing, and the token that authorizes the read never enters the model’s context.
The trap is the error already in the file
Every error in this API looks the same.
// GET /notes/:id -> one note, or 404
const byId = path.match(/^\/notes\/(\d+)$/);
if (method === "GET" && byId) {
const note = store.get(Number(byId[1]));
if (!note) return send(res, 404, { error: "not found" });
return send(res, 200, note);
} A bare { error: "not found" }. Copy that shape for the new endpoint and you miss the house envelope the ticket asks for. Getting it right means following a convention that lives in the tracker, not in the code and not in the prompt.
The gate checks the wire, not the agent’s opinion
The acceptance check is hidden. The agent never sees this file. The harness drops it in only at scoring time, so a passing run had to satisfy the real contract, not whatever the agent claimed about its own work. The 404 test checks the error envelope and its code, the part a guessing agent gets wrong.
test("summary 404 uses the house error envelope", async () => {
await withServer(async (base) => {
const res = await fetch(`${base}/notes/999/summary`);
assert.equal(res.status, 404);
const body = await res.json();
// House format: { error: { code, message } }, code from the error catalog.
assert.equal(body.error?.code, "NOTE_NOT_FOUND");
assert.equal(typeof body.error?.message, "string");
});
}); Run that same check against two 404s, the one copied from the file and the one the ticket asks for, and it settles the question on its own. This snippet runs when the page is built, so the output below is regenerated every time and is not a screenshot.
// The gate the agent cannot talk past. A missing note must return the house error
// envelope, { error: { code, message } } with code NOTE_NOT_FOUND. We run the SAME
// hidden check against two handlers and let the check decide the outcome, not the
// handler's own opinion of itself.
import http from "node:http";
import assert from "node:assert/strict";
const handlers = {
// Copies the bare { error } 404 already in the file.
"copied { error }": (res) => {
res.writeHead(404, { "content-type": "application/json" });
res.end(JSON.stringify({ error: "not found" }));
},
// The house envelope the ticket specifies.
"house envelope": (res) => {
res.writeHead(404, { "content-type": "application/json" });
res.end(JSON.stringify({ error: { code: "NOTE_NOT_FOUND", message: "note 999 not found" } }));
},
};
// Serve one request and report back what the client saw on the wire.
async function serve(handler) {
const server = http.createServer((req, res) => handler(res));
await new Promise((r) => server.listen(0, r));
const { port } = server.address();
try {
const res = await fetch(`http://localhost:${port}/notes/999/summary`);
const body = await res.json();
return { status: res.status, code: body.error?.code };
} finally {
await new Promise((r) => server.close(r));
}
}
// The hidden acceptance check, the same one in judge/acceptance.test.js. It reads
// the response, never the implementation's self-report.
for (const [name, handler] of Object.entries(handlers)) {
const { status, code } = await serve(handler);
let verdict = "PASS";
try {
assert.equal(status, 404);
assert.equal(code, "NOTE_NOT_FOUND");
} catch {
verdict = "FAIL (wrong error shape)";
}
console.log(`${name.padEnd(18)} -> error.code: ${String(code).padEnd(16)} GATE: ${verdict}`);
} copied { error } -> error.code: undefined GATE: FAIL (wrong error shape)
house envelope -> error.code: NOTE_NOT_FOUND GATE: PASS
How one trial runs
Every trial runs in a throwaway container. In the with-MCP arm the agent gets an .mcp.json that points at the sidecar, and nothing else. The prompt is the same one line. This is the actual call:
if [ "$MODE" = "no-mcp" ]; then
# Strip the skill and the MCP config: the agent works with no private context.
rm -rf .claude CLAUDE.md mcp .mcp.json
else
# Reach the MCP tools over the network. No secret lives in this container.
cat > .mcp.json <<'JSON'
{ "mcpServers": { "notes-spec": { "type": "http", "url": "http://mcp-sidecar:8765/mcp" } } }
JSON
fi
TASK='Implement ticket NOTES-4567: add a summary endpoint for notes.'
# Inside the sandbox, skipping permission prompts is safe: the blast radius is
# this container, and it has no credentials to lose.
claude -p "$TASK" \
--model "$MODEL" \
--dangerously-skip-permissions \
--output-format stream-json --verbose \
> /work/transcript.jsonl 2> /work/agent.err || true The harness launches that container on a shared network, passing in only the subscription token claude itself needs:
// A throwaway container on the MCP network. Nothing from the host is mounted,
// so a wandering agent has no real source tree to escape to, and the only
// credential passed in is the subscription token claude itself needs.
const run = await exec('podman', ['run', '--name', name, '--network', NET, '-e', 'CLAUDE_CODE_OAUTH_TOKEN', IMAGE, model, mode]); The ticket’s credential lives somewhere else entirely. The MCP server runs in its own container, and that is the only place the token exists:
// The credentials live ONLY in this container's env. The agent containers
// join the same network and call the tools by name, so they reach the spec
// without ever holding the token or having a file to read it from.
const envArgs = ['-e', `MCP_HTTP_PORT=${MCP_PORT}`,
...Object.entries(MCP_ENV).flatMap(([k, v]) => ['-e', `${k}=${v}`])];
await exec('podman', ['run', '-d', '--name', 'mcp-sidecar', '--network', NET, ...envArgs,
'--entrypoint', 'node', IMAGE, 'mcp/server.js']); So the agent reads the spec over the network but never holds the key. That is why the leak check comes back empty every run. There is no file or environment variable in its container to read.
The four boundaries
Four ways a run can go wrong, each closed by construction rather than by trust:
- Wrong contract. The real spec is private, reachable only through the MCP tool. An agent without it cannot copy the contract, and one with it has no excuse to guess.
- Weakened test. The harness restores the canonical contract from a pristine copy before scoring, so editing a test cannot buy a pass.
- Silenced linter.
verifyrunseslint --no-inline-config, so aneslint-disablecomment is a no-op. - Leaked secret. The MCP server runs in its own container, so the agent has no environment variable and no file to read the credential out of.
The first three live in the few lines the harness runs after the agent stops:
src/ + tests
drop in the hidden judge
# Design boundary: did the agent edit the canonical contract tests? Detect, then
# restore the pristine copy so scoring uses the real invariants.
PRISTINE=/home/node/contract.test.js.orig
if cmp -s test/contract.test.js "$PRISTINE"; then TAMPERED=false; else TAMPERED=true; fi
cp "$PRISTINE" test/contract.test.js
# Drop in the hidden acceptance tests the agent never saw, then score.
cp /home/node/acceptance.test.js test/acceptance.test.js
if npm run verify > /work/verify.log 2>&1; then PASS=true; else PASS=false; fi What the agents did
Same model, claude-sonnet-4-6, run twice against the same baseline. The only difference is whether it can read the ticket. Toggle between the two pull requests.
Without the ticket the agent cannot know the contract, so it guesses, and it signs off just as confidently when it is wrong. With the skill and the MCP server it reads NOTES-4567, follows the procedure, and builds exactly what the ticket asks for. Same model, opposite result.
The green run is trustworthy because of the hidden judge, not because the agent says so. It is the one check the agent never sees and cannot pass by accident. The skill and the MCP put the contract in front of the agent. The judge is what proves the agent met it, and that is the piece to build first.
Done. GET /notes/:id/summary returns { id, summary }, and 404s with { error } for a missing note. npm run verify passes.