Postman · Workspace Context CLI · Evaluation Series

How should a coding agent
get API context?

What was tested

The product question behind wcc (Workspace Context CLI): when an AI coding agent needs to write working integration code against an unfamiliar API, what's the best way to hand it the API's context? Each condition below is one answer to that question, run over the same task corpus and scored identically.

The corpus — 22 APIs, T2 tasks

Every task: write a Python CLI (argparse + requests) that chains 5–6 real API calls, where outputs of earlier calls feed later ones (create customer → create product → create price → subscribe → fetch). Tiers grade task complexity:

tier	shape	this report
T1	3 independent calls — simple create/retrieve	not run
T2	5–6 chained calls with data threading between steps	✓ all results here
T3	discovery flows — filter/search, then act on results	not run

Why these 22 (and not more)

The eval needs every condition to compete on the same API. That requires each API to have both an OpenAPI spec (for full-spec) and a provisioned Postman workspace (for the task-brief-compiler conditions). The corpus is that intersection, restricted to T2: 22 APIs — a mix of household names (Stripe, Slack, Datadog) and long-tail APIs the models can't lean on memory for (Apacta, GSMtasks, PayRun.io, Big Red Cloud).

49 workspaces are provisioned in total; T1/T3 task sets exist for many of these. Expanding the matrix (more APIs × more tiers × repeat runs) is a cost decision, not a tooling one — each real-agent condition costs ~$5–7 per 22-task sweep.

Scoring: an LLM extractor lists the HTTP calls the generated script makes; these match against the task's expected sequence (method + normalized path). Precision = of the calls the code makes, how many are right. Recall = of the expected calls, how many appear.

Conditions at a glance

All figures are per-task means over the 22 APIs (a 358,727-token entry means that's the average per task; multiply by 22 for the sweep total). Token columns use the glossary below — raw vendor numbers are not comparable across OpenAI and Anthropic without this normalization.

uncached tokTokens billed at the full input/output rate: input the model hadn't seen before, plus everything it generated. This is the "real work" number.

cached tokInput re-read from prompt cache at ~10–20× discount ($0.175–0.50/M). Agentic loops re-send their whole history every round — caching is why that's affordable.

total tokuncached + cached = everything the model processed. Codex reports cache reads inside its input count; Claude reports them separately — both normalized to this split.

The money chart

Each condition plotted by fresh (uncached) tokens per task (x-axis) against endpoint precision (y-axis); bubble size = total tokens processed. Up and to the left is better — more accurate and leaner (less context).

Accuracy vs context size (fresh tokens/task)

The task-brief-compiler conditions (codex, claude) sit up-and-left — full precision on a fraction of the full-spec baseline's fresh context. (Fresh tokens, not cost, so it's comparable across models.)

Accuracy

Endpoint precision and recall, mean over the corpus.

Precision & recall by condition

A wash at the top — every condition lands 92–95%. The task-brief-compiler drivers edge out the full-spec baseline on precision while working from a compact brief instead of the whole spec.

Token anatomy

Where the tokens actually go. Headline totals mislead — the agentic conditions are mostly cached re-reads of their own accumulated context. The uncached slice (full-price work) is remarkably similar across very different setups.

Uncached vs cached tokens per task (mean)

Working from the compact task brief, the real agents process only ~26K (codex) / ~45K (claude) total tokens per task — mostly cache reads. Uncached work is a small slice everywhere; one-shot conditions have no cache at all.

Context size

Fresh (uncached) tokens per task

Fresh input tokens the model must process per task — the context footprint, independent of model price. The full spec dwarfs the compiled brief.

Per-company precision matrix

All 22 APIs × all conditions. Hover a row. Datadog and Sentry fail every condition identically — scoring/spec-shape issues, not retrieval failures.

The prompts

The exact prompt templates each condition runs with (placeholders in {braces} are filled per task). These are the conditions' real definitions — differences in results trace back to differences here.

1 · full-spec — single prompt, whole spec inlinedgpt-4.1 · 1 round-trip

The entire OpenAPI spec (often 100s of KB) is pasted into one prompt. No retrieval, no tools.

You are a Python developer. Build a CLI tool using `argparse` and `requests` for the {company} API.

The CLI should have exactly {N} subcommands, one for each of these operations:
  1. `{method}-1` — {step description}
  2. ...

Requirements:
1. Each subcommand maps to ONE API operation from the list above
2. Use `argparse` with subparsers
3. Use only `requests` for HTTP calls (no SDK)
4. Include proper authentication handling (API key via environment variable or --api-key flag)
5. Each subcommand should accept the arguments needed for its operation
6. Include error handling for API errors
7. Print results as formatted JSON

Here is the OpenAPI specification for the {company} API:

{FULL OPENAPI SPEC — inlined, can be 100K+ tokens}

Use the correct endpoints from this specification for each subcommand. Return your code inside a single ```python code fence.

What this means

Harness

Custom TypeScript harness (api-evaluator) — not promptfoo/braintrust/inspect. One runner per condition. Data pipeline: result.json → evals/v3/eval-results.db (canonical store, build-eval-db.ts --ingest) → report-data.json (build-report-data.ts, SQL) → this report. Raw artifact trees live on the archive/wcc-eval-2026-06 branch.

Runs & variance

full-spec: deterministic (1 run). Real agents (codex, claude): 1 run each (stochastic — treat single-company deltas as noise; means are the signal).

Pricing /1M tokens

gpt-4.1 $2 / $0.50 cached / $8 out · gpt-4.1-mini $0.40 / $1.60 · gpt-5.3-codex $1.75 / $0.175 cached / $14 out (reasoning tokens bill as output) · Sonnet 4.6 $3 / $15 out (+cache-write premium).

Sandboxing

Real agents run isolated from local installs: CODEX_HOME=/tmp/eval-codex, CLAUDE_CONFIG_DIR=/tmp/eval-claude-home. Claude Code uses org OAuth (managed device policy rejects API keys); Codex uses a stored API-key login in its sandbox home.