Postman · Workspace Context CLI · Evaluation Series

How should a coding agent
get API context?

00

What was tested

The product question behind wcc (Workspace Context CLI): when an AI coding agent needs to write working integration code against an unfamiliar API, what's the best way to hand it the API's context? Each condition below is one answer to that question, run over the same task corpus and scored identically.

The corpus — 22 APIs, T2 tasks

Every task: write a Python CLI (argparse + requests) that chains 5–6 real API calls, where outputs of earlier calls feed later ones (create customer → create product → create price → subscribe → fetch). Tiers grade task complexity:
tiershapethis report
T13 independent calls — simple create/retrievenot run
T25–6 chained calls with data threading between steps✓ all results here
T3discovery flows — filter/search, then act on resultsnot run

Why these 22 (and not more)

The eval needs every condition to compete on the same API. That requires each API to have both an OpenAPI spec (for full-spec / task-brief) and a provisioned Postman workspace (for the wcc conditions). The corpus is that intersection, restricted to T2: 22 APIs — a mix of household names (Stripe, Slack, Datadog) and long-tail APIs the models can't lean on memory for (Apacta, GSMtasks, PayRun.io, Big Red Cloud).

49 workspaces are provisioned in total; T1/T3 task sets exist for many of these. Expanding the matrix (more APIs × more tiers × repeat runs) is a cost decision, not a tooling one — each real-agent condition costs ~$5–7 per 22-task sweep.

Scoring: an LLM extractor lists the HTTP calls the generated script makes; these match against the task's expected sequence (method + normalized path). Precision = of the calls the code makes, how many are right. Recall = of the expected calls, how many appear.
01

Conditions at a glance

All figures are per-task means over the 22 APIs (a 358,727-token entry means that's the average per task; multiply by 22 for the sweep total). Token columns use the glossary below — raw vendor numbers are not comparable across OpenAI and Anthropic without this normalization.

uncached tokTokens billed at the full input/output rate: input the model hadn't seen before, plus everything it generated. This is the "real work" number.
cached tokInput re-read from prompt cache at ~10–20× discount ($0.175–0.50/M). Agentic loops re-send their whole history every round — caching is why that's affordable.
total tokuncached + cached = everything the model processed. Codex reports cache reads inside its input count; Claude reports them separately — both normalized to this split.
02

The money chart

Every condition placed by client cost per task (log scale) against endpoint precision. Bubble area = total tokens processed. Up and left is better.

Precision vs cost per task

task-brief dominates on economics at equal precision. The real agents cluster at ~90% but 20× the cost.
03

Accuracy

Endpoint precision and recall, mean over the corpus.

Precision & recall by condition

A wash — every condition lands 88–92%. (An earlier Codex run on the older gpt-4.1 model scored 79%; swapping to its native gpt-5.3-codex closed the gap entirely, same harness — that run is omitted here, see RESULTS.md.)
04

Token anatomy

Where the tokens actually go. Headline totals mislead — the agentic conditions are mostly cached re-reads of their own accumulated context. The uncached slice (full-price work) is remarkably similar across very different setups.

Uncached vs cached tokens per task (mean)

Claude Code processes ~2.2× fewer total tokens than Codex (163K vs 359K) — both are ~88% cache reads. Uncached work is ~19–39K everywhere agentic; one-shot conditions have no cache at all.
05

Cost

Client cost per task (log scale)

Cache-aware. task-brief's one-time backend compile (Postman-side, gpt-4.1-mini) shown as the pale stack.

task-brief amortization

The backend compile is one-time per task; reuse drives effective cost toward the client-only floor. Real agents pay full freight every run.
06

Per-company precision matrix

All 22 APIs × all conditions. Hover a row. Datadog and Sentry fail every condition identically — scoring/spec-shape issues, not retrieval failures.

07

Real-agent matchups

Current frontier models (gpt-5.3-codex vs Sonnet 4.6), identical tasks, both sandboxed. Two questions: which agent, and — for the same agent — which context surface: read the pulled folder directly (wcc v0.1.3 pull) or query the 16 structured retrieval subcommands (wcc v0.1.2)?

Cost per company — Codex 5.3 vs Claude Code (folder surface), precision deltas marked ▲

Dead heat on accuracy (91% vs 89%), Codex ~28% cheaper ($0.222 vs $0.307/task) — its tokens are mostly $0.175/M cache reads, while Claude's leaner stream bills output at Sonnet's $15/M. Claude is steadier; Codex spikes on big workspaces (Datadog 1.34M tokens).
08

The prompts

The exact prompt templates each condition runs with (placeholders in {braces} are filled per task). These are the conditions' real definitions — differences in results trace back to differences here.

1 · full-spec — single prompt, whole spec inlinedgpt-4.1 · 1 round-trip
The entire OpenAPI spec (often 100s of KB) is pasted into one prompt. No retrieval, no tools.
You are a Python developer. Build a CLI tool using `argparse` and `requests` for the {company} API.

The CLI should have exactly {N} subcommands, one for each of these operations:
  1. `{method}-1` — {step description}
  2. ...

Requirements:
1. Each subcommand maps to ONE API operation from the list above
2. Use `argparse` with subparsers
3. Use only `requests` for HTTP calls (no SDK)
4. Include proper authentication handling (API key via environment variable or --api-key flag)
5. Each subcommand should accept the arguments needed for its operation
6. Include error handling for API errors
7. Print results as formatted JSON

Here is the OpenAPI specification for the {company} API:

{FULL OPENAPI SPEC — inlined, can be 100K+ tokens}

Use the correct endpoints from this specification for each subcommand. Return your code inside a single ```python code fence.
2 · task-brief — backend compiler + client generatorgpt-4.1-mini compile · gpt-4.1 generate
Step 1 (backend, one-time): gpt-4.1-mini reads the full spec + task and compiles a ~1.5K-token brief. Step 2 (client): gpt-4.1 generates code from the brief alone.
— COMPILER (backend, gpt-4.1-mini) —
You are an API context compiler. Your job is to read an OpenAPI spec and a task description, then produce a minimal TASK BRIEF that contains ONLY what an AI coding agent needs to complete this specific task.

TASK: {task}

The task brief must include:
1. AUTH — the exact authentication method, headers, and setup steps
2. BASE URL — the correct base URL for API calls
3. STEPS — for each step in the task, provide:
   - The exact HTTP method and path (e.g., POST /v2/customers)
   - Required parameters with names, types, and where they go (path, query, body)
   - What the response returns (specifically the fields needed by subsequent steps)
   - Data threading: which field from a previous step's response feeds into this step
4. GOTCHAS — non-obvious things that will cause first-call failures:
   - Unusual content types (form-encoded instead of JSON, line-delimited JSON, etc.)
   - Unusual parameter formats (cents vs dollars, full URIs vs plain IDs, PascalCase, etc.)
   - Required fields that aren't obvious from the path
   - Auth quirks (Token vs Bearer prefix, API key in body vs header, etc.)

Rules:
- Include ONLY endpoints needed for this specific task. Not the whole API.
- Be precise: exact paths, exact parameter names, exact header values.
- Keep it under 1500 tokens. Brevity is critical — every extra token is wasted.
- Use a compact structured format, not prose.
- If the spec has multiple endpoints that could work, pick the most standard one:
  - Prefer native CRUD endpoints over SCIM, provisioning, or enterprise SSO endpoints.
  - Prefer the simplest path structure (e.g., /teams over /scim/v2/Groups for team creation).
  - When both v1 and v2 exist for the same resource, prefer v1 unless v2 is clearly the replacement.
  - Prefer PUT for full updates, PATCH for partial updates. Match the HTTP method to the task.
  - Do NOT add extra steps beyond what the task describes.

Here is the OpenAPI specification:
{FULL SPEC}

Produce the task brief now. Use this exact format:
TASK BRIEF: [task summary] ([company])
AUTH / BASE URL / STEPS (method, path, params, returns, threading) / GOTCHAS

— GENERATOR (client, gpt-4.1) —
You are a Python developer. Build a CLI tool using `argparse` and `requests` for the {company} API.
[same CLI requirements as full-spec]
Here is the task brief for the {company} API — it contains the exact endpoints, auth, parameters, and gotchas you need:
{TASK BRIEF}
Use ONLY the endpoints from this task brief. Return your code inside a single ```python code fence.
3 · wcc 16-tool — harness loop, run_wcc function toolgpt-4.1 · OpenAI function calling · wcc v0.1.2
Our harness exposes one function tool, run_wcc(args), that executes the wcc CLI's 16 retrieval subcommands. ~9 round-trips typical.
— SYSTEM —
You are a Python developer building a CLI for the {company} API using `argparse` and `requests`.

You do NOT have the API spec. Instead you have a tool, run_wcc, that retrieves API context from the API's Postman workspace. You MUST verify the actual endpoints (method + path), parameters, auth, and request bodies via run_wcc before writing code — do NOT rely on prior knowledge of this API, which may be wrong or outdated. Read only what's relevant to the task; you don't need to read everything.

{WCC COMMAND REFERENCE — overview, list-folders, list-endpoints, find-endpoints,
 get-endpoint, list-examples, get-example, list-envs, get-env, auth, list-specs,
 spec-overview, spec-list-operations, spec-get-operation, spec-get-schema, grep}

Workflow: call run_wcc to explore (start with overview and list-folders, then drill into the specific endpoints for the task with get-endpoint). Confirm each operation's real endpoint. When you have verified everything you need, STOP calling the tool and reply with ONLY the final Python code in a single ```python fence.

— USER —
Task: {task}
Build a CLI with exactly {N} subcommands, one per operation: {steps}
Requirements: [same CLI requirements]. Explore with run_wcc first, then return the final code in a single ```python fence.
3 · codex / claude-code (tools surface) — real agent + wcc v0.1.2's 16 subcommandsgpt-5.3-codex / Sonnet 4.6 · shell → wcc CLI
Same agents, but instead of files they get the v0.1.2 retrieval CLI on PATH (Claude restricted to Bash(wcc:*); Codex read-only sandbox). The cache is pre-warmed so tools run offline.
You are building a Python CLI for the {company} API using argparse and requests.

The API's documentation lives in a Postman workspace. Explore it ONLY with the `wcc` CLI (run it via shell) to find the exact endpoints (method + path), parameters, auth, and request bodies — do NOT rely on prior knowledge of this API, and do NOT read files directly.

Workspace ID: {workspaceId}
Every command takes --workspace {workspaceId}. Available commands:
  wcc overview / list-folders / list-endpoints [--folder --method --page-size --cursor]
  wcc find-endpoints --query TEXT / get-endpoint --id ID [--include-examples]
  wcc list-examples --endpoint ID / get-example --id ID
  wcc list-envs / get-env --id NAME [--keys-only] / auth [--folder ID]
  wcc list-specs / spec-overview / spec-list-operations / spec-get-operation / spec-get-schema
  wcc grep --pattern REGEX [--in endpoint|env|spec|example|all]
All output is JSON. Start with `overview` or `find-endpoints`; use `auth` for authentication details.

Task: {task}
Build a CLI with exactly {N} subcommands, one per operation: {steps}
Requirements: [same as folder surface]
CRITICAL OUTPUT RULE: [same — final message must be only the python script]
4 · wcc-files — harness loop, read-only bash over the pulled foldergpt-4.1 · OpenAI function calling · wcc v0.1.3 pull
wcc pull downloads the workspace; the harness exposes run_bash(cmd) (read-only: grep/cat/find/sed allowed, writes/network blocked), cwd = the folder. Capped at ~20 rounds.
— SYSTEM —
You are a Python developer building a CLI for the {company} API using `argparse` and `requests`.

You do NOT have the spec. The API's Postman workspace has been downloaded locally and you have a read-only shell over it via run_bash (cwd = the workspace root). Explore the files to find the exact endpoints (method + path), parameters, auth, and request bodies — do NOT rely on prior knowledge of this API.

Workspace layout: {layout hint from wcc pull}
Counts: {collections/endpoints/environments/specs}
A `*.request.yaml` file holds one endpoint's method, url, headers, query/path params, and body. Auth is usually defined once at the collection or folder level in a `.resources/definition.yaml` file (look there, not in each request). Example requests/responses live under `.resources/<request>.resources/examples/`.

Explore efficiently like a real coding agent: grep/find first to locate the relevant request files, then read only the slices you need (e.g. sed -n / head) — don't cat large files wholesale. When you have verified everything, STOP calling the tool and reply with ONLY the final Python code in a single ```python fence.

— USER —
Task: {task} … [same shape as wcc 16-tool user prompt, with run_bash]
4 · codex / claude-code (folder surface) — real agent over the pulled foldergpt-5.3-codex / Sonnet 4.6 · native tools · wcc v0.1.3 pull
One prompt to the real agent CLI (codex exec -s read-only / claude -p --allowedTools Read,Grep,Glob), cwd = the pulled folder. The agent uses its own Read/Grep/shell tools, caching, and compaction. A retry asks for code-only if the agent ends on a plan.
You are building a Python CLI for the {company} API using argparse and requests.

The API's Postman workspace has been downloaded into the CURRENT DIRECTORY. Explore it with your own tools to find the exact endpoints (method + path), parameters, auth, and request bodies — do NOT rely on prior knowledge of this API.
Layout: {layout hint from wcc pull}
Counts: {counts}
Each `*.request.yaml` holds one endpoint (method, url, headers, query/path params, body). Auth is usually defined at the collection/folder level in a `.resources/definition.yaml` file. Example requests/responses are under `.resources/<request>.resources/examples/`.

Task: {task}

Build a CLI with exactly {N} subcommands, one per operation: {steps}

Requirements: each subcommand maps to ONE real endpoint found in the workspace; argparse + subparsers; only `requests` (no SDK); auth via env var or --api-key as the workspace indicates; error handling; print JSON.

CRITICAL OUTPUT RULE: do NOT end on a plan or a summary. Your FINAL message must be the COMPLETE Python script itself — nothing before or after — inside a single ```python code fence. Write the actual code, do not describe it.
09

What this means

Harness

Custom TypeScript harness (api-evaluator) — not promptfoo/braintrust/inspect. One runner per condition under validation-runner/src/codegen/; results in evals/v3/condition-*/run-N/…/result.json; this report is built by build-report-data.ts.

Runs & variance

full-spec & task-brief: deterministic, 1 run. wcc harness conditions: mean of 3 runs. Real agents: 1 run each (stochastic — treat single-company deltas as noise; means are the signal). wcc-files hit its 20-round cap on most APIs.

Pricing /1M tokens

gpt-4.1 $2 / $0.50 cached / $8 out · gpt-4.1-mini $0.40 / $1.60 · gpt-5.3-codex $1.75 / $0.175 cached / $14 out (reasoning tokens bill as output) · Sonnet 4.6 $3 / $15 out (+cache-write premium).

Sandboxing

Real agents run isolated from local installs: CODEX_HOME=/tmp/eval-codex, CLAUDE_CONFIG_DIR=/tmp/eval-claude-home. Claude Code uses org OAuth (managed device policy rejects API keys); Codex uses a stored API-key login in its sandbox home.