The product question behind wcc (Workspace Context CLI): when an AI coding agent needs to write working integration code against an unfamiliar API, what's the best way to hand it the API's context? Each condition below is one answer to that question, run over the same task corpus and scored identically.
| tier | shape | this report |
|---|---|---|
| T1 | 3 independent calls — simple create/retrieve | not run |
| T2 | 5–6 chained calls with data threading between steps | ✓ all results here |
| T3 | discovery flows — filter/search, then act on results | not run |
All figures are per-task means over the 22 APIs (a 358,727-token entry means that's the average per task; multiply by 22 for the sweep total). Token columns use the glossary below — raw vendor numbers are not comparable across OpenAI and Anthropic without this normalization.
Every condition placed by client cost per task (log scale) against endpoint precision. Bubble area = total tokens processed. Up and left is better.
Endpoint precision and recall, mean over the corpus.
Where the tokens actually go. Headline totals mislead — the agentic conditions are mostly cached re-reads of their own accumulated context. The uncached slice (full-price work) is remarkably similar across very different setups.
All 22 APIs × all conditions. Hover a row. Datadog and Sentry fail every condition identically — scoring/spec-shape issues, not retrieval failures.
Current frontier models (gpt-5.3-codex vs Sonnet 4.6), identical tasks, both sandboxed. Two questions: which agent, and — for the same agent — which context surface: read the pulled folder directly (wcc v0.1.3 pull) or query the 16 structured retrieval subcommands (wcc v0.1.2)?
The exact prompt templates each condition runs with (placeholders in {braces} are filled per task). These are the conditions' real definitions — differences in results trace back to differences here.
You are a Python developer. Build a CLI tool using `argparse` and `requests` for the {company} API.
The CLI should have exactly {N} subcommands, one for each of these operations:
1. `{method}-1` — {step description}
2. ...
Requirements:
1. Each subcommand maps to ONE API operation from the list above
2. Use `argparse` with subparsers
3. Use only `requests` for HTTP calls (no SDK)
4. Include proper authentication handling (API key via environment variable or --api-key flag)
5. Each subcommand should accept the arguments needed for its operation
6. Include error handling for API errors
7. Print results as formatted JSON
Here is the OpenAPI specification for the {company} API:
{FULL OPENAPI SPEC — inlined, can be 100K+ tokens}
Use the correct endpoints from this specification for each subcommand. Return your code inside a single ```python code fence.
— COMPILER (backend, gpt-4.1-mini) —
You are an API context compiler. Your job is to read an OpenAPI spec and a task description, then produce a minimal TASK BRIEF that contains ONLY what an AI coding agent needs to complete this specific task.
TASK: {task}
The task brief must include:
1. AUTH — the exact authentication method, headers, and setup steps
2. BASE URL — the correct base URL for API calls
3. STEPS — for each step in the task, provide:
- The exact HTTP method and path (e.g., POST /v2/customers)
- Required parameters with names, types, and where they go (path, query, body)
- What the response returns (specifically the fields needed by subsequent steps)
- Data threading: which field from a previous step's response feeds into this step
4. GOTCHAS — non-obvious things that will cause first-call failures:
- Unusual content types (form-encoded instead of JSON, line-delimited JSON, etc.)
- Unusual parameter formats (cents vs dollars, full URIs vs plain IDs, PascalCase, etc.)
- Required fields that aren't obvious from the path
- Auth quirks (Token vs Bearer prefix, API key in body vs header, etc.)
Rules:
- Include ONLY endpoints needed for this specific task. Not the whole API.
- Be precise: exact paths, exact parameter names, exact header values.
- Keep it under 1500 tokens. Brevity is critical — every extra token is wasted.
- Use a compact structured format, not prose.
- If the spec has multiple endpoints that could work, pick the most standard one:
- Prefer native CRUD endpoints over SCIM, provisioning, or enterprise SSO endpoints.
- Prefer the simplest path structure (e.g., /teams over /scim/v2/Groups for team creation).
- When both v1 and v2 exist for the same resource, prefer v1 unless v2 is clearly the replacement.
- Prefer PUT for full updates, PATCH for partial updates. Match the HTTP method to the task.
- Do NOT add extra steps beyond what the task describes.
Here is the OpenAPI specification:
{FULL SPEC}
Produce the task brief now. Use this exact format:
TASK BRIEF: [task summary] ([company])
AUTH / BASE URL / STEPS (method, path, params, returns, threading) / GOTCHAS
— GENERATOR (client, gpt-4.1) —
You are a Python developer. Build a CLI tool using `argparse` and `requests` for the {company} API.
[same CLI requirements as full-spec]
Here is the task brief for the {company} API — it contains the exact endpoints, auth, parameters, and gotchas you need:
{TASK BRIEF}
Use ONLY the endpoints from this task brief. Return your code inside a single ```python code fence.
run_wcc(args), that executes the wcc CLI's 16 retrieval subcommands. ~9 round-trips typical.— SYSTEM —
You are a Python developer building a CLI for the {company} API using `argparse` and `requests`.
You do NOT have the API spec. Instead you have a tool, run_wcc, that retrieves API context from the API's Postman workspace. You MUST verify the actual endpoints (method + path), parameters, auth, and request bodies via run_wcc before writing code — do NOT rely on prior knowledge of this API, which may be wrong or outdated. Read only what's relevant to the task; you don't need to read everything.
{WCC COMMAND REFERENCE — overview, list-folders, list-endpoints, find-endpoints,
get-endpoint, list-examples, get-example, list-envs, get-env, auth, list-specs,
spec-overview, spec-list-operations, spec-get-operation, spec-get-schema, grep}
Workflow: call run_wcc to explore (start with overview and list-folders, then drill into the specific endpoints for the task with get-endpoint). Confirm each operation's real endpoint. When you have verified everything you need, STOP calling the tool and reply with ONLY the final Python code in a single ```python fence.
— USER —
Task: {task}
Build a CLI with exactly {N} subcommands, one per operation: {steps}
Requirements: [same CLI requirements]. Explore with run_wcc first, then return the final code in a single ```python fence.
Bash(wcc:*); Codex read-only sandbox). The cache is pre-warmed so tools run offline.You are building a Python CLI for the {company} API using argparse and requests.
The API's documentation lives in a Postman workspace. Explore it ONLY with the `wcc` CLI (run it via shell) to find the exact endpoints (method + path), parameters, auth, and request bodies — do NOT rely on prior knowledge of this API, and do NOT read files directly.
Workspace ID: {workspaceId}
Every command takes --workspace {workspaceId}. Available commands:
wcc overview / list-folders / list-endpoints [--folder --method --page-size --cursor]
wcc find-endpoints --query TEXT / get-endpoint --id ID [--include-examples]
wcc list-examples --endpoint ID / get-example --id ID
wcc list-envs / get-env --id NAME [--keys-only] / auth [--folder ID]
wcc list-specs / spec-overview / spec-list-operations / spec-get-operation / spec-get-schema
wcc grep --pattern REGEX [--in endpoint|env|spec|example|all]
All output is JSON. Start with `overview` or `find-endpoints`; use `auth` for authentication details.
Task: {task}
Build a CLI with exactly {N} subcommands, one per operation: {steps}
Requirements: [same as folder surface]
CRITICAL OUTPUT RULE: [same — final message must be only the python script]
wcc pull downloads the workspace; the harness exposes run_bash(cmd) (read-only: grep/cat/find/sed allowed, writes/network blocked), cwd = the folder. Capped at ~20 rounds.— SYSTEM —
You are a Python developer building a CLI for the {company} API using `argparse` and `requests`.
You do NOT have the spec. The API's Postman workspace has been downloaded locally and you have a read-only shell over it via run_bash (cwd = the workspace root). Explore the files to find the exact endpoints (method + path), parameters, auth, and request bodies — do NOT rely on prior knowledge of this API.
Workspace layout: {layout hint from wcc pull}
Counts: {collections/endpoints/environments/specs}
A `*.request.yaml` file holds one endpoint's method, url, headers, query/path params, and body. Auth is usually defined once at the collection or folder level in a `.resources/definition.yaml` file (look there, not in each request). Example requests/responses live under `.resources/<request>.resources/examples/`.
Explore efficiently like a real coding agent: grep/find first to locate the relevant request files, then read only the slices you need (e.g. sed -n / head) — don't cat large files wholesale. When you have verified everything, STOP calling the tool and reply with ONLY the final Python code in a single ```python fence.
— USER —
Task: {task} … [same shape as wcc 16-tool user prompt, with run_bash]
codex exec -s read-only / claude -p --allowedTools Read,Grep,Glob), cwd = the pulled folder. The agent uses its own Read/Grep/shell tools, caching, and compaction. A retry asks for code-only if the agent ends on a plan.You are building a Python CLI for the {company} API using argparse and requests.
The API's Postman workspace has been downloaded into the CURRENT DIRECTORY. Explore it with your own tools to find the exact endpoints (method + path), parameters, auth, and request bodies — do NOT rely on prior knowledge of this API.
Layout: {layout hint from wcc pull}
Counts: {counts}
Each `*.request.yaml` holds one endpoint (method, url, headers, query/path params, body). Auth is usually defined at the collection/folder level in a `.resources/definition.yaml` file. Example requests/responses are under `.resources/<request>.resources/examples/`.
Task: {task}
Build a CLI with exactly {N} subcommands, one per operation: {steps}
Requirements: each subcommand maps to ONE real endpoint found in the workspace; argparse + subparsers; only `requests` (no SDK); auth via env var or --api-key as the workspace indicates; error handling; print JSON.
CRITICAL OUTPUT RULE: do NOT end on a plan or a summary. Your FINAL message must be the COMPLETE Python script itself — nothing before or after — inside a single ```python code fence. Write the actual code, do not describe it.
api-evaluator) — not promptfoo/braintrust/inspect. One runner per condition under validation-runner/src/codegen/; results in evals/v3/condition-*/run-N/…/result.json; this report is built by build-report-data.ts.
CODEX_HOME=/tmp/eval-codex, CLAUDE_CONFIG_DIR=/tmp/eval-claude-home. Claude Code uses org OAuth (managed device policy rejects API keys); Codex uses a stored API-key login in its sandbox home.