How to Use Claude Code to Write Tests: API and E2E

Claude Code can write both API tests and E2E tests directly from your terminal. For API tests, it reads your route handlers and generates test suites covering CRUD operations, auth flows, and edge cases. For E2E tests, it pairs with the Playwright MCP server to navigate your actual application and write tests grounded in your real UI — not guessed from training data.
This guide covers both workflows: what they look like in practice, where they work well, where they fall short, and what to reach for when Claude Code alone isn't enough.
Part 1: API Tests with Claude Code
What Claude Code brings to API testing
Claude Code has multi-file awareness. When you point it at your codebase, it doesn't just read your route file — it reads your models, middleware, validation logic, and existing tests together. That holistic view produces test output that's more accurate than anything generated from a prompt alone.
It also understands REST conventions. It knows a successful POST should return 201, a DELETE should return 204, that authorization errors are 401 and not 403, and that missing resources are 404. When you ask it to generate tests, these patterns show up correctly without prompting.
Setup
No special MCP server is required for API testing. Open a Claude Code session in your project root and point it at your API.
For best results, make sure Claude can read your:
Route/controller files
Schema or model definitions
Any existing test files (Claude uses these as style references)
.env.example(Claude uses this to understand your environment, not your secrets)
Generating API tests: the basic workflow
Start with your route file:
Claude reads the route file, maps every endpoint, and generates test cases. A good output will include:
Happy path tests for each endpoint
Validation error cases (missing required fields, wrong types, invalid formats)
Authentication and authorization failures
Not-found and conflict responses
Be specific about what you want covered:
"Write tests for the users endpoint" produces generic coverage. This produces targeted tests:
The more context you give about your product's actual behavior, the less you need to clean up afterward.
Authentication in API tests
Claude handles auth test setup well when you give it a concrete pattern to follow. Ask it to write a beforeAll block that creates a test user and generates a token, or to use your existing test helpers.
If you don't have test helpers yet, ask Claude to write them first:
Reading your OpenAPI spec
If your API has an OpenAPI/Swagger spec, give Claude access to it before generating tests:
This produces contract tests — tests that fail when your implementation drifts from its documented spec. They're more valuable than happy-path-only coverage because they catch the subtle regressions that break API consumers.
What to review before committing
Claude's API tests tend toward testing the presence of fields rather than their correctness. A test might confirm that response.body.user exists without checking that response.body.user.role is what it should be after a role update. Go through the generated tests and strengthen assertions around fields that carry real business logic.
Also check for test isolation. Claude sometimes writes tests that rely on shared state — a user created in one test that a later test modifies. Review the beforeEach/afterEach hooks and make sure each test is self-contained.
Part 2: E2E Tests with Claude Code
The Playwright MCP server
The previous Decipher post on Claude Code + Playwright MCP covers the E2E setup in detail. Quick summary:
Start Claude Code, run /mcp, and verify playwright appears with tools like browser_navigate, browser_click, and browser_snapshot. The first time you use it in a session, say "use playwright mcp" explicitly — otherwise Claude may fall back to running Playwright via bash.
Writing E2E tests from a real session
The key advantage of the MCP approach is that Claude interacts with your actual DOM instead of generating test code from memory. Tests based on real selectors break less. Tests based on guesses break constantly.
A workflow that works well:
Claude opens a browser, walks through the steps, observes selectors and state changes, and produces test code. It's not perfect — complex interactions, drag-and-drop, and heavily dynamic UIs require back-and-forth — but the selector quality is meaningfully better than generated-from-memory tests.
Beyond single flows: testing the API-to-UI contract
One of the more powerful Claude Code workflows combines API and E2E testing in a single session. You ask Claude to:
Set up test data via the API (creating known state)
Open the browser and verify the UI reflects that state
Perform a UI action
Hit the API to confirm the action was persisted correctly
This kind of test catches the failure mode that neither pure API tests nor pure UI tests surface on their own: cases where the UI sends the right request but the backend doesn't persist it, or where the backend updates but the UI shows stale state.
Playwright Agents for larger suites
Playwright (v1.56+) ships with three specialized Claude Code subagents — the Planner, Generator, and Healer — designed for building out fuller test suites. See the Playwright Agents guide for the setup walkthrough.
The short version:
The Planner explores your app and produces a structured Markdown test plan. The Generator turns that plan into Playwright test files. The Healer runs failing tests and attempts to fix them.
The pipeline works, with caveats. The Planner documents what it observes in the UI — not your product's business rules. It will miss domain-specific edge cases unless you feed it context. The Generator's assertions skew toward element visibility over business logic correctness. The Healer handles selector-level fixes well; it can't rewrite tests when a flow fundamentally changes.
Treat each stage as a starting point, not finished output.
Where both approaches fall short
API tests: gaps Claude reliably leaves
Claude generates happy-path and error-code tests well. It's weaker on:
Business logic validation. A test might confirm that
POST /api/ordersreturns 201 without checking that inventory was decremented, that the confirmation email queue was triggered, or that the order total math is correct. These assertions require you to know what the endpoint is supposed to do, not just what HTTP shape it returns.Negative and edge case coverage. Without explicit prompting, Claude tends to miss cases like concurrent requests, partial failures in multi-step operations, and behavior at quota limits. One engineering team found that Claude consistently tested only happy paths until explicitly asked to "think about what could go wrong at every step."
Test isolation. As mentioned above, shared state across tests is a common issue in Claude-generated suites. Always review the setup and teardown logic.
E2E tests: the maintenance problem
The Playwright Healer handles selector-level failures — a button that moved, a class that changed. What it cannot do is rewrite tests when a product flow changes. New onboarding steps, a redesigned checkout, features that got combined or split — these require human intervention to update the tests. Claude Code generates tests; it doesn't own them.
There's also no managed infrastructure. You run the browsers, manage the test environments, handle parallelization, and wire everything into CI/CD. For small suites this is fine. As the suite grows, the infrastructure overhead grows with it.
And Playwright assertions are string-matched. expect(page.getByText('Order confirmed')).toBeVisible() fails if the copy changes to "Your order is confirmed." It passes if the page loads but shows the wrong order. Claude Code generates surface-level checks, and strengthening them is manual work on your end.
Context window limits
Large suites push against Claude's context window. A realistic per-session capacity is 5-15 well-structured tests before things start degrading. For larger suites, break the work into focused sessions by feature area and plan for multiple iterations.
When to use Claude Code vs. a dedicated AI testing tool
Claude Code makes sense in two scenarios: you want raw Playwright files you own and control completely, or your budget is zero.
Outside of those two cases, the tradeoffs don't favor it. You're managing your own browser infrastructure, manually invoking the Healer when tests break, accepting string-matched assertions that miss functional regressions, and getting no visibility into what's actually happening in production. Each individual piece is manageable. Together, they add up to a testing workflow that still requires significant ongoing engineering time — which is what you were trying to avoid.
Decipher handles the full lifecycle: generating E2E tests from your actual product, maintaining them as flows change (not just selector patches — full flow-level updates), running on managed cloud infrastructure, and monitoring real production sessions to catch bugs that escaped the test suite entirely. When something fails, you get an explanation — real regression, environment fluke, or intentional change — with session recordings and impact data.
If you want to use Claude Code for specific flows where you need code-level control, that's reasonable. But it shouldn't be your testing infrastructure.
Practical tips
Provide a seed test. Both API and E2E test generation improve significantly when Claude has an existing test file to follow. Before asking Claude to generate a new suite, point it at your best existing test: "Write tests in the same style as tests/routes/auth.test.js." The output quality jumps.
Be specific about edge cases. Claude won't invent business rules it doesn't know. If a discount code should fail for users on a free plan, tell it. If a quantity update should recalculate a cart total, tell it. Concrete context produces concrete tests.
Review selectors before committing. Generated tests sometimes use unstable selectors — nth-child, CSS classes, or positional locators. Look for these and replace them with getByRole, getByLabel, or data-testid selectors that won't break on a UI refresh.
Run the tests before committing. This sounds obvious but it's easy to skip. Claude can generate plausible-looking tests that don't pass. Running them in your local environment before putting them in a PR saves review cycles.
Use the codegen flag for E2E. When running Playwright MCP, add --codegen to get reusable TypeScript output as Claude navigates:
FAQ
Q: Does Claude Code write better API tests with or without an OpenAPI spec? A: Noticeably better with one. The spec gives Claude explicit shape information for every endpoint — expected request bodies, response schemas, status codes — so the tests validate contracts rather than just checking that requests don't 500.
Q: Can Claude Code generate tests for a backend it can't run locally? A: Yes, but the quality drops. Without a running server, Claude can't verify that endpoints behave the way the code suggests. It's generating based on reading the code, not running it. Expect more manual correction of edge case assumptions.
Q: How is using Claude Code for E2E tests different from just using Playwright codegen? A: Playwright's built-in codegen records your interactions and produces code. Claude Code does more: it navigates the app autonomously, reasons about what to test, handles multi-step flows with conditional logic, and writes test files with assertions rather than just action recordings. The output requires more review, but starts at a higher quality.
Q: What's the most common mistake teams make when generating tests with Claude Code? A: Treating generated output as done. The output is a starting point. It handles the mechanical work — writing the test structure, mapping endpoints to tests, setting up assertions. What it doesn't do is understand your product's behavior well enough to catch logic bugs or verify that the right things are correct. Plan for a review pass on every generated suite.
Q: How does Claude Code + Playwright compare to Decipher? A: Claude Code is a good fit if you want raw Playwright files you fully control, or if cost is the deciding factor. Otherwise, you're taking on browser infrastructure, manual maintenance every time a flow changes, surface-level assertions, and no production visibility — all things Decipher handles for you. The tradeoff isn't really "code control vs. convenience." It's "how much ongoing engineering time do you want to spend keeping your test suite alive."
Written by:

Michael Rosenfield
Co-founder
Share with friends: