How to Debug Playwright Tests with Claude Code: The 2026 Guide

Writing a Playwright test with Claude Code is the easy part. Debugging the test when it fails is where most of the time goes. A test that passed yesterday, a flake that only shows up in CI, an assertion that breaks after a UI refresh, a timeout that points at a button that definitely exists — Claude Code has become a genuinely useful partner for each of these, but only if you set up the loop correctly.

This guide is a practical tour of Playwright debugging with Claude Code in 2026. It covers the Playwright debug tools that already exist, how to pair them with Claude Code and the Playwright MCP server, specific workflows for the failure modes you actually hit (selector drift, timing flakes, hydration races, post-refactor breakage), and where Claude Code hits a wall that needs something else.

The Playwright debug toolkit, briefly

Claude Code is most useful when it has the same information a human debugger would reach for. Before bringing it into the loop, you want these in place:

  • Trace Viewer. trace: 'on-first-retry' in playwright.config.ts produces a .zip that contains the full action timeline, DOM snapshots, network, and console logs. Open it with npx playwright show-trace trace.zip.

  • UI Mode. npx playwright test --ui gives a time-travel UI with per-step screenshots, action-level re-runs, and watch-mode reruns on file change. For interactive debugging, this is still the fastest loop.

  • Headed + --debug. npx playwright test --debug launches the Playwright Inspector, pausing at the first line so you can step through selectors against a real page.

  • Verbose logs. DEBUG=pw:api surfaces Playwright's internal decision-making — which is the single fastest way to understand why a locator is waiting.

Claude Code doesn't replace any of these. It reads their output and makes faster sense of it.

Setting up the debugging loop with Claude Code

The workflow that works: open Claude Code in your test project, make sure it can read your test files and the Playwright trace, and add the Playwright MCP server so it can interact with the app live when needed.

Start Claude Code, run /mcp, and confirm the Playwright tools show up. Tell it explicitly to use Playwright MCP in the first prompt of a session — otherwise it tends to shell out to npx playwright test and you lose the live inspection channel.

A useful first message:




The "don't rewrite yet" instruction matters. Without it, Claude will jump to a fix — usually a waitForTimeout or a CSS selector swap — before you've figured out what's actually wrong.

Debugging a failed trace with Claude Code

Traces are the richest debugging artifact Playwright produces. Claude Code can read them directly from disk.




Claude unzips the trace, reads the serialized timeline, and correlates the failing action with DOM and network state. In practice, it's noticeably better than eyeballing the trace yourself for three specific patterns:

  • A selector that was stable, then wasn't. Claude compares the pre-failure and failure-time DOM snapshots and notices the element moved, was renamed, or got wrapped in a new container.

  • A network race. Claude correlates the failed click with an in-flight request to the same resource and flags the timing.

  • Hydration lag. Claude spots that the click landed on the server-rendered button before the client handler attached, and points at the missing hydration checkpoint.

It is noticeably worse at reasoning about business logic. If the real bug is "the cart subtotal math is wrong," Claude sees "the subtotal says $48.99" in the DOM and has no way to know that's incorrect unless you tell it.

Debugging flaky tests

Flakes are the hardest class of failure to debug because they don't reproduce on command. A structured loop with Claude Code cuts down the diagnosis time significantly.




The pattern that emerges in practice: Claude finds the common failure mode after 3-5 failed runs. Typical root causes it surfaces:

  • Timing-sensitive assertions. An expect(...).toBeVisible() that races a CSS transition. Claude spots that the failing runs assert on the element mid-animation.

  • Order-dependent test data. A test that relies on a specific record being first in a list, which fails when the DB returns results in a different order.

  • Third-party iframes. Stripe, Intercom, reCAPTCHA — iframes that load at different speeds and steal focus from Playwright's click. Claude reads the trace timeline and points at the focus shift.

For each category, Claude's suggested fix is usually close to right: role-based locators with built-in auto-waiting, explicit test data setup, or scoped waits for third-party frames. Review the fix; it's a starting point, not a finished patch.

Live-inspecting a broken test with Playwright MCP

When the trace isn't enough — usually because the test passes locally but fails in CI — the next step is to run the target flow live and let Claude Code observe.




This works well for:

  • Tests written against an older UI that has shifted.

  • Tests where the selector in the spec file no longer uniquely identifies the element.

  • Tests where the flow itself has changed — a step got split, combined, or moved to a different page.

It does not work well when the failure is environmental: a test that passes in MCP but fails in your CI container because of a locale, timezone, or headless-mode difference. For that, the gap is between your local environment and CI, not between Claude and the app.

Post-refactor debugging: the "half the suite is red" scenario

A UI refactor typically breaks 20-40% of an E2E suite at once. The Playwright Healer handles selector-level drift reasonably well; it does not handle flow changes (new pages, combined steps, renamed routes). Claude Code can bridge the gap if you set up the session correctly.




What you get back is a triage list. The specs that need a three-line change get fixed quickly; the specs that need a rewrite (the flow changed) are flagged for product review, not silent patching. That second pile is where most bad test-suite debt accumulates — tests that were "fixed" by making them pass against the new UI without anyone checking whether they still test the right thing.

Common debugging patterns, concretely

"The selector is correct but the element isn't found"

Ask Claude to compare the locator string to the current DOM via MCP:




Most of the time, the answer is "the button is now inside a dropdown" or "the name is 'Archive item' not 'Archive.'"

"The test hangs on page.click()"

Almost always an element that's visible but not interactable — covered by a cookie banner, disabled, or waiting for a hydration pass. Ask Claude to check:




"toBeVisible times out on something I can see in a screenshot"

The element is probably visible but the containing region has opacity: 0 or visibility: hidden. Claude can spot this from the trace's HTML snapshot. Alternatively, use .toBeAttached() for presence and .toBeVisible() only when you really mean "visible to the user."

"Networked assertion fails only in CI"

CI is slower; the click fires before the network has responded. Claude's suggested fix — page.waitForResponse — is usually wrong. The better fix is to assert on the DOM outcome of the request (a row appearing, a status message updating), which Playwright's built-in auto-waiting handles. Push back on a waitForResponse patch and ask Claude to assert on the UI state instead.

Where Claude Code runs out of road

Claude Code is useful for debugging tests inside your repo. It is not useful for:

  • Understanding why a test failed in production. It doesn't have the session replay, the user's state, or the error logs. It can read your code; it can't read what happened on a real user's browser yesterday at 3pm.

  • Distinguishing a real regression from an environment blip. A test failed on Tuesday and passed on Wednesday. Was it flaky? A real bug that got fixed? An infra issue? Claude will generate plausible explanations for any of those without evidence.

  • Maintaining a suite through a redesign. The Planner/Generator/Healer loop helps, but every medium-to-large redesign needs a human to decide which flows are still meaningful.

  • Explaining a failure to someone who wasn't in the session. The debugging context lives in your Claude Code window. Once the session closes, the reasoning is gone.

Where Decipher fits

Decipher sits on the other side of the debugging problem: it generates and maintains E2E tests for your product, and when a test fails — in CI or in production — it tells you why. Every failure ships with a session video, the request/response data, a natural-language explanation, and impact data. Regression, environment fluke, or intentional change — you see the answer without re-running the suite locally or opening a trace.

Claude Code is a useful hands-on debugger inside your repo. Decipher handles the ongoing layer: test maintenance as flows change, managed infrastructure so you don't babysit browsers, and the production observability that catches what tests miss. The combination works better than either alone.

Practical tips

Keep a "known failures" file. When Claude Code diagnoses a root cause, paste the summary into a FAILURES.md in your test directory. Next time you hit a similar failure, include the file in the Claude context — it dramatically speeds up the second diagnosis.

Capture the full trace, not just screenshots. Screenshots alone lose the action timeline. trace: 'on-first-retry' costs a few KB per failure and saves hours per incident.

Don't let Claude silently downgrade assertions. A common "fix" from a hurried session is replacing toBeVisible() with toBeAttached() or adding a waitForTimeout(2000). Both make tests green without making them correct. Call these out and ask for the real root cause.

Scope each session to one failure class. Debug one spec at a time, not "fix the suite." Multi-spec debugging sessions run into Claude's context window and the quality of diagnoses drops sharply.

Use --repeat-each for flakes. npx playwright test spec.ts --repeat-each=20 --workers=4 is the fastest way to reproduce a flake locally. Feed the resulting traces to Claude in the same session.

FAQ

Q: Can Claude Code read Playwright trace files directly? A: Yes. The .zip is a structured bundle — Claude unzips and reads the action timeline, DOM snapshots, and network log. The richer the trace, the better the diagnosis.

Q: Should I let Claude Code run playwright test --debug? A: Only in an interactive session you're watching. --debug pauses at every step and waits for input. Unattended debug runs hang CI pipelines and don't produce usable output.

Q: Is the Playwright MCP server required to debug with Claude Code? A: Not for trace-based debugging. It's required when you need Claude to observe the live app — for selector drift, hydration issues, or UI-after-refactor debugging. For CI failures that reproduce only on the remote environment, MCP can only help if you can reproduce the env locally.

Q: What's the best way to stop Claude Code from jumping to a fix? A: Tell it up front. "Don't suggest a fix until we agree on root cause" as the first message of the session changes the dynamic. Without that instruction, Claude defaults to the most common fix for the observed failure — which is often wrong, because the observed failure is usually a downstream symptom.

Q: How does this compare to Playwright's Healer agent? A: The Healer is narrow and automatic — it rewrites failing selectors based on the current DOM. Claude Code in an interactive session is broader and slower, handling root-cause analysis, flow changes, and cross-spec issues the Healer can't touch. Use the Healer for selector drift at scale; use Claude Code when you need to actually understand why something broke.

Q: How does debugging with Claude Code compare to Decipher? A: Claude Code is the right tool when you have the code and want to fix a specific spec. Decipher is the right layer when the question is "what's happening in production that my suite isn't catching, and which tests are still pulling their weight six months in?" They're complementary: one is a local, code-level debugger; the other is the ongoing system that keeps the suite useful as the product changes.

Written by:

Michael Rosenfield

Co-founder

Share with friends:

Share on X