What AI-Generated Playwright Tests Get Wrong in Production

What AI-Generated Playwright Tests Get Wrong in Production
AI can generate Playwright tests quickly. That part is real.
What is less real is the idea that generated tests are automatically good tests.
A generated Playwright suite can look impressive in a pull request. It can cover a login flow, a checkout path, a settings page, and a few happy-path assertions in minutes. It can even run successfully in CI the first time. But that does not mean it will hold up in production, and it definitely does not mean it is catching the failures that matter.
That gap is where a lot of teams get burned.
If your team is using Claude Code, Copilot, or another coding agent to create Playwright tests, the right question is not, "Can AI write the test?" The right question is, what kind of test did it actually write?
The core problem
Most AI-generated Playwright tests optimize for completion, not coverage quality.
The model is trying to produce something that compiles, runs, and appears reasonable. It is not accountable for whether the assertion is strong enough, whether the flow still reflects the real product behavior a month later, or whether the test would catch the kind of bug that costs you money.
That matters because production bugs rarely come from total page failure. They usually come from more subtle things:
the wrong item was saved
the UI looked successful but the backend write failed
the confirmation message appeared for the wrong state
the user completed the flow, but the side effect never happened
a redesign changed the meaning of the page without obviously breaking selectors
Generated Playwright tests often miss those failures.
1. They over-index on selectors and under-index on intent
The first failure mode is simple: AI is usually better at finding something to click than proving the user achieved the right outcome.
A generated test might do this:
navigate to the page
click a button
fill a form
wait for a toast
assert that some text is visible
That sounds fine until you ask what the test actually validated.
Did it prove the payment method was stored correctly? Did it prove the workflow saved the right model setting? Did it prove the onboarding change took effect in the database? Or did it just prove that the page showed some success state?
This is one of the reasons we wrote about how to use Claude Code to write Playwright tests with so much emphasis on grounding tests in the real DOM. Real selectors are better than guessed selectors. But even real selectors are not enough if the assertion is shallow.
2. They generate visibility checks where business assertions should be
A lot of AI-generated Playwright output looks like this in spirit:
click "Save"
expect success banner to be visible
expect redirected URL to match
/settings
Those checks are easy to produce, and sometimes they are useful. But they are also cheap. They validate the shell of the interaction, not the business outcome.
In production, the more important questions are usually:
was the correct record created?
was the right account updated?
did the state persist after reload?
did the backend emit the expected side effect?
did the UI reflect the actual saved data and not cached local state?
AI does not naturally invent those checks unless you force it to.
That is why teams often get false confidence from generated tests. The suite is green, but the tests are green for the wrong reasons.
3. They struggle with the API-to-UI contract
Modern product bugs often live in the gap between frontend behavior and backend state.
The UI says "saved." The API returned 200. But the persisted record is wrong, incomplete, or scoped to the wrong entity. Or the backend updated correctly but the UI is still rendering stale data.
Purely generated Playwright tests usually do not probe that boundary deeply enough.
The stronger pattern is to combine UI actions with API verification or controlled setup/teardown. We talked about this in our post on how to use Claude Code to write tests: API and E2E. The important point is that the highest-value test is often not just "click and assert visible text." It is "create known state, perform the UI action, then verify the persisted result."
That takes more intent than most one-shot AI generation gives you.
4. They fail quietly when the product evolves
When people say AI-generated Playwright tests are brittle, they usually mean selectors break.
That is true, but it is not the most dangerous version of brittleness.
The more dangerous case is when the product changes enough that the test still runs, but no longer represents the real workflow you meant to protect.
Examples:
a new onboarding step changes the critical path
a settings page gets reorganized so the same click path now configures something else
a modal becomes optional for one segment of users but required for another
a button label stays the same while the underlying action changes
In those cases, the test can keep passing and still stop being a good regression check.
That is why "self-healing" alone is not the answer either. If a system only repairs selectors, it does not know whether the repaired test still captures the right user intent.
5. Generated suites tend to be happy-path heavy
This is one of the most common patterns we see in AI-written tests: strong initial coverage of the default path, weak coverage of the cases that actually hurt you.
The model will happily write:
sign up succeeds
login succeeds
checkout succeeds
settings save succeeds
What it is much less likely to generate on its own:
a race condition between two saves
partial failure after a backend timeout
a malformed but technically accepted input
account-scoping problems for multi-tenant data
stale state after refresh
flows that differ by plan, role, or entitlement
Those are the places where production regressions live.
6. The maintenance cost shows up later
The first draft is not the expensive part. The expensive part is carrying the suite forward.
This is where AI-generated Playwright tests often underperform compared with how they are sold.
At the start, the team sees leverage: dozens of tests generated quickly. A few weeks later, the reality shows up:
flaky timing behavior
fragile assumptions about copy and layout
duplicated tests with slightly different selectors
assertions that are too weak to be useful but too strong to ignore when they fail
engineers spending time re-prompting the model instead of improving test architecture
This is one reason the comparison between frameworks misses the larger issue. As we argued in Playwright vs Cypress vs Selenium in 2026, the hard part is not choosing the browser automation framework. The hard part is owning the maintenance model.
What to do instead
If you are going to use AI to generate Playwright tests, use it with a stricter standard.
1. Treat generated tests as drafts, not finished assets
The model should give you a starting point. It should not get automatic trust.
Review every test for:
what user behavior it is meant to protect
whether the assertions validate the business outcome, not just visibility
whether the setup is deterministic
whether the flow would still make sense after a moderate UI change
2. Strengthen assertions around state, not presentation
If the most important thing in the test is a toast banner, the test is probably weak.
Prefer checks like:
reloading the page and verifying persistence
verifying the API response or resulting backend state
checking that the correct entity changed, not just that an entity changed
asserting role- or plan-specific behavior
3. Use AI where it is strong
AI is genuinely useful for:
generating a first-pass test skeleton
finding real selectors from the DOM
expanding obvious edge cases when given context
converting a manual QA flow into a starting test
producing variations of an existing strong pattern
It is weaker at:
inferring business-critical assertions without context
deciding what outcomes truly matter
maintaining long-lived test intent through product changes
4. Build for the full lifecycle, not just generation
A testing workflow is not just creation. It is:
generation
execution
maintenance
triage
regression explanation
production follow-up when something escapes
That is the place where most AI-generated Playwright stories start to break down.
The practical takeaway
AI-generated Playwright tests are useful. They are just not reliable by default.
They tend to get four things wrong in production:
they prove interaction, not intent
they prefer visibility checks over business checks
they under-test the API-to-UI boundary
they accumulate maintenance debt faster than teams expect
If you treat them as finished output, you will probably end up with a green CI pipeline that still lets the important bugs through.
If you treat them as a starting draft and apply pressure on assertions, state validation, and maintenance, they can save real time.
That is the distinction that matters.
Where Decipher fits
If your goal is simply to get Playwright code generated faster, coding agents can help.
If your goal is to own fewer brittle scripts, spend less time on maintenance, and catch real regressions before users do, you need more than code generation. You need a system that understands whether the workflow is still correct, not just whether the selector still exists.
That is the problem Decipher is built to solve.
If you want to see how teams are using AI to generate tests without inheriting the usual maintenance trap, book a demo or explore how Decipher handles the full lifecycle from generation to production monitoring.
Written by:

Michael Rosenfield
Co-founder
Share with friends: