What AI-Generated Playwright Tests Get Wrong in Production

What AI-Generated Playwright Tests Get Wrong in Production

AI can generate Playwright tests quickly. That part is real.

What is less real is the idea that generated tests are automatically good tests.

A generated Playwright suite can look impressive in a pull request. It can cover a login flow, a checkout path, a settings page, and a few happy-path assertions in minutes. It can even run successfully in CI the first time. But that does not mean it will hold up in production, and it definitely does not mean it is catching the failures that matter.

That gap is where a lot of teams get burned.

If your team is using Claude Code, Copilot, or another coding agent to create Playwright tests, the right question is not, "Can AI write the test?" The right question is, what kind of test did it actually write?

The core problem

Most AI-generated Playwright tests optimize for completion, not coverage quality.

The model is trying to produce something that compiles, runs, and appears reasonable. It is not accountable for whether the assertion is strong enough, whether the flow still reflects the real product behavior a month later, or whether the test would catch the kind of bug that costs you money.

That matters because production bugs rarely come from total page failure. They usually come from more subtle things:

  • the wrong item was saved

  • the UI looked successful but the backend write failed

  • the confirmation message appeared for the wrong state

  • the user completed the flow, but the side effect never happened

  • a redesign changed the meaning of the page without obviously breaking selectors

Generated Playwright tests often miss those failures.

1. They over-index on selectors and under-index on intent

The first failure mode is simple: AI is usually better at finding something to click than proving the user achieved the right outcome.

A generated test might do this:

  • navigate to the page

  • click a button

  • fill a form

  • wait for a toast

  • assert that some text is visible

That sounds fine until you ask what the test actually validated.

Did it prove the payment method was stored correctly? Did it prove the workflow saved the right model setting? Did it prove the onboarding change took effect in the database? Or did it just prove that the page showed some success state?

This is one of the reasons we wrote about how to use Claude Code to write Playwright tests with so much emphasis on grounding tests in the real DOM. Real selectors are better than guessed selectors. But even real selectors are not enough if the assertion is shallow.

2. They generate visibility checks where business assertions should be

A lot of AI-generated Playwright output looks like this in spirit:

  • click "Save"

  • expect success banner to be visible

  • expect redirected URL to match /settings

Those checks are easy to produce, and sometimes they are useful. But they are also cheap. They validate the shell of the interaction, not the business outcome.

In production, the more important questions are usually:

  • was the correct record created?

  • was the right account updated?

  • did the state persist after reload?

  • did the backend emit the expected side effect?

  • did the UI reflect the actual saved data and not cached local state?

AI does not naturally invent those checks unless you force it to.

That is why teams often get false confidence from generated tests. The suite is green, but the tests are green for the wrong reasons.

3. They struggle with the API-to-UI contract

Modern product bugs often live in the gap between frontend behavior and backend state.

The UI says "saved." The API returned 200. But the persisted record is wrong, incomplete, or scoped to the wrong entity. Or the backend updated correctly but the UI is still rendering stale data.

Purely generated Playwright tests usually do not probe that boundary deeply enough.

The stronger pattern is to combine UI actions with API verification or controlled setup/teardown. We talked about this in our post on how to use Claude Code to write tests: API and E2E. The important point is that the highest-value test is often not just "click and assert visible text." It is "create known state, perform the UI action, then verify the persisted result."

That takes more intent than most one-shot AI generation gives you.

4. They fail quietly when the product evolves

When people say AI-generated Playwright tests are brittle, they usually mean selectors break.

That is true, but it is not the most dangerous version of brittleness.

The more dangerous case is when the product changes enough that the test still runs, but no longer represents the real workflow you meant to protect.

Examples:

  • a new onboarding step changes the critical path

  • a settings page gets reorganized so the same click path now configures something else

  • a modal becomes optional for one segment of users but required for another

  • a button label stays the same while the underlying action changes

In those cases, the test can keep passing and still stop being a good regression check.

That is why "self-healing" alone is not the answer either. If a system only repairs selectors, it does not know whether the repaired test still captures the right user intent.

5. Generated suites tend to be happy-path heavy

This is one of the most common patterns we see in AI-written tests: strong initial coverage of the default path, weak coverage of the cases that actually hurt you.

The model will happily write:

  • sign up succeeds

  • login succeeds

  • checkout succeeds

  • settings save succeeds

What it is much less likely to generate on its own:

  • a race condition between two saves

  • partial failure after a backend timeout

  • a malformed but technically accepted input

  • account-scoping problems for multi-tenant data

  • stale state after refresh

  • flows that differ by plan, role, or entitlement

Those are the places where production regressions live.

6. The maintenance cost shows up later

The first draft is not the expensive part. The expensive part is carrying the suite forward.

This is where AI-generated Playwright tests often underperform compared with how they are sold.

At the start, the team sees leverage: dozens of tests generated quickly. A few weeks later, the reality shows up:

  • flaky timing behavior

  • fragile assumptions about copy and layout

  • duplicated tests with slightly different selectors

  • assertions that are too weak to be useful but too strong to ignore when they fail

  • engineers spending time re-prompting the model instead of improving test architecture

This is one reason the comparison between frameworks misses the larger issue. As we argued in Playwright vs Cypress vs Selenium in 2026, the hard part is not choosing the browser automation framework. The hard part is owning the maintenance model.

What to do instead

If you are going to use AI to generate Playwright tests, use it with a stricter standard.

1. Treat generated tests as drafts, not finished assets

The model should give you a starting point. It should not get automatic trust.

Review every test for:

  • what user behavior it is meant to protect

  • whether the assertions validate the business outcome, not just visibility

  • whether the setup is deterministic

  • whether the flow would still make sense after a moderate UI change

2. Strengthen assertions around state, not presentation

If the most important thing in the test is a toast banner, the test is probably weak.

Prefer checks like:

  • reloading the page and verifying persistence

  • verifying the API response or resulting backend state

  • checking that the correct entity changed, not just that an entity changed

  • asserting role- or plan-specific behavior

3. Use AI where it is strong

AI is genuinely useful for:

  • generating a first-pass test skeleton

  • finding real selectors from the DOM

  • expanding obvious edge cases when given context

  • converting a manual QA flow into a starting test

  • producing variations of an existing strong pattern

It is weaker at:

  • inferring business-critical assertions without context

  • deciding what outcomes truly matter

  • maintaining long-lived test intent through product changes

4. Build for the full lifecycle, not just generation

A testing workflow is not just creation. It is:

  • generation

  • execution

  • maintenance

  • triage

  • regression explanation

  • production follow-up when something escapes

That is the place where most AI-generated Playwright stories start to break down.

The practical takeaway

AI-generated Playwright tests are useful. They are just not reliable by default.

They tend to get four things wrong in production:

  • they prove interaction, not intent

  • they prefer visibility checks over business checks

  • they under-test the API-to-UI boundary

  • they accumulate maintenance debt faster than teams expect

If you treat them as finished output, you will probably end up with a green CI pipeline that still lets the important bugs through.

If you treat them as a starting draft and apply pressure on assertions, state validation, and maintenance, they can save real time.

That is the distinction that matters.

Where Decipher fits

If your goal is simply to get Playwright code generated faster, coding agents can help.

If your goal is to own fewer brittle scripts, spend less time on maintenance, and catch real regressions before users do, you need more than code generation. You need a system that understands whether the workflow is still correct, not just whether the selector still exists.

That is the problem Decipher is built to solve.

If you want to see how teams are using AI to generate tests without inheriting the usual maintenance trap, book a demo or explore how Decipher handles the full lifecycle from generation to production monitoring.

Written by:

Michael Rosenfield

Co-founder

Share with friends:

Share on X