How to Test AI Features in Your App: A 2026 Guide to QA-ing LLM-Powered Features

In 2026, "testing your product" has quietly become two jobs. The first is the traditional one: unit tests, integration tests, E2E specs. The second is new — making sure the AI features you just shipped behave the way you want them to across real users, real prompts, and a model vendor that updates its own weights on its own schedule.
Most teams are underinvested in the second job. The tools are newer, the failure modes are different, and the instinct from traditional testing — "pin down every input, assert every output" — doesn't transfer cleanly to a system that's nondeterministic by design. The result is a category of bugs that ships to production untested: subtle hallucinations, tone regressions, the AI answering the wrong question, and costs that silently spike when a prompt change triples the average token count.
This guide is a practical 2026 playbook for testing AI features. It covers what's actually worth testing, how to structure the test pyramid for LLM-powered code, the tooling that works, what breaks the moment the model vendor changes something, and the part of the job that no test can do — observing what real users actually send your AI feature and how it responds.
Why testing AI features is different
Every bug in a traditional feature has a root cause you can chase in code. An AI feature fails in ways code doesn't:
Nondeterminism. The same prompt produces different outputs. A test that pins down "the response must say X" fails on a perfectly good variation of X.
Model drift. Providers update their models (sometimes silently via a pinned alias). Behavior that passed yesterday can regress overnight.
Long-tail inputs. Your test fixtures cover the prompts you imagined. Users send the prompts you didn't imagine. A feature that works for 95% of inputs can be the feature users complain about, because the 5% is a clear signal and the 95% is invisible.
Correctness is fuzzy. "Is this response good?" rarely has a boolean answer. It depends on intent, tone, accuracy, and context — most of which your test harness doesn't know.
Cost and latency as first-class failure modes. A prompt change that keeps responses correct but triples the token count is a regression you'll only see in billing, not in tests.
Effective AI testing acknowledges each of these. The goal isn't to pretend LLM outputs are deterministic — it's to build a test surface that catches the real failure modes without fighting the probabilistic nature of the system.
What's worth testing
Not everything about an LLM-powered feature needs a test. The high-value surfaces are:
Prompt construction. The code that assembles the user input, system message, retrieved context, and tool definitions into a request. This is normal code with normal failure modes — test it like normal code.
Input guards. PII redaction, input length caps, content-filter pre-checks. These have deterministic pass/fail and should fail loudly.
Tool calls. If the model calls your tools, the tool handlers must do the right thing for the right arguments. Test the handlers independently of the model.
Output handling. Parsing, validation, fallbacks, and UI rendering of model outputs. Does malformed JSON crash the UI? Does a refusal get shown to the user as a helpful message?
Behavioral expectations. The model answers in the right language, stays on-topic, refuses the right things, follows the tone. These need evals, not unit tests.
Cost and latency envelopes. A regression in tokens per call or time to first token is a real bug even if the output is fine.
The AI testing pyramid
Traditional pyramid: unit tests at the base, integration in the middle, E2E at the top. For AI features, the shape is similar but each layer has its own flavor.
Layer 1: Deterministic unit tests
Everything that doesn't involve the model. Prompt builders, context retrievers, tool handlers, output parsers, validators. These are pure functions with expected inputs and outputs; run them in Jest, Vitest, or whatever your stack uses.
This layer should cover most of your AI-feature code. If it doesn't, you have too much logic sitting behind the model.
Layer 2: Fixture-based integration tests
The middle layer mocks the model with recorded responses. Run the real prompt-building, retrieval, and output-handling code, but replace the model call with a fixture.
Fixture tests are where most teams find their bugs. They're fast, deterministic, and exercise the full code path. Collect fixtures from real production traces — they represent behaviors that actually happen.
Layer 3: Evals against real models
Evals run the real model against a curated input set and score the outputs. They're slow, flaky, and expensive — so run them less often (pre-release, on prompt changes, on model version bumps), not on every commit.
A minimum viable eval setup:
A versioned dataset of
{input, context, expected_behavior}records (200-500 is plenty to start).An automated scorer: LLM-as-judge for fuzzy checks ("does this answer stay on-topic?"), assertion functions for deterministic ones ("does it include the source URL?").
A report that compares the current run against the last baseline and flags regressions, not raw scores.
Tools like Promptfoo, LangSmith, and OpenAI evals handle the scaffolding. Pick one and commit — they're all better than rolling your own.
Layer 4: E2E tests for the full user flow
Your AI feature lives inside a real UI. The end-to-end flow — user types in the input, sees a loading state, sees the response, does something with it — needs an E2E spec like any other feature.
Two caveats:
Point the E2E suite at a recording proxy or a sandbox model endpoint — not the live model. Playwright specs shouldn't be paying for inference every run.
Keep assertions loose. Match on intent (
/refund/i), not exact strings. Pinning a specific output guarantees a flaky spec.
LLM-as-judge: when and how
For fuzzy correctness checks, a second LLM call can grade the first. Example: asking a stronger model whether a support-bot response is on-topic, polite, and factually consistent with the retrieved documents.
It works, with rules:
Use a different model as the judge. Grading with the same model that generated the output inflates scores.
Ask for a rubric, not a number. "Score 1-10" produces noisy output. "Is the answer on-topic: yes/no" is more reliable.
Calibrate against humans. Hand-label 50 outputs. Compare the judge's verdict against your own. If they disagree more than 20% of the time, the rubric needs work.
Log judge disagreements. Any time the judge flags a regression, store the case. Over time, you build a gold set of real failures that doubles as a regression fixture.
The failure modes tests actually catch
In practice, a well-structured AI test suite catches:
Prompt regressions. Someone edits the system prompt, tokens per call jumps 3x, the fixture tests catch the cost change before it ships.
Output parser crashes. The model returns a response the parser didn't expect. Integration fixtures catch this.
Tool handler bugs. The model calls
lookupRefund({ orderId: "invalid" }). A unit test on the handler exercises the branch.UI rendering bugs. The response contains markdown the renderer doesn't handle. E2E specs surface this.
Topical drift. A new prompt starts answering off-topic questions. Evals with an on-topic judge flag the drift.
Refusal regressions. The model stops refusing prompts it should refuse. Safety evals catch this if you maintained them.
The failure modes tests miss
This is the part most AI-feature teams don't internalize until after an incident.
Prompts you didn't imagine. Your eval set has 300 inputs. Your feature processes 50,000 prompts a day. The prompts that matter most are the ones no engineer thought to include.
Hallucinations on long-tail queries. The model confidently invents a number, a policy, a URL. Unless the eval set happens to include that exact prompt, the test suite never sees it.
Tone-deaf outputs. The response is technically correct but awful — lecturing the user, using jargon, getting cute when the user is frustrated. These are obvious in a replay and invisible in a fixture.
User-side confusion. The user typed something the model couldn't handle, re-tried three times, and gave up. The response was "correct" for each input, but the feature still failed the user.
Silent model changes. A pinned alias points at a new underlying model. Outputs shift in ways evals don't flag because the eval set wasn't designed to surface that delta.
Every one of these is caught by watching real sessions, not by running tests.
Observing production AI features
A usable production observability loop for AI features has four parts:
Prompt + completion logs. Every call, with input, output, model version, token counts, latency, and a redaction pass on PII. Your AI SDK usually emits these already.
User context. Which user sent the prompt, what they were doing before, what they did after. This is where a session-replay product becomes useful.
Outcome signals. Did the user click the suggested action? Close the chat? Re-prompt with frustration? These are the real labels on your production data.
Automated triage. A pass that groups failures by pattern — "30 sessions today with a retry within 5 seconds of the first response" — so the humans triage patterns, not raw logs.
Without this, the feedback loop on your AI feature is "customer emails us when something's really bad." That's a very slow loop.
Where Decipher fits
Decipher gives AI-feature teams the production side of this loop. It records real sessions — the user's prompt, your feature's response, what they did next — and groups them by behavior. The sessions where the user re-tried three times. The sessions where the response looked fine but the user left. The sessions where the model hallucinated a number that nobody on your team wrote a fixture for.
It also pairs with your existing test suite: Decipher watches for regressions that tests didn't catch, and when a real user hits one, you get a video of the flow, the full prompt-and-response pair, and a natural-language explanation of what went wrong. That's the layer no test suite can fill, and it's the layer AI features need most.
Evals tell you whether your fixed test set still passes. Production observability tells you whether the product your users are actually touching is still good. You need both.
Practical tips
Seed your eval set from production. Don't write eval inputs from your imagination. Pull 200 real prompts out of production logs, remove PII, and make those the baseline. Refresh every release.
Pin model versions explicitly. Aliases like gpt-4 or claude-sonnet drift. Pin to a specific version in production and in evals. Bump both together, with a full eval run.
Track token counts as a test assertion. expect(usage.total_tokens).toBeLessThan(2000) is a valid regression test. Prompt changes that 2x the cost are real regressions.
Log prompt hashes, not raw prompts. System prompts can contain sensitive or copyrighted content. Hash them and log the hash + version so you can correlate behavior changes to deploys without leaking prompts in logs.
Write one adversarial eval set and reuse it. Prompt injections, jailbreak attempts, off-topic steering, nonsense inputs. Keep a small dedicated set (100-200) and run it on every model version bump.
FAQ
Q: Do I need evals if I already have integration tests? A: Yes. Integration tests with fixtures verify that your code handles specific responses. Evals verify that the model still produces reasonable responses. Those are different questions.
Q: How often should evals run? A: On every prompt change, every model version change, and on a nightly schedule against production-seeded data. Not on every PR — they're too slow and noisy to gate a merge on.
Q: Is LLM-as-judge reliable enough to use in CI? A: For binary rubrics (on-topic / off-topic, refused / answered) with a stronger model judge, yes. For numerical scores, no. Calibrate against hand-labels once a quarter so you notice when the judge drifts.
Q: Should my Playwright E2E specs hit the real model? A: No. Use a sandbox endpoint, a recording proxy, or stubbed responses. Specs that pay real inference costs every run become the flakiest and most expensive part of the suite.
Q: How do I test an AI feature with streaming responses? A: Split it into two: test the stream-handling code (buffer chunks, parse, render) with fixtures; test the end state of the UI after streaming completes with Playwright. Don't assert on partial stream states — they're inherently racy.
Q: What's the single highest-leverage thing to add to an AI testing setup today? A: A production eval seeded from real user prompts. Everything else is downstream of knowing what your users actually ask. Without that, you're testing a product that doesn't match the one running in production.
Written by:

Michael Rosenfield
Co-founder
Share with friends: