You Can’t QA Vibes: How to Test AI Products When Outputs Are Never the Same Twice
AI outputs are different every time — so how do you test them? Shape tests, semantic evaluation, mock boundaries, visual regression, and production monitoring. The practical testing playbook for AI products.
Here’s the testing problem nobody talks about: how do you write automated tests for a system where the output is different every time? You can’t assert that the AI returns exactly “The project deadline is March 15th” because tomorrow it might return “The deadline for this project is 3/15” — same meaning, different string, test fails.
Traditional testing assumes deterministic outputs. AI products break that assumption. After shipping 7 AI products, here’s how I actually test them.
The Testing Pyramid for AI Products
The classic testing pyramid — unit tests at the base, integration tests in the middle, E2E tests at the top — still applies. But each layer needs a different strategy for AI outputs.
Layer 1: Test Everything That Isn’t AI
Most of your product is deterministic. API routes parse inputs, validate data, manage state, format responses. All of this is testable with standard unit tests. The AI call is one step in a larger pipeline — test everything around it.
In my products, the AI-dependent code accounts for maybe 15% of the total codebase. The other 85% — auth, CRUD operations, data transformations, UI components, routing — is fully deterministic and fully testable.
Layer 2: Test AI Outputs by Shape, Not Value
Instead of asserting what the AI said, assert the shape of what it returned. Did the response include all required fields? Is the confidence score between 0 and 1? Does the generated text contain fewer than 500 characters? Is the JSON parseable?
test('AI analysis returns expected shape', async () => {
const result = await analyzeContent(sampleInput);
expect(result).toHaveProperty('summary');
expect(result).toHaveProperty('confidence');
expect(result.confidence).toBeGreaterThan(0);
expect(result.confidence).toBeLessThanOrEqual(1);
expect(result.tags).toBeInstanceOf(Array);
expect(result.tags.length).toBeGreaterThan(0);
});
Shape tests catch structural failures — the API changed, the response format broke, a required field is missing — without being brittle to natural language variation.
Layer 3: Semantic Evaluation for Critical Paths
For critical features where the AI’s meaning matters — not just its structure — I use semantic evaluation. The test sends a known input and evaluates the output against criteria, not exact strings.
Example: “Given a product description about running shoes, the AI-generated ad copy should mention comfort, performance, and include a call-to-action.” I evaluate these criteria programmatically using keyword matching for simple cases or a second AI call (an LLM-as-judge) for complex semantic evaluation.
The Mock Boundary
The most important architectural decision for testability: define a clear boundary between AI and non-AI code. I wrap every AI call in a service function with a typed interface. In tests, I mock that service. In production, it calls the real model.
// The boundary
interface AIService {
analyze(input: string): Promise<AnalysisResult>;
generate(prompt: string): Promise<GenerationResult>;
}
This boundary lets me test 100% of my application logic without ever making an AI API call. The AI service gets its own test suite with shape tests and semantic evaluations. The rest of the app gets standard unit and integration tests.
Visual Regression for AI-Rendered Content
When AI generates content that the UI renders — markdown, charts, dynamic layouts — visual regression tests catch rendering bugs that content-level tests miss. I snapshot the rendered component with various AI outputs (stored as fixtures) and compare against baselines.
The fixtures aren’t “correct” AI outputs — they’re representative outputs that exercise different rendering paths. A short response, a long response, a response with code blocks, a response with lists, an empty response, a malformed response. Each fixture tests a different rendering edge case.
Production Monitoring Is Testing Too
For AI products, production monitoring is the final test layer. I track:
- Confidence distribution: If average confidence drops, the model or the inputs have changed.
- Latency percentiles: P50, P95, P99 for AI calls. Latency spikes indicate model issues or rate limiting.
- User edit rates: If users are editing AI outputs more frequently, the quality has degraded.
- Retry rates: Frequent retries mean the first output isn’t meeting expectations.
These metrics are often more valuable than pre-production tests because they measure real-world quality with real data and real users.
The Pragmatic Approach
You can’t test AI products the way you test CRUD apps. Accepting that — and designing your testing strategy around it — is the first step. Test the deterministic code thoroughly. Test AI outputs by shape. Evaluate semantics for critical paths. Monitor quality in production. And always maintain the mock boundary so your test suite runs in seconds, not minutes.
Related Articles
Why Atomic Design Is the Secret to AI-Assisted Development That Doesn’t Break
I've shipped 7 products using AI-assisted development. The secret isn't better prompts — it's Atomic Design, strict separation…
Designing for Probabilistic Outputs: UX Patterns When AI Responses Aren’t Deterministic
Traditional UX assumes deterministic software. AI breaks that contract. Here are six design patterns for building interfaces where…
Multi-Agent UX: Designing Interfaces Where Multiple AI Models Collaborate
I'm building a product with four collaborating AI agents on LangGraph. Nobody has published design patterns for this.…