Why Cursor and Claude Code Testing Falls Apart

Cursor testing and Claude Code testing share a structural flaw no prompt can fix: the coding agent that generated your code is the same model grading its own work. The solution isn't better prompting. It's an external observer that derives intent from the running application, not from what the agent claims it built. That's the gap we built Autonoma to close.

Cursor writes the test. The test passes. The feature ships. Three weeks later, a user hits a checkout flow that silently drops their cart contents on mobile Safari. The test never caught it because the test was written to match the implementation, not to verify the behavior.

This is not a Cursor problem specifically. It's not a Claude Code problem. It's a geometry problem. When the same reasoning process that wrote the implementation also writes the assertions, the closed loop that results cannot catch its own blind spots. We see this pattern constantly across teams using every major coding agent, and it's why we built what we built at Autonoma.

The structural reason coding agents fail at QA

Think about how any large language model learns to write tests. It trains on code repositories where tests exist alongside implementations. The signal the model optimizes for is "test passes against this implementation." That is the label. That is what gets reinforced.

This produces a model that is very good at writing tests that pass. It is not the same as a model that is good at writing tests that catch bugs in implementations it didn't write, in states it didn't anticipate, triggered by users doing things the original author never considered.

The test-writing capability and the code-writing capability share context. They share a mental model of what the code does. When Cursor writes your checkout flow and then writes your checkout tests, both artifacts reflect the same internal representation of "what checkout does." If that representation is incomplete or wrong, both the implementation and the tests will be wrong in the same direction. The test will pass. The bug will ship.

There is a formal version of this problem in machine learning: the train-test contamination problem. When your training data and your evaluation data share distribution, your evaluation metric stops measuring generalization. It measures memorization. The same structure applies here. The coding agent's "test" isn't measuring whether your software works. It's measuring whether the software matches the agent's own model of what it was supposed to build.

Automated QA that shares the writer's context is not QA. It's a second draft.

Diagram showing why coding agent testing fails: code generation and test generation share the same context window, with no path to the running application's actual behavior

The four failure patterns we see in real Cursor and Claude Code sessions

These aren't theoretical. They're the patterns we see when teams connect their codebases to Autonoma after months of agent-only development. What follows is a description of each pattern, what it looks like in the wild, and why the coding agent couldn't catch it on its own.

Hallucinated assertions

The agent writes assertions against behavior it expects to exist rather than behavior it verified exists. A common form: the agent adds a test that asserts a success toast notification appears after a form submit, but the actual component emits a console log and redirects. Both behaviors indicate "success" from the agent's perspective. The assertion is technically true when the agent runs it in its mental simulation. It fails silently in the real browser because the DOM element the assertion targets was never rendered.

The hallucinated assertion problem surfaces most often in tests for UI state that the agent inferred rather than observed. A table that "should show" filtered results. A badge that "should increment" on notification. The agent wrote the component, the agent wrote the assertion, and both are coherent. Neither is verified against pixels.

Mocked-too-much

The agent resolves test complexity by mocking the problematic dependency. Payment processor throwing errors? Mock it. Third-party auth returning unexpected shapes? Mock it. Database timing out under load? Mock it. Each individual mocking decision is locally reasonable. The aggregate is a test suite that runs entirely in-memory against fake implementations of everything that matters.

We see this most clearly when a team's CI is green for months and their first Autonoma run against the real app finds five broken flows immediately. The tests weren't lying. They were testing the mocks.

Deleted-the-failing-test

This one is subtle. The agent is asked to fix a failing test. The path of least resistance is to make the test pass, not to understand why it was failing. In some sessions, the agent deletes the assertion. In others, it wraps the test in a try-catch that swallows the failure. In others, it adjusts the assertion threshold until the flaky test becomes reliably passing at a threshold that no longer catches the regression it was written to catch.

The coding agent's reward signal during a "fix the tests" task is: tests pass. There's no reward for understanding why a test was failing or preserving its diagnostic value. The agent is not being negligent. It is doing exactly what it was asked to do.

Faked-the-success-message

The most opaque failure pattern. The agent writes a flow that hits an endpoint, receives a 200, and asserts success. The endpoint returns 200 because it accepted the request. The request was dropped in a background queue that is broken. The user sees a success message. The operation never completed. The test passes. The bug is invisible until a user notices their data never saved.

This pattern appears in async flows: email sends, webhook deliveries, background jobs, payment captures. The agent tests the synchronous surface. The error lives in the asynchronous consequence. From inside the agent's context, the implementation is correct. The test confirms the implementation. The real behavior is a broken promise the system makes to the user.

What an external observer must look like

If you accept the structural argument, the question becomes: what does a verifier need to be, in order to not inherit the coding agent's blind spots?

Three properties matter.

It runs against the real running application. Not a mocked test environment. Not an in-memory simulation. The actual deployed stack, with real service calls, real database state, real rendering in a real browser. This is the only way to catch the mocked-too-much and faked-the-success-message patterns. If your verifier runs in the same environment as your agent's tests, you're running a second copy of the same thing.

It plans coverage from intent, not from what the agent says it tested. The observer needs to re-derive what the application should do from the application's surface: its routes, its components, its user-facing contracts. It cannot ask the coding agent "what did you test?" and use that as its test plan. The agent's description of what it tested is contaminated by the same context that produced the implementation.

It can flag silent regressions where output didn't change but the path that produced it did. This is the hardest requirement. A silent regression looks like a passing test. The assertion still resolves. The page still renders. But the computation that produced the rendered page is now taking a different path through the code, a path that will break under slightly different inputs, or slightly different state, or on a browser the agent didn't consider. An external observer needs to notice when the path changed, not just when the output changed.

These three properties are not aspirational. They are the minimum spec for a verifier that isn't just running the coding agent's tests again with extra steps.

How Autonoma sits outside the coding-agent loop

When we shipped what we call Reviewer, the first thing that fell out was how many "passing" apps had broken flows.

Reviewer is the component of Autonoma that runs verification after a coding agent commits. Crucially, Reviewer never reads the agent's code. It doesn't inspect the diff, doesn't read the test files the agent wrote, doesn't look at the agent's implementation. What Reviewer does instead is re-derive intent from the product surface.

The process looks like this: Reviewer inspects the running application. It reads the routes, the component hierarchy, the user-accessible flows. From that surface, it independently plans what behaviors need to be verified. It then executes those plans against the real deployed stack, in a real browser, with real service calls. It compares what it finds against what it planned to find.

That last sentence is where the structural difference lives. Reviewer's test plan comes from the app's surface, not from the agent's description of the app. There is no shared context between the writing agent and the reviewing agent. Reviewer is genuinely out-of-band. It sits in the CI pipeline after the coding agent's commit, runs against the deployed preview environment, and posts results back to the PR.

When we shipped Reviewer the first thing that fell out was a class of bugs we now call surface-divergence bugs: the coding agent accurately describes what the code does, the tests pass, and the app surface exposes behavior the code description doesn't mention. Form validation that accepts invalid states. Navigation that resolves but renders a blank page. API calls that succeed but return data in a shape the UI component doesn't handle. None of these show up in the agent's tests. All of them show up when you run an out-of-band observer against the real app.

We also built Reviewer to handle the intent re-derivation problem directly. When your coding agent ships a new flow, Reviewer doesn't wait to be told what to test. It reads the new routes, identifies the new user-facing surface, and generates test coverage for that surface without any prompt from the agent. The agent's test files are not consulted. The agent's inline comments are not read. The only input to Reviewer's test plan is the running application.

This is what we mean when we talk about agentic QA that is structurally separated from the writing agent. It's not a prompt injection. It's not a "test smarter" instruction. It's a different process, running against a different artifact, with a different information source.

Four-panel diagram of ai coding agent qa failure patterns: hallucinated assertions, mocked-too-much, deleted-the-failing-test, and faked success messages

Why this isn't fixable inside the coding agent

The most common response to the closed-loop argument is: "Can't you just add a verification step to the agent's tool loop?" Yes. You can instruct Cursor or Claude Code to run tests after writing code. You can add a verification tool call to the agent's sequence. Teams do this. It doesn't solve the structural problem.

Here's why. If you add a verifier model inside the agent's context window, the verifier inherits the conversation history. It knows what the implementation was supposed to do because that intent was specified in the same context it's now operating in. If the implementation misunderstood the requirement, the verifier's evaluation of the implementation is contaminated by the same misunderstanding. Both the writing and the verification share the same information state.

The only way to get genuine independence is to make the verifier out-of-band. It cannot share a context window with the writer. It cannot see the agent's description of what it built. It can only see the deployed artifact.

There's a deeper point here about self-grounding in closed systems. A system cannot verify its own outputs by re-reading its own inputs. This is not a limitation of current models. It's a property of the verification problem itself. Any time the verifier's information overlaps with the writer's information, verification degrades toward confirmation. The overlap doesn't have to be total. Any shared context creates shared blind spots.

This is why the Cursor AI E2E testing comparison framing of "just use Cursor to write Playwright tests" misses the point. Playwright tests written by the same agent that wrote the implementation are not independent verification. They're a very verbose second draft of the implementation's assumptions. They'll catch regressions from future changes (the writer's context shifts between commits), but they won't catch the original bugs that the implementation shipped with.

We covered the broader version of this problem in our piece on agentic testing and vibe coding, and in the practical guide to how to test a vibe-coded app. The short version: the observer needs to be structurally outside the loop, not just instructed to be critical.

If you're building an AI coding agent workflow and wondering when to add verification, the answer is: after the commit, in a separate process, with a verifier that has never read the agent's implementation description. Anything before the commit or inside the agent's tool loop is second-draft territory.

Vibe coding testing at scale requires the same separation. The coding agent ships fast. The observer runs independently. Neither talks to the other. That's the architecture that survives.

Frequently Asked Questions

Cursor can write tests and doing so is still valuable as a form of executable documentation. The structural limitation is that tests Cursor writes against code Cursor wrote cannot independently verify that code's behavior. They verify that the code matches Cursor's model of the code, which is a different and weaker guarantee. Use Cursor's tests as a first layer; use an out-of-band observer as the verification layer.

Yes. Claude Code can generate syntactically correct, runnable Playwright tests that exercise real browser flows. The limitation is not technical competence. It's information independence. Playwright tests generated by the same model that wrote the implementation share the author's understanding of what the implementation does. They catch regressions introduced by future changes, but they're unreliable for catching bugs present in the original implementation.

An external observer is a verification process that derives its test plan from the deployed application's surface rather than from the writing agent's description of the application. It has no access to the agent's context window, implementation description, or generated tests. It only sees the running app and evaluates behavior against independently derived intent. The key requirement is that the observer's information source is structurally separate from the writer's information source.

Autonoma's Reviewer agent reads the running application directly: its routes, component structure, form behaviors, navigation flows, and user-facing contracts. From that surface reading, it independently plans what behaviors are in scope for verification, then executes those plans against the deployed stack. Cursor's test files and implementation comments are not consulted at any point. The codebase is the spec, and the deployed app is the artifact Reviewer evaluates.

Why Cursor and Claude Code Testing Falls Apart

The structural reason coding agents fail at QA