Auto-triage

Every failed run answers the first question: is this our bug?

When a flow fails, klera classifies the failure into one of four verdicts and ships PM-readable and engineer-readable prose alongside it. The classifier is deterministic; the prose is LLM-narrated; both run on every failed run with no flags to set.

failureflows/checkout-android.flow.mdstep 4 of 6 · failed at 00:12.4 · 2026-04-29

last frame · captured

Element graph124 nodes Last 3 frames60fps Stack trace2 sources

verdict

regressiondriftflakedata

The runtime tapped “Place order”, but the next screen never mounted. The element graph shows the button transitioning to disabled — no navigation event followed.

suspect commit

a1c4f29checkout: gate submit on payment-method validity@miyu · 2h ago · packages/checkout/src/PlaceOrderButton.tsx · first flow run after this commit to fail

proposed fix · pick a payment method before tapping

- Tap “Place order” and confirm the order receipt appears.
+ Pick a saved card, tap “Place order”, and confirm the order receipt appears.

Open PR with this fix View element graph__failure-evidence__/checkout-android/14-22

The four verdicts

Verdict	Meaning	What it tells you
regression	The matcher could not resolve the target and the planner could not find a working path either	Likely a real bug; klera surfaces a suspect commit window for the failed file(s)
drift	The matcher could not resolve the cached target, but the planner produced a different working step	The screen shifted enough to outrun the matcher; klera proposes the test update
flake	A `waitForIdle` gate in the same flow timed out and the planner agrees the cached IR is correct	Environment noise, not a product bug; rerun before opening a ticket
data	The step’s error string matches a known value-mismatch pattern (e.g. `hasText mismatch`)	The product is doing what was asked; the test fixture / seeded data has shifted

The classifier picks one verdict per failed flow. Earlier rules short-circuit later ones — data wins over a matcher-based call, regression wins over a flake heuristic. The full rule order lives in pickVerdict in packages/engine/src/triage.ts.

How the classifier decides

The classifier is pure. No LLM call. It reads three signals off the failed step:

matcherTrace — every ladder rung the matcher probed, how many candidates each rung saw, and whether the ladder resolved (match, drift, or fail). Built by the matcher; lives on every step result. See self-healing matcher for the trace shape.
Planner replan record — the engine optionally re-runs the planner against the failure-state element graph and the original prose. If the planner produces a different step at the failed index, that is the drift signal; if it produces an equivalent step or errors out, that is the regression signal. The replan never overwrites the cached IR — it is read-only input to the classifier.
Error strings — the executor’s per-step handlers emit specific error shapes (hastext mismatch, expected notVisible). Those land the data verdict directly.

A few derived signals show up too: a waitForIdle step earlier in the flow that timed out is the flake heuristic; a network-mock divergence between the cached IR and the runtime call log is part of how the planner replan decides whether to propose a different step. When the classifier degrades — no replan available, no source links emitted in production builds — it picks the conservative call. A matcher-fail without a replan answer is presumed regression.

Worked example: button text changed

A flow asserts tap "Sign In". Engineering renames the button to "Log in" without touching the test.

The matcher tries testID (rename, miss), accessibility-label (miss), exact-text (no element with that text), and falls all the way down to fuzzy-text — which finds "Log in" at score 0.78. If self-healing is enabled the run passes with a drift annotation. If the threshold isn’t met, the run fails. Either way the planner re-runs against the post-rename element graph and emits tap "Log in" instead.

Verdict: drift. The triage block ships:

A PM narrative explaining that the button copy changed.
An engineer narrative pointing at the matcher trace and the planner’s replan diff.
A proposedSteps array — the IR the planner thinks the cache should hold next.

The triage card in the HTML report has a one-click “Open PR with this fix” affordance that turns proposedSteps into a prose update.

Worked example: API contract changed

A flow does tap "Place order" then asserts visible "Order placed". A backend deploy starts returning HTTP 500 for the order endpoint; the in-app handler shows an error toast and disables the button.

The matcher resolves "Place order" cleanly. The tap fires. The following assert for "Order placed" exhausts the ladder — the text is genuinely not on screen. The planner re-runs against the post-failure element graph; the order screen has not navigated, the toast carries "Network error", and there is no obvious replanned path. The planner returns error: "no working path".

Verdict: regression. The triage block ships:

A PM narrative explaining what the user would see.
An engineer narrative naming the failed assertion.
A ranked suspectFiles list, derived from the elements involved in the matcher trace (the dev-only __source denormalisation; see failure evidence).
A suspectCommit pointer — the most recent commit touching any of the suspect files within the last 200 commits.

The HTML report renders the suspect commit author, message, and SHA inline. The PR comment has a clickable link straight to that commit.

Suspect-file ranking

The deterministic step uses the failed step’s sourceLinks. Each linked node carries (elementId, fileName, lineNumber) from the dev- only __source denormalisation that React Native’s babel plugin emits. The classifier walks the matcher trace’s matchedElementId and candidateIds, looks each up in the failure-state snapshot, and collects _source from the element itself plus its three nearest ancestors. Composite components that wrap interactive primitives are the typical suspects, so ancestors carry weight.

The LLM narrator may re-rank or trim the list. The deterministic list is always populated when __source is available; it survives an LLM outage as the fallback.

The LLM narrator

After the classifier picks the verdict, klera invokes the planner LLM (Claude Sonnet 4.6 by default) with a structured prompt:

The verdict.
The flow name + failed step index.
The cached IR step that ran.
The replanned step the planner produced (drift case).
The matcher trace, error message and details, source links, frames.

The narrator writes a tweet-length PM narrative (≤280 chars) and a paragraph-length engineer narrative (≤800 chars), plus its own ranked suspectFiles (≤5 entries). The schema is enforced by Zod — malformed narrator output falls back to the deterministic suspect-file list and short stub prose.

Escape hatches

The classifier and narrator are both opt-out:

klera run --no-triage — skip the triage block entirely. The report still ships, just without the verdict / narrative / suspect files.
KLERA_NO_TRIAGE=1 — environment-variable form of the same flag. Useful in CI environments where you want triage on locally and off in a one-off rerun job.

Graceful degradation without an API key

The classifier never needs an LLM. The narrator does. When ANTHROPIC_API_KEY is unset (or the local-CLI planner transport is unreachable), klera ships:

The deterministic verdict.
A short stub PM narrative ("Auto-triage narrative pending — wire ANTHROPIC_API_KEY for prose.").
A short stub engineer narrative.
The deterministic suspectFiles list.
The suspectCommit pointer when the verdict is regression and git log succeeds.

The HTML report renders all of the above without a placeholder. Add the API key (or wire claude / codex / gemini on PATH) and the next failure ships full narratives.

Where the triage block lives

The triage block is part of the JSON report’s top-level shape:


{
  "schemaVersion": 4,
  "flow": { ... },
  "steps": [ ... ],
  "triage": {
    "verdict": "regression",
    "failedStepIndex": 4,
    "pmNarrative": "Tapping Place order didn't navigate forward...",
    "engineerNarrative": "Step 4 (assert visible 'Order placed')...",
    "suspectFiles": [
      { "fileName": "packages/checkout/src/PlaceOrderButton.tsx", "lineNumber": 42, "reason": "..." }
    ],
    "suspectCommit": {
      "sha": "a1c4f29",
      "author": "miyu",
      "subject": "checkout: gate submit on payment-method validity"
    }
  }
}

klera report --html renders the triage block as the card you saw at the top of this page. klera report --junit folds the verdict into the JUnit <system-out> so it shows up in PR test panes.

Failure evidence Self-healing matcher Reading a report Reports