Failure taxonomy (F1-F8)¶

Every code in the catalog maps to a failure mode. The taxonomy is the conceptual axis: it answers how a test goes green without protecting anything, independent of language. The product policy (block, warn, or diagnostic) is a separate axis, decided per code.

Family	Failure mode	Risk axis	Layer that resolves it
F1	Checks nothing (no oracle)	effectiveness	static, per file
F2	The check exists but never runs	execution	static, per file
F3	The check is trivial (always passes)	effectiveness	static, per file
F4	Checks the wrong thing	effectiveness (oracle)	static (partial) + skill
F5	The test drops out of the count (skip / not collected)	execution	static + project layer
F6	Passes or fails by luck (non-determinism)	non-determinism	static (proxy) + runtime
F7	Circular or semantic oracle	semantic	skill + mutation testing
F8	Hygiene / readability (not false-green)	structure	diagnostic opt-in / linter

How to read it¶

F1-F6 are false-green. The test reports success while the code it covers may be broken. These block or warn, depending on confidence.

F7 is semantic. The oracle is circular (the test re-derives the expected value from the code, or mocks the unit it claims to test). No parser proves intent. The skill reads it; a live gate with mutation testing confirms it.

F8 is not false-green. The test still protects; it is just hard to read or maintain (assertion roulette, an over-long body, a magic number). These are diagnostic, off by default, and dedicated linters (ruff, ESLint, Robocop) also cover them. Where a linter covers it, the scanners defer.

Why a separate axis from the product groups¶

A scanner sorts its output into three CLI groups: false-positive (blocks or warns), diagnostic (opt-in), and coupling. That is the policy. F1-F8 is the map. The two do not collide: F1-F6 land in the false-positive group, F7 routes to the skill and mutation testing, F8 is the diagnostic group.

The taxonomy also names the denominator for measurement. When the catalog is cross-walked against the published test-smell literature, almost every external smell maps onto an F1-F8 mode, which keeps the academic claims honest: the tools measure against a named set of failure modes, not an open-ended list.

The two layers a parser cannot reach¶

A clean file is not a clean suite. Two failure modes survive a per-file scan:

F5 at the project level. The file asserts correctly, but the runner is configured to let an empty or partial run pass: --passWithNoTests, a coverage gate that is never enforced, a filterwarnings that never becomes an error. The --config-audit mode reads the project and CI config to catch these.
F7 at the semantic level. The assertion runs and looks specific, but the expected value comes from the code itself, or the mock stands in for the unit under test. This needs reading the test against its intent, which is the skill, and proving it with mutation, which is runtime.

See judgments (J1-J6) for the per-test questions that decide which family a finding belongs to.