Research foundation¶
falsegreen is a research project as much as a tool. It has a dual purpose: academic - a defensible taxonomy, a named denominator, threats to validity - and industrial - low false positives, real patterns, something that runs in CI. Every code in the catalog traces back to a failure mode and a judgment, so the claim behind it is checkable, not folklore.
The methodology (our base)¶
The whole approach rests on four pillars, each with its own page:
- Failure taxonomy F1-F8 - the conceptual axis: how a test passes green without protecting anything, independent of language.
- Judgments J1-J6 - six questions asked of a single test; a finding names the exact guarantee that fails, not a vague smell.
- The oracle hierarchy - the expected value must come from a source independent of the code; promoting the code itself to oracle is how a bug freezes as "correct".
- The AI-fix gate (F7) - a bidirectional mutation gate: a strengthened test must pass on clean code and fail on the reintroduced bug, or it is rejected.
The denominator and threats to validity¶
Precision and recall are reported against a named universe, not an open-ended list. The family measures against the Open Catalog of Test Smells (517 documented smells, 1621 references, 166 sources), and only the false-green slice is in scope. What stays out and why is on the coverage vs the literature page - that page is the threats-to-validity statement in public form.
Baselines from the literature¶
For comparison context, the published detectors and studies in the adjacent space:
| Tool / study | Precision | Recall | F1 | Scope |
|---|---|---|---|---|
| xNose (Paul, 2024) | 96.97% | 96.03% | - | C#, 16 smells |
| srcML (Lopes, 2023) | 87.25% | 100% | - | C++ and Java, 7 smells |
| JNose (Goes, 2024) | 85-100% | 90-100% | - | Java, 6 smells |
| LLM CoT + one-shot (Santana, 2025) | - | - | 0.732 Py / 0.763 Java | Python and Java |
Our own evaluation against this denominator lives in the research hub; the numbers are released when they are published, not before.
The study¶
The product code and this documentation are public. The dataset, the per-smell adjudication, and unpublished results live in a private research hub, so no unpublished number or evidence appears in a public repository. Results and any paper are linked here when published.
Public study materials:
- falsegreen (Python), falsegreen-js (JS/TS), robotframework-falsegreen (Robot), falsegreen-skill (semantic).
- The founding work and full reference list: credits and references.
- The literature denominator: Open Catalog of Test Smells.
How to cite¶
If you use falsegreen in academic work, cite the relevant product repository and the founding rotten-green-test literature listed in credits. A canonical citation entry is added here once the study is published.