falsegreen-skill (semantic LLM pass)¶

The semantic layer and a superset of the three static scanners. It reads a test against its intent, the spec, and the production code to catch the false-greens no parser sees (F4, F7), and it carries every structural code of the scanners plus the AI-only S-series.

Repository: github.com/vinicq/falsegreen-skill
Catalog: semantic codes

What it is¶

Not Claude-specific. The same protocol is packaged for Claude Code, Codex, Gemini, Cursor, plain LLM prompts, API usage, and an npm CLI. For Python it applies the complete falsegreen catalog directly; for TypeScript, JavaScript, and Robot Framework it is the primary detection tool.

The protocol (J1-J6)¶

Every test is read through six judgments: does the assertion run, is the expected value from an independent oracle, is the real unit exercised, is the assertion sufficient, is it free of coupling to internals, does it pass in isolation. A false positive is worse than a miss, so a wrong-value finding is not reported without citing an oracle.

Install and first run¶

The skill runs on several hosts (Claude Code, Codex, Gemini, Cursor) and as a standalone npm CLI. The CLI needs only Node 18+ and an API key for the provider you choose:

export ANTHROPIC_API_KEY=sk-ant-...
npx falsegreen-skill analyze tests/test_demo.py

--provider openai or --provider gemini switches the model; --json and --fail-on-high wire it into CI. Per-host setup (the Claude Code plugin, the Codex and Gemini extensions, the Cursor rule) is in the skill README.

First finding¶

Given a test that asserts the mock back to itself:

# tests/test_tax.py
def test_calculate_tax(mock_calc):
    mock_calc.return_value = 0.15
    result = calculate_tax(100, mock_calc)
    assert result == mock_calc.return_value

npx falsegreen-skill analyze tests/test_tax.py reports:

CASE 11 (J2) - HIGH - Python - unit - behavior

Test: test_calculate_tax (line 2-5)
Finding: The assertion checks mock_calc.return_value - the same value the mock
was configured to return. It passes for any result, including a wrong one.
Evidence:
  mock_calc.return_value = 0.15
  assert result == mock_calc.return_value
Fix hint: Assert against an independently computed value, e.g. assert result == 15.0.

Reading a finding¶

Each finding names five things:

CASE 11 - the semantic code. The skill also emits the structural C*, JS*, and R* codes the scanners use.
(J2) - the failed judgment: the expected value is not from an independent oracle.
HIGH - confidence. HIGH means no plausible legitimate reading; LOW warns.
Python - unit - behavior - language, pyramid level, and test intent.
Finding / Evidence / Fix hint - what is wrong, the lines that prove it, and how to repair it.

What it covers¶

The broadest tool of the family. It is the superset, so its coverage is the union of everything the scanners catch plus what only a reader of intent can:

Layer	Coverage
All structural codes	every `C` (Python), `JS`, and `R*` code from the three scanners, applied by reading the source
Semantic cases	10 (mocks the unit), 11 (asserts the stub), 12 (re-implements the formula), 15 (shared state), 18 (contradicts the spec)
The S-series (AI-only)	`S1`-`S13`: intent mismatch, irrelevant oracle, plausible-but-wrong value, coarse oracle, tests the framework, happy-path-only, value lifted from output, mock through indirection, self-fulfilling arrangement, asserts the log, negative-only security check, patches core logic, cross-file order dependence
DSL passes	Gherkin `.feature` and Tavern `*.tavern.yaml` (see Gherkin and Tavern)
Level awareness	reads unit / integration / E2E from signals and adjusts the oracle

Modes¶

Detect - read a suite and report findings (J1-J6, level, evidence, fix hint).
Author - generate tests that are not false-green by construction, one spec per pyramid level.
AI-fix gate (F7) - propose a strengthened test and validate it with a bidirectional mutation gate (pass on clean code, fail on the reintroduced bug).

Complete usage and configuration¶

The first-run above is the five-minute path. This section is the full reference: the CLI (analyze and the fix mutation-gate mode), provider configuration for every supported backend, and the per-host enable steps. It mirrors what the project README, docs/cli.md, and providers.md document.

The CLI¶

Node 18 or newer, zero dependencies. Install or run on demand:

npm install -g falsegreen-skill
npx falsegreen-skill analyze tests/test_payment.py

Two commands:

falsegreen-skill analyze <file...> [options]
falsegreen-skill fix <test-file> --case <code> --line <n> [options]

The CLI sends each file to an LLM provider with the J1-J6 protocol as the system prompt and prints the findings report. It identifies the language from the file extension, so Python, TypeScript, and JavaScript work the same way with no extra flag.

`analyze` flags¶

npx falsegreen-skill analyze tests/test_orders.py                 # single file
npx falsegreen-skill analyze tests/test_orders.py tests/test_pay.py  # multiple (separate calls)
npx falsegreen-skill analyze tests/ --json --fail-on-high          # CI gate: exits 2 on a HIGH finding
npx falsegreen-skill analyze tests/ --model claude-opus-4-8        # deeper model for case 18
npx falsegreen-skill analyze tests/ --temperature 0.0              # more deterministic (default 0.2)

Flag	Description	Default
`--provider <name>`	`anthropic`, `openai`, `gemini`, or `openai-compatible`	`anthropic`
`--model <model>`	Model override. Required for `openai-compatible`	per provider (below)
`--base-url <url>`	API base URL. Required for `openai-compatible`	none
`--json`	Validate and output JSON conforming to `schema/report.json`	off
`--conventions <file>`	Conventions YAML/text block injected per SKILL.md Step 0	none
`--temperature <n>`	Sampling temperature 0.0-1.0. Skipped for OpenAI o-series (o3, o4-mini)	`0.2`
`--max-tokens <n>`	Max output tokens per request	`4096`
`--fail-on-high`	Exit 2 when any HIGH finding is present. Requires `--json`	off

With --json, each model response is validated against the canonical schema and emitted as one aggregate report. If your project uses custom assertion helpers or intentional patterns that look like smells, declare them in a conventions file and pass --conventions; the block is injected before the file content, per Step 0 of the protocol, so the model reads it before judging.

`fix` mode (mutation gate)¶

analyze finds a false-green; fix proposes a stronger test and proves it before you trust it. It is opt-in, Python/pytest only, and propose-only: it prints a test-file patch but never applies it and never edits production code.

# propose a patch for a C2b finding and run the gate against the real SUT
npx falsegreen-skill fix tests/test_discount.py --case C2b --line 14 --sut src/discount.py

# parse + preserve only, no mutation gate (no runnable SUT, or a quick pass)
npx falsegreen-skill fix tests/test_discount.py --case C5 --line 9 --cheap

# machine-readable gate verdict (schema/fix-validation.json)
npx falsegreen-skill fix tests/test_discount.py --case C20 --line 22 --sut src/discount.py --json

Flag	Description
`--case <code>`	Catalog code of the finding to fix (`C2b`, `C20`, `C21`, `C5`, `C7`)
`--line <n>`	Line of the finding in the test file (1-indexed)
`--sut <file>`	Production file the test protects. Required for a validated fix; without it the gate degrades to propose-only
`--sut-line <n>`	Line in the SUT to mutate (the behavioural line the finding names). Defaults to `--line`
`--cheap`	Validation tier: parse + preserve only, no mutation gate. Default tier is strong (parse + preserve + mutation)

The gate runs three checks on a clean replica: the patch parses, it passes pytest against the real code, and it fails on a line-scoped mutation of the SUT. A patch is accepted only when it both passes on correct code and goes red on the mutant, which is what proves the new assertion catches a bug instead of being a fresh tautology. Without --sut it degrades to propose-only and says the fix is unvalidated. The honest limit: the gate proves the fix catches the targeted mutant, not every possible bug.

Exit codes¶

Code	Meaning
`0`	Analysis completed (findings may still exist; `analyze` is an analysis tool, not a gate)
`1`	Error: missing file, missing API key, bad flag, invalid JSON, schema mismatch, non-2xx API response
`2`	`--fail-on-high` was set and the JSON report contains at least one HIGH finding

Provider configuration¶

The skill is not tied to Claude. The protocol is provider-agnostic; pick a backend with --provider and set the matching API key in the environment.

Variable	Used by
`ANTHROPIC_API_KEY`	`--provider anthropic`
`OPENAI_API_KEY`	`--provider openai` (and fallback for `openai-compatible`)
`GEMINI_API_KEY`	`--provider gemini`
`FALSEGREEN_API_KEY`	`--provider openai-compatible` (takes precedence over `OPENAI_API_KEY`)

Default models: anthropic uses claude-sonnet-4-6, openai uses gpt-4o, gemini uses gemini-2.5-pro. For deep case 18 analysis, pass --model claude-opus-4-8 (Anthropic) or --model o3 (OpenAI); when using o3, --temperature is ignored automatically.

# Anthropic (default)
export ANTHROPIC_API_KEY=sk-ant-...
falsegreen-skill analyze tests/test_payment.py

# OpenAI
export OPENAI_API_KEY=sk-...
falsegreen-skill analyze tests/test_payment.py --provider openai

# Google Gemini
export GEMINI_API_KEY=...
falsegreen-skill analyze tests/test_payment.py --provider gemini

openai-compatible (base-url)¶

Any OpenAI-compatible endpoint works through --provider openai-compatible with an explicit --base-url and --model. Set FALSEGREEN_API_KEY to that provider's key.

# Groq (fast LLaMA)
export FALSEGREEN_API_KEY=gsk_...
falsegreen-skill analyze tests/test_payment.py \
  --provider openai-compatible \
  --base-url https://api.groq.com/openai/v1 \
  --model llama-3.3-70b-versatile

# Ollama (local)
export FALSEGREEN_API_KEY=ollama
falsegreen-skill analyze tests/test_payment.py \
  --provider openai-compatible \
  --base-url http://localhost:11434/v1 \
  --model qwen2.5-coder:32b

Reasoning models on openai-compatible hosts use the same shape, just a stronger model id:

# Nvidia NIM (DeepSeek R1 reasoning)
export FALSEGREEN_API_KEY=nvapi-...
falsegreen-skill analyze tests/test_orders.py \
  --provider openai-compatible \
  --base-url https://integrate.api.nvidia.com/v1 \
  --model deepseek-ai/deepseek-r1

# Fireworks (DeepSeek R1 reasoning)
export FALSEGREEN_API_KEY=fw_...
falsegreen-skill analyze tests/test_orders.py \
  --provider openai-compatible \
  --base-url https://api.fireworks.ai/inference/v1 \
  --model accounts/fireworks/models/deepseek-r1

# Groq (DeepSeek R1 distill, reasoning)
export FALSEGREEN_API_KEY=gsk_...
falsegreen-skill analyze tests/test_orders.py \
  --provider openai-compatible \
  --base-url https://api.groq.com/openai/v1 \
  --model deepseek-r1-distill-llama-70b

For the hardest case 18 findings (expected value contradicts the spec), the maintained pattern is a two-pass finder/refuter call with a reasoning model, documented in the project's providers.md.

Per-host enable¶

The same protocol is packaged for each host. Pick the path that matches your editor or agent.

Claude Code (primary path). Add the marketplace and install the plugin:

/plugin marketplace add vinicq/falsegreen-skill
/plugin install falsegreen-skill@falsegreen

Then invoke /falsegreen-skill:falsegreen-llm, or attach a test file and ask for false-positive analysis, the skill triggers on intent. For Python it applies the full pattern catalog directly; optionally run the static Python scanner first and hand its output to the skill as the structural pass.

OpenAI Codex CLI. Two paths: the plugin marketplace (codex plugin marketplace add vinicq/falsegreen-skill), or clone the repo, where the root AGENTS.md loads automatically when Codex starts a session inside the clone.

Gemini CLI. Install the extension:

gemini extensions install https://github.com/vinicq/falsegreen-skill

The manifest loads GEMINI.md as persistent context, so every session carries the J1-J6 protocol; ask in natural language. A workspace skill at .gemini/skills/falsegreen-skill/SKILL.md is the alternative when you want Gemini's skill discovery rather than extension-wide context.

Cursor. Copy the rule template into .cursor/rules/falsegreen-skill.mdc (full template in the project's contexts/cursor.md). Cursor loads it on a matching test file; ask Cursor to analyze the file and the J1-J6 protocol runs.

Plain LLM / API. Use the SDK snippets in the project's providers.md: the system prompt is SKILL.md, the user message is the test file, and structured JSON output follows schema/report.json. The same base-url pattern as the CLI covers any OpenAI-compatible backend.

What it does not cover, and why¶

The runtime half of F7¶

The skill proposes a strengthened test and self-checks that it can fail, but it does not run mutation testing - that is the host's job (mutmut, cosmic-ray, Stryker). A strengthened test is only accepted after the bidirectional gate runs, and the skill never invokes the mutation tool. So the live gate result is out of the skill's hands by design.

The wrong axis¶

Even as the broadest tool, it stays false-green only. Brittleness/false-red, pure hygiene, slow, design, naming, and runtime-culture smells are out, the same boundary as the scanners. See coverage vs the literature.

Determinism trade-off¶

The semantic findings are operator-confirmed, not deterministic: confidence is LOW or HIGH by how clear the contradiction is, and the skill shows its reasoning instead of auto-blocking. Where a parser can prove a pattern, the static scanners are the faster, deterministic pre-filter; the skill is the complete multi-stack net. See scope and honesty.