From pup
Investigates a specific flaky test by retrieving its history, failure pattern, and category, then recommends fix, quarantine, or escalate. Best for DataDog CI users.
How this skill is triggered — by the user, by Claude, or both
Slash command
/pup:dd-triage-flaky-testThe summary Claude sees in its skill listing — used to decide when to auto-load this skill
One-line summary: Investigate a specific flaky test — get history, failure pattern, and category, then recommend fix, quarantine, or escalate.
One-line summary: Investigate a specific flaky test — get history, failure pattern, and category, then recommend fix, quarantine, or escalate.
Requires: dd-pup skill (pup CLI installed and authenticated).
| Parameter | Description |
|---|---|
| Test name | Fully qualified test name (e.g. TestMyFunc or com.example.MyTest) |
| Repository | Lowercase, no-schema URL (e.g. github.com/org/repo). Derive from git remote get-url origin if not provided. |
Derive repository ID from git if not provided:
git remote get-url origin
# Strip protocol and trailing .git, then lowercase the result
# e.g. https://github.com/DataDog/my-repo.git → github.com/datadog/my-repo
Validation fallback: If STEP 1 returns no results, confirm the correct repository by searching without a repo filter:
pup cicd tests search \
--query "@test.name:\"<test-name>\"" \
--from 30d \
--limit 5
Extract @git.repository.id_v2 from results and retry STEP 1 with the confirmed value.
Preferred — use fingerprint_fqn if known (fingerprint_fqn is a valid CI Visibility search facet, distinct from flaky_state):
pup cicd flaky-tests search \
--query "fingerprint_fqn:<fqn>" \
--sort="-last_flaked" \
--limit 5
Fallback — use name + suite + repo:
pup cicd flaky-tests search \
--query "@test.name:\"<test-name>\" @test.suite:\"<suite>\" @git.repository.id_v2:\"<repo>\"" \
--sort="-last_flaked" \
--limit 10
Omit @test.suite if unknown; if the same test name appears in multiple suites, pick the entry whose suite matches the failing test.
Do not filter by flaky_test_state — return the test regardless of state.
Note: the query filter facet is flaky_test_state; the returned response attribute is flaky_state — these are different names for the same concept; do not use flaky_state:active as a query filter.
Extract from results:
fingerprint_fqn — unique test identifier; used as the id in STEP 5 write call. If absent, do not proceed to quarantine — see STEP 5.flaky_state — current state (active / quarantined / disabled / fixed)test_stats.failure_rate_pct — percentage of runs that failflaky_category — root cause categorycodeowners — owning teampipeline_stats.total_lost_time_ms — total CI time lostpup cicd tests search \
--query "@test.name:\"<test-name>\" @test.suite:\"<suite>\" @test.status:fail @git.repository.id_v2:\"<repo>\"" \
--from 7d \
--limit 20
Extract:
@error.message, @error.stack)@git.branch) — branch-specific vs. widespread@ci.pipeline.id values for blast radius (STEP 3)Count distinct pipelines impacted using pipeline IDs from STEP 2:
pup cicd events aggregate \
--query "@ci.status:error @ci.pipeline.id:(<id1> OR <id2> OR ...) @git.repository.id_v2:\"<repo>\"" \
--compute count \
--group-by "@ci.pipeline.name" \
--from 7d
Use the first 10 pipeline IDs from STEP 2 (cap at 10; if more are available, run a second batch and merge results by summing counts per @ci.pipeline.name across batches). Report blast radius as: total number of unique pipelines impacted and whether failures are branch-specific or widespread.
Note: a pipeline failure is not necessarily caused solely by this flaky test — treat blast radius as a signal, not a definitive count.
Use flaky_category from STEP 1 and error messages from STEP 2.
Root cause first:
infra and recommend retry instead.Fix at the correct layer:
Forbidden — do not propose these:
Fix patterns by category:
| Category | Approach |
|---|---|
timeout | Identify the slow operation and make it synchronous or deterministic — do NOT simply raise the timeout constant |
concurrency | Add deterministic synchronization (barriers, channels, locks); remove shared mutable state between tests |
network | Mock or stub network calls at the boundary; if the test requires a real connection, isolate it with a test server |
time | Inject a controllable clock; replace wall-clock assertions with relative or event-driven checks |
order_dependency | Isolate test state with setup/teardown; eliminate dependencies on execution order or global state |
environment_dependency | Mock env variables and external config; use test-local fixtures, not shared directories or singletons |
resource_leak | Ensure every resource opened in a test is closed in teardown; use cleanup hooks that run even on failure |
randomness | Fix the random seed for the test run; use deterministic inputs instead of random generation |
asynchronous_wait | Replace fixed sleeps with condition polling or event/signal-driven waits with a hard timeout |
io | Use temp files/dirs cleaned up in teardown; mock or stub filesystem interactions |
unknown | Skip fix attempt → go to quarantine |
Before proposing code changes, verify all of the following — if any fails, skip fix and recommend quarantine:
Decision:
unknown OR verification above fails → skip fix, recommend quarantineFlaky Test Triage Brief
=======================
Test: <fully qualified test name>
Service: <@test.service>
Category: <flaky_category>
Failure Rate: <test_stats.failure_rate_pct>%
Duration Lost: <pipeline_stats.total_lost_time_ms>ms
Codeowners: <codeowners>
Blast Radius: <N> pipelines (<branch-specific | widespread>) [approximate — other failures in the same pipeline runs may not be related]
Evidence:
<1-2 key error message lines from STEP 2>
Recommendation: <fix | quarantine | escalate>
Confidence: <high | medium | low>
Action: <specific next step>
Decision thresholds:
failure_rate_pct > 10 OR blast radius > 5 pipelines → quarantinefailure_rate_pct ≤ 10 AND known category AND clear fix → fixfailure_rate_pct ≤ 10 AND category unknown → escalate to codeowners with triage briefIf recommending quarantine, present and require explicit user approval before writing:
Proposed action: quarantine "<test-name>"
id (fingerprint_fqn): <fingerprint_fqn from STEP 1>
Effect: test still runs but failures are suppressed (CI will not be blocked)
Reversible: yes — update new_state to active to restore
Approve? (yes/no)
If fingerprint_fqn was not returned in STEP 1 (test not yet in FTM or query returned no results): do not attempt the write. Surface an error and ask the user to open the Flaky Test Management UI directly to quarantine manually.
Only after explicit approval and a confirmed fingerprint_fqn, write the body file and run:
cat > /tmp/flaky-update.json <<'EOF'
{
"data": {
"type": "UpdateFlakyTestsRequest",
"attributes": {
"tests": [{"id": "<fingerprint_fqn>", "new_state": "quarantined"}]
}
}
}
EOF
pup test-optimization flaky-tests update --file /tmp/flaky-update.json
To undo: repeat with "new_state": "active".
npx claudepluginhub datadog/pup --plugin pupDiagnoses non-deterministic test failures and eliminates root causes (timing, shared state, concurrency, external dependency, randomness) instead of retrying or skipping.
Diagnoses and eliminates flaky or nondeterministic tests by classifying failure types (ordering, timing, resource, environment, external, concurrency) and isolating root causes with reproducible fixes.
Expert approach to flaky-test-remediation in test automation. Use when working with .