From cybersecurity-skills
Detects poisoned training data and backdoored ML models using ART, Cleanlab, and integrity checks for data provenance, label quality, activation clustering, and spectral signatures.
How this skill is triggered — by the user, by Claude, or both
Slash command
/cybersecurity-skills:detecting-data-and-model-poisoningThe summary Claude sees in its skill listing — used to decide when to auto-load this skill
> **Authorized-use-only notice:** This skill includes routines that craft poisoned samples and backdoor triggers for *defensive validation*. Generate and use poisoned data and backdoored models only in isolated test environments you control. Never deploy a backdoored model or distribute poisoned datasets.
Authorized-use-only notice: This skill includes routines that craft poisoned samples and backdoor triggers for defensive validation. Generate and use poisoned data and backdoored models only in isolated test environments you control. Never deploy a backdoored model or distribute poisoned datasets.
Data poisoning and model backdooring attack the integrity of an ML system at training time rather than at inference. In data poisoning (MITRE ATLAS AML.T0020 Poison Training Data), an adversary injects manipulated samples into the training, fine-tuning, or RAG corpus so the resulting model misbehaves — degraded accuracy, targeted misclassification, or an attacker-chosen bias. In model backdooring (MITRE ATLAS AML.T0018 Backdoor ML Model), the model behaves normally on clean inputs but produces an attacker-chosen output whenever a hidden trigger (a pixel patch, a rare token, a phrase) is present. Both are amplified by ML supply-chain compromise (AML.T0010): poisoned public datasets, trojaned pre-trained weights downloaded from a hub, or a malicious model serialization. This is OWASP LLM04:2025 Data and Model Poisoning.
Detection spans the pipeline. On the data side: provenance and integrity checks, statistical outlier and label-flip detection, and de-duplication of suspiciously near-identical samples. On the model side: activation-clustering and spectral-signature analysis (which exploit the fact that poisoned samples activate the network differently than clean ones) and trigger reconstruction. On the supply-chain side: verifying weights hashes/signatures and refusing unsafe serialization formats (pickle-based .bin/.pt) in favor of safetensors. This skill implements all three using IBM's Adversarial Robustness Toolbox (ART), Cleanlab for label-quality issues, and integrity tooling.
python -m venv .venv && source .venv/bin/activate
# IBM Adversarial Robustness Toolbox — poisoning detection defenses
pip install adversarial-robustness-toolbox
# Cleanlab — label/data quality issue detection
pip install cleanlab
# Modeling + safe serialization + hashing
pip install numpy scikit-learn safetensors
# (Choose one framework backend ART can wrap)
pip install tensorflow # or: pip install torch
| ID | Official Name | Relevance |
|---|---|---|
| AML.T0020 | Poison Training Data | Injection of manipulated samples into the training corpus |
| AML.T0018 | Backdoor ML Model | Trigger-activated hidden behavior in the trained model |
| AML.T0010 | ML Supply Chain Compromise | Poisoned public datasets / trojaned downloaded weights |
| AML.T0024 | Exfiltration via ML Inference API | Some poisoning aims to leak data via the model's responses |
Refuse artifacts whose hash/signature you cannot verify, and prefer safetensors over pickle-based formats (pickle can execute code on load).
# Verify a downloaded checkpoint against a published SHA-256
sha256sum model.safetensors
# compare to the hub-published digest
# Flag unsafe pickle-based weights in a directory
find ./models -type f \( -name "*.bin" -o -name "*.pt" -o -name "*.pkl" -o -name "*.ckpt" \)
# safe_load.py — load weights without executing pickle
from safetensors.numpy import load_file
weights = load_file("model.safetensors") # no arbitrary code execution
Cleanlab finds mislabeled, outlier, and near-duplicate samples — common signatures of label-flip poisoning.
# cleanlab_scan.py
import numpy as np
from cleanlab.filter import find_label_issues
# pred_probs: out-of-sample predicted probabilities (n_samples x n_classes)
# labels: given integer labels (n_samples,)
def scan(labels: np.ndarray, pred_probs: np.ndarray):
issues = find_label_issues(
labels=labels, pred_probs=pred_probs,
return_indices_ranked_by="self_confidence",
)
print(f"[*] {len(issues)} suspected label issues (potential poisoning)")
return issues
ActivationDefence clusters per-class activations; a class whose activations split into two distinct clusters indicates injected (poisoned) samples.
# activation_defence.py
import numpy as np
from art.estimators.classification import KerasClassifier
from art.defences.detector.poison import ActivationDefence
def detect(model, x_train, y_train):
classifier = KerasClassifier(model=model) # wrap your trained model
defence = ActivationDefence(classifier, x_train, y_train)
report, is_clean_lst = defence.detect_poison(
nb_clusters=2, nb_dims=10, reduce="PCA"
)
# is_clean_lst[i] == 0 marks a suspected poisoned sample
poisoned_idx = np.where(np.array(is_clean_lst) == 0)[0]
print(f"[*] activation clustering flagged {len(poisoned_idx)} samples")
return poisoned_idx, report
Spectral signatures use the covariance spectrum of feature representations to surface poisoned samples — a strong second signal.
# spectral.py
import numpy as np
from art.estimators.classification import KerasClassifier
from art.defences.detector.poison import SpectralSignatureDefense
def detect(model, x_train, y_train, nb_classes):
classifier = KerasClassifier(model=model)
defence = SpectralSignatureDefense(
classifier, x_train, y_train,
expected_pp_poison=0.05, batch_size=128, eps_multiplier=1.5,
)
report, is_clean_lst = defence.detect_poison()
poisoned_idx = np.where(np.array(is_clean_lst) == 0)[0]
print(f"[*] spectral signatures flagged {len(poisoned_idx)} samples")
return poisoned_idx, report
Test whether a candidate trigger flips predictions to an attacker target class far above the clean baseline.
# trigger_probe.py
import numpy as np
def test_trigger(model, x_clean, target_class, apply_trigger):
"""apply_trigger(x) stamps a candidate trigger (e.g. a corner pixel patch)."""
clean_preds = model.predict(x_clean).argmax(axis=1)
x_trig = np.stack([apply_trigger(x.copy()) for x in x_clean])
trig_preds = model.predict(x_trig).argmax(axis=1)
asr = float(np.mean(trig_preds == target_class)) # attack success rate
base = float(np.mean(clean_preds == target_class))
print(f"[*] target-class rate clean={base:.3f} triggered={asr:.3f}")
return {"baseline": base, "trigger_success_rate": asr,
"backdoor_suspected": asr - base > 0.5}
Remove flagged samples (intersection of Cleanlab + ART signals is highest-confidence), retrain on the cleaned set, and re-test for the trigger. Document: artifact provenance, samples flagged by each method, trigger ASR before/after, and ATLAS mapping. Recommend dataset provenance controls, signed weights (safetensors + sigstore/cosign), and ongoing pipeline scanning.
| Tool | Purpose | Source |
|---|---|---|
| Adversarial Robustness Toolbox | Activation clustering & spectral-signature poisoning defenses | https://github.com/Trusted-AI/adversarial-robustness-toolbox |
| Cleanlab | Label/data-quality issue detection | https://github.com/cleanlab/cleanlab |
| safetensors | Safe (non-pickle) weight serialization | https://github.com/huggingface/safetensors |
| OWASP LLM04:2025 | Data and Model Poisoning reference | https://genai.owasp.org/llmrisk/llm042025-data-and-model-poisoning/ |
| MITRE ATLAS | AI threat technique taxonomy | https://atlas.mitre.org/ |
| Layer | Method | Tool | Signal |
|---|---|---|---|
| Supply chain | Hash/signature + safe format | sha256/safetensors | Tampered or unsafe artifact |
| Data | Label issues / outliers | Cleanlab | Mislabeled / injected samples |
| Model | Activation clustering | ART ActivationDefence | Per-class activation split |
| Model | Spectral signatures | ART SpectralSignatureDefense | Outlier covariance spectrum |
| Model | Trigger probing | custom | High trigger attack-success-rate |
npx claudepluginhub costrict-plugins-repo/mukul975-anthropic-cybersecurity-skills-cybersecurity-skillsDetects poisoned training data and backdoored ML models using ART, Cleanlab, and integrity checks for data provenance, label quality, activation clustering, and spectral signatures.
Detects training pipelines that ingest external data without integrity gating. Use when auditing dataset ingestion, fine-tuning scripts, or web-scraped data curation.
Validates, deduplicates, and tracks provenance of training data to detect poisoning, bias, and privacy violations before model training.