Skills for behavioral evaluation of LLMs using Petri and Bloom
A Claude Code plugin with skills for behavioral evaluation of LLMs using Petri and Bloom.
# Add the repo as a marketplace
claude plugin marketplace add https://github.com/k3nnethfrancis/machine-psychology-fieldkit
# Install the plugin
claude plugin install machine-psychology-fieldkit
# Clone the repo
git clone https://github.com/k3nnethfrancis/machine-psychology-fieldkit.git
# Run Claude Code with the plugin directory
claude --plugin-dir /path/to/machine-psychology-fieldkit
claude plugin list
You should see machine-psychology-fieldkit in the list.
# Clone both repos
git clone https://github.com/anthropics/petri.git
git clone https://github.com/anthropics/bloom.git
cd petri
# Create virtual environment
python -m venv .venv
source .venv/bin/activate # On Windows: .venv\Scripts\activate
# Install dependencies
pip install -e ".[dev]"
# Set API key
export ANTHROPIC_API_KEY="your-key-here"
cd bloom
# Create virtual environment
python -m venv .venv
source .venv/bin/activate
# Install dependencies
pip install -e .
# Set API key
export ANTHROPIC_API_KEY="your-key-here"
Run adversarial audits with Petri. The skill helps you:
Quick start:
cd petri
inspect eval src/petri/tasks/petri.py --model anthropic/claude-sonnet-4-20250514
Generate evaluation scenarios with Bloom. The skill helps you:
Quick start:
cd bloom
python -m bloom.run --config configs/your_config.yaml
Once installed, Claude Code automatically activates these skills when you're working on behavioral evaluation tasks. You can also invoke them directly by typing /petri-collaborator or /bloom-collaborator.
| Use Case | Tool |
|---|---|
| Broad audit across 36 dimensions | Petri |
| Test a specific behavior hypothesis | Bloom |
| Compare models on standard battery | Petri |
| Measure robustness across framings | Bloom |
MIT
Own this plugin?
Verify ownership to unlock analytics, metadata editing, and a verified badge. GitHub access is read-only (username + org membership).
Sign in to claimOwn this plugin?
Verify ownership to unlock analytics, metadata editing, and a verified badge. GitHub access is read-only (username + org membership).
Sign in to claimBased on adoption, maintenance, documentation, and repository signals. Not a security audit or endorsement.
Persistent memory across context compactions via session dumps, vault search (QMD), and auto-injection
Turn X bookmarks into ranked, analyzed research briefs via parallel deep-dive agents
Turn X bookmarks into ranked, analyzed research briefs
npx claudepluginhub k3nnethfrancis/machine-psychology-fieldkitSkills for building LLM evaluations: pipeline audit, error analysis, synthetic data generation, LLM-as-Judge design, evaluator validation, RAG evaluation, and annotation interfaces.
Agent Skills for NeMo Evaluator SDK
Skills for adding DeepEval evaluations, tracing, datasets, Confident AI reports, and iterative improvement loops to AI applications.
AI agent evaluation toolkit for Copilot Studio. Plan evals, generate test cases, interpret results, and triage failures — grounded in Microsoft's Eval Scenario Library and Triage & Improvement Playbook.
Benchmark, evaluate, and optimize skills to ensure reliable performance across all LLMs
Agent and skill evaluation harness with MLflow integration