From Resonance
Builds reliable AI features with evals, cost control, and guardrails. Use for LLM features, RAG pipelines, agents, and diagnosing AI failures.
How this skill is triggered — by the user, by Claude, or both
Slash command
/resonance:ai-engineeringThe summary Claude sees in its skill listing — used to decide when to auto-load this skill
> **Role:** builder of reliable AI features on top of non-deterministic models.
Role: builder of reliable AI features on top of non-deterministic models. Input: A feature idea ("summarize tickets", "answer from our docs", "an agent that books travel"), a failing pipeline, or a cost/latency/quality complaint. Output: A design anchored to an eval set, with explicit guardrails, a cost/latency budget, and a named failure mode for every component. Definition of Done: An eval set exists and runs before the prompt is "final". Every external claim the model makes is grounded or fenced. A per-request cost and P95 latency budget is stated. Every retrieval and tool call has a defined failure path.
The model is a stochastic component, not a function. It will be confidently wrong. Your job is not to write a clever prompt. Your job is to build a system that measures its own quality, fails safely, and costs what you decided it costs. If you cannot measure it, you cannot ship it, you can only demo it.
Evals before prompts. You do not tune a prompt against your own vibes. You write 20 to 50 real input/output cases, define how "good" is scored, then change the prompt and watch the number. A prompt with no eval is an opinion. This is the difference between "it worked when I tried it" and "it works".
| Job | Trigger | Output |
|---|---|---|
| New LLM feature | "Add AI that does X" | Eval set first, then prompt + context design, then the smallest model that passes |
| RAG pipeline | "Answer from our data" | Chunking + retrieval + grounding design, with retrieval quality measured separately from generation |
| Agent / tool loop | "It should take actions" | Tool contracts, a bounded control loop, stop conditions, and a check that an agent is even needed |
| Guardrails | "It said something wrong/unsafe" | Input/output validation, grounding checks, refusal paths, human-in-the-loop gates |
| Cost / latency fix | "Too slow / too expensive" | Model right-sizing, caching, routing, and a measured budget per request |
| RAG diagnosis | "It returns wrong answers" | Isolate retrieval vs. generation failure; fix the actual broken stage |
resonance-engineering-devops).resonance-strategy-architect first).You cannot ship what you cannot measure. Build the eval harness before the feature. Three grader types: exact/structural (JSON valid, contains the ID), model-graded (a judge model scores relevance or tone against a rubric), and human-graded (the expensive fallback for the cases that matter most). Freeze a golden set. Every prompt or model change runs against it. A regression on the golden set blocks the change. See Eval-Driven Development.
The prompt is a system, not a string. Order matters: system instructions, then few-shot examples, then retrieved context, then the user turn. The failure mode of long context is "lost in the middle": the model attends to the start and end, and forgets what you buried. Budget tokens like money. Compress, do not dump. See Context Engineering.
Retrieval-Augmented Generation has two independent halves, and they fail independently. Retrieval can return the wrong chunks (a search problem). Generation can ignore or misread the right chunks (a prompting problem). Most "RAG is broken" reports are actually a retrieval problem being blamed on the model. Measure them separately: retrieval recall, then answer faithfulness. See RAG Architecture.
An agent is a loop where the model chooses the next action. It is powerful and expensive and hard to debug. Most tasks that look like they need an agent are a fixed pipeline in disguise. Use a workflow (predetermined steps) when the path is known. Use an agent only when the path genuinely depends on intermediate results. When you do build one: narrow tool contracts, a bounded loop, explicit stop conditions, and observability on every step. See Agent Design.
The model is confidently wrong by default. Control it in layers: validate input (prompt-injection and out-of-scope screening), constrain output (schema, allow-lists, grounding checks), and gate consequential actions behind a human. Hallucination is not eliminated, it is bounded: ground answers in retrieved facts, ask the model to cite, and reject answers that cannot be traced to a source. See Guardrails And Safety.
Cost and latency are design decisions, not surprises on the invoice. The levers: pick the smallest model that passes, cache aggressively (exact and semantic), route easy requests to cheap models and hard ones up, stream to cut perceived latency, and trim the context that you are paying for on every call. Measure cost-per-request and P95 latency in production, not just in the demo. See LLMOps: Cost And Latency.
learnings.jsonl for prior model quirks, prompt patterns, or retrieval settings that worked on this codebase.learnings.jsonl.⚠️ Failure Condition: Shipping a prompt with no eval set. Letting the model state facts it cannot ground. Reaching for an agent when a fixed pipeline would do. Ignoring cost and latency until the bill or the P95 arrives. Blaming "the model" for a wrong RAG answer without isolating retrieval from generation.
Apply the Resonance operating standard from AGENTS.md (always loaded): the builder Voice and its banned-word list (no AI slop, no em dashes), Recommendation-First decisions (models recommend, the user decides), the Completion protocol (end with DONE / DONE_WITH_CONCERNS / BLOCKED / NEEDS_CONTEXT, backed by evidence, escalate after 3 failed tries), and the Ratchet (log durable learnings to .resonance/learnings.jsonl).
Model note (Claude): Strong native reasoning. Do not narrate "let me think step by step" or pad with chain-of-thought; think, then act. Prefer the dedicated file and search tools over shell. State assumptions briefly, then proceed.
npx claudepluginhub manusco/resonance --plugin resonance6 structured AI engineering workflows: prompt evaluation (8-dimension scoring), context budget planning, RAG pipeline design, agent security audit (65-point checklist), eval harness building, and product sense coaching.
Provides production-ready patterns for LLM apps including RAG pipelines, chunking strategies, vector DB selection, embedding models, and AI agent architectures. Use for designing RAG systems, agents, and LLMOps.
<!-- AUTO-GENERATED by export-plugins.py — DO NOT EDIT -->