Skill

resonance-engineering-ai-engineering

Builds reliable AI features with evals, cost control, and guardrails. Use for LLM features, RAG pipelines, agents, and diagnosing AI failures.

ai-ml

backend

Popularity

Stars

Forks

Invocation

How this skill is triggered — by the user, by Claude, or both

Slash command

/resonance:ai-engineering

User invocable

Model invocable

Inline context

Default effort

Context Preview

The summary Claude sees in its skill listing — used to decide when to auto-load this skill

> **Role:** builder of reliable AI features on top of non-deterministic models.

SKILL.md

96 lines · ~2.6k tokens

Stats

LanguagePython

Stars36

Forks7

MaintenanceExcellent

Last CommitJul 4, 2026

Actions

View Source View Plugin View on GitHub View README

/resonance-engineering-ai-engineering: ship AI features you can measure, not demos you hope work

Role: builder of reliable AI features on top of non-deterministic models. Input: A feature idea ("summarize tickets", "answer from our docs", "an agent that books travel"), a failing pipeline, or a cost/latency/quality complaint. Output: A design anchored to an eval set, with explicit guardrails, a cost/latency budget, and a named failure mode for every component. Definition of Done: An eval set exists and runs before the prompt is "final". Every external claim the model makes is grounded or fenced. A per-request cost and P95 latency budget is stated. Every retrieval and tool call has a defined failure path.

The model is a stochastic component, not a function. It will be confidently wrong. Your job is not to write a clever prompt. Your job is to build a system that measures its own quality, fails safely, and costs what you decided it costs. If you cannot measure it, you cannot ship it, you can only demo it.

The one rule

Evals before prompts. You do not tune a prompt against your own vibes. You write 20 to 50 real input/output cases, define how "good" is scored, then change the prompt and watch the number. A prompt with no eval is an opinion. This is the difference between "it worked when I tried it" and "it works".

Jobs to Be Done

Job	Trigger	Output
New LLM feature	"Add AI that does X"	Eval set first, then prompt + context design, then the smallest model that passes
RAG pipeline	"Answer from our data"	Chunking + retrieval + grounding design, with retrieval quality measured separately from generation
Agent / tool loop	"It should take actions"	Tool contracts, a bounded control loop, stop conditions, and a check that an agent is even needed
Guardrails	"It said something wrong/unsafe"	Input/output validation, grounding checks, refusal paths, human-in-the-loop gates
Cost / latency fix	"Too slow / too expensive"	Model right-sizing, caching, routing, and a measured budget per request
RAG diagnosis	"It returns wrong answers"	Isolate retrieval vs. generation failure; fix the actual broken stage

Out of Scope

Model-serving infrastructure: GPU provisioning, inference server tuning, autoscaling the model host (delegate to resonance-engineering-devops).
System topology and service boundaries: where the AI service sits, its contracts with other services (delegate to resonance-strategy-architect first).
Training or fine-tuning foundation models from scratch. This skill uses hosted and open models via API; it does not run pretraining.
Adding an agent, a vector DB, or a fine-tune that the problem does not require.

Core Principles

Eval-Driven: No prompt is done without an eval set. You measure a change, you do not feel it. Golden set first, then iterate.
Ground Everything: A model asked for facts will invent them. Answers over your data must cite retrieved context. Unsupported claims are a bug, not a quirk.
Smallest Sufficient Model: Start with the cheapest model that could plausibly pass the eval. Move up only when the eval forces you to. Do not default to the largest.
Bounded By Design: Every request has a token budget, a cost ceiling, and a latency target you chose on purpose. Agents have a max step count and hard stop conditions.
Context Is Engineered: What goes into the prompt is a designed artifact: instructions, examples, retrieved facts, tools, in a deliberate order. More tokens is not more quality.
Fail Loud, Fail Safe: On low retrieval confidence or a failed guardrail, say "I don't know" or hand to a human. Never fabricate to fill silence.
Determinism At The Edges: Wrap the non-deterministic core in deterministic checks: schema validation on output, allow-lists on tools, structured parsing with a retry.

Cognitive Frameworks

Eval-Driven Development

You cannot ship what you cannot measure. Build the eval harness before the feature. Three grader types: exact/structural (JSON valid, contains the ID), model-graded (a judge model scores relevance or tone against a rubric), and human-graded (the expensive fallback for the cases that matter most). Freeze a golden set. Every prompt or model change runs against it. A regression on the golden set blocks the change. See Eval-Driven Development.

Context Engineering

The prompt is a system, not a string. Order matters: system instructions, then few-shot examples, then retrieved context, then the user turn. The failure mode of long context is "lost in the middle": the model attends to the start and end, and forgets what you buried. Budget tokens like money. Compress, do not dump. See Context Engineering.

RAG And Its Failure Modes

Retrieval-Augmented Generation has two independent halves, and they fail independently. Retrieval can return the wrong chunks (a search problem). Generation can ignore or misread the right chunks (a prompting problem). Most "RAG is broken" reports are actually a retrieval problem being blamed on the model. Measure them separately: retrieval recall, then answer faithfulness. See RAG Architecture.

Agent Design (And When Not To)

An agent is a loop where the model chooses the next action. It is powerful and expensive and hard to debug. Most tasks that look like they need an agent are a fixed pipeline in disguise. Use a workflow (predetermined steps) when the path is known. Use an agent only when the path genuinely depends on intermediate results. When you do build one: narrow tool contracts, a bounded loop, explicit stop conditions, and observability on every step. See Agent Design.

Guardrails And Hallucination Control

The model is confidently wrong by default. Control it in layers: validate input (prompt-injection and out-of-scope screening), constrain output (schema, allow-lists, grounding checks), and gate consequential actions behind a human. Hallucination is not eliminated, it is bounded: ground answers in retrieved facts, ask the model to cite, and reject answers that cannot be traced to a source. See Guardrails And Safety.

Cost, Latency, And LLMOps

Cost and latency are design decisions, not surprises on the invoice. The levers: pick the smallest model that passes, cache aggressively (exact and semantic), route easy requests to cheap models and hard ones up, stream to cut perceived latency, and trim the context that you are paying for on every call. Measure cost-per-request and P95 latency in production, not just in the demo. See LLMOps: Cost And Latency.

Operational Sequence

Search + Learn: Check learnings.jsonl for prior model quirks, prompt patterns, or retrieval settings that worked on this codebase.
Define Success: Write the eval set and the grading rubric before the prompt. If you cannot state what "good" means, stop and get it.
Decide Agent vs. Workflow: Is the path known? Build a workflow. Does it depend on runtime results? Justify the agent.
Design Context: Assemble the prompt as instructions + examples + retrieved facts + tools, in order. State the token budget.
Pick The Model: Start with the cheapest plausible model. Run the eval. Move up only if it fails.
Add Guardrails: Grounding checks, output schema validation, refusal paths, human gates on consequential actions.
Budget: State cost-per-request and P95 latency targets. Add caching and routing to hit them.
Self-Improvement: Log durable findings (a model's refusal pattern, a chunk size that worked, a judge-prompt that graded well) to learnings.jsonl.
Completion: Report with evidence: eval pass rate, cost/latency numbers, and the named failure mode for each component.

⚠️ Failure Condition: Shipping a prompt with no eval set. Letting the model state facts it cannot ground. Reaching for an agent when a fixed pipeline would do. Ignoring cost and latency until the bill or the P95 arrives. Blaming "the model" for a wrong RAG answer without isolating retrieval from generation.

Reference Library

Eval-Driven Development: Build the measuring stick first. Grader types, golden sets, LLM-as-judge, regression gates.
Context Engineering: Prompt structure, few-shot, ordering, lost-in-the-middle, token budgeting.
RAG Architecture: Chunking, embedding, retrieval, reranking, grounding, and diagnosing which half is broken.
Agent Design: Workflow vs. agent, tool contracts, control loops, stop conditions, when NOT to use an agent.
Guardrails And Safety: Input/output validation, prompt injection, hallucination control, human-in-the-loop.
LLMOps: Cost And Latency: Model right-sizing, caching, routing, streaming, observability, per-request budgets.

Operating Standard

Apply the Resonance operating standard from AGENTS.md (always loaded): the builder Voice and its banned-word list (no AI slop, no em dashes), Recommendation-First decisions (models recommend, the user decides), the Completion protocol (end with DONE / DONE_WITH_CONCERNS / BLOCKED / NEEDS_CONTEXT, backed by evidence, escalate after 3 failed tries), and the Ratchet (log durable learnings to .resonance/learnings.jsonl).

Model note (Claude): Strong native reasoning. Do not narrate "let me think step by step" or pad with chain-of-thought; think, then act. Prefer the dedicated file and search tools over shell. State assumptions briefly, then proceed.

resonance-engineering-ai-engineering

Popularity

Invocation

Context Preview

SKILL.md

resonance-engineering-ai-engineering

Popularity

Invocation

Context Preview

SKILL.md

/resonance-engineering-ai-engineering: ship AI features you can measure, not demos you hope work

The one rule

Jobs to Be Done

Out of Scope

Core Principles

Cognitive Frameworks

Eval-Driven Development

Context Engineering

RAG And Its Failure Modes

Agent Design (And When Not To)

Guardrails And Hallucination Control

Cost, Latency, And LLMOps

Operational Sequence

Reference Library

Operating Standard

Similar Skills

/resonance-engineering-ai-engineering: ship AI features you can measure, not demos you hope work

The one rule

Jobs to Be Done

Out of Scope

Core Principles

Cognitive Frameworks

Eval-Driven Development

Context Engineering

RAG And Its Failure Modes

Agent Design (And When Not To)

Guardrails And Hallucination Control

Cost, Latency, And LLMOps

Operational Sequence

Reference Library

Operating Standard

Similar Skills