From langfuse
Langfuse LLM observability operator, CLI guide, and RAG evaluation analyst. Use when the user asks to inspect traces, observations, sessions, prompts, datasets, scores, costs, latency, exceptions, agent tracing, OpenTelemetry integration, Langfuse CLI usage, or RAG evaluation results. Default to the official langfuse-cli through scripts/lf_cli.py for generic API discovery and operations; use bundled Python scripts for curated trace trees, reports, prompt extraction, chat export, and existing RAG-result interpretation. Do not run legacy FlowWise RAG execution scripts until their project-local config dependencies are ported.
How this skill is triggered — by the user, by Claude, or both
Slash command
/langfuse:langfuseThe summary Claude sees in its skill listing — used to decide when to auto-load this skill
Operator, guide, and RAG evaluation analyst for Langfuse projects. Four modes:
README.mdreferences/agent-tracing.mdreferences/api_endpoints.mdreferences/cli.mdreferences/concepts.mdreferences/error-analysis.mdreferences/evaluators.mdreferences/integrations.mdreferences/judge-calibration.mdreferences/prompt_migration.mdreferences/rag-eval-interpretation.mdreferences/rag-flowwise-port-plan.mdreferences/sdk-upgrade.mdreferences/setup.mdreferences/tool_reference.mdreferences/trace-nesting-validation-links.mdreferences/user-feedback.mdscripts/extract_graph_prompts.pyscripts/lf_api.pyscripts/lf_cli.pyOperator, guide, and RAG evaluation analyst for Langfuse projects. Four modes:
langfuse-cli via scripts/lf_cli.py for current API coverage.references/.scripts/lf_cli.py; it delegates to the official langfuse-cli, which tracks Langfuse's OpenAPI surface.~/.skills/langfuse/credentials.json; setup runs outside the model conversation..env runtime reads. Runtime scripts must use the local credential loader or explicit non-secret project config. Project-local .env imports are blocked.Before first operation, verify connectivity:
python "${CLAUDE_SKILL_DIR}/scripts/lf_cli.py" --dry-run api __schema
python "${CLAUDE_SKILL_DIR}/scripts/lf_client.py" --action health
python "${CLAUDE_SKILL_DIR}/scripts/lf_client.py" --action list-profiles
Credentials live at ~/.skills/langfuse/credentials.json (mode 600, in a mode-700 directory), structured as named profiles. This follows R50 v2.0.0 of the credential storage convention. There is no .env discovery and no --env flag.
First-run setup runs OUTSIDE the model conversation (secrets never enter the transcript):
python3 "${CLAUDE_SKILL_DIR}/scripts/setup_credentials.py" [PROFILE]
The setup script prompts for keys via getpass and writes atomically with mode 600. It does not discover, parse, or migrate project .env files.
File shape — ~/.skills/langfuse/credentials.json:
{
"default": { "secret_key": "sk-lf-...", "public_key": "pk-lf-...", "host": "https://cloud.langfuse.com", "project_name": "Default" },
"iurfriend": { "secret_key": "sk-lf-...", "public_key": "pk-lf-...", "host": "https://cloud.langfuse.com", "project_name": "iurFriend" },
"simulator": { "secret_key": "sk-lf-...", "public_key": "pk-lf-...", "host": "https://cloud.langfuse.com", "project_name": "Simulator" }
}
Profile selection — five-tier precedence (tier 4 skipped, langfuse is not host-aware):
--profile NAME flag on any scriptLANGFUSE_PROFILE=NAME environment variable<project>/.skills/langfuse.profile (one-line, safe to commit) — loader walks up from cwd, stops at the first project-root marker (.git, pyproject.toml, package.json, Cargo.toml, MANIFEST.md); never above $HOMEdefault profilePer-key env-var overrides stack on top of the resolved profile (useful in CI / emergency rotations):
LANGFUSE_SECRET_KEY=sk-...
LANGFUSE_PUBLIC_KEY=pk-...
LANGFUSE_BASE_URL=https://...
LANGFUSE_HOST=https://... # legacy alias; LANGFUSE_BASE_URL wins if both are set
LANGFUSE_PROJECT_NAME=...
All scripts accept --profile <name>. No script reads .env files at runtime. The local loader lives at utilities/credentials.py (this plugin only — it does NOT import from any sibling plugin per the standalone-skill principle).
Use scripts/lf_cli.py first when the user asks for a Langfuse API operation that is not already a curated workflow below.
python "${CLAUDE_SKILL_DIR}/scripts/lf_cli.py" api __schema
python "${CLAUDE_SKILL_DIR}/scripts/lf_cli.py" api <resource> --help
python "${CLAUDE_SKILL_DIR}/scripts/lf_cli.py" --profile simulator api <resource> <action> [flags]
The wrapper resolves the selected profile, exports LANGFUSE_PUBLIC_KEY, LANGFUSE_SECRET_KEY, and LANGFUSE_BASE_URL to the subprocess, and never prints credential values. Prefer --dry-run before first use in a project:
python "${CLAUDE_SKILL_DIR}/scripts/lf_cli.py" --profile simulator --dry-run api __schema
| Intent | Script | Key actions |
|---|---|---|
| Use current official API | lf_cli.py | Delegates to langfuse-cli api ... |
| Explore traces | lf_traces.py | list, get, tree, search, stats |
| Inspect observations | lf_observations.py | list, get (supports --type GENERATION|SPAN|EVENT) |
| Browse sessions | lf_sessions.py | list, get, users, timeline |
| Manage datasets | lf_datasets.py | list, create, items, add-item, add-items-bulk, export, runs |
| Manage prompts | lf_prompts.py | list, get, create, from-file, from-trace, update-labels, diff, history |
| Extract chat logs | lf_extract_chat.py | session, trace, batch → JSONL + Markdown |
| Manage scores | lf_scores.py | list, create, bulk-create, delete, analyze, export |
| Run reports | lf_report.py | overview, cost, latency, quality, full |
| Run RAG evaluations | quarantined pending port | FlowWise RAG execution scripts are project-specific until ported off core.config / .env; analyze existing result files instead |
| Find exceptions | lf_exceptions.py | find, file, details, count |
| Look up schema | lf_schema.py | list, show, fields, hierarchy, endpoints |
| Raw API call fallback | lf_api.py | Any METHOD /path --params '{}' --body '{}'; use only when CLI is unavailable |
| Question type | Reference file |
|---|---|
| "What is a trace/span/generation?" | references/concepts.md |
| "How do evaluations work?" | references/evaluators.md |
| "How to integrate with Python/JS SDK?" | references/integrations.md |
| "What API endpoints exist?" | references/api_endpoints.md |
| Setup, credentials, troubleshooting | references/setup.md |
| Official CLI usage and API discovery | references/cli.md |
| "How should development projects or agents trace to Langfuse?" | references/agent-tracing.md |
| Error/exception triage | references/error-analysis.md |
| Evaluator and judge calibration | references/judge-calibration.md |
| SDK v4 / OpenTelemetry upgrade planning | references/sdk-upgrade.md |
| User feedback ingestion | references/user-feedback.md |
| Full parameter reference for all tools | references/tool_reference.md |
| "How to migrate prompts to LangFuse?" | references/prompt_migration.md |
| "Full FlowWise prompt lifecycle (capture → promote)?" | FlowWise Prompt Lifecycle section below |
| "Why are traces nesting unexpectedly?" | references/trace-nesting-validation-links.md |
| "How to interpret RAG evaluation scores?" | references/rag-eval-interpretation.md |
| "What is needed to re-enable FlowWise RAG runners?" | references/rag-flowwise-port-plan.md |
| Object fields and relationships | python "${CLAUDE_SKILL_DIR}/scripts/lf_schema.py" show <type> |
When the user asks to analyze RAG results, interpret evaluation scores, diagnose content gaps, or compare experiment runs, read references/rag-eval-interpretation.md and follow its structured analysis protocol. The guide covers score semantics, diagnostic patterns, category analysis, run comparison, root cause reasoning, and action recommendations.
Quick workflow:
data/rag-eval/results/<run-name>.json for local results, or use lf_datasets.py --profile simulator runs rag-eval-baseline-v01 to list LangFuse runsdata/rag-eval/rag-eval-baseline-v01.json for item category/variant metadata, join by question textpython "${CLAUDE_SKILL_DIR}/scripts/lf_client.py" --action list-profiles
# Then either:
# - Pass --profile NAME on each command, OR
# - Set LANGFUSE_PROFILE=NAME in the session env, OR
# - Drop a one-line ".skills/langfuse.profile" at the project root (auto-routes per-project).
python "${CLAUDE_SKILL_DIR}/scripts/lf_cli.py" api __schema # discover current API surface
python "${CLAUDE_SKILL_DIR}/scripts/lf_traces.py" list --limit 20 --age 60 # last hour
python "${CLAUDE_SKILL_DIR}/scripts/lf_traces.py" list --name "agent-run" --user "user-123"
python "${CLAUDE_SKILL_DIR}/scripts/lf_traces.py" get TRACE_ID --compact # truncated output
python "${CLAUDE_SKILL_DIR}/scripts/lf_traces.py" tree TRACE_ID # observation hierarchy
python "${CLAUDE_SKILL_DIR}/scripts/lf_traces.py" search "error" --age 1440 # text search
python "${CLAUDE_SKILL_DIR}/scripts/lf_observations.py" list --type GENERATION --age 60
python "${CLAUDE_SKILL_DIR}/scripts/lf_observations.py" list --trace-id TRACE_ID
python "${CLAUDE_SKILL_DIR}/scripts/lf_observations.py" get OBS_ID --compact
python "${CLAUDE_SKILL_DIR}/scripts/lf_sessions.py" list --age 1440
python "${CLAUDE_SKILL_DIR}/scripts/lf_sessions.py" get SESSION_ID # traces in session
python "${CLAUDE_SKILL_DIR}/scripts/lf_sessions.py" timeline SESSION_ID # chronological view
python "${CLAUDE_SKILL_DIR}/scripts/lf_sessions.py" users --age 1440 # group by user
python "${CLAUDE_SKILL_DIR}/scripts/lf_datasets.py" create --name "eval-v1" --description "Core eval set"
python "${CLAUDE_SKILL_DIR}/scripts/lf_datasets.py" add-item "eval-v1" --input '{"q":"What is AI?"}' --expected '{"a":"..."}'
python "${CLAUDE_SKILL_DIR}/scripts/lf_datasets.py" add-items-bulk "eval-v1" data.json # JSON or CSV
python "${CLAUDE_SKILL_DIR}/scripts/lf_datasets.py" items "eval-v1"
python "${CLAUDE_SKILL_DIR}/scripts/lf_datasets.py" export "eval-v1" --format json
python "${CLAUDE_SKILL_DIR}/scripts/lf_datasets.py" runs "eval-v1"
python "${CLAUDE_SKILL_DIR}/scripts/lf_datasets.py" run-items "eval-v1" "run-name"
Bulk import formats: JSON array of {input, expectedOutput, metadata}, or CSV with input/expected_output columns (or input_*/expected_* prefixed columns).
This skill is the canonical tool for managing FlowWise prompt versions in LangFuse Prompt Management. Use the five-step workflow below together with references/prompt_migration.md; there is no separate packaged references/prompt-lifecycle.md file.
Quick reference:
# Step 1 — Capture baseline from a production trace
python "${CLAUDE_SKILL_DIR}/scripts/lf_prompts.py" from-trace TRACE_ID --upload --prefix "salesbot-v2" --labels "baseline"
# Step 2 — Create fix version (prompt text from the approved tuning workflow)
python "${CLAUDE_SKILL_DIR}/scripts/lf_prompts.py" create --name "salesbot-v2-interviewer" --prompt "..." --labels "fix-b2,candidate"
# Step 3 — Test via regression (see regression-runner skill)
uv run regression_runner.py --dataset <name> --prompt-name "salesbot-v2-interviewer" --prompt-label candidate
# Step 4 — Compare versions
python "${CLAUDE_SKILL_DIR}/scripts/lf_prompts.py" diff "salesbot-v2-interviewer" --v1 1 --v2 2
# Step 5 — Promote verified version
python "${CLAUDE_SKILL_DIR}/scripts/lf_prompts.py" update-labels "salesbot-v2-interviewer" --version 2 --add "production,latest"
Label conventions: production, latest, baseline, candidate, fix-<code> (e.g. fix-b2), deprecated.
For generic prompt migration, read references/prompt_migration.md. FlowWise-specific SimHuman lifecycle details (DNA catalog → simulation-tuner → apply_sim_tuning.py) are project-local and must not be assumed to ship with this plugin unless they are explicitly packaged in the consuming project.
python "${CLAUDE_SKILL_DIR}/scripts/lf_prompts.py" list
python "${CLAUDE_SKILL_DIR}/scripts/lf_prompts.py" get "my-prompt" --label production
python "${CLAUDE_SKILL_DIR}/scripts/lf_prompts.py" create --name "qa" --prompt "Rate: {{answer}}"
python "${CLAUDE_SKILL_DIR}/scripts/lf_prompts.py" from-file --name "system" prompt.txt
python "${CLAUDE_SKILL_DIR}/scripts/lf_prompts.py" from-trace TRACE_ID # discover system prompts
python "${CLAUDE_SKILL_DIR}/scripts/lf_prompts.py" from-trace TRACE_ID --upload --prefix "mybot"
python "${CLAUDE_SKILL_DIR}/scripts/lf_prompts.py" from-trace --session SID --strategy max_coverage --upload
python "${CLAUDE_SKILL_DIR}/scripts/lf_prompts.py" update-labels "my-prompt" --version 3 --add production
python "${CLAUDE_SKILL_DIR}/scripts/lf_prompts.py" diff "my-prompt" --v1 1 --v2 3
python "${CLAUDE_SKILL_DIR}/scripts/lf_prompts.py" history "my-prompt"
Prompt types: text (string with {{vars}}), chat (JSON array of {role, content}). Auto-detected on create.
from-trace extracts system prompts from generation observations. Strategies for --session: latest (most recent trace), first_last, max_coverage (all traces, keep longest per node), all.
Runtime status: execution scripts in this section are blocked until ported.
scripts/lf_rag_eval.py, scripts/lf_rag_question_gen.py, and
scripts/lf_rag_retrieval_audit.py currently import project-local
core.config and load .env. That violates the runtime credential contract for
an installed skill. Do not run these scripts from the shipped plugin until they
are ported to the local credential loader and explicit non-secret project
configuration.
Important: RAG datasets and experiment runs live in the Simulator LangFuse project (--profile simulator), not FlowWise. This avoids polluting production traces.
Run an evaluation after the port is complete:
# Full run (all evaluators: semantic similarity + LLM judge + compliance)
uv run python "${CLAUDE_SKILL_DIR}/scripts/lf_rag_eval.py" --dataset-name rag-eval-baseline-v01 --run-name my-run
# Fast run (skip LLM judge, Tier 1 only — cheaper)
uv run python "${CLAUDE_SKILL_DIR}/scripts/lf_rag_eval.py" --dataset-name rag-eval-baseline-v01 --run-name my-fast-run --skip-llm-judge
# Test with small sample
uv run python "${CLAUDE_SKILL_DIR}/scripts/lf_rag_eval.py" --dataset-name rag-eval-baseline-v01 --run-name test-3 --limit 3
Access results:
# Local results (includes bot responses, judge reasoning)
# data/rag-eval/results/<run-name>.json
# List runs in LangFuse
python "${CLAUDE_SKILL_DIR}/scripts/lf_datasets.py" --profile simulator runs rag-eval-baseline-v01
# Get dataset item metadata (categories, variants)
# data/rag-eval/rag-eval-baseline-v01.json
Analyze existing results: Read references/rag-eval-interpretation.md for the full interpretation framework — score semantics, diagnostic patterns, category analysis protocol, run comparison methodology, root cause reasoning, and action recommendations. The guide teaches you how to reason about the scores, not just compute them.
Porting plan: Read references/rag-flowwise-port-plan.md before re-enabling any FlowWise RAG runner.
Evaluator tiers:
semantic_similarity (embedding cosine), response_quality (heuristics), no_marketing (compliance regex)--skip-llm-judge): llm_relevance, llm_completeness, llm_accuracy (LLM-as-judge via OpenRouter)avg_semantic_similarity, avg_llm_quality, marketing_violation_ratepython "${CLAUDE_SKILL_DIR}/scripts/lf_scores.py" list --name "accuracy" --age 1440
python "${CLAUDE_SKILL_DIR}/scripts/lf_scores.py" create --trace ID --name "accuracy" --value 0.85
python "${CLAUDE_SKILL_DIR}/scripts/lf_scores.py" bulk-create scores.json
python "${CLAUDE_SKILL_DIR}/scripts/lf_scores.py" analyze --name "accuracy"
python "${CLAUDE_SKILL_DIR}/scripts/lf_scores.py" export --format csv --output scores.csv
python "${CLAUDE_SKILL_DIR}/scripts/lf_scores.py" delete SCORE_ID
Score types: NUMERIC (float), CATEGORICAL (string), BOOLEAN (0/1). Sources: API (code), ANNOTATION (human), EVAL (automated evaluator).
python "${CLAUDE_SKILL_DIR}/scripts/lf_report.py" overview # project snapshot
python "${CLAUDE_SKILL_DIR}/scripts/lf_report.py" cost --age 1440 # cost by trace name
python "${CLAUDE_SKILL_DIR}/scripts/lf_report.py" latency --age 1440 --name "agent-run"
python "${CLAUDE_SKILL_DIR}/scripts/lf_report.py" quality --score-name "accuracy"
python "${CLAUDE_SKILL_DIR}/scripts/lf_report.py" full --age 1440 --output report.json
python "${CLAUDE_SKILL_DIR}/scripts/lf_exceptions.py" find --age 1440 --group-by type
python "${CLAUDE_SKILL_DIR}/scripts/lf_exceptions.py" find --group-by filepath
python "${CLAUDE_SKILL_DIR}/scripts/lf_exceptions.py" file "src/agent.py" --age 1440
python "${CLAUDE_SKILL_DIR}/scripts/lf_exceptions.py" details TRACE_ID
python "${CLAUDE_SKILL_DIR}/scripts/lf_exceptions.py" count --age 60
Extract session conversations into simplified JSONL (machine-readable) + Markdown (human-readable):
python "${CLAUDE_SKILL_DIR}/scripts/lf_extract_chat.py" session SESSION_ID # single session
python "${CLAUDE_SKILL_DIR}/scripts/lf_extract_chat.py" trace TRACE_ID # single trace
python "${CLAUDE_SKILL_DIR}/scripts/lf_extract_chat.py" batch sessions.txt # batch from file
python "${CLAUDE_SKILL_DIR}/scripts/lf_extract_chat.py" session SID --output-dir ./chats # custom output dir
Output per session: {session_id}.jsonl with {session_id, source, conversation.turns[], metadata} and {session_id}.md with numbered turns in blockquotes.
python "${CLAUDE_SKILL_DIR}/scripts/lf_schema.py" list # all object types
python "${CLAUDE_SKILL_DIR}/scripts/lf_schema.py" show trace # fields + endpoints
python "${CLAUDE_SKILL_DIR}/scripts/lf_schema.py" show generation # alias for observation
python "${CLAUDE_SKILL_DIR}/scripts/lf_schema.py" fields userId # search across types
python "${CLAUDE_SKILL_DIR}/scripts/lf_schema.py" hierarchy # object tree
python "${CLAUDE_SKILL_DIR}/scripts/lf_schema.py" endpoints dataset # API endpoints
For any endpoint not covered by convenience scripts, prefer the official CLI:
python "${CLAUDE_SKILL_DIR}/scripts/lf_cli.py" api __schema
python "${CLAUDE_SKILL_DIR}/scripts/lf_cli.py" api <resource> <action> [flags]
Use the local raw HTTP script only when the official CLI is unavailable in the shell:
python "${CLAUDE_SKILL_DIR}/scripts/lf_api.py" GET /api/public/traces --params '{"limit":5}'
python "${CLAUDE_SKILL_DIR}/scripts/lf_api.py" POST /api/public/comments --body '{"objectId":"...","objectType":"TRACE","content":"flagged"}'
python "${CLAUDE_SKILL_DIR}/scripts/lf_api.py" GET /api/public/annotation-queues
Most scripts support --age <minutes> for relative time ranges:
--age 60 → last hour--age 1440 → last 24 hours--age 10080 → last 7 daysSome also support --from-ts and --to for absolute ISO timestamps.
Generic Langfuse API work goes through lf_cli.py, which delegates to the official langfuse-cli and maps this plugin's credential profile to the CLI environment. Curated scripts use lf_client.py — a zero-dependency HTTP client (stdlib only: urllib, json, base64). Project-specific FlowWise RAG execution scripts are excluded until their core.config / .env dependency is ported.
Key shared features:
~/.skills/langfuse/credentials.json (per R50 v2.0.0)utilities/credentials.py — five-tier precedence ladder; no walk-up .env discoveryscripts/lf_cli.py — exports LANGFUSE_BASE_URL, LANGFUSE_PUBLIC_KEY, LANGFUSE_SECRET_KEY to the subprocess onlyapi_call_paginated()--compact flag (preserves essential fields, truncates large ones)LANGFUSE_SECRET_KEY, LANGFUSE_PUBLIC_KEY, LANGFUSE_BASE_URL, LANGFUSE_HOST, LANGFUSE_PROJECT_NAMElf_cli.py --dry-run api __schema before first uself_client.py --action health when using bundled scripts/ (e.g., eval/accuracy)tree command shows ⚠ indicators on ERROR/FATAL observations--compact on get actions truncates input/output/stacktrace for readabilityStandard runs end after credentials/tooling are verified, traces/prompts/datasets/scores are inspected or exported, and findings are surfaced to the user. No self-improvement prompt fires for uneventful work.
Prompt for skill improvement only when the run deviated from the documented flow: a Langfuse API surface was missing from lf_cli.py/local scripts, credential setup hit an unsupported profile shape, quarantined FlowWise RAG code was needed, a repeated analysis pattern should become a reference, or the skill had to improvise around CLI/runtime drift. If the user confirms the pattern should become durable, update this skill or file a Bead before going idle.
Generates brand assets: logos (55+ styles, Gemini AI), CIP mockups, HTML slides (Chart.js), banners (22 styles), SVG icons (15 styles), and social media photos. Routes to sub-skills for design tokens and UI styling.
npx claudepluginhub cmgramse/skill-development --plugin langfuse