Enables AI agents to use Braintrust for LLM evaluation, logging, and observability. Includes scripts for querying logs with SQL, running evals, and logging data.
This skill inherits all available tools. When active, it can use any tool Claude has access to.
scripts/log_data.pyscripts/query_logs.pyscripts/run_eval.pyBraintrust is a platform for evaluating, logging, and monitoring LLM applications.
Use the query_logs.py script to run SQL queries against Braintrust logs.
Always share the SQL query you used when reporting results, so the user understands what was executed.
Script location: scripts/query_logs.py (relative to this file)
Run from the user's project directory (where .env with BRAINTRUST_API_KEY exists):
uv run /path/to/scripts/query_logs.py --project "Project Name" --query "SQL_QUERY"
Count logs from last 24 hours:
SELECT count(*) as count FROM logs WHERE created > now() - interval 1 day
Get recent logs:
SELECT input, output, created FROM logs ORDER BY created DESC LIMIT 10
Filter by metadata:
SELECT input, output FROM logs WHERE metadata.user_id = 'user123' LIMIT 20
Filter by time range:
SELECT * FROM logs WHERE created > now() - interval 7 day LIMIT 50
Aggregate by field:
SELECT metadata.model, count(*) as count FROM logs GROUP BY metadata.model
Group by hour:
SELECT hour(created) as hr, count(*) as count FROM logs GROUP BY hour(created)
hour(), day(), month(), year() instead of date_trunc()
hour(created)date_trunc('hour', created)interval 1 day, interval 7 day, interval 1 hour (no quotes, singular unit)metadata.user_id, scores.Factuality, metrics.durationFROM logs (the script handles project scoping)Operators:
=, !=, >, <, >=, <=IS NULL, IS NOT NULLLIKE 'pattern%'AND, OR, NOTAggregations:
count(*), count(field)avg(field), sum(field)min(field), max(field)Time filters:
created > now() - interval 1 daycreated > now() - interval 7 daycreated > now() - interval 1 hourUse scripts/log_data.py to log data to a project:
uv run /path/to/scripts/log_data.py --project "Project Name" --input "query" --output "response"
With metadata:
--input "query" --output "response" --metadata '{"user_id": "123"}'
Batch from JSON:
--data '[{"input": "a", "output": "b"}, {"input": "c", "output": "d"}]'
Use scripts/run_eval.py to run evaluations:
uv run /path/to/scripts/run_eval.py --project "Project Name" --data '[{"input": "test", "expected": "test"}]'
From file:
--data-file test_cases.json --scorer factuality
Create a .env file in your project directory:
BRAINTRUST_API_KEY=your-api-key-here
For custom evaluation logic, use the SDK directly.
IMPORTANT: First argument to Eval() is the project name (positional).
import braintrust
from autoevals import Factuality
braintrust.Eval(
"My Project", # Project name (required, positional)
data=lambda: [{"input": "What is 2+2?", "expected": "4"}],
task=lambda input: my_llm_call(input),
scores=[Factuality],
)
Common mistakes:
Eval(project_name="My Project", ...) - Wrong!Eval(name="My Project", ...) - Wrong!Eval("My Project", data=..., task=..., scores=...) - Correct!import braintrust
logger = braintrust.init_logger(project="My Project")
logger.log(input="query", output="response", metadata={"user_id": "123"})
logger.flush() # Always flush!
logger.flush() after logging.env file with BRAINTRUST_API_KEY=your-key