Using Braintrust

Braintrust is a platform for evaluating, logging, and monitoring LLM applications.

Querying logs with SQL

Use the query_logs.py script to run SQL queries against Braintrust logs.

Always share the SQL query you used when reporting results, so the user understands what was executed.

Script location: scripts/query_logs.py (relative to this file)

Run from the user's project directory (where .env with BRAINTRUST_API_KEY exists):

uv run /path/to/scripts/query_logs.py --project "Project Name" --query "SQL_QUERY"

Common queries

Count logs from last 24 hours:

SELECT count(*) as count FROM logs WHERE created > now() - interval 1 day

Get recent logs:

SELECT input, output, created FROM logs ORDER BY created DESC LIMIT 10

Filter by metadata:

SELECT input, output FROM logs WHERE metadata.user_id = 'user123' LIMIT 20

Filter by time range:

SELECT * FROM logs WHERE created > now() - interval 7 day LIMIT 50

Aggregate by field:

SELECT metadata.model, count(*) as count FROM logs GROUP BY metadata.model

Group by hour:

SELECT hour(created) as hr, count(*) as count FROM logs GROUP BY hour(created)

SQL quirks in Braintrust

Time functions: Use hour(), day(), month(), year() instead of date_trunc()
- ✅ hour(created)
- ❌ date_trunc('hour', created)
Intervals: Use interval 1 day, interval 7 day, interval 1 hour (no quotes, singular unit)
Nested fields: Use dot notation: metadata.user_id, scores.Factuality, metrics.duration
Table name: Always use FROM logs (the script handles project scoping)

SQL reference

Operators:

=, !=, >, <, >=, <=
IS NULL, IS NOT NULL
LIKE 'pattern%'
AND, OR, NOT

Aggregations:

count(*), count(field)
avg(field), sum(field)
min(field), max(field)

Time filters:

created > now() - interval 1 day
created > now() - interval 7 day
created > now() - interval 1 hour

Logging data

Use scripts/log_data.py to log data to a project:

uv run /path/to/scripts/log_data.py --project "Project Name" --input "query" --output "response"

With metadata:

--input "query" --output "response" --metadata '{"user_id": "123"}'

Batch from JSON:

--data '[{"input": "a", "output": "b"}, {"input": "c", "output": "d"}]'

Running evaluations

Use scripts/run_eval.py to run evaluations:

uv run /path/to/scripts/run_eval.py --project "Project Name" --data '[{"input": "test", "expected": "test"}]'

From file:

--data-file test_cases.json --scorer factuality

Setup

Create a .env file in your project directory:

BRAINTRUST_API_KEY=your-api-key-here

Writing evaluation code (SDK)

For custom evaluation logic, use the SDK directly.

IMPORTANT: First argument to Eval() is the project name (positional).

import braintrust
from autoevals import Factuality

braintrust.Eval(
    "My Project",  # Project name (required, positional)
    data=lambda: [{"input": "What is 2+2?", "expected": "4"}],
    task=lambda input: my_llm_call(input),
    scores=[Factuality],
)

Common mistakes:

❌ Eval(project_name="My Project", ...) - Wrong!
❌ Eval(name="My Project", ...) - Wrong!
✅ Eval("My Project", data=..., task=..., scores=...) - Correct!

Writing logging code (SDK)

import braintrust

logger = braintrust.init_logger(project="My Project")
logger.log(input="query", output="response", metadata={"user_id": "123"})
logger.flush()  # Always flush!

Common issues

"Eval() got an unexpected keyword argument 'project_name'": Use positional argument
Logs not appearing: Call logger.flush() after logging
Authentication errors: Create .env file with BRAINTRUST_API_KEY=your-key

using-braintrust