From evsys-sdk
Teaches how to use evsys-sdk to read project goals, experiment history, and to create/launch experiments via EvsysStore and Workspace.
How this skill is triggered — by the user, by Claude, or both
Slash command
/evsys-sdk:using-the-sdkThe summary Claude sees in its skill listing — used to decide when to auto-load this skill
Two objects. **`EvsysStore`** routes every call through the backend gateway
Two objects. EvsysStore routes every call through the backend gateway
with your Bearer API key — no Supabase key in the SDK; the backend checks
project membership and does the DB I/O. Workspace caches remote datasets to
local JSONL for fast training.
export EVSYS_API_URL="https://<backend>" # backend base URL
export EVSYS_API_KEY="sk_..." # dashboard → Settings → API keys
export EVSYS_PROJECT_ID="<uuid>" # default project for project-scoped calls
from evsys_sdk import EvsysStore, Workspace
store = EvsysStore() # reads the env above
project → goals[] (versioned) + datasets[]/benchmarks[] → experiments → groups → runs → checkpoints/evals/metrics
store.current_goal() # project's active goal (read-only!)
store.list_goals() # all goal versions, oldest→newest
store.experiment_summaries() # [{id,name,hypothesis,conclusion,is_valid}] — broad scan
store.experiment_detail(exp_id) # {experiment, groups:[{...,runs:[{run_config,evals,checkpoints}]}]}
store.experiment_detail(exp_id, include_metrics=True) # + per-run train-log series
store.get_metrics(run_id, name="val_loss", split="val") # a single metric series
store.list_datasets(); store.list_benchmarks()
run_config is the run's hyperparameters (incl. the model). evals carry the
benchmark result metrics (pass_at_1, …). Skip is_valid=False experiments as evidence.
exp = store.create_experiment(experiment_name="small_vs_large_model",
hypothesis="the smaller model matches the larger on this benchmark",
project_goal_id=store.current_goal()["id"])
g4 = store.create_group(exp["id"], "Qwen 4B", description="v2 dataset, train Qwen 4B")
for seed in (1, 2):
run = store.create_run(experiment_id=exp["id"], group_id=g4["id"], seed=seed,
recipe_kind="sft",
run_config={"model": "Qwen/Qwen3-4B", "lr": 1e-5}) # model = hyperparam
# ... train ... then:
store.log_metrics(run_id=run["id"], step=10, metrics={"loss": 1.2})
store.log_metrics(run_id=run["id"], step=10, split="val", metrics={"val_loss": 1.4})
ckpt = store.add_checkpoint(run_id=run["id"], uri="tinker://final", label="final",
step=100, is_final=True)
store.create_eval(run_id=run["id"], benchmark_id=bm["id"], checkpoint_id=ckpt["id"],
metrics={"pass_at_1": 0.83})
store.update_run(run["id"], status="completed")
store.set_conclusion(exp["id"], "4B matched 9B at half the cost — promote.")
# store.invalidate_experiment(exp_id, reason="...") # if a bug is found later
Remote-first: pull a dataset to local JSONL once, then train from the file.
ws = Workspace(store) # root: $EVSYS_WORKSPACE or ./.evsys (gitignored)
mat = ws.pull_dataset(dataset_id) # cache-hit if already local; else fetch+write
# mat.path → JSONL of RAW rows (one per line)
# mat.format, mat.transform → how to render raw → typed (source_kind=jsonl + transform)
bench = ws.pull_benchmark(benchmark_id)
script = ws.script_path(exp["id"]); out = ws.outputs_dir(run["id"])
Rows are stored raw (+ the dataset's recorded transform); render typed rows
on read. pull_dataset(..., force=True) re-pulls; otherwise a valid local copy is
reused (manifest-guarded).
store.set_goal(...) unless the user explicitly asks to change the goal.run_config, not an experiment field.hypothesis and conclusion are the experiment's commit message — write them well.@register_verifier / @register_metric);
benchmark rows reference a verifier by verifier_name, not inline code.npx claudepluginhub ev-sys/evsys-sdk --plugin evsys-sdkScaffolds or migrates a repo to the evsys-sdk research-project layout (data/, src/, experiments/, .evsys/). Use when starting a new project or bringing an ad-hoc project into the standard shape.
Manages ML experiment lifecycle via structured YAML registry. Registers experiments, records results, compares runs, tracks status. Activates on experiment-related queries.
Sets up Harness Evolver v3 in Python projects: explores codebase for entry points, configures LangSmith, runs baseline evaluation for LLM agent optimization.