Skill

using-the-sdk

Teaches how to use evsys-sdk to read project goals, experiment history, and to create/launch experiments via EvsysStore and Workspace.

Python

ai-ml

backend

Popularity

Stars

Forks

Invocation

How this skill is triggered — by the user, by Claude, or both

Slash command

/evsys-sdk:using-the-sdk

User invocable

Model invocable

Inline context

Default effort

Context Preview

The summary Claude sees in its skill listing — used to decide when to auto-load this skill

Two objects. **`EvsysStore`** routes every call through the backend gateway

SKILL.md

96 lines · ~1.1k tokens

Stats

LanguagePython

Stars27

Forks2

MaintenanceExcellent

Last CommitJun 23, 2026

Actions

View Source View Plugin View on GitHub View README

Using the evsys-sdk SDK

Two objects. EvsysStore routes every call through the backend gateway with your Bearer API key — no Supabase key in the SDK; the backend checks project membership and does the DB I/O. Workspace caches remote datasets to local JSONL for fast training.

export EVSYS_API_URL="https://<backend>"   # backend base URL
export EVSYS_API_KEY="sk_..."               # dashboard → Settings → API keys
export EVSYS_PROJECT_ID="<uuid>"            # default project for project-scoped calls

from evsys_sdk import EvsysStore, Workspace
store = EvsysStore()           # reads the env above

Hierarchy

project → goals[] (versioned) + datasets[]/benchmarks[] → experiments → groups → runs → checkpoints/evals/metrics

Reading (agent context)

store.current_goal()                       # project's active goal (read-only!)
store.list_goals()                         # all goal versions, oldest→newest

store.experiment_summaries()               # [{id,name,hypothesis,conclusion,is_valid}] — broad scan
store.experiment_detail(exp_id)            # {experiment, groups:[{...,runs:[{run_config,evals,checkpoints}]}]}
store.experiment_detail(exp_id, include_metrics=True)   # + per-run train-log series
store.get_metrics(run_id, name="val_loss", split="val") # a single metric series
store.list_datasets(); store.list_benchmarks()

run_config is the run's hyperparameters (incl. the model). evals carry the benchmark result metrics (pass_at_1, …). Skip is_valid=False experiments as evidence.

Writing (launching an experiment)

exp = store.create_experiment(experiment_name="small_vs_large_model",
                              hypothesis="the smaller model matches the larger on this benchmark",
                              project_goal_id=store.current_goal()["id"])
g4 = store.create_group(exp["id"], "Qwen 4B", description="v2 dataset, train Qwen 4B")
for seed in (1, 2):
    run = store.create_run(experiment_id=exp["id"], group_id=g4["id"], seed=seed,
                           recipe_kind="sft",
                           run_config={"model": "Qwen/Qwen3-4B", "lr": 1e-5})  # model = hyperparam
    # ... train ... then:
    store.log_metrics(run_id=run["id"], step=10, metrics={"loss": 1.2})
    store.log_metrics(run_id=run["id"], step=10, split="val", metrics={"val_loss": 1.4})
    ckpt = store.add_checkpoint(run_id=run["id"], uri="tinker://final", label="final",
                               step=100, is_final=True)
    store.create_eval(run_id=run["id"], benchmark_id=bm["id"], checkpoint_id=ckpt["id"],
                      metrics={"pass_at_1": 0.83})
    store.update_run(run["id"], status="completed")

store.set_conclusion(exp["id"], "4B matched 9B at half the cost — promote.")
# store.invalidate_experiment(exp_id, reason="...")   # if a bug is found later

Materializing data for training (Workspace)

Remote-first: pull a dataset to local JSONL once, then train from the file.

ws = Workspace(store)                       # root: $EVSYS_WORKSPACE or ./.evsys (gitignored)
mat = ws.pull_dataset(dataset_id)           # cache-hit if already local; else fetch+write
# mat.path  → JSONL of RAW rows (one per line)
# mat.format, mat.transform → how to render raw → typed (source_kind=jsonl + transform)
bench = ws.pull_benchmark(benchmark_id)
script = ws.script_path(exp["id"]); out = ws.outputs_dir(run["id"])

Rows are stored raw (+ the dataset's recorded transform); render typed rows on read. pull_dataset(..., force=True) re-pulls; otherwise a valid local copy is reused (manifest-guarded).

Rules of thumb

Never call store.set_goal(...) unless the user explicitly asks to change the goal.
The model is a per-run hyperparameter in run_config, not an experiment field.
hypothesis and conclusion are the experiment's commit message — write them well.
Verifiers/metrics are SDK-registered by name (@register_verifier / @register_metric); benchmark rows reference a verifier by verifier_name, not inline code.

using-the-sdk

Popularity

Invocation

Context Preview

SKILL.md

using-the-sdk

Popularity

Invocation

Context Preview

SKILL.md

Using the evsys-sdk SDK

Hierarchy

Reading (agent context)

Writing (launching an experiment)

Materializing data for training (Workspace)

Rules of thumb

Similar Skills

Using the evsys-sdk SDK

Hierarchy

Reading (agent context)

Writing (launching an experiment)

Materializing data for training (Workspace)

Rules of thumb

Similar Skills