From arbor
Enforces merge/evaluation discipline with B_dev/B_test separation, score parsing, GitMergeBranch behavior, protected paths, medal detection, and tree metadata updates.
How this skill is triggered — by the user, by Claude, or both
Slash command
/arbor:arbor-agent-merge-evalThe summary Claude sees in its skill listing — used to decide when to auto-load this skill
Use this whenever scores, metadata, merge decisions, or final validation are
Use this whenever scores, metadata, merge decisions, or final validation are involved.
Persist evaluation metadata early and update it after merges:
baseline_score: unmodified B_dev score.trunk_score: current trunk B_dev score.test_baseline_score: unmodified B_test score.test_trunk_score: current trunk B_test score.eval_cmd: B_dev command.eval_cmd_test: B_test command.eval_timeout, eval_retries, eval_retry_base_delay,
eval_retry_max_delay.dataset_info: paths and split descriptions.metric_direction: maximize or minimize.trunk_branch: non-protected branch that receives verified merges.submission_path, sample_submission_path.Use {cwd} and {node_id} placeholders. Example:
cd {cwd} && uv run python run_eval.py --split dev --run-name {node_id}
score is an absolute B_dev metric value.metric_direction controls improvement:
score, prefer it. Otherwise extract the primary
metric from text (primary_score, score, accuracy, acc, etc.).Native GitMergeBranch:
main or master.trunk_branch. main/master are base
branches, not merge targets.source_branch.eval_cmd_test with {cwd} and {node_id} substituted.test_trunk_score or
test_baseline_score.--no-ff.After success:
TreeSetMeta(test_trunk_score=<verified score>).TreeSetMeta(trunk_score=<dev score>).TreeUpdateNode(node_id=<id>, status="merged").For plugins such as MLE/Kaggle:
data/**, private/**,
or evaluation/**.submission.csv do not exist on
the branch.merge_threshold is a soft coordinator guideline, not a substitute for B_test.
A small improvement can merge when performance-first mode says every gain
counts and B_test verifies it. A large B_dev improvement must still be rejected
if B_test fails.
Before stopping:
test_trunk_score.test_baseline_score is missing and a baseline test run is feasible,
record it.arbor-agent-resume-report.For smoke/forward tests, do not run B_test or merge verification unless the
user explicitly requested a real run. Record test_trunk_score as unavailable,
state that no separate B_test was used, run arbor_state.py check, and hand
off to report generation.
If native GitMergeBranch is unavailable, use arbor-agent-tools:
python <tools>/arbor_state.py eval --cwd <project> --run-name <run> \
--split dev --cmd "<eval_cmd>" --set-meta baseline
python <tools>/arbor_state.py meta --cwd <project> --run-name <run> \
--set "trunk_branch=<trunk_branch>"
python <tools>/arbor_state.py merge --cwd <project> --run-name <run> \
--source-branch <branch> --node-id <id>
Pass --target-branch <trunk_branch> explicitly only when metadata is not set.
If a manual merge would touch live work, prefer --dry-run first.
npx claudepluginhub ruc-nlpir/arbor --plugin arborMerges top agent branch from eval session into base via git, archives losers as tags, cleans worktrees, generates summary, updates state.
Orchestrates local git merges of subagent worktree branches onto integration branches with preflight checks and recorded recovery points. Does not handle remote PR merging.
Emulates missing Arbor-native tools (TreeView, RunExecutor, GitMergeBranch) in Claude Code or Codex using a stdlib-only Python script. Provides local state management, eval score capture, executor prompt generation, merge checks, and validation for forward-testing Arbor runs.