SKILL-SPECIFIC GUIDANCE

When to Use This Skill

Validating ML baselines before novel methods (Deep Research SOP Pipeline D)
Reproducing experiments for NeurIPS/ICML challenges
Verifying SOTA claims before building on prior work
Creating reproducibility packages for Quality Gate 1
Debugging performance gaps between published and reproduced results

When NOT to Use This Skill

Quick proof-of-concept implementations (use prototyping)
When exact reproducibility not critical (demo projects)
Papers without code or detailed methodology
Industry benchmarks without academic rigor

Success Criteria

Results within +/- 1% of published metrics
3+ successful reproductions in fresh Docker
All 47+ hyperparameters documented
Reproducibility package tested independently
Dataset checksums verified (SHA256)
Random seeds documented, deterministic results

Edge Cases & Limitations

Missing hyperparameters: GitHub issues, contact authors, supplements
Framework version mismatches: test PyTorch 1.7 vs 1.13, document differences
Hardware differences: V100 vs A100 precision, document in notes
Non-deterministic operations: force torch.use_deterministic_algorithms(True)
Dataset preprocessing: compare sample outputs, validate checksums

Critical Guardrails

NEVER claim reproduction without statistical validation (t-test, CI)
ALWAYS document exact framework versions (pip freeze, conda export)
NEVER skip random seed validation (test 3+ runs)
ALWAYS verify dataset integrity (SHA256, sample counts, splits)
NEVER assume default hyperparameters (extract from paper/code)

Evidence-Based Validation

Validate determinism: run 3x, verify identical results (variance = 0)
Cross-validate metrics: recompute from saved outputs
Test fresh environment: Docker from scratch, no cached deps
Verify statistical power: calculate required N, check reported N
Confirm effect size: compute Cohen's d, verify significance

name: baseline-replication description: "Replicate published ML baseline experiments with exact reproducibility
\ (\xB11% tolerance) for Deep Research SOP Pipeline D. Use when validating baselines,
\ reproducing experiments, verifying published results, or preparing for novel method
\ development." version: 1.0.0 category: research tags:

research
analysis
planning author: ruv

Baseline Replication

Overview

Replicates published machine learning baseline methods with exact reproducibility, ensuring results match within ±1% tolerance. This skill implements Deep Research SOP Pipeline D baseline validation, which is a prerequisite for developing novel methods.

Prerequisites

Python 3.8+ with PyTorch/TensorFlow
Docker (for reproducibility)
Git and Git LFS
Access to datasets (HuggingFace, academic repositories)

What This Skill Does

Extracts methodology from papers and code repositories
Validates datasets match baseline specifications exactly
Implements baseline with exact hyperparameters
Runs experiments with deterministic settings
Validates results within ±1% statistical tolerance
Creates reproducibility package tested in fresh Docker environment
Generates Quality Gate 1 validation checklist

Quick Start (30 minutes)

Basic Replication

# 1. Specify baseline to replicate
BASELINE_PAPER="BERT: Pre-training of Deep Bidirectional Transformers (Devlin et al., 2019)"
BASELINE_CODE="https://github.com/google-research/bert"
TARGET_METRIC="Accuracy on SQuAD 2.0"
PUBLISHED_RESULT=0.948

# 2. Run replication workflow
./scripts/replicate-baseline.sh \
  --paper "$BASELINE_PAPER" \
  --code "$BASELINE_CODE" \
  --metric "$TARGET_METRIC" \
  --expected "$PUBLISHED_RESULT"

# 3. Review results
cat output/baseline-bert/replication-report.md

Expected output:

✓ Paper analyzed: Extracted 47 hyperparameters
✓ Dataset validated: SQuAD 2.0 matches baseline
✓ Implementation complete: 12 BERT layers, 110M parameters
✓ Training complete: 3 epochs, 26.3 GPU hours
✓ Results validated: 0.945 vs 0.948 (within ±1% tolerance)
✓ Reproducibility verified: 3/3 fresh reproductions successful
→ Quality Gate 1: APPROVED

Step-by-Step Guide

Phase 1: Paper Analysis (15 minutes)

Extract Methodology

# Coordinate with researcher agent
./scripts/analyze-paper.sh --paper "arXiv:2103.00020"

The script extracts:

Model architecture (layers, hidden sizes, attention heads)
Training hyperparameters (learning rate, batch size, warmup)
Optimization details (optimizer type, weight decay, dropout)
Dataset specifications (splits, preprocessing, tokenization)
Evaluation metrics (primary and secondary)

Output: baseline-specification.md with all extracted details

Validate Completeness

# Check for missing hyperparameters
./scripts/validate-spec.sh baseline-specification.md

Common Missing Details:

Learning rate schedule (linear warmup, cosine decay)
Random seeds (NumPy, PyTorch, Python)
Hardware specifications (GPU type, memory)
Framework versions (PyTorch 1.7 vs 1.13 numerical differences)

If details missing:

Check paper supplements
Check official code config files
Check GitHub issues
Contact authors

Phase 2: Dataset Validation (20 minutes)

Coordinate with data-steward Agent

# Validate dataset matches baseline specs
./scripts/validate-dataset.sh \
  --dataset "SQuAD 2.0" \
  --splits "train:130k,dev:12k" \
  --preprocessing "WordPiece tokenization, max_length=384"

data-steward checks:

Exact dataset version (v2.0, not v1.1)
Sample counts match (training: 130,319 examples)
Data splits match (80/10/10 vs 90/10)
Preprocessing matches (lower-casing, accent stripping)
Checksum validation (SHA256 hashes)

Output: dataset-validation-report.md

Phase 3: Implementation (2 hours)

Coordinate with coder Agent

# Implement baseline with exact specifications
./scripts/implement-baseline.sh \
  --spec baseline-specification.md \
  --framework pytorch \
  --template resources/templates/bert-base.py

coder creates:

# baseline-bert-implementation.py
import torch
import random
import numpy as np

# CRITICAL: Set all random seeds for reproducibility
torch.manual_seed(42)
torch.cuda.manual_seed_all(42)
np.random.seed(42)
random.seed(42)

# CRITICAL: Enable deterministic mode
torch.backends.cudnn.deterministic = True
torch.backends.cudnn.benchmark = False
torch.use_deterministic_algorithms(True)

# Exact hyperparameters from paper
config = {
    "num_layers": 12,
    "hidden_size": 768,
    "num_attention_heads": 12,
    "intermediate_size": 3072,
    "max_position_embeddings": 512,
    "learning_rate": 5e-5,  # From paper section 4.2
    "batch_size": 32,       # Per-GPU batch size
    "num_epochs": 3,
    "warmup_steps": 10000,  # 10% of training steps
    "weight_decay": 0.01,
    "dropout": 0.1
}

Unit Tests:

pytest baseline-bert-implementation_test.py -v

Output: Fully tested implementation matching baseline exactly

Phase 4: Experiment Execution (4-8 hours)

Coordinate with tester Agent

# Run experiments with monitoring
./scripts/run-experiments.sh \
  --implementation baseline-bert-implementation.py \
  --config config/bert-squad.yaml \
  --gpus 4 \
  --monitor true

tester executes:

Environment Setup:

# Create deterministic environment
docker build -t baseline-bert:v1.0 -f Dockerfile .
docker run --gpus all -v $(pwd):/workspace baseline-bert:v1.0

Training with Monitoring:

# Log training curves
from torch.utils.tensorboard import SummaryWriter
writer = SummaryWriter('logs/baseline-bert')

for epoch in range(3):
    for batch in dataloader:
        loss = model(batch)
        writer.add_scalar('Loss/train', loss, global_step)
        writer.add_scalar('LR', optimizer.param_groups[0]['lr'], global_step)

Checkpoint Saving:

# Save best checkpoint
torch.save({
    'epoch': epoch,
    'model_state_dict': model.state_dict(),
    'optimizer_state_dict': optimizer.state_dict(),
    'loss': loss,
    'accuracy': accuracy
}, 'checkpoints/best-model.pt')

Output:

training.log - Complete training logs
best-model.pt - Best checkpoint
metrics.json - All evaluation metrics

Phase 5: Results Validation (30 minutes)

Statistical Comparison

# Compare reproduced vs published results
./scripts/compare-results.sh \
  --reproduced 0.945 \
  --published 0.948 \
  --tolerance 0.01

Validation checks:

import scipy.stats as stats

# Paired t-test
reproduced = [0.945, 0.946, 0.944]  # 3 runs
published = 0.948

difference = np.mean(reproduced) - published
percent_diff = (difference / published) * 100

# Within tolerance?
within_tolerance = abs(difference / published) <= 0.01

# Statistical significance
t_stat, p_value = stats.ttest_1samp(reproduced, published)
confidence_interval = stats.t.interval(0.95, len(reproduced)-1,
                                       loc=np.mean(reproduced),
                                       scale=stats.sem(reproduced))

print(f"Reproduced: {np.mean(reproduced):.3f} ± {np.std(reproduced):.3f}")
print(f"Published: {published:.3f}")
print(f"Difference: {difference:.3f} ({percent_diff:.2f}%)")
print(f"Within ±1% tolerance: {within_tolerance}")
print(f"95% CI: [{confidence_interval[0]:.3f}, {confidence_interval[1]:.3f}]")

If results differ > 1%:

# Debug systematically
./scripts/debug-divergence.sh \
  --reproduced 0.932 \
  --published 0.948

Common causes:

Random seed not propagated to all libraries
Framework version differences (PyTorch 1.7 vs 1.13)
Hardware differences (V100 vs A100 numerical precision)
Missing hyperparameter (learning rate schedule)
Dataset preprocessing mismatch

Output: baseline-bert-comparison.md with statistical analysis

Phase 6: Reproducibility Packaging (30 minutes)

Coordinate with archivist Agent

# Create complete reproducibility package
./scripts/create-repro-package.sh \
  --name baseline-bert \
  --code baseline-bert-implementation.py \
  --model best-model.pt \
  --env requirements.txt

archivist creates:

baseline-bert-repro.tar.gz
├── README.md                    # ≤5 steps to reproduce
├── requirements.txt             # Exact versions
├── Dockerfile                   # Exact environment
├── src/
│   ├── baseline-bert-implementation.py
│   ├── data_loader.py
│   └── train.py
├── data/
│   └── download_instructions.txt
├── models/
│   └── best-model.pt
├── logs/
│   └── training.log
├── results/
│   ├── metrics.json
│   └── comparison.csv
└── MANIFEST.txt                 # SHA256 checksums

README.md (≤5 steps):

# BERT SQuAD 2.0 Baseline Reproduction

## Quick Reproduction (3 steps)

1. Build Docker environment:
   ```bash
   docker build -t bert-squad:v1.0 .

Download SQuAD 2.0 dataset:
```
./download_data.sh
```

Run training:

docker run --gpus all -v $(pwd):/workspace bert-squad:v1.0 python src/train.py

Expected result: 0.945 ± 0.001 accuracy (within ±1% of published 0.948)


#### Test Reproducibility
```bash
# Fresh Docker reproduction
./scripts/test-reproducibility.sh --package baseline-bert-repro.tar.gz --runs 3

Output: 3 successful reproductions with deterministic results

Phase 7: Quality Gate 1 Validation (15 minutes)

Coordinate with evaluator Agent

# Validate Quality Gate 1 requirements
./scripts/validate-gate-1.sh --baseline baseline-bert

evaluator checks:

✅ Baseline specification complete (47/47 hyperparameters)
✅ Dataset validation passed
✅ Implementation tested (100% unit test coverage)
✅ Results within ±1% tolerance (0.945 vs 0.948)
✅ Reproducibility verified (3/3 fresh reproductions)
✅ Code documented and archived

Decision Logic:

if results_within_tolerance and reproducibility_verified:
    decision = "APPROVED"
elif minor_gaps_fixable:
    decision = "CONDITIONAL"  # e.g., 1.2% difference but deterministic
else:
    decision = "REJECT"  # e.g., 5% difference, non-deterministic

Output: gate-1-validation-checklist.md

# Quality Gate 1: Baseline Validation

## Status: APPROVED

### Requirements
- [x] Baseline specification document complete
- [x] Dataset validation passed
- [x] Implementation tested and reviewed
- [x] Results within ±1% of published (0.945 vs 0.948)
- [x] Reproducibility package tested in fresh environment
- [x] Documentation complete
- [x] All artifacts archived

### Evidence
- Baseline spec: `baseline-bert-specification.md`
- Dataset validation: `dataset-validation-report.md`
- Implementation: `baseline-bert-implementation.py` (100% test coverage)
- Results comparison: `baseline-bert-comparison.md`
- Reproducibility package: `baseline-bert-repro.tar.gz` (3/3 successful)

### Approval
**Date**: 2025-11-01
**Approved By**: evaluator agent
**Next Step**: Proceed to Pipeline D novel method development

Advanced Features

Multi-Baseline Comparison

# Compare multiple baselines simultaneously
./scripts/compare-baselines.sh \
  --baselines "bert-base,roberta-base,electra-base" \
  --dataset "SQuAD 2.0" \
  --metrics "accuracy,f1,em"

Ablation Study Integration

# Once baseline validated, run ablations
./scripts/run-ablations.sh \
  --baseline baseline-bert \
  --ablations "no-warmup,no-weight-decay,smaller-lr"

Continuous Validation

# Set up monitoring for baseline drift
./scripts/setup-monitoring.sh \
  --baseline baseline-bert \
  --schedule "weekly" \
  --alert-threshold 0.02

Troubleshooting

Issue: Missing Hyperparameters

Symptoms: Specification validation fails with missing details Cause: Paper doesn't document all hyperparameters Solution:

# Check official code config files
grep -r "learning_rate\|batch_size\|warmup" ${BASELINE_CODE}/

# Check GitHub issues
gh issue list --repo ${BASELINE_REPO} --search "hyperparameter"

# Contact authors
./scripts/contact-authors.sh --paper "arXiv:2103.00020" --question "learning rate schedule"

Issue: Results Diverge > 1%

Symptoms: Reproduced 0.932, published 0.948 (1.7% difference) Solution:

# Systematic debugging
./scripts/debug-divergence.sh --detailed

# Check random seeds
python -c "import torch; print(torch.initial_seed())"

# Check framework version
python -c "import torch; print(torch.__version__)"

# Enable detailed logging
python baseline-bert-implementation.py --debug --log-level DEBUG

Issue: Non-Deterministic Results

Symptoms: 3 runs produce 0.945, 0.951, 0.938 (high variance) Solution:

# Force deterministic mode
import torch
torch.use_deterministic_algorithms(True)
os.environ['CUBLAS_WORKSPACE_CONFIG'] = ':4096:8'

# Check for non-deterministic operations
torch.set_deterministic(True)  # Will error if non-deterministic ops used

Issue: Docker Environment Fails

Symptoms: Docker build or run errors Solution:

# Check Docker resources
docker system df
docker system prune -a  # Free up space if needed

# Use pre-built base image
docker pull pytorch/pytorch:1.13.1-cuda11.6-cudnn8-runtime

# Debug interactively
docker run -it --gpus all pytorch/pytorch:1.13.1 bash

Output Files

File	Description	Size
`baseline-{method}-specification.md`	Extracted methodology	~5KB
`dataset-validation-report.md`	Dataset validation results	~2KB
`baseline-{method}-implementation.py`	Clean implementation	~10KB
`baseline-{method}-implementation_test.py`	Unit tests	~5KB
`training.log`	Complete training logs	~100MB
`best-model.pt`	Best checkpoint	~400MB
`metrics.json`	All evaluation metrics	~1KB
`baseline-{method}-comparison.md`	Results comparison	~3KB
`baseline-{method}-comparison.csv`	Metrics table	~1KB
`baseline-{method}-repro.tar.gz`	Reproducibility package	~450MB
`gate-1-validation-checklist.md`	Quality Gate 1 evidence	~3KB

Integration with Deep Research SOP

Pipeline D: Method Development

Baseline replication is mandatory before novel method development:

Pipeline D Flow:
1. Replicate Baseline (this skill) → Quality Gate 1
2. Develop Novel Method (method-development skill)
3. Ablation Studies (5+ ablations required)
4. Statistical Validation (p < 0.05)
5. Submit for Gate 2 review

Agent Coordination

Sequential workflow:

researcher:
  - Analyze paper
  - Extract methodology
  - Identify data sources
  ↓
data-steward:
  - Validate datasets
  - Check integrity
  - Verify preprocessing
  ↓
coder:
  - Implement baseline
  - Add unit tests
  - Code review
  ↓
tester:
  - Run experiments
  - Monitor training
  - Collect metrics
  ↓
archivist:
  - Create repro package
  - Test fresh reproduction
  - Archive artifacts
  ↓
evaluator:
  - Validate Gate 1
  - Generate checklist
  - Approve/Conditional/Reject

Memory MCP Integration

# Store baseline specification
memory-store --key "sop/pipeline-d/baseline-bert/specification" \
             --value "$(cat baseline-bert-specification.md)" \
             --layer long_term

# Store validation results
memory-store --key "sop/pipeline-d/baseline-bert/gate-1" \
             --value "$(cat gate-1-validation-checklist.md)" \
             --layer long_term

Related Skills

method-development - Develop novel methods after baseline validation
holistic-evaluation - Run HELM + CheckList evaluations (Pipeline E)
gate-validation - Quality Gate approval workflow
reproducibility-audit - Test reproducibility packages
literature-synthesis - PRISMA systematic reviews

Resources

Official Standards

Tools

Docker - Reproducible environments
DVC - Data version control
Weights & Biases - Experiment tracking
MLflow - ML lifecycle management

Deep Research SOP Documentation

Pipeline D specification: docs/deep-research-sop-gap-analysis.md
Quality Gates overview: CLAUDE.md
Agent definitions: agents/research/
Command specifications: .claude/commands/research/

Created: 2025-11-01 Version: 1.0.0 Category: Deep Research SOP Pipeline: D (Method Development) Quality Gate: 1 (Baseline Validation) Estimated Time: 8-12 hours (first baseline), 4-6 hours (subsequent)