name: ml-reviewer description: | WHEN: Machine Learning/Deep Learning code review, PyTorch/TensorFlow patterns, Model training optimization, MLOps checks WHAT: Model architecture review + Training patterns + Data pipeline checks + GPU optimization + Experiment tracking WHEN NOT: Data analysis only → python-data-reviewer, General Python → python-reviewer

ML Reviewer Skill

Purpose

Reviews Machine Learning and Deep Learning code for PyTorch, TensorFlow, scikit-learn, and MLOps best practices.

When to Use

ML/DL project code review
"PyTorch", "TensorFlow", "Keras", "scikit-learn", "model training" mentions
Model performance, training optimization inspection
Projects with ML framework dependencies

Project Detection

torch, tensorflow, keras, sklearn in requirements.txt/pyproject.toml
.pt, .pth, .h5, .pkl model files
train.py, model.py, dataset.py files
Jupyter notebooks with ML imports

Workflow

Step 1: Analyze Project

**Framework**: PyTorch / TensorFlow / scikit-learn
**Python**: 3.10+
**CUDA**: 11.x / 12.x
**Task**: Classification / Regression / NLP / CV
**Stage**: Research / Production

Step 2: Select Review Areas

AskUserQuestion:

"Which areas to review?"
Options:
- Full ML pattern check (recommended)
- Model architecture review
- Training loop optimization
- Data pipeline efficiency
- MLOps/deployment patterns
multiSelect: true

Detection Rules

PyTorch Patterns

Check	Recommendation	Severity
Missing model.eval()	Inconsistent inference	HIGH
Missing torch.no_grad()	Memory leak in inference	HIGH
In-place operations in autograd	Gradient computation error	CRITICAL
DataLoader num_workers=0	CPU bottleneck	MEDIUM
Missing gradient clipping	Exploding gradients	MEDIUM

# BAD: Missing eval() and no_grad()
def predict(model, x):
    return model(x)  # Dropout/BatchNorm inconsistent!

# GOOD: Proper inference mode
def predict(model, x):
    model.eval()
    with torch.no_grad():
        return model(x)

# BAD: In-place operation breaking autograd
x = torch.randn(10, requires_grad=True)
x += 1  # In-place! Breaks gradient computation

# GOOD: Out-of-place operation
x = torch.randn(10, requires_grad=True)
x = x + 1

# BAD: DataLoader bottleneck
loader = DataLoader(dataset, batch_size=32)  # num_workers=0

# GOOD: Parallel data loading
loader = DataLoader(
    dataset,
    batch_size=32,
    num_workers=4,
    pin_memory=True,  # For GPU
    persistent_workers=True,
)

# BAD: No gradient clipping
optimizer.step()

# GOOD: Clip gradients
torch.nn.utils.clip_grad_norm_(model.parameters(), max_norm=1.0)
optimizer.step()

TensorFlow/Keras Patterns

Check	Recommendation	Severity
Missing @tf.function	Performance loss	MEDIUM
Eager mode in production	Slow inference	HIGH
Large model in memory	OOM risk	HIGH
Missing mixed precision	Training inefficiency	MEDIUM

# BAD: No @tf.function
def train_step(x, y):
    with tf.GradientTape() as tape:
        pred = model(x)
        loss = loss_fn(y, pred)
    gradients = tape.gradient(loss, model.trainable_variables)
    optimizer.apply_gradients(zip(gradients, model.trainable_variables))

# GOOD: Use @tf.function
@tf.function
def train_step(x, y):
    with tf.GradientTape() as tape:
        pred = model(x, training=True)
        loss = loss_fn(y, pred)
    gradients = tape.gradient(loss, model.trainable_variables)
    optimizer.apply_gradients(zip(gradients, model.trainable_variables))

# BAD: Missing mixed precision
model.fit(x_train, y_train, epochs=10)

# GOOD: Enable mixed precision
tf.keras.mixed_precision.set_global_policy('mixed_float16')
model.fit(x_train, y_train, epochs=10)

scikit-learn Patterns

Check	Recommendation	Severity
fit_transform on test data	Data leakage	CRITICAL
Missing cross-validation	Overfitting risk	HIGH
No feature scaling	Model performance	MEDIUM
Hardcoded random_state	Reproducibility	LOW

# BAD: Data leakage
scaler = StandardScaler()
X_train_scaled = scaler.fit_transform(X_train)
X_test_scaled = scaler.fit_transform(X_test)  # LEAK! Re-fitting

# GOOD: transform only on test
scaler = StandardScaler()
X_train_scaled = scaler.fit_transform(X_train)
X_test_scaled = scaler.transform(X_test)  # No re-fit

# BAD: No cross-validation
model.fit(X_train, y_train)
score = model.score(X_test, y_test)

# GOOD: Use cross-validation
from sklearn.model_selection import cross_val_score
scores = cross_val_score(model, X, y, cv=5)
print(f"CV Score: {scores.mean():.3f} (+/- {scores.std():.3f})")

# BAD: Pipeline without scaling
model = LogisticRegression()
model.fit(X_train, y_train)

# GOOD: Use Pipeline with scaling
from sklearn.pipeline import Pipeline
pipeline = Pipeline([
    ('scaler', StandardScaler()),
    ('classifier', LogisticRegression())
])
pipeline.fit(X_train, y_train)

Data Pipeline

Check	Problem	Solution
Loading full dataset to memory	OOM	Use generators/tf.data
No data augmentation	Overfitting	Add augmentation
Unbalanced classes	Biased model	Oversample/undersample/weights
No validation split	No early stopping	Use validation set

# BAD: Full dataset in memory
images = []
for path in all_image_paths:
    images.append(load_image(path))  # OOM for large datasets!

# GOOD: Use generator
def data_generator(paths, batch_size):
    for i in range(0, len(paths), batch_size):
        batch_paths = paths[i:i+batch_size]
        yield np.array([load_image(p) for p in batch_paths])

# GOOD: Use tf.data
dataset = tf.data.Dataset.from_tensor_slices(paths)
dataset = dataset.map(load_and_preprocess)
dataset = dataset.batch(32).prefetch(tf.data.AUTOTUNE)

# BAD: No class weights for imbalanced data
model.fit(X_train, y_train)

# GOOD: Add class weights
from sklearn.utils.class_weight import compute_class_weight
weights = compute_class_weight('balanced', classes=np.unique(y), y=y)
class_weights = dict(enumerate(weights))
model.fit(X_train, y_train, class_weight=class_weights)

GPU/Performance

Check	Recommendation	Severity
CPU tensor operations	Use GPU tensors	HIGH
Frequent GPU-CPU transfer	Batch transfers	HIGH
No gradient accumulation	OOM for large batch	MEDIUM
Missing torch.cuda.empty_cache()	Memory fragmentation	LOW

# BAD: CPU operations
x = torch.randn(1000, 1000)
y = torch.randn(1000, 1000)
z = x @ y  # CPU computation

# GOOD: GPU operations
device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')
x = torch.randn(1000, 1000, device=device)
y = torch.randn(1000, 1000, device=device)
z = x @ y  # GPU computation

# BAD: Frequent CPU-GPU transfer
for x, y in dataloader:
    x = x.cuda()
    y = y.cuda()
    loss = model(x, y)
    print(loss.item())  # Sync every iteration!

# GOOD: Batch logging
losses = []
for x, y in dataloader:
    x, y = x.to(device), y.to(device)
    loss = model(x, y)
    losses.append(loss)
if step % log_interval == 0:
    print(torch.stack(losses).mean().item())

# Gradient accumulation for large effective batch
accumulation_steps = 4
for i, (x, y) in enumerate(dataloader):
    loss = model(x, y) / accumulation_steps
    loss.backward()
    if (i + 1) % accumulation_steps == 0:
        optimizer.step()
        optimizer.zero_grad()

MLOps/Experiment Tracking

Check	Recommendation	Severity
No experiment tracking	Reproducibility	HIGH
Hardcoded hyperparameters	Config management	MEDIUM
No model versioning	Deployment issues	MEDIUM
Missing seed setting	Non-reproducible	HIGH

# BAD: No seed setting
model = train_model(X, y)

# GOOD: Set all seeds
import random
import numpy as np
import torch

def set_seed(seed=42):
    random.seed(seed)
    np.random.seed(seed)
    torch.manual_seed(seed)
    torch.cuda.manual_seed_all(seed)
    torch.backends.cudnn.deterministic = True

set_seed(42)

# BAD: Hardcoded hyperparameters
lr = 0.001
batch_size = 32
epochs = 100

# GOOD: Use config file or hydra
import hydra
from omegaconf import DictConfig

@hydra.main(config_path="configs", config_name="train")
def train(cfg: DictConfig):
    model = build_model(cfg.model)
    optimizer = torch.optim.Adam(model.parameters(), lr=cfg.lr)

# GOOD: Use experiment tracking
import wandb
wandb.init(project="my-project", config=cfg)
for epoch in range(epochs):
    loss = train_epoch(model, dataloader)
    wandb.log({"loss": loss, "epoch": epoch})
wandb.finish()

Response Template

## ML Code Review Results

**Project**: [name]
**Framework**: PyTorch/TensorFlow/scikit-learn
**Task**: Classification/Regression/NLP/CV
**Files Analyzed**: X

### Model Architecture
| Status | File | Issue |
|--------|------|-------|
| MEDIUM | models/resnet.py | Missing dropout for regularization |
| LOW | models/transformer.py | Consider gradient checkpointing |

### Training Loop
| Status | File | Issue |
|--------|------|-------|
| HIGH | train.py | Missing model.eval() in validation (line 45) |
| HIGH | train.py | No gradient clipping (line 67) |

### Data Pipeline
| Status | File | Issue |
|--------|------|-------|
| CRITICAL | data/dataset.py | fit_transform on test data (line 23) |
| HIGH | data/loader.py | DataLoader num_workers=0 |

### MLOps
| Status | File | Issue |
|--------|------|-------|
| HIGH | train.py | No seed setting for reproducibility |
| MEDIUM | train.py | Hardcoded hyperparameters |

### Recommended Actions
1. [ ] Add model.eval() and torch.no_grad() for inference
2. [ ] Fix data leakage in preprocessing
3. [ ] Set random seeds for reproducibility
4. [ ] Add experiment tracking (wandb/mlflow)

Best Practices

Training: eval mode, no_grad, gradient clipping, mixed precision
Data: No leakage, proper splits, augmentation, balanced classes
Performance: GPU operations, batch transfers, gradient accumulation
MLOps: Seed setting, experiment tracking, config management
Testing: Unit tests for data pipeline, model output shape tests

Integration

python-reviewer skill: General Python code quality
python-data-reviewer skill: Data preprocessing patterns
test-generator skill: ML test generation
docker-reviewer skill: ML containerization

Notes

Based on PyTorch 2.x, TensorFlow 2.x, scikit-learn 1.x
Supports distributed training patterns (DDP, FSDP)
Includes MLOps patterns (wandb, mlflow, hydra)

ml-reviewer

ML Reviewer Skill

Purpose

When to Use

Project Detection

Workflow

Step 1: Analyze Project

Step 2: Select Review Areas

Detection Rules

PyTorch Patterns

TensorFlow/Keras Patterns

scikit-learn Patterns

Data Pipeline

GPU/Performance

MLOps/Experiment Tracking

Response Template

Best Practices

Integration

Notes

Related Skills