Skill

split-strategy

From ds

Selects and implements train/validation/test split strategies based on data characteristics like time, groups, imbalance, and size. Guides sklearn usage for model evaluation frameworks.

Python

Pandas

ai-ml

Popularity

Stars

Forks

Invocation

How this skill is triggered — by the user, by Claude, or both

Slash command

/ds:split-strategy

User invocable

Model invocable

Inline context

Default effort

Context Preview

The summary Claude sees in its skill listing — used to decide when to auto-load this skill

Select the right train/validation/test split based on your data characteristics. Follow the decision tree below.

SKILL.md

85 lines · ~739 tokens

Stats

LanguagePython

Stars11

Forks1

MaintenanceExcellent

Last CommitFeb 25, 2026

Actions

View Source View Plugin View on GitHub View README

Split Strategy

Select the right train/validation/test split based on your data characteristics. Follow the decision tree below.

Decision Tree

1. Is there a time dimension?

Yes -> Temporal split: Train on past, validate on recent, test on most recent. Never shuffle across time.

Use TimeSeriesSplit for cross-validation
Set embargo gap = largest feature look-back window

No -> Continue to next question.

2. Are observations grouped?

Examples: multiple rows per customer, multiple images per patient, repeated measurements.

Yes -> Group-aware split: Keep all observations of a group in the same fold.

Use GroupKFold or GroupShuffleSplit
Never let the same group appear in both train and test

No -> Continue.

3. Is the target imbalanced?

Minority class <10% of total.

Yes -> Stratified split: Preserve class ratios across folds.

Use StratifiedKFold or StratifiedShuffleSplit
Combine with group awareness if needed: StratifiedGroupKFold

No -> Simple random split: Standard train_test_split with fixed seed.

4. Is the dataset small?

Less than 5,000 rows.

Yes -> Cross-validation: Use 5-fold or 10-fold CV instead of a single holdout. Report mean and std of metrics.

No -> Single holdout is fine (70/15/15 or 80/10/10).

Split Ratios

Dataset Size	Recommended Split	Notes
<1,000	Leave-one-out or 10-fold CV	Every data point matters
1,000-10,000	5-fold CV or 80/10/10	CV preferred for reliable estimates
10,000-100,000	80/10/10	Single holdout usually sufficient
>100,000	90/5/5 or 95/2.5/2.5	Large test sets are unnecessary

Common Mistakes

Shuffling time-series data -- Destroys temporal structure, causes leakage
Fitting preprocessors before splitting -- Scalers, encoders must be fit only on training data
Using test set for tuning -- Test set should be touched only once, at the very end
Ignoring groups -- Correlated observations in different folds inflate performance estimates
Not setting random seed -- Results are not reproducible without random_state

Implementation

from sklearn.model_selection import train_test_split, StratifiedKFold, GroupKFold, TimeSeriesSplit

# Simple stratified split
X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.2, stratify=y, random_state=42
)

# Temporal split
tscv = TimeSeriesSplit(n_splits=5)
for train_idx, val_idx in tscv.split(X):
    X_train, X_val = X.iloc[train_idx], X.iloc[val_idx]

# Group-aware split
gkf = GroupKFold(n_splits=5)
for train_idx, val_idx in gkf.split(X, y, groups=groups):
    X_train, X_val = X.iloc[train_idx], X.iloc[val_idx]

split-strategy

Popularity

Invocation

Context Preview

SKILL.md

split-strategy

Popularity

Invocation

Context Preview

SKILL.md

Split Strategy

Decision Tree

1. Is there a time dimension?

2. Are observations grouped?

3. Is the target imbalanced?

4. Is the dataset small?

Split Ratios

Common Mistakes

Implementation

Similar Skills

Split Strategy

Decision Tree

1. Is there a time dimension?

2. Are observations grouped?

3. Is the target imbalanced?

4. Is the dataset small?

Split Ratios

Common Mistakes

Implementation

Similar Skills