Comprehensive feature engineering for ML pipelines: data quality assessment, feature creation, selection, transformation, and encoding. Activates for "feature engineering", "create features", "feature selection", "data preprocessing", "handle missing values", "encode categorical", "scale features", "feature importance". Ensures features are production-ready with automated validation, documentation, and integration with SpecWeave increments.
Inherits all available tools
Additional assets for this skill
This skill inherits all available tools. When active, it can use any tool Claude has access to.
Feature engineering often makes the difference between mediocre and excellent ML models. This skill transforms raw data into model-ready features through systematic data quality assessment, feature creation, selection, and transformation—all integrated with SpecWeave's increment workflow.
Before creating features, understand your data:
from specweave import DataQualityReport
# Automated data quality check
report = DataQualityReport(df, increment="0042")
# Generates:
# - Missing value analysis
# - Outlier detection
# - Data type validation
# - Distribution analysis
# - Correlation matrix
# - Duplicate detection
Quality Report Output:
# Data Quality Report
## Dataset Overview
- Rows: 100,000
- Columns: 45
- Memory: 34.2 MB
## Missing Values
| Column | Missing | Percentage |
|-----------------|---------|------------|
| email | 15,234 | 15.2% |
| phone | 8,901 | 8.9% |
| purchase_date | 0 | 0.0% |
## Outliers Detected
- transaction_amount: 234 outliers (>3 std dev)
- user_age: 12 outliers (<18 or >100)
## Data Type Issues
- user_id: Stored as float, should be int
- date_joined: Stored as string, should be datetime
## Recommendations
1. Impute email/phone or create "missing" indicator features
2. Cap/remove outliers in transaction_amount
3. Convert data types for efficiency
Create features from domain knowledge:
from specweave import FeatureCreator
creator = FeatureCreator(df, increment="0042")
# Temporal features (from datetime)
creator.add_temporal_features(
date_column="purchase_date",
features=["hour", "day_of_week", "month", "is_weekend", "is_holiday"]
)
# Aggregation features (user behavior)
creator.add_aggregation_features(
group_by="user_id",
target="purchase_amount",
aggs=["mean", "std", "count", "min", "max"]
)
# Creates: user_purchase_amount_mean, user_purchase_amount_std, etc.
# Interaction features
creator.add_interaction_features(
features=[("age", "income"), ("clicks", "impressions")],
operations=["multiply", "divide", "subtract"]
)
# Creates: age_x_income, clicks_per_impression, etc.
# Ratio features
creator.add_ratio_features([
("revenue", "cost"),
("conversions", "visits")
])
# Creates: revenue_to_cost_ratio, conversion_rate
# Binning (discretization)
creator.add_binned_features(
column="age",
bins=[0, 18, 25, 35, 50, 65, 100],
labels=["child", "young_adult", "adult", "middle_aged", "senior", "elderly"]
)
# Text features (from text columns)
creator.add_text_features(
column="product_description",
features=["length", "word_count", "unique_words", "sentiment"]
)
# Generate all features
df_enriched = creator.generate()
# Auto-documents in increment folder
creator.save_feature_definitions(
path=".specweave/increments/0042.../features/feature_definitions.yaml"
)
Feature Definitions (auto-generated):
# .specweave/increments/0042.../features/feature_definitions.yaml
features:
- name: purchase_hour
type: temporal
source: purchase_date
description: Hour of purchase (0-23)
- name: user_purchase_amount_mean
type: aggregation
source: purchase_amount
group_by: user_id
description: Average purchase amount per user
- name: age_x_income
type: interaction
sources: [age, income]
operation: multiply
description: Product of age and income
- name: conversion_rate
type: ratio
sources: [conversions, visits]
description: Conversion rate (conversions / visits)
Reduce dimensionality, improve performance:
from specweave import FeatureSelector
selector = FeatureSelector(X_train, y_train, increment="0042")
# Method 1: Correlation-based (remove redundant features)
selector.remove_correlated_features(threshold=0.95)
# Removes features with >95% correlation
# Method 2: Variance-based (remove constant features)
selector.remove_low_variance_features(threshold=0.01)
# Removes features with <1% variance
# Method 3: Statistical tests
selector.select_by_statistical_test(k=50)
# SelectKBest with chi2/f_classif
# Method 4: Model-based (tree importance)
selector.select_by_model_importance(
model=RandomForestClassifier(),
threshold=0.01
)
# Removes features with <1% importance
# Method 5: Recursive Feature Elimination
selector.select_by_rfe(
model=LogisticRegression(),
n_features=30
)
# Get selected features
selected_features = selector.get_selected_features()
# Generate selection report
selector.generate_report()
Feature Selection Report:
# Feature Selection Report
## Original Features: 125
## Selected Features: 35 (72% reduction)
## Selection Process
1. Removed 12 correlated features (>95% correlation)
2. Removed 8 low-variance features
3. Statistical test: Selected top 50 (chi-squared)
4. Model importance: Removed 15 low-importance features (<1%)
## Top 10 Features (by importance)
1. user_purchase_amount_mean (0.18)
2. days_since_last_purchase (0.12)
3. total_purchases (0.10)
4. age_x_income (0.08)
5. conversion_rate (0.07)
...
## Removed Features
- user_id_hash (constant)
- temp_feature_1 (99% correlated with temp_feature_2)
- random_noise (0% importance)
...
Scale, normalize, encode for model compatibility:
from specweave import FeatureTransformer
transformer = FeatureTransformer(increment="0042")
# Numerical transformations
transformer.add_numerical_transformer(
columns=["age", "income", "purchase_amount"],
method="standard_scaler" # Or: min_max, robust, quantile
)
# Categorical encoding
transformer.add_categorical_encoder(
columns=["country", "device_type", "product_category"],
method="onehot", # Or: label, target, binary
handle_unknown="ignore"
)
# Ordinal encoding (for ordered categories)
transformer.add_ordinal_encoder(
column="education",
order=["high_school", "bachelors", "masters", "phd"]
)
# Log transformation (for skewed distributions)
transformer.add_log_transform(
columns=["transaction_amount", "page_views"],
method="log1p" # log(1 + x) to handle zeros
)
# Box-Cox transformation (for normalization)
transformer.add_power_transform(
columns=["revenue", "engagement_score"],
method="box-cox"
)
# Custom transformation
def clip_outliers(x):
return np.clip(x, x.quantile(0.01), x.quantile(0.99))
transformer.add_custom_transformer(
columns=["outlier_prone_feature"],
func=clip_outliers
)
# Fit and transform
X_train_transformed = transformer.fit_transform(X_train)
X_test_transformed = transformer.transform(X_test)
# Save transformer pipeline
transformer.save(
path=".specweave/increments/0042.../features/transformer.pkl"
)
Ensure features are production-ready:
from specweave import FeatureValidator
validator = FeatureValidator(
X_train, X_test,
increment="0042"
)
# Check for data leakage
leakage_report = validator.check_data_leakage()
# Detects: perfectly correlated features, future data in training
# Check for distribution drift
drift_report = validator.check_distribution_drift()
# Compares train vs test distributions
# Check for missing values after transformation
missing_report = validator.check_missing_values()
# Check for infinite/NaN values
invalid_report = validator.check_invalid_values()
# Generate validation report
validator.generate_report()
Validation Report:
# Feature Validation Report
## Data Leakage: ✅ PASS
No perfect correlations detected between train and test.
## Distribution Drift: ⚠️ WARNING
Features with significant drift (KS test p < 0.05):
- user_age: p=0.023 (minor drift)
- device_type: p=0.001 (major drift)
Recommendation: Check if test data is from different time period.
## Missing Values: ✅ PASS
No missing values after transformation.
## Invalid Values: ✅ PASS
No infinite or NaN values detected.
## Overall: READY FOR TRAINING
2 warnings, 0 critical issues.
# All feature engineering steps logged to increment
with track_experiment("feature-engineering-v1", increment="0042") as exp:
# Create features
df_enriched = creator.generate()
# Select features
selected = selector.select()
# Transform features
X_transformed = transformer.fit_transform(X)
# Validate
validation = validator.validate()
# Auto-logs:
exp.log_param("original_features", 125)
exp.log_param("created_features", 45)
exp.log_param("selected_features", 35)
exp.log_metric("feature_reduction", 0.72)
exp.save_artifact("feature_definitions.yaml")
exp.save_artifact("transformer.pkl")
exp.save_artifact("validation_report.md")
After completing feature engineering:
/sw:sync-docs update
Updates:
<!-- .specweave/docs/internal/architecture/feature-engineering.md -->
## Recommendation Model Features (Increment 0042)
### Feature Engineering Pipeline
1. Data Quality: 100K rows, 45 columns
2. Created: 45 new features (temporal, aggregation, interaction)
3. Selected: 35 features (72% reduction via importance + RFE)
4. Transformed: StandardScaler for numerical, OneHot for categorical
### Key Features
- user_purchase_amount_mean: Average user spend (top feature, 18% importance)
- days_since_last_purchase: Recency indicator (12% importance)
- age_x_income: Interaction feature (8% importance)
### Feature Store
All features documented in: `.specweave/increments/0042.../features/`
- feature_definitions.yaml: Feature catalog
- transformer.pkl: Production transformation pipeline
- validation_report.md: Quality checks
# Bad: Create features without explanation
df["feature_1"] = df["col_a"] * df["col_b"]
# Good: Document why features were created
creator.add_interaction_feature(
sources=["age", "income"],
operation="multiply",
rationale="High-income older users have different behavior patterns"
)
# Options for missing values:
# 1. Imputation (mean, median, mode)
creator.impute_missing(column="age", strategy="median")
# 2. Indicator features (flag missing as signal)
creator.add_missing_indicator(column="email")
# Creates: email_missing (0/1)
# 3. Forward/backward fill (for time series)
creator.fill_missing(column="sensor_reading", method="ffill")
# 4. Model-based imputation
creator.impute_with_model(column="income", model=RandomForestRegressor())
# ❌ WRONG: Fit on all data (includes test set!)
scaler.fit(X)
X_train = scaler.transform(X_train)
X_test = scaler.transform(X_test)
# ✅ CORRECT: Fit only on train, transform both
scaler.fit(X_train)
X_train = scaler.transform(X_train)
X_test = scaler.transform(X_test)
# SpecWeave's transformer enforces this pattern
transformer.fit_transform(X_train) # Fits
transformer.transform(X_test) # Only transforms
# Version features with increment
transformer.save(
path=".specweave/increments/0042.../features/transformer-v1.pkl",
metadata={
"version": "v1",
"features": selected_features,
"transformations": ["standard_scaler", "onehot"]
}
)
# Load specific version for reproducibility
transformer_v1 = FeatureTransformer.load(
".specweave/increments/0042.../features/transformer-v1.pkl"
)
# Before deploying, test on held-out data
X_production_sample = load_production_data()
try:
X_transformed = transformer.transform(X_production_sample)
except Exception as e:
raise FeatureEngineeringError(f"Failed on production data: {e}")
# Check for unexpected values
validator = FeatureValidator(X_train, X_production_sample)
validation_report = validator.validate()
if validation_report["status"] == "CRITICAL":
raise FeatureEngineeringError("Feature engineering failed validation")
# For e-commerce / customer analytics
creator.add_rfm_features(
user_id="user_id",
transaction_date="purchase_date",
transaction_amount="purchase_amount"
)
# Creates:
# - recency: days since last purchase
# - frequency: total purchases
# - monetary: total spend
# For time series
creator.add_rolling_features(
column="daily_sales",
windows=[7, 14, 30],
aggs=["mean", "std", "min", "max"]
)
# Creates: daily_sales_7day_mean, daily_sales_7day_std, etc.
# Encode categorical as target mean (careful: can leak!)
creator.add_target_encoding(
column="product_category",
target="purchase_amount",
cv_folds=5 # Cross-validation to prevent leakage
)
# Creates: product_category_target_encoded
# For non-linear relationships
creator.add_polynomial_features(
columns=["age", "income"],
degree=2,
interaction_only=True
)
# Creates: age^2, income^2, age*income
# Generate feature engineering pipeline for increment
/ml:engineer-features 0042
# Validate features before training
/ml:validate-features 0042
# Generate feature importance report
/ml:feature-importance 0042
Feature engineering is 70% of ML success. This skill ensures:
Good features make mediocre models great. Great features make mediocre models excellent.