WHEN: Pandas/NumPy code review, data processing, vectorization, memory optimization WHAT: Vectorization patterns + Memory efficiency + Data validation + Performance optimization + Best practices WHEN NOT: Web framework → fastapi/django/flask-reviewer, General Python → python-reviewer
Inherits all available tools
Additional assets for this skill
This skill inherits all available tools. When active, it can use any tool Claude has access to.
Reviews data science code for Pandas/NumPy efficiency, memory usage, and best practices.
pandas, numpy in requirements.txt.ipynb Jupyter notebooksdata/, notebooks/ directories**Pandas**: 2.0+
**NumPy**: 1.24+
**Other**: polars, dask, vaex
**Visualization**: matplotlib, seaborn, plotly
**ML**: scikit-learn, xgboost
AskUserQuestion:
"Which areas to review?"
Options:
- Full data code review (recommended)
- Vectorization and performance
- Memory optimization
- Data validation
- Code organization
multiSelect: true
| Check | Recommendation | Severity |
|---|---|---|
| iterrows() loop | Use vectorized operations | CRITICAL |
| apply() with simple func | Use built-in vectorized | HIGH |
| Manual loop over array | Use NumPy broadcasting | HIGH |
| List comprehension on Series | Use .map() or vectorize | MEDIUM |
# BAD: iterrows (extremely slow)
for idx, row in df.iterrows():
df.loc[idx, "total"] = row["price"] * row["quantity"]
# GOOD: Vectorized operation
df["total"] = df["price"] * df["quantity"]
# BAD: apply with simple operation
df["upper_name"] = df["name"].apply(lambda x: x.upper())
# GOOD: Built-in string method
df["upper_name"] = df["name"].str.upper()
# BAD: apply with condition
df["status"] = df["score"].apply(lambda x: "pass" if x >= 60 else "fail")
# GOOD: np.where or np.select
df["status"] = np.where(df["score"] >= 60, "pass", "fail")
# Multiple conditions
conditions = [
df["score"] >= 90,
df["score"] >= 60,
df["score"] < 60,
]
choices = ["A", "B", "F"]
df["grade"] = np.select(conditions, choices)
| Check | Recommendation | Severity |
|---|---|---|
| int64 for small ints | Use int8/int16/int32 | MEDIUM |
| object dtype for categories | Use category dtype | HIGH |
| Loading full file | Use chunks or usecols | HIGH |
| Keeping unused columns | Drop early | MEDIUM |
# BAD: Default dtypes waste memory
df = pd.read_csv("large_file.csv") # All int64, object
# GOOD: Specify dtypes
dtype_map = {
"id": "int32",
"age": "int8",
"status": "category",
"price": "float32",
}
df = pd.read_csv("large_file.csv", dtype=dtype_map)
# GOOD: Load only needed columns
df = pd.read_csv(
"large_file.csv",
usecols=["id", "name", "price"],
dtype=dtype_map,
)
# GOOD: Process in chunks
chunks = pd.read_csv("huge_file.csv", chunksize=100_000)
result = pd.concat([process_chunk(chunk) for chunk in chunks])
# Memory check
print(df.info(memory_usage="deep"))
# Convert object to category if low cardinality
for col in df.select_dtypes(include=["object"]).columns:
if df[col].nunique() / len(df) < 0.5: # < 50% unique
df[col] = df[col].astype("category")
| Check | Recommendation | Severity |
|---|---|---|
| No null check | Validate nulls early | HIGH |
| No dtype validation | Assert expected types | MEDIUM |
| No range validation | Check value bounds | MEDIUM |
| Silent data issues | Raise or log warnings | HIGH |
# GOOD: Data validation function
def validate_dataframe(df: pd.DataFrame) -> pd.DataFrame:
"""Validate and clean input DataFrame."""
# Required columns
required_cols = ["id", "name", "price", "quantity"]
missing = set(required_cols) - set(df.columns)
if missing:
raise ValueError(f"Missing columns: {missing}")
# Null checks
null_counts = df[required_cols].isnull().sum()
if null_counts.any():
logger.warning(f"Null values found:\n{null_counts[null_counts > 0]}")
# Type validation
assert df["id"].dtype in ["int32", "int64"], "id must be integer"
assert df["price"].dtype in ["float32", "float64"], "price must be float"
# Range validation
invalid_prices = df[df["price"] < 0]
if len(invalid_prices) > 0:
logger.warning(f"Found {len(invalid_prices)} negative prices")
df = df[df["price"] >= 0]
# Duplicate check
duplicates = df.duplicated(subset=["id"])
if duplicates.any():
logger.warning(f"Removing {duplicates.sum()} duplicates")
df = df.drop_duplicates(subset=["id"])
return df
| Check | Recommendation | Severity |
|---|---|---|
| Python loop over array | Use broadcasting | CRITICAL |
| np.append in loop | Pre-allocate array | HIGH |
| Repeated array creation | Reuse arrays | MEDIUM |
| Copy when not needed | Use views | MEDIUM |
# BAD: Python loop
result = []
for i in range(len(arr)):
result.append(arr[i] * 2 + 1)
result = np.array(result)
# GOOD: Vectorized
result = arr * 2 + 1
# BAD: np.append in loop (creates new array each time)
result = np.array([])
for x in data:
result = np.append(result, process(x))
# GOOD: Pre-allocate
result = np.empty(len(data))
for i, x in enumerate(data):
result[i] = process(x)
# BETTER: Vectorize if possible
result = np.vectorize(process)(data)
# BEST: Pure NumPy operation
result = np.sqrt(data) + np.log(data)
# BAD: Unnecessary copy
subset = arr[arr > 0].copy() # copy() often not needed
# GOOD: View when modification not needed
subset = arr[arr > 0] # Returns view
| Check | Recommendation | Severity |
|---|---|---|
| Chained indexing | Use .loc/.iloc | HIGH |
| inplace=True | Assign result instead | MEDIUM |
| reset_index() abuse | Keep index when useful | LOW |
| df.append (deprecated) | Use pd.concat | HIGH |
# BAD: Chained indexing (unpredictable)
df[df["price"] > 100]["status"] = "premium" # May not work!
# GOOD: Use .loc
df.loc[df["price"] > 100, "status"] = "premium"
# BAD: df.append (deprecated)
result = pd.DataFrame()
for chunk in chunks:
result = result.append(chunk)
# GOOD: pd.concat
result = pd.concat(chunks, ignore_index=True)
# BAD: Multiple operations creating copies
df = df.dropna()
df = df.reset_index(drop=True)
df = df.sort_values("date")
# GOOD: Method chaining
df = (
df
.dropna()
.reset_index(drop=True)
.sort_values("date")
)
## Python Data Code Review Results
**Project**: [name]
**Pandas**: 2.1 | **NumPy**: 1.26 | **Size**: ~1M rows
### Vectorization
| Status | File | Issue |
|--------|------|-------|
| CRITICAL | process.py:45 | iterrows() loop - use vectorized ops |
### Memory
| Status | File | Issue |
|--------|------|-------|
| HIGH | load.py:23 | object dtype for 'status' - use category |
### Data Validation
| Status | File | Issue |
|--------|------|-------|
| HIGH | etl.py:67 | No null value handling |
### NumPy
| Status | File | Issue |
|--------|------|-------|
| HIGH | calc.py:12 | np.append in loop - pre-allocate |
### Recommended Actions
1. [ ] Replace iterrows with vectorized operations
2. [ ] Convert low-cardinality columns to category
3. [ ] Add data validation before processing
4. [ ] Pre-allocate NumPy arrays in loops
# Profile memory usage
df.info(memory_usage="deep")
# Profile time
%timeit df.apply(func) # Jupyter magic
%timeit df["col"].map(func)
# Use query() for complex filters
df.query("price > 100 and category == 'A'")
# Use eval() for complex expressions
df.eval("profit = revenue - cost")
python-reviewer: General Python patternsperf-analyzer: Performance profilingcoverage-analyzer: Test coverage for data code