From analytics
Compares Bayesian models using ArviZ 1.1 LOO-CV, ELPD, stacking, and Bayes factors. Provides Pareto k diagnostics, moment matching, and k-fold cross-validation for PyMC 6 models.
How this skill is triggered — by the user, by Claude, or both
Slash command
/analytics:model-evaluationThe summary Claude sees in its skill listing — used to decide when to auto-load this skill
CRITICAL: PyMC 6 returns xarray DataTree objects by default, and ArviZ 1.1 stats/plots are DataTree-first while still accepting idata-like inputs. `az.waic` is removed entirely — use PSIS-LOO-CV exclusively. Default credible intervals are 0.89 ETI, controlled via `ci_prob=` and `ci_kind=` for summaries/plots; low-level `hdi()` uses `prob=`.
CRITICAL: PyMC 6 returns xarray DataTree objects by default, and ArviZ 1.1 stats/plots are DataTree-first while still accepting idata-like inputs. az.waic is removed entirely — use PSIS-LOO-CV exclusively. Default credible intervals are 0.89 ETI, controlled via ci_prob= and ci_kind= for summaries/plots; low-level hdi() uses prob=.
For model building context, prior selection, and convergence diagnostics, see the pymc-modeling skill.
Leave-one-out cross-validation via Pareto-smoothed importance sampling (PSIS).
import arviz_stats as azs
import arviz_plots as azp
# dt is a DataTree from pm.sample()
loo_result = azs.loo(dt)
print(loo_result)
# Returns: ELPDData with elpd, se, p, n_data_points, pareto_k
# Equivalent via the xarray accessor when arviz_stats is imported:
loo_result = dt.azstats.loo()
Pareto k values indicate reliability of PSIS approximation for each observation:
| k value | Interpretation | Action |
|---|---|---|
| k < 0.5 | Good | LOO estimate reliable |
| 0.5 < k < 0.7 | Marginal | Results usable but less accurate |
| 0.7 < k < 1.0 | Bad | Estimate unreliable — use moment matching or k-fold |
| k > 1.0 | Very bad | PSIS fails entirely — must use k-fold CV |
# Check Pareto k values
print(loo_result.pareto_k)
# Plot Pareto k diagnostics
azp.plot_khat(loo_result)
# Count problematic observations
import numpy as np
k_values = loo_result.pareto_k.values
print(f"k > 0.7: {np.sum(k_values > 0.7)} observations")
Automatically refit problematic observations using moment matching:
# Requires log_likelihood in the DataTree
loo_mm = azs.loo_moment_match(dt)
This importance-weights the posterior for each problematic observation, improving the PSIS approximation without refitting the model. Much faster than k-fold.
When LOO is unreliable for many observations, use exact k-fold CV:
# Perform 10-fold cross-validation
kfold_result = azs.loo_kfold(dt, K=10)
print(kfold_result)
This refits the model K times, so it is K times slower than LOO. Use only when LOO diagnostics indicate problems.
Compare multiple models on predictive accuracy:
# dt1, dt2, dt3 are DataTree objects from pm.sample()
comparison = azs.compare(
{"linear": dt1, "quadratic": dt2, "spline": dt3},
)
print(comparison)
Note: compare in ArviZ 1.1 only supports LOO, so the old ic= and scale= arguments have been dropped.
| Column | Meaning |
|---|---|
rank | Model rank (0 = best) |
elpd | Expected log pointwise predictive density |
p | Effective number of parameters |
elpd_diff | Difference in ELPD from the reference model |
weight | Stacking weight (sums to 1) |
se | Standard error of ELPD |
dse | Standard error of the ELPD difference |
diag_elpd | Pareto-k diagnostic issues for each model's ELPD |
diag_diff | Small-data or practically-equivalent-difference diagnostics |
elpd_diff = 0: reference/best model|elpd_diff| < 4: models are practically indistinguishable — prefer simpler one|elpd_diff| > 4 and |elpd_diff / dse| > 2: meaningful difference in predictive accuracydiag_elpd: LOO unreliable for this model — investigate Pareto k values# Visualize comparison
azp.plot_compare(comparison)
# Detailed forest plot of ELPD differences
azp.plot_elpd({"linear": dt1, "quadratic": dt2, "spline": dt3})
See references/model_comparison.md for detailed usage.
Stacking minimizes KL divergence from the true predictive distribution to the weighted mixture. This is the recommended default.
comparison = azs.compare({"m1": dt1, "m2": dt2, "m3": dt3})
# Stacking weights are in the "weight" column by default
print(comparison["weight"])
Alternative weighting based on Bayesian bootstrap of ELPD:
comparison = azs.compare(
{"m1": dt1, "m2": dt2, "m3": dt3},
method="BB-pseudo-BMA",
)
| Method | Use When |
|---|---|
| Stacking | Default. Best for prediction when true model is not in the set |
| Pseudo-BMA+ | Want Bayesian uncertainty over weights |
| Equal weights | Models represent different scientific hypotheses to average over |
weights = comparison["weight"].values
# Manually mix posterior predictive samples
# weighted by stacking weights
See references/stacking.md for detailed averaging workflows.
Bayes factors compare marginal likelihoods. Conceptually different from LOO (predictive accuracy vs. evidence).
# Bayes factors are difficult to compute reliably
# Bridge sampling is the most reliable method but requires specialized setup
# For most applied work, LOO-CV is preferred
# Approximate Bayes factor from LOO (rough):
# BF ~ exp(elpd_m1 - elpd_m2)
# This is a very rough approximation — use with caution
LOO probability integral transform checks if the model is calibrated:
azp.plot_loo_pit(dt, var_names=["observed_data_name"])
This is a powerful diagnostic that LOO uniquely provides — it checks calibration without held-out data.
Compute LOO-weighted posterior expectations (mean, variance, quantile) for each observation. Requires both posterior_predictive and log_likelihood groups on the DataTree:
# LOO-weighted posterior predictive mean for each observation
loo_mean = azs.loo_expectations(dt, kind="mean")
loo_var = azs.loo_expectations(dt, kind="var")
loo_q = azs.loo_expectations(dt, kind="quantile", probs=[0.055, 0.945])
Compute common LOO-based predictive metrics (RMSE, MAE, etc.) from posterior_predictive and log_likelihood:
metrics = azs.loo_metrics(dt, kind="rmse")
import arviz_stats as azs registers an .azstats accessor on DataArray, Dataset, and DataTree. This gives a fluent xarray-native interface alongside the function API:
import arviz_stats as azs
dt.azstats.loo() # same as azs.loo(dt)
dt["posterior"].azstats.rhat() # on a Dataset
dt["posterior"].azstats.ess()
dt["posterior"].azstats.summary()
dt["posterior"].azstats.hdi()
dt["posterior"].azstats.eti()
Bayesian R-squared via LOO:
r2 = azs.loo_r2(dt)
print(f"LOO-R2: {r2.mean():.3f} [{r2.quantile(0.055):.3f}, {r2.quantile(0.945):.3f}]")
Compute LOO-based scoring rules (CRPS, log score):
score = azs.loo_score(dt, score_func="crps")
LOO with subsampling for large datasets:
# When n > 10000, subsample for speed
loo_sub = azs.loo_subsample(dt, observations=1000)
Exact refit LOO for observations with high Pareto k:
# Refits the model for problematic observations
loo_exact = azs.reloo(dt, loo_result, model=model)
import arviz_stats as azs
import arviz_plots as azp
# 1. Compute LOO
loo = azs.loo(dt)
print(loo)
# 2. Check Pareto k
azp.plot_khat(loo)
# 3. If k > 0.7, try moment matching
if (loo.pareto_k > 0.7).any():
loo = azs.loo_moment_match(dt)
# 4. LOO-PIT calibration
azp.plot_loo_pit(dt, var_names=["y"])
# 5. Compare models
comparison = azs.compare({"model_a": dt_a, "model_b": dt_b})
azp.plot_compare(comparison)
print(comparison)
# 6. Predictive R2
r2 = azs.loo_r2(dt)
npx claudepluginhub pymc-labs/python-analytics-skills --plugin analyticsBayesian modeling with PyMC: hierarchical models, MCMC (NUTS), variational inference, LOO/WAIC comparison, posterior predictive checks. Use for fitting Bayesian models and estimating posteriors.
Builds and fits Bayesian models using PyMC: hierarchical models, MCMC (NUTS), variational inference, LOO/WAIC comparison, posterior checks for probabilistic programming.
Builds Bayesian models with PyMC: hierarchical models, MCMC (NUTS), variational inference, LOO/WAIC comparison, posterior checks.