From dqx
Profiles Spark DataFrames or Unity Catalog tables and generates DQX data quality rule candidates with summary statistics. Supports sampling, filters, DLT expectations, and AI-assisted variants.
How this skill is triggered — by the user, by Claude, or both
Slash command
/dqx:dqx-profile-and-generateThe summary Claude sees in its skill listing — used to decide when to auto-load this skill
Typical one-shot bootstrap for a new table:
Typical one-shot bootstrap for a new table:
from databricks.labs.dqx.profiler.profiler import DQProfiler
from databricks.labs.dqx.profiler.generator import DQGenerator
from databricks.sdk import WorkspaceClient
ws = WorkspaceClient()
profiler = DQProfiler(ws)
generator = DQGenerator(ws)
df = spark.read.table("catalog.schema.input")
# Step 1 — profile. Returns summary stats + DQProfile candidates per column.
# Three entry points, pick by what you have on hand:
# - profiler.profile(df, ...) — in-memory DataFrame
# - profiler.profile_table(input_config=..., ...) — single Unity Catalog table by InputConfig
# - profiler.profile_tables_for_patterns( — many tables; returns
# patterns=["catalog.schema.*"], ...) dict[table_fqn -> (stats, profiles)]
summary_stats, profiles = profiler.profile(df)
# Step 2 — turn candidates into DQX checks (declarative list[dict]).
checks = generator.generate_dq_rules(profiles) # default criticality="error"
# Step 3 — inspect / edit, then persist. See dqx-storage for save targets.
for c in checks:
print(c)
Profiling is a one-time bootstrap action per dataset. The candidate checks need human review before apply — don't auto-apply the raw output to production data.
DQProfiler.profile(df, columns=None, options=None) — columns is a top-level kwarg limiting the profiled columns; the following optional keys are set via the options dict:
sample_fraction — float 0–1 (e.g. 0.1 for 10% sample). Use on large tables.sample_seed — int; pair with sample_fraction for reproducible runs.limit — absolute row cap (e.g. 1_000_000).filter — SQL string applied before profiling ("event_date >= '2026-01-01'").criticality — default for every generated rule ("error" or "warn", default "error").summary_stats, profiles = profiler.profile(
df,
columns=["order_id", "total_amount", "country_code"],
options={"sample_fraction": 0.1, "sample_seed": 42, "criticality": "warn"},
)
from databricks.labs.dqx.profiler.dlt_generator import DQDltGenerator
dlt_expectations = DQDltGenerator(ws).generate_dlt_rules(profiles, language="python")
# language can be "python" or "sql"
DQX can generate rules from natural-language requirements via DSPy-backed LLMs — see the companion skills / docs rather than hand-rolling prompts:
databricks labs dqx install # once per workspace
databricks labs dqx profile # all run configs
databricks labs dqx profile --run-config default # one run config
databricks labs dqx profile --run-config default \
--patterns "main.product001.*;main.product002" \
--exclude-patterns "*_output;*_quarantine"
The workflow writes the generated candidates + summary stats to the checks_location on the run config (see dqx-storage).
criticality / bounds before rolling to production._dq_output / _dq_quarantine suffixes; keep the convention.limit or sample_fraction against the current backfill.Canonical docs: https://databrickslabs.github.io/dqx/docs/guide/data_profiling.
npx claudepluginhub databrickslabs/dqx --plugin dqxValidates PySpark DataFrames or Delta tables against DQX quality rules using DQEngine. Appends results as columns, splits valid/invalid rows, or uses metadata rules.
Validates AIDP tables against data-quality rules (not-null, uniqueness, range/set, referential integrity, freshness) using bounded Spark SQL. Reports pass/fail with violation counts and can persist rule sets for re-runs.
References data quality dimensions with qsv checks and provides remediation decision tree for tabular CSV assessment and fixes.