Skill

aidp-data-quality

From oracle-ai-data-platform-workbench-engineer-agent

Validates AIDP tables against data-quality rules (not-null, uniqueness, range/set, referential integrity, freshness) using bounded Spark SQL. Reports pass/fail with violation counts and can persist rule sets for re-runs.

Python

SQL

data-engineering

Popularity

Parent stars

Parent forks

Invocation

How this skill is triggered — by the user, by Claude, or both

Slash command

/oracle-ai-data-platform-workbench-engineer-agent:aidp-data-quality

User invocable

Model invocable

Inline context

Default effort

Context Preview

The summary Claude sees in its skill listing — used to decide when to auto-load this skill

Validate AIDP tables against explicit data-quality rules, each compiled to bounded Spark SQL and executed

SKILL.md

57 lines · ~939 tokens

Stats

LanguagePython

Parent stars36

Parent forks21

MaintenanceGood

Last CommitJun 12, 2026

Actions

View Source View Plugin View on GitHub View README

`aidp-data-quality` — rule checks via Spark SQL

Validate AIDP tables against explicit data-quality rules, each compiled to bounded Spark SQL and executed with the bundled helper — no MCP and no ai-data-engineer-agent repo required.

When to use

"Check for nulls/duplicates", "validate ", "are there orphan rows", "is the data fresh", or gating a pipeline on quality.

Rule types (each → a counting SQL that should return 0 violations)

Rule	Check (violations)
not-null	`COUNT(*) WHERE col IS NULL`
unique	`COUNT() - COUNT(DISTINCT key)` (or `GROUP BY key HAVING COUNT()>1`)
range / set	`COUNT(*) WHERE col NOT BETWEEN lo AND hi` / `col NOT IN (...)`
referential	`COUNT(*) child LEFT JOIN parent ... WHERE parent.key IS NULL`
freshness	`MAX(ts)` vs SLA (e.g. `datediff(current_date, MAX(ts)) <= N`)

Workflow

Resolve table(s)/columns; use join keys from .aidp/catalog.md for referential checks (don't guess). Pull rule definitions from .aidp/semantic.md value dictionaries where available.
Ensure the cluster is RUNNING (aidp-cluster-ops / oci raw-request), then for each rule run the violation-count SQL with the bundled helper (PASS if 0, else FAIL):
```
python "$PLUGIN_DIR/scripts/aidp_sql.py" --region <region> --datalake <DATALAKE_OCID> --workspace <ws> \
  --cluster <cluster-key> \
  --code "spark.sql('''SELECT COUNT(*) AS v FROM cat.sch.t WHERE col IS NULL''').show()"
```
It mints a UPST from the api_key DEFAULT profile, auto-creates a scratch notebook, and returns JSON with status / outputs / spark_job_ids. No AIDP_SESSION required (--session-profile optional).
On a non-zero count, FAIL and pull a few example offending rows with a separate bounded LIMIT query.
Report a summary table: rule · target · result · violation count.
Offer to (a) persist the rule set for re-runs (see below), and (b) wire checks into a Job (aidp-pipelines) as a gating task.

Persisting a re-runnable rule set

Register validated rules in .aidp/dq-rules.md so they can be re-run later (the quality analogue of .aidp/verified-queries.md). One entry per rule records the target table/column, rule-type (the five types above), the violation-SQL (counts violations → PASS when 0), and last-result / last-checked. To re-run, execute each entry's stored violation-SQL via scripts/aidp_sql.py, set the result to PASS (0) or FAIL (<count>), and record the cluster + date — never mark PASS without a status: ok run returning 0. Format and re-run rules: references/dq-rules.md.

Reliability rules

Run real SQL via scripts/aidp_sql.py; never assert a rule passed without a status: ok result.
Keep checks bounded; sample example offenders rather than dumping full result sets.
If a cell returns status: error, read the error, fix the SQL grounded in the catalog, and retry.

References

references/dq-rules.md (.aidp/dq-rules.md rule-set format + re-run)
scripts/aidp_sql.py · references/no-mcp-rest-map.md · references/oci-raw-request.md · references/semantic-model.md

aidp-data-quality

Popularity

Invocation

Context Preview

SKILL.md

aidp-data-quality

Popularity

Invocation

Context Preview

SKILL.md

`aidp-data-quality` — rule checks via Spark SQL

When to use

Rule types (each → a counting SQL that should return 0 violations)

Workflow

Persisting a re-runnable rule set

Reliability rules

References

Reused across plugins

Similar Skills

`aidp-data-quality` — rule checks via Spark SQL

When to use

Rule types (each → a counting SQL that should return 0 violations)

Workflow

Persisting a re-runnable rule set

Reliability rules

References

Reused across plugins

Similar Skills

aidp-data-quality

Popularity

Invocation

Context Preview

SKILL.md

aidp-data-quality

Popularity

Invocation

Context Preview

SKILL.md

aidp-data-quality — rule checks via Spark SQL

When to use

Rule types (each → a counting SQL that should return 0 violations)

Workflow

Persisting a re-runnable rule set

Reliability rules

References

Reused across plugins

Similar Skills

aidp-data-quality — rule checks via Spark SQL

When to use

Rule types (each → a counting SQL that should return 0 violations)

Workflow

Persisting a re-runnable rule set

Reliability rules

References

Reused across plugins

Similar Skills

`aidp-data-quality` — rule checks via Spark SQL

`aidp-data-quality` — rule checks via Spark SQL