About sparkdq

A lightweight, declarative PySpark framework for data quality validation — check columns, rows, and entire datasets directly in your Spark pipelines

s

Published by

sparkdq-community

Visit View Profile

README.md

View on GitHub

SparkDQ — Data Quality Validation for Apache Spark

SparkDQ is a lightweight data quality framework built natively for PySpark — no JVM bridge like PyDeequ, no complexity overhead like Great Expectations, and no platform lock-in like Databricks dqx. Define checks declaratively via YAML/JSON or through a type-safe Python API, validate at row and aggregate level in a single pass, and extend the framework via a plugin system without touching the core.

One dependency. No wrappers. No bloat.

Quickstart

Declarative — checks are passed as dicts, loaded from anywhere: YAML files, JSON, databases, or APIs:

from pyspark.sql import SparkSession
from sparkdq.engine import BatchDQEngine
from sparkdq.management import CheckSet

spark = SparkSession.builder.getOrCreate()

df = spark.createDataFrame(
    [
        {"id": 1, "name": "Alice"},
        {"id": 2, "name": None},
        {"id": 3, "name": "Bob"},
    ]
)

check_set = CheckSet()
check_set.add_checks_from_dicts([
    {"check": "null-check", "check-id": "no-null-name", "columns": ["name"]},
])

result = BatchDQEngine(check_set).run_batch(df)
print(result.summary())
# Validation Summary (2024-01-01 00:00:00)
# Total records:   3
# Passed records:  2
# Failed records:  1
# Warnings:        0
# Pass rate:       67.00%

Python-native — full type safety and IDE autocompletion:

from pyspark.sql import SparkSession
from sparkdq.checks import NullCheckConfig
from sparkdq.core import Severity
from sparkdq.engine import BatchDQEngine
from sparkdq.management import CheckSet

spark = SparkSession.builder.getOrCreate()

df = spark.createDataFrame(
    [
        {"id": 1, "name": "Alice"},
        {"id": 2, "name": None},
        {"id": 3, "name": "Bob"},
    ]
)

check_set = (
    CheckSet()
    .add_check(NullCheckConfig(check_id="no-null-name", columns=["name"], severity=Severity.CRITICAL))
)

result = BatchDQEngine(check_set).run_batch(df)
print(result.summary())
# Validation Summary (2024-01-01 00:00:00)
# Total records:   3
# Passed records:  2
# Failed records:  1
# Warnings:        0
# Pass rate:       67.00%

SparkDQ ships with 30+ built-in checks across null validation, numeric ranges, string patterns, date boundaries, schema enforcement, uniqueness, and referential integrity.

🚀 See the official documentation to learn more.

Installation

For Local Development / Standalone Clusters

Install with PySpark included:

pip install sparkdq[spark]

For Databricks / Managed Platforms

Install without PySpark (runtime provided by platform):

pip install sparkdq

The framework supports Python 3.11+ and is fully tested with PySpark 3.5.x. SparkDQ will automatically check for PySpark availability on import and provide clear error messages if PySpark is missing in your environment.

Why SparkDQ?

Extensible by design: Add custom checks via a simple plugin system — no changes to the core required
Declarative or Pythonic: YAML/JSON configs or type-safe Python — your choice
Severity-aware: Distinguish between hard failures (CRITICAL) and soft constraints (WARNING)
Row-level and aggregate: Validate individual records and entire datasets in a single pass
Minimal footprint: Only Pydantic required — PySpark is provided by your platform

Let’s Build Better Data Together

⭐️ Found this useful? Give it a star and help spread the word!

📣 Questions, feedback, or ideas? Open an issue or discussion — we’d love to hear from you.

🤝 Want to contribute? Check out CONTRIBUTING.md to get started.

sparkdq