Home
Softono
sparkdq

sparkdq

Open source Apache-2.0 Python
75
Stars
8
Forks
5
Issues
1
Watchers
1 week
Last Commit

About sparkdq

A lightweight, declarative PySpark framework for data quality validation — check columns, rows, and entire datasets directly in your Spark pipelines

Platforms

Web Self-hosted

Languages

Python

CI Pipeline codecov PyPI version Python Versions PyPI Downloads

SparkDQ — Data Quality Validation for Apache Spark

SparkDQ is a lightweight data quality framework built natively for PySpark — no JVM bridge like PyDeequ, no complexity overhead like Great Expectations, and no platform lock-in like Databricks dqx. Define checks declaratively via YAML/JSON or through a type-safe Python API, validate at row and aggregate level in a single pass, and extend the framework via a plugin system without touching the core.

One dependency. No wrappers. No bloat.

Quickstart

Declarative — checks are passed as dicts, loaded from anywhere: YAML files, JSON, databases, or APIs:

from pyspark.sql import SparkSession
from sparkdq.engine import BatchDQEngine
from sparkdq.management import CheckSet

spark = SparkSession.builder.getOrCreate()

df = spark.createDataFrame(
    [
        {"id": 1, "name": "Alice"},
        {"id": 2, "name": None},
        {"id": 3, "name": "Bob"},
    ]
)

check_set = CheckSet()
check_set.add_checks_from_dicts([
    {"check": "null-check", "check-id": "no-null-name", "columns": ["name"]},
])

result = BatchDQEngine(check_set).run_batch(df)
print(result.summary())
# Validation Summary (2024-01-01 00:00:00)
# Total records:   3
# Passed records:  2
# Failed records:  1
# Warnings:        0
# Pass rate:       67.00%

Python-native — full type safety and IDE autocompletion:

from pyspark.sql import SparkSession
from sparkdq.checks import NullCheckConfig
from sparkdq.core import Severity
from sparkdq.engine import BatchDQEngine
from sparkdq.management import CheckSet

spark = SparkSession.builder.getOrCreate()

df = spark.createDataFrame(
    [
        {"id": 1, "name": "Alice"},
        {"id": 2, "name": None},
        {"id": 3, "name": "Bob"},
    ]
)

check_set = (
    CheckSet()
    .add_check(NullCheckConfig(check_id="no-null-name", columns=["name"], severity=Severity.CRITICAL))
)

result = BatchDQEngine(check_set).run_batch(df)
print(result.summary())
# Validation Summary (2024-01-01 00:00:00)
# Total records:   3
# Passed records:  2
# Failed records:  1
# Warnings:        0
# Pass rate:       67.00%

SparkDQ ships with 30+ built-in checks across null validation, numeric ranges, string patterns, date boundaries, schema enforcement, uniqueness, and referential integrity.

🚀 See the official documentation to learn more.

Installation

For Local Development / Standalone Clusters

Install with PySpark included:

pip install sparkdq[spark]

For Databricks / Managed Platforms

Install without PySpark (runtime provided by platform):

pip install sparkdq

The framework supports Python 3.11+ and is fully tested with PySpark 3.5.x. SparkDQ will automatically check for PySpark availability on import and provide clear error messages if PySpark is missing in your environment.

Why SparkDQ?

  • Extensible by design: Add custom checks via a simple plugin system — no changes to the core required

  • Declarative or Pythonic: YAML/JSON configs or type-safe Python — your choice

  • Severity-aware: Distinguish between hard failures (CRITICAL) and soft constraints (WARNING)

  • Row-level and aggregate: Validate individual records and entire datasets in a single pass

  • Minimal footprint: Only Pydantic required — PySpark is provided by your platform

Let’s Build Better Data Together

⭐️ Found this useful? Give it a star and help spread the word!

📣 Questions, feedback, or ideas? Open an issue or discussion — we’d love to hear from you.

🤝 Want to contribute? Check out CONTRIBUTING.md to get started.