Comprehensive Rules for building robust, maintainable data-validation layers and pipelines in Python using Pydantic, Pandera, and Great Expectations.
Your data pipelines are breaking at 3 AM. Again. Invalid dates, missing required fields, corrupted numeric values—the same preventable issues that could have been caught hours earlier with proper validation. You know the drill: emergency fixes, data backfills, and stakeholder explanations that could have been avoided entirely.
Modern data engineering teams face a brutal truth: data validation isn't optional anymore. With data volumes exploding and business-critical decisions depending on your pipelines, a single bad record can cascade into system failures, incorrect analytics, and lost revenue.
The traditional approach—scattered validation logic buried in transformation code, inconsistent error handling, and manual quality checks—doesn't scale. You need a systematic approach that treats data validation as a first-class engineering concern.
These Cursor Rules transform how you build data quality into your Python pipelines. Instead of reactive firefighting, you get proactive validation that catches issues before they propagate, with comprehensive tooling across three battle-tested frameworks:
The rules establish validation as code—version controlled, tested, and deployed through your existing CI/CD pipeline.
Stop chasing data issues through complex transformation chains. Validation failures include precise error locations, sample values, and actionable fix suggestions:
# Instead of cryptic downstream errors
ValidationError: "order_date is in the future (2028-01-01) for order_id 12345"
Pre-configured severity levels automatically route issues to the right place:
Built-in dead letter tables, structured JSON logging, and audit trails mean your validation failures become valuable debugging data instead of silent corruption.
Before: Manual type checking scattered across transformation functions
# Fragile, hard to maintain
def process_orders(df):
# Hidden validation logic
df = df[df['amount'] > 0] # Where did this rule come from?
df['order_date'] = pd.to_datetime(df['order_date']) # Silent failures
return df
After: Declarative schema with comprehensive validation
class OrdersSchema(BaseModel):
id: PositiveInt
amount_usd: confloat(gt=0)
customer_email: EmailStr
order_date: datetime
@root_validator
def validate_business_rules(cls, values):
if values['order_date'] > datetime.utcnow():
raise ValueError('order_date cannot be in the future')
return values
# Pipeline becomes self-documenting
validated_orders = [Order(**row) for row in raw_data]
Before: Runtime errors with unclear origins
# Fails silently or with cryptic pandas errors
result = df.groupby('category').sum() # What if category has nulls?
After: Schema-first validation with early error detection
@pa.check("category", nullable=False)
@pa.check("amount", lambda s: s > 0)
class SalesSchema(pa.SchemaModel):
category: str
amount: float
# Explicit contract validation
validated_df = SalesSchema.validate(df, lazy=True)
result = validated_df.groupby('category').sum()
pip install pydantic pandera great-expectations
Create the directory structure:
project_root/
├─ src/validators/
│ ├─ pydantic_models.py
│ └─ dataframe_schemas.py
├─ expectations/
└─ tests/
# src/validators/pydantic_models.py
from pydantic import BaseModel, PositiveInt, EmailStr, validator
from datetime import datetime
class CustomerRecord(BaseModel):
customer_id: PositiveInt
email: EmailStr
signup_date: datetime
class Config:
strict = True
validate_assignment = True
allow_mutation = False
@validator('signup_date')
def signup_date_not_future(cls, v):
if v > datetime.utcnow():
raise ValueError('signup_date cannot be in the future')
return v
from src.validators.pydantic_models import CustomerRecord
from pydantic import ValidationError
def process_customer_batch(raw_records):
validated_records = []
error_records = []
for record in raw_records:
try:
validated = CustomerRecord(**record)
validated_records.append(validated)
except ValidationError as e:
error_records.append({
'record': record,
'errors': e.errors(),
'timestamp': datetime.utcnow()
})
# Handle errors according to your business rules
if error_records:
write_to_dead_letter_queue(error_records)
return validated_records
# src/validators/dataframe_schemas.py
import pandera as pa
from pandera import Column, Check
@pa.dataclass
class OrdersSchema:
order_id: int = pa.Field(gt=0, unique=True)
customer_id: int = pa.Field(gt=0)
amount: float = pa.Field(gt=0)
order_date: str = pa.Field(regex=r'\d{4}-\d{2}-\d{2}')
@pa.check('amount')
def amount_within_limits(cls, series):
return series.between(0.01, 10000)
These Cursor Rules give you everything needed to build enterprise-grade validation into your Python data pipelines. Stop treating data quality as an afterthought—make it a core engineering practice that prevents issues before they impact your systems.
Your stakeholders will thank you when the dashboards stay green, and you'll sleep better knowing your data pipelines have comprehensive quality controls built in from day one.
Ready to eliminate those 3 AM data quality alerts? Implementation starts with your next pipeline.
You are an expert in Python, SQL, PySpark, Pydantic, Pandera, Great Expectations, and modern ETL tooling.
Key Principles
- Treat data validation as code: place rules under version control, code-review them, and ship through CI/CD.
- Validate as early as possible (ingress) and as late as necessary (egress) to guarantee contract safety at every stage.
- Make all validation rules explicit, declarative, and test-covered; avoid implicit or scattered checks.
- Fail fast and loudly: stop the pipeline on critical data defects; route non-critical issues to alerting/monitoring.
- Keep the happy path last: handle nulls, type mismatches, and boundary cases first with early returns.
- Prefer immutable, pure validation functions; avoid hidden state to keep behaviour predictable and cache-friendly.
- Document every rule in-code via docstrings and external README/auto-generated docs.
Python
- Enable strict mypy in CI (`mypy --strict`) and exhaustively annotate all public functions and models.
- Use `pydantic.BaseModel` or `dataclasses.dataclass(frozen=True)` for all DTOs; forbid bare dictionaries.
- Name validators with imperative verbs (`validate_email_format`, `check_range`).
- Raise `ValueError` for user input issues, `TypeError` for type violations, and custom `DataValidationError` (derive from `Exception`) for business-rule failures.
- Never catch broad `Exception`; catch specific subclasses and re-raise with context.
- Use pathlib for file paths, decimal for monetary values, and `datetime` objects timezone-aware by default.
- Keep individual validator functions under 40 LOC; split complex logic into helpers.
Error Handling and Validation
- Centralise error handling via a `ValidationReport` object containing: failed_rows, error_type, rule_id, severity.
- Use structured logging (`json` format) for all validation errors; include rule name, counts, and sample ids.
- Distinguish severities: `CRITICAL` → stop pipeline, `HIGH` → quarantine rows, `LOW` → log & continue.
- Provide actionable messages: "order_date is in the future (2028-01-01) for order_id 12345".
- Batch pipelines: write rejected records to a dead-letter table with identical schema + `__rejection_reason` column.
- Stream/real-time: return HTTP 422 with error payload conforming to RFC-7807.
Pydantic Rules
- Always declare `Config` with `strict = True`, `validate_assignment = True`, `allow_mutation = False`.
- Use `constr`, `conint`, `PositiveFloat` etc. to encode primitive rules; avoid manual `@validator` when a built-in type suffices.
- Group business rules in class-level `@root_validator(pre=True)` – validate cross-field dependencies once.
- Cascade models: top-level ingestion model → domain model → persistence model; validate at each hop.
- Provide example JSON via `schema_extra` for documentation & tests.
Pandera Rules (DataFrame validation)
- Define one schema per data contract (`OrdersSchema`, `CustomersSchema`). Store them in `schemas/`.
- Set `coerce=True` to auto-cast on read; forbid implicit string→numeric coercions.
- Use `Column(..., unique=True, nullable=False)` for PKs; add `Check` objects for custom logic (`lambda s: s > 0`).
- Validate partitions in parallel: `schema.validate(df, lazy=True, n_jobs=-1)` to gather all errors at once.
- Pipe style: `validated_df = (df.pipe(clean_columns).pipe(pandera.validate, schema=OrdersSchema))`.
Great Expectations Rules
- Store expectation suites alongside data asset in repository (`expectations/asset_name/`).
- Use `expect_*_to_match_regex_list` instead of generic regex to whitelist patterns.
- Tag expectations with `meta: {owner: "data-team", jira_ticket: "DQ-123"}`.
- Run `great_expectations checkpoint run asset_name` in CI on sampled data; block merge on failures.
- Auto-generate data docs and host them on internal S3/static site for transparency.
Testing
- Unit test every custom validator: happy path + at least 3 edge cases (`pytest`, `pytest-parametrize`).
- Integration test pipelines with synthetic fixtures generated by Faker and Hypothesis for property-based tests.
- Snapshot test Great Expectations data docs to catch unintended rule removals.
Performance
- Use lazy validation where order independence allows; eagerly validate ordering/sequence-dependent rules.
- Memoize deterministic pure validators via `functools.lru_cache(maxsize=1024)` when reused across rows.
- For PySpark, push predicates to the cluster: convert Python checks into `Column` expressions instead of `.collect()`.
- Chunk large CSV/Parquet files (`chunksize=50_000`) to keep memory constant.
Security & Compliance
- Mask PII in logs: replace email/local-part with `***`; never log full SSNs.
- Encrypt quarantine tables (KMS) and restrict GRANTs to data-quality role.
- Traceability: append `__validated_at`, `__validator_version` columns to persisted datasets.
- Maintain an audit trail of rule modifications via Git tags and CHANGELOG.
Documentation
- Autogenerate model docs with `pydantic-schema-json` and serve via MkDocs.
- Each schema file must contain a header with: Author, Created, Modified, Jira.
Directory Layout
```
project_root/
├─ src/
│ ├─ validators/
│ │ ├─ __init__.py
│ │ ├─ pydantic_models.py
│ │ └─ dataframe_schemas.py
│ ├─ pipelines/
│ └─ utils/
├─ expectations/
├─ tests/
└─ docs/
```
Common Pitfalls & How to Avoid
- Pitfall: Silent type coercion (`"123" → 123`). Fix: set `strict=True`, `coerce=False`.
- Pitfall: Catch-all `except` hides errors. Fix: catch `ValidationError` / specific subclasses only.
- Pitfall: Duplicate rule logic across models. Fix: centralise shared validators in `validators/common.py`.
- Pitfall: Validation disabled in production for performance. Fix: enable sampling mode (e.g., 5% rows) instead of disabling.
Ready-to-Use Snippet
```python
from pydantic import BaseModel, PositiveInt, EmailStr, root_validator, ValidationError
class Order(BaseModel):
id: PositiveInt
amount_usd: float # validated >0 below
customer_email: EmailStr
order_date: datetime
@root_validator
def check_amount_and_date(cls, values):
if values['amount_usd'] <= 0:
raise ValueError('amount_usd must be > 0')
if values['order_date'] > datetime.utcnow():
raise ValueError('order_date cannot be in the future')
return values
try:
Order(id=0, amount_usd=-10, customer_email='bad', order_date=datetime(2028,1,1))
except ValidationError as e:
print(e.json())
```