Why Pandera

backend

develop

check

A key functionality of Sprout is checking whether user-supplied data match the metadata that describe them. This post explains why we chose to use pandera for automating this process.

Published

November 8, 2024

Context and problem statement

A key functionality of Sprout is checking whether user-supplied data match the metadata that describe them. Metadata are stored in JSON files following the Data Package standard. Checking data against metadata has two components: verification and validation. Verification involves checking whether the overall structure of the data (e.g. column number, column data type) is as expected, while validation involves checking that all individual data items meet constraints listed in the metadata (e.g. maximum values, specific formats). We are looking for a tool to automate the data verification and validation process. The question then is:

Which data verification and validation tools are available and which one should we use?

Decision drivers

The new tool should support both data verification and validation.
Ideally, it should support multiple tabular data formats, including polars data frames.
It should be easy to transform JSON metadata into the representation required by the tool.
The tool should be able to handle relatively large datasets efficiently.
Support for extracting metadata from data would be a plus.

Considered options

`frictionless-py`

frictionless-py is the Python implementation of the Data Package standard by its parent organisation, and as such it would be the obvious choice for our use case. As well as functionality for data verification and validation, it supports checking metadata against the Data Package standard and building pipelines for transforming data.

Benefits

Supports both data verification and validation, although it is not possible to run these checks separately.
Multiple tabular data formats are supported, including pandas data frames.
Directly compatible with our JSON metadata, as it implements the Data Package standard.
Supports large data files.
Supports extracting metadata from data, matching the Data Package standard.

Drawbacks

The API suggests that it is possible to filter for specific errors, but this functionality does not seem to work fully.
There are a number of different entry points to the verification/validation flow and it is quite difficult to foresee how these differ in behaviour.
polars data frames are not supported.
So far we’ve found it a bit difficult to navigate the frictionless-py codebase and documentation.

Pandera

pandera is a flexible data validation library operating on data frames. Its validation mechanism is based on the concept of a schema expressing expectations about the data. It also has capabilities for preprocessing data and generating synthetic data from pandera schemas.

Benefits

Supports both data verification and validation, and can run these checks separately.
Supports polars data frames to a large extent (see on the right).
Supports large datasets.
Offers schema inference, although not with polars.
pandera is widely used, extensively tested, and has good documentation.

Drawbacks

Only data frames are accepted as input, so other formats (e.g. CSV) have to be loaded into a data frame first.
While polars is supported, the integration is not yet complete. E.g., it cannot yet extract metadata from polars data frames.
We would need to write custom code to translate our table metadata from JSON to pandera schemas in Python. For its own schemas, pandera provides JSON conversion out of the box.

Great Expectations

Great Expectations is a larger framework for testing and validating data. It also offers a range of other functionality, which includes data visualisation, data collation from remote sources, and statistical summary generation. It is structured around expectations about the data, which are organised into expectation suites.

Benefits

Supports both data verification and validation, and can run these checks separately.
Supports a wide range of data formats, although not polars.
Supports large datasets.
Can generate an expectation suite based on data.

Drawbacks

No support for polars.
We would need to write custom code to translate our table metadata from JSON to expectations in Python. For its own expectations suites, Great Expectations provides JSON conversion out of the box.
The API for declaring expectations matches the structure of the Data Package standard less closely than that of the other options.
Significantly larger and more complex to set up than any of the other options.

Pydantic

Pydantic is the most popular library for matching data against a schema in Python. Its basic use case is describing how data should be structured in a Pydantic model and checking an object against this model to confirm that they match. Model requirements are expressed using type hints and the matching behaviour is highly customisable.

Benefits

Supports data validation.

Drawbacks

No out-of-the-box support for data verification.
We would need to translate our JSON metadata into Pydantic models.
Pydantic only accepts dictionary-like objects as input, so data files would need to be loaded into Python manually and fed to the Pydantic model row by row.
The above means that support for large datasets would depend on our implementation.
No support for model extraction.

Decision outcome

We decided to use pandera because it is a great match for our use case, has extensive documentation, and its behaviour is easy to tailor to our needs. While frictionless-py is a direct implementation of the Data Package standard, it is less mature and less widely used than pandera. We have found some inconsistencies in its verification/validation behaviour and feel that we would need to customise it using somewhat brittle and inelegant workarounds for it to fit into our workflow.

As for the remaining options, we decided not to go with Pydantic because its use case is not verifying or validating datasets. Although Great Expectations offers most of the functionality we need, it is a complete framework with many parts we don’t need, is rather complex to set up, and integrating with it would shape our codebase more than any of the other tools.

Consequences

We will have to write custom logic for transforming JSON metadata into pandera schemas.
We will have to find a solution for extracting metadata from data, as pandera cannot currently infer schemas from polars data frames.
If we want to add any checks or behaviours based on file-level properties of the data (e.g. file size, hash, encoding, etc.), these will have to be implemented outside of pandera.