Why Pandera
A key functionality of Sprout is checking whether user-supplied data match the metadata that describe them. This post explains why we chose to use pandera
for automating this process.
Context and problem statement
A key functionality of Sprout is checking whether user-supplied data match the metadata that describe them. Metadata are stored in JSON files following the Data Package standard. Checking data against metadata has two components: verification and validation. Verification involves checking whether the overall structure of the data (e.g. column number, column data type) is as expected, while validation involves checking that all individual data items meet constraints listed in the metadata (e.g. maximum values, specific formats). We are looking for a tool to automate the data verification and validation process. The question then is:
Which data verification and validation tools are available and which one should we use?
Decision drivers
- The new tool should support both data verification and validation.
- Ideally, it should support multiple tabular data formats, including
polars
data frames. - It should be easy to transform JSON metadata into the representation required by the tool.
- The tool should be able to handle relatively large datasets efficiently.
- Support for extracting metadata from data would be a plus.
Considered options
frictionless-py
frictionless-py
is the Python implementation of the Data Package standard by its parent organisation, and as such it would be the obvious choice for our use case. As well as functionality for data verification and validation, it supports checking metadata against the Data Package standard and building pipelines for transforming data.
Benefits
- Supports both data verification and validation, although it is not possible to run these checks separately.
- Multiple tabular data formats are supported, including
pandas
data frames. - Directly compatible with our JSON metadata, as it implements the Data Package standard.
- Supports large data files.
- Supports extracting metadata from data, matching the Data Package standard.
Drawbacks
- The API suggests that it is possible to filter for specific errors, but this functionality does not seem to work fully.
- There are a number of different entry points to the verification/validation flow and it is quite difficult to foresee how these differ in behaviour.
polars
data frames are not supported.- So far we’ve found it a bit difficult to navigate the
frictionless-py
codebase and documentation.
Pandera
pandera
is a flexible data validation library operating on data frames. Its validation mechanism is based on the concept of a schema expressing expectations about the data. It also has capabilities for preprocessing data and generating synthetic data from pandera
schemas.
Benefits
- Supports both data verification and validation, and can run these checks separately.
- Supports
polars
data frames to a large extent (see on the right). - Supports large datasets.
- Offers schema inference, although not with
polars
. pandera
is widely used, extensively tested, and has good documentation.
Drawbacks
- Only data frames are accepted as input, so other formats (e.g. CSV) have to be loaded into a data frame first.
- While
polars
is supported, the integration is not yet complete. E.g., it cannot yet extract metadata frompolars
data frames. - We would need to write custom code to translate our table metadata from JSON to
pandera
schemas in Python. For its own schemas,pandera
provides JSON conversion out of the box.
Great Expectations
Great Expectations is a larger framework for testing and validating data. It also offers a range of other functionality, which includes data visualisation, data collation from remote sources, and statistical summary generation. It is structured around expectations about the data, which are organised into expectation suites.
Benefits
- Supports both data verification and validation, and can run these checks separately.
- Supports a wide range of data formats, although not
polars
. - Supports large datasets.
- Can generate an expectation suite based on data.
Drawbacks
- No support for
polars
. - We would need to write custom code to translate our table metadata from JSON to expectations in Python. For its own expectations suites, Great Expectations provides JSON conversion out of the box.
- The API for declaring expectations matches the structure of the Data Package standard less closely than that of the other options.
- Significantly larger and more complex to set up than any of the other options.
Pydantic
Pydantic is the most popular library for matching data against a schema in Python. Its basic use case is describing how data should be structured in a Pydantic model and checking an object against this model to confirm that they match. Model requirements are expressed using type hints and the matching behaviour is highly customisable.
Benefits
- Supports data validation.
Drawbacks
- No out-of-the-box support for data verification.
- We would need to translate our JSON metadata into Pydantic models.
- Pydantic only accepts dictionary-like objects as input, so data files would need to be loaded into Python manually and fed to the Pydantic model row by row.
- The above means that support for large datasets would depend on our implementation.
- No support for model extraction.
Decision outcome
We decided to use pandera
because it is a great match for our use case, has extensive documentation, and its behaviour is easy to tailor to our needs. While frictionless-py
is a direct implementation of the Data Package standard, it is less mature and less widely used than pandera
. We have found some inconsistencies in its verification/validation behaviour and feel that we would need to customise it using somewhat brittle and inelegant workarounds for it to fit into our workflow.
As for the remaining options, we decided not to go with Pydantic because its use case is not verifying or validating datasets. Although Great Expectations offers most of the functionality we need, it is a complete framework with many parts we don’t need, is rather complex to set up, and integrating with it would shape our codebase more than any of the other tools.
Consequences
- We will have to write custom logic for transforming JSON metadata into
pandera
schemas. - We will have to find a solution for extracting metadata from data, as
pandera
cannot currently infer schemas frompolars
data frames. - If we want to add any checks or behaviours based on file-level properties of the data (e.g. file size, hash, encoding, etc.), these will have to be implemented outside of
pandera
.