Why jsonschema
As we follow the Data Package standard for structuring metadata, we have to make sure that our metadata files conform to this standard. This post explains why we chose jsonschema
for automating the process of checking metadata files and objects against the standard.
Context and problem statement
We follow the Data Package standard for organising metadata. This means that the metadata stored by Sprout in JSON files have to conform to this standard. These metadata files evolve over the life span of a project: they may be generated by Sprout, updated by the user through Sprout, and/or edited directly in a text editor outside of Sprout. In order to avoid inconsistencies and prevent errors, every time we access or process metadata in Sprout, we have to make sure that they still conform to the standard. We therefore need a tool to automate the process of checking metadata against the standard. The question then is:
Which tools are available for this purpose and which one should we use?
Decision drivers
- The Data Package standard consists of requirements and recommendations. The former are expressed formally in a JSON schema, while the latter are presented in a more discursive textual form. An example of a requirement is that a package must include a resource while a recommendation is that the name of packages and resources should follow a particular pattern. Whether we want to adhere to requirements and recommendations from both of these sources will inform which tool we choose.
- We would like the logic for checking metadata against the standard to be extracted from the standard automatically, as far as possible. This way, we wouldn’t have to transcribe or translate the requirements of the Data Package standard into Python ourselves, which would be both error prone and subject to falling out of sync with the standard if it’s updated.
- We already have a set of classes representing the Data Package standard. Their aim is to allow users to construct metadata more easily via Sprout. Ideally, our new tool will not require us to develop a parallel set of classes, as we would like to avoid having multiple (potentially inconsistent) internal representations of the standard.
- As we treat package, resource and table metadata separately, it would be helpful if we could check these separately against the standard as well.
- Another relevant factor is whether we want to allow blank values in our JSON metadata. On the one hand, it may be helpful for users to see a sort of template JSON file with the most important properties present but empty (e.g. set to the empty string or an empty list) when a data package is first set up. On the other hand, such a JSON file does not, technically speaking, conform to the JSON schema of the standard, as the standard disallows empty lists, for example. As different tools come with different default behaviours, different amounts of customisation work are needed depending on our treatment of blank values.
- Finally, we would like the mechanism underlying the process of checking metadata against the standard to be transparent, reliable, consistent, and as simple as possible.
Considered options
We consider the following options:
There exist alternatives for both Pydantic and jsonschema
, but the pros and cons of these are largely similar for the purpose of this comparison. As Pydantic and jsonschema
are the most popular and best supported tools in their categories, only these are discussed along with the frictionless-py
package that is developed to follow the Data Package standard.
frictionless-py
frictionless-py
is the Python implementation of the Data Package standard by its parent organisation, and as such it would be the obvious choice for our use case. As well as structuring metadata, it has functionality to extract metadata from data, run checks on data and metadata, and transform data. A variety of sources is supported for both data and metadata.
Benefits
- Comes with a mechanism for checking metadata against the Data Package standard out of the box.
- Includes requirements from both the JSON schema and the text-based description of the Data Package standard.
- Can check metadata without checking associated data.
- Comes with other functionality that is useful to us, such as extracting metadata from data and validating data.
- We will receive any updates or fixes to the standard through
frictionless-py
automatically.
Drawbacks
- There are minor discrepancies between the JSON schema and the behaviour of the
validate
method. E.g., the JSON schema does not allow empty lists for properties such asresources
orlicenses
, butfrictionless-py
does. The schema specifies patterns for allpath
s, butfrictionless-py
checks these only selectively (e.g. it checks thepath
of a resource, but not of alicense
orcontributor
). - Needs custom logic to be able to run checks on package, resource and table metadata separately, even though the API suggests that it is possible to filter for these based on error type.
- The system of error codes is not fine-grained enough for our use case. E.g., there’s no separate error code for format/pattern violations or for required fields missing.
- Does not always collect all errors in a single round of checks.
- When we use this package alongside our metadata classes, we end up with two parallel representations of data packages in Sprout (even if one of these is not exposed to the user).
- So far, we’ve had to dig around in the
frictionless-py
codebase quite a bit to understand some behaviours and we’ve found it a bit difficult to navigate. - The documentation is the least extensive and the library the least widely used of the three options considered.
Pydantic
Pydantic is the most popular library for matching data against a schema in Python. Its basic use case is describing how data should be structured in a Pydantic model and checking an object against this model to confirm that they match. Model requirements are expressed using type hints and the matching behaviour is highly customisable.
Benefits
- As we would need to translate the Data Package standard into Pydantic models in any case, we would be fully in control of deciding what optional recommendations to take from the standard and how to treat blank values.
- We could integrate our existing metadata classes with Pydantic models to avoid having two parallel representations of the standard. This would make it possible to check user-supplied metadata against the standard at the time of object instantiation.
- We can use
datamodel_code_generator
to convert the JSON schema of the standard to Pydantic models automatically. We could then fetch and integrate updates to the standard by regenerating the models. - It would be easy to check package, resource, and table metadata separately.
- All errors are collected in a single round of checks.
- Error codes are detailed and easy to extract.
- Pydantic is widely used, extensively tested, and has good documentation.
Drawbacks
- We would need to translate the Data Package standard into Pydantic models to be able to use Pydantic for checking objects. However, we could use
datamodel_code_generator
to automate this process (see Benefits on the left). - While integrating our existing metadata classes with Pydantic models would lead to a single representation of the standard in Sprout, our previous experience with generating classes from the schema has shown that the resulting class system is more complex than what we would like. If we choose Pydantic, a similar exercise of simplifying the class system without losing compatibility with the standard is likely to happen. This would lead to losing the ability to receive updates to the standard automatically.
- As Pydantic models have a
schema
attribute, we would need to alias theschema
attribute of our resource metadata class internally to avoid ambiguity. - Currently, our metadata classes are allowed to represent partial metadata objects (e.g. just one field specified). Adding checks directly in these classes at instantiation time could lead to some difficulties with instantiating partial metadata objects, which might not meet the standard in that form.
jsonschema
jsonschema is the most widely used implementation of the JSON Schema specification in Python. Its main use case is checking that a JSON object conforms to a JSON schema.
Benefits
- Can check JSON metadata files against the JSON schema of the Data Package standard directly.
- Matches the schema exactly, e.g., it disallows empty lists.
- Does not introduce a parallel set of classes representing the standard.
- We can receive updates to the standard automatically by downloading the new version of the schema.
- It is possible to collect all errors in a single round of checks.
- It is possible to filter errors both by source (which field did not match) and by cause (why that field did not match).
jsonschema
is widely used, extensively tested, and has good documentation.
Drawbacks
- Does not include recommendations in the textual description of the standard.
- Cannot check package, resource and table metadata separately out of the box, but this would be fairly easy to implement.
- Does not provide error codes as such, but it does expose the cause of the error (e.g. a missing required field, a mismatched pattern, etc.) in a property other than the error message as well.
Decision outcome
We decided to use jsonschema
for automating checks on metadata. It is a great match for our use case: it offers a single functionality with a robust implementation, integrates into our codebase with minimal work, and makes it easy to keep Sprout up-to-date with the Data Package standard.
Although frictionless-py
implements the Data Package standard directly, it is less mature and less widely used than jsonschema
. We have found some inconsistencies in its verification/validation behaviour and feel that we would need to customise it using somewhat brittle and inelegant workarounds for it to fit into our workflow. Pydantic would have been a fair choice, but it would have involved the most integration and customisation work, resulting in far more complicated code to achieve the same behaviour as the other options.
Consequences
- Any recommendations not included in the schema of the Data Package standard will have to be added (if needed) as custom validators.
- We will need to decide how we want to treat blank values, as some of these violate the schema (e.g. empty lists).
- If we want to keep checks for different types of metadata separate, we will need to write custom logic for this.
- We will have to think about how we want to incorporate the schema of the Data Package standard into Sprout. Although the Data Package standard is hosted as a project on GitHub, it is maintained using npm, so we cannot add it directly as a dependency.