Why Parquet

The format of the file for storing data directly impacts how it is used later. Parquet is a format that has very good compressed file sizes, is very fast to process and analyse, and is well integrated into the R and Python ecosystems.

Published

March 10, 2025

Context and problem statement

The core aim of Seedcase is to structure and organise data to be modern, FAIR, and standardised so that it can be more easily used and shared later on. How we store the data directly affects how it can be used later on. There is a large variety of file formats, so we need to decide on the best one for our needs. So the question is:

What file format should we use for storing data organised and structured by Seedcase software?

Decision drivers

Mainly used for storing various sizes of data.
We don’t do “transactional” data processing or data entry, so we don’t need to worry about ACID compliance as much nor do we need to worry about row-level write speeds.
Stores the schema within the format itself, and can also handle schema evolution.
Suitable for both local, single-node processing and remote, distributed processing.
Has good compressed file sizes.
Is also relatively simple to use and understand for those doing data analysis, which means it can integrate easily with software like R and Python.

Considered options

Based on our needs, we considered the following options:

Commonly used file formats that we don’t consider are:

CSV, JSON, and other text-based formats are some of the most commonly used file formats for data. However, their biggest disadvantage is that they don’t store the data schema nor are they good for file compression and eventual analysis. So we don’t consider them.
ORC is a format that is similar to Parquet, but is used mostly within distributed computing and Big Data environments like in Hadoop or Hive systems. We don’t use these technologies, so we are not considering ORC.
HDFS, which is a common file system used in Big Data environments. We don’t consider it because both Parquet and Avro are natively supported within HDFS systems. We also don’t consider it because Hadoop is not a common use case for the problems we aim to solve.
Non-embedded SQL engines (like Postgres or MySQL) because they require a server to use and aren’t stored as a file on a local filesystem.
Excel or similar spreadsheet programs, as they aren’t technically open source nor do they store any schema information.

Parquet

Parquet is a column-oriented binary file format, where each line in the file represents a “column” of data rather than each line being a “row” of data as is seen with spreadsheet structured data formats. This Parquet file format is optimized for compression and storage as well as very large-scale data processing and analysis. It is designed to handle complex data structures and is natively supported by many big data processing frameworks like Spark or Hadoop.

Benefits

Designed for batch data processing, which is how most research data analysis is performed.
Stores the schema within the format itself, and can also handle schema evolution.
Of the options considered, it has the best compressed file sizes because it stores data by column, not by row. What this means is that each line in the file format is a column, not a row, unlike formats like CSV where each line in the file is a row.
Integrates very easily into both R and Python ecosystems for data analysis via popular packages like arrow.
Is natively supported by the incredibly powerful in-memory SQL engine DuckDB, which is a great tool for data analysis.
Has plugins for most common data processing tools, including SQLite.
Can handle unstructured data.
Designed to handle the situation of many columns relative to rows (though still lots of rows). Most research data falls under this category, for instance -omic type data, where there are often many hundreds of columns and maybe a few thousand rows.

Drawbacks

Not particularly well designed for inserting new rows into the dataset. This isn’t strictly an issue for our purposes, since we don’t do “transactional” data processing or entry.
Is in a binary format, so is not human-readable. This is a drawback for people who want to quickly look at their data.
Not very good at row-level scans or lookups, though this isn’t a common use case in research data analysis.

Avro

Avro is a row-oriented binary file format that was designed to be a compact and fast format for serializing data within the Hadoop system. It is also designed to be a data exchange format to more effectively move data between systems.

Benefits

Can handle unstructured data.
Has the schema stored within the format itself.
Very good file size compression.
Better at writing new rows to the dataset than Parquet.
Has better schema evolution features than Parquet.
For individual row lookups, Avro is faster than Parquet since it stores data by row, not column.

Drawbacks

Not as fast as Parquet at reading data.
Doesn’t have as good compressed file sizes as Parquet because it stores data by row, not column.
Isn’t as well integrated into common research analysis tools like R and Python.

SQL (e.g. SQLite)

SQL is a standard language for relational databases throughout much of the internet, and SQLite is a popular embedded SQL database format that is used in many applications. SQLite is a file-based database format, which means that it is a single file that contains the entire database.

Benefits

Is a well established, classic relational, embedded SQL database format.
Very fast at reading and writing data (though not as fast as Parquet or Avro for reading).
Can handle unstructured data.
Can integrate in a wide range of software tools.
Great for row-level scans, insertions, and lookups. However, this isn’t a common use case for our purposes.

Drawbacks

Is a relational database, which requires some knowledge of SQL to use.
It is row-oriented, so isn’t as good of a format for processing or analysing data compared to Parquet.
Doesn’t have as good compressed file sizes as Parquet or Avro.
Doesn’t integrate well with the Frictionless Data Package standard.
While it can integrate with R and Python, it does require some additional knowledge and code compared to using a file format like Parquet.

Decision outcome

We’ve decided to use the Parquet file format for storing our data as it is the best file format for our needs, which are mainly storing and processing large amounts of data for research purposes.

Consequences

Since Parquet is not a text-based format, it is not human-readable. This means that people wanting to have a quick look at their data won’t be able to do that. A consequence will be that some use cases will not be ideal for Parquet, so people who have smaller datasets may not use our solutions.

Context and problem statement

Decision drivers

Considered options

Parquet

Benefits

Drawbacks

Avro

Benefits

Drawbacks

SQL (e.g. SQLite)

Benefits

Drawbacks

Decision outcome

Consequences

Resources used for this post