Why Python for data engineering

develop

process

script

Our core, value-driven work revolves around engineering data in various ways and developing data into data packages. This decision post explains why we use Python for building data packages and for developing libraries of common functionalities that simplify building data packages.

Published

June 1, 2026

Context and problem statement

Our core, stream-aligned work is creating, building, and continually developing data packages. Our data engineering tasks revolve around designing, building, and processing data from various unclean sources into a final, structured and tidied data package. This data package can then be used for research analysis. Using an effective and powerful language can greatly simplify this type of work.

We previously decided to use Python for both data engineering work as well as for developing all of our software tools, which meant we compared languages that could do both of those things rather than just one of them. However, our needs for building software tools have changed, especially as we’ve started to build more command-line and web-based tools. Because of this, we reconsidered Python for building software tools, and instead switched to Rust. However, we still need a language suitable for data engineering and with this narrower focus, we can ask the question:

Which programming language is best suited for data engineering work, including to build libraries to support that work?.

Decision drivers

We need a language that:

Can be used for data engineering directly and for building libraries supporting data engineering workflows that can be used across projects. A library is a collection of functionality that can be installed and used across different projects. A library is always a package, but a package isn’t always a library. A package could be a set of command-line tools, but not contain any library functionality. We want to be able to build library packages to support and simplify data package development, which has different requirements from building non-library packages.
Has a strong ecosystem of library packages for data engineering as we don’t need to worry as much about minimizing the number of dependencies we use, unlike when we build software tools.
Is widely used in the data science and data engineering community, both so they can contribute to our projects but also so that they can more easily review or reuse our work.
Has powerful and fast packages for data engineering.

Considered options

Our previous post was about what language to use for all purposes, not just data engineering. Because of that, we compared a wider range of languages, even if they weren’t very well suited for data engineering. For example, we compared Java and C++ alongside R and Python. For this decision post, we are focusing firmly on languages designed for data engineering as well as with decent capabilities for building library packages to simplify that data engineering work. Based on our drivers and this focus, we are only comparing two languages:

R
Python

There are other languages that match some of our drivers but not all of them:

SQL is a powerful language for data engineering and processing. However, it is not a general-purpose programming language and doesn’t have the capabilities for building library packages to support data engineering work, so we are excluding it from consideration.
Scala is a commonly used language in data engineering, but is nearly unknown in data science and the research community.
Julia is a fairly new language that is mainly designed for data science, with powerful data engineering capabilities. But it isn’t very popular or widely adopted. Its primary aim is mathematical computation and statistical analysis rather than data engineering tasks.

While we decided on using Rust for building software tools, we are not considering it for data engineering work. While the software tools we build are used for data engineering purposes, they themselves don’t do the data engineering work. This post is about using a language for the data engineering work itself, which is different from building software tools that support that work. These are some reasons why we don’t consider Rust:

It isn’t widely used in the data science and data engineering communities.
There isn’t yet a strong ecosystem of library packages for data engineering work (aside from Polars).
Rust is a lower-level language with high-level ergonomics, with strict and enforced types and a memory ownership model. These features all require substantially more and deeper knowledge of how computers operate compared to other languages like Python and R. For developing robust software, these are all good things. But for noncritical data engineering tasks that don’t need high performance features, these extra features and restrictions make it difficult to write code to build up data packages quickly and effectively. It also doesn’t have a REPL, which makes data engineering tasks interactive and supports iterative feedback loops.

R

R is an open-source programming language and software environment for statistical computing and graphics supported by the R Foundation for Statistical Computing. The R language is widely used among statisticians and data analysts for developing statistical methods and doing data analysis.

Benefits

Aside from Python and SQL, R is one of the most commonly used open-source languages in data science and the research community.
The community is massive, friendly, and supportive and there is a large ecosystem of packages for a wide range of tasks and purposes.
It integrates very well with many powerful data engineering tools, such as Arrow, DuckDB, and can easily connect with databases via DBI.
There are amazing and beginner-friendly data processing and analysis tools in R via the tidyverse, which provides a consistent and user-friendly interface to most data engineering tasks (like the tools mentioned above).
It has an effective and powerful library package development system that makes it easy to quickly build and share packages.
It has an extremely flexible programming environment, where there are few rules on how things are done. It was designed in a way to do analysis interactively in the R console, rather than to write scripts that are run from start to finish. This means it is very easy to get started, write code that works, and get stuff done quickly.

Drawbacks

Package dependency management isn’t really built into the language, so pinning specific versions of packages used is quite difficult and depends on external tools like renv (which itself has a number of limitations).
Is not as commonly used within the data engineering community, likely because it mainly came from statistics and analysis, rather than from data engineering.
It isn’t as well designed for building packages as some other languages, though it is fairly similar to Python in this respect (similar but different, see Python’s drawbacks). For example, it doesn’t have the capability to integrate continuous delivery processes for distributing packages. Nor are the checks that CRAN (the package registry) uses publicly available, which makes it very difficult to build packages that pass the CRAN checks without first uploading the package to their system and having them run their checks.
Running local CRAN checks (which aren’t exactly the same as the official checks) is very slow, which also makes continuous integration checks slow.
Its flexibility in the programming environment is also its weakness, as it can lead to unexpected issues and behaviours if you don’t know the language well enough to know how to avoid those pitfalls. Because of this, writing code that is robust, reliable, re-usable, and reproducible is a bit harder to achieve compared to other languages. However, both Python and R are similar (in different ways) in this regard, as they are both flexible. See Python’s drawbacks below.

Python

Python is an interpreted, high-level, general-purpose programming language. Python’s design philosophy emphasizes code readability, for example by using whitespace as part of its syntax. Its language uses an object-oriented approach and is often quoted as being “the second-best language for everything” and the “glue” between other languages and environments. This means that it can do a lot of different things, even if it doesn’t do any one of those things well. This makes it a flexible language to do a wide variety of tasks in.

Benefits

It has a massive community with a large number of packages for many different programming tasks, including data engineering tasks.
It is widely used in the data science, machine learning, and data engineering communities.
Many tools have some bindings or connections to Python. For example, SQL databases can be easily connected to Python with packages like SQLAlchemy, and many data engineering tools have Python bindings, such as Arrow, DuckDB, Spark, and Polars.
It has packages for easily connecting to different sources, such as web APIs and other programming languages.
It has a lot of programming features from different paradigms, which means it is flexible in how it can be used and written.
Like R, it is dynamically typed, which means you don’t need to worry about types when writing code, so it can be easy to write something that (at least seems to) works and get stuff done quickly.
Its package development features are quite good, especially when paired with tools like uv, and distributing packages through PyPI can be done with a single command. It can also integrate easily into continuous delivery pipelines.
Because it is used in so many different industries and communities and has so many packages, there is a robust ecosystem of tools to make it easier to build stable and reliable packages (at least compared to R). For example, the relatively recently created tools uv and Ruff have made it much easier to develop packages.

Drawbacks

The documentation for different packages is of inconsistent quality, even within the official Python documentation. This can make it difficult to effectively use some functionality or packages.
Like R, Python’s flexibility is also its weakness, as it doesn’t have strong guardrails when writing code. This means it can be harder to know if the code you wrote is actually doing what you think it’s doing, even if it runs and outputs something.
While there are lots of packages to help with package development, it is still a fragmented ecosystem and experience for building packages, even when compared to R.
Its multiparadigm nature is also a drawback, as it can lead to a lot of different ways to do the same thing and it doesn’t implement any one of those paradigms very well. So the coding experience can feel a bit fragmented, especially when trying to use different packages that use different paradigms.

Decision outcome

While both R and Python provide a lot of power and are similarly capable of doing both data engineering work and building library packages to support that work, we’ve decided to use Python. One of the two main reasons is that Python is more widely used in the data engineering community, so it will (over time) have a stronger ecosystem of tools for data engineering. The other reason is that Python’s package development experience, while not great, is better than R’s experience. This is primarily because of the tools now available, like uv, and because it is much easier to distribute packages through PyPI than through R’s CRAN system (even if the CRAN system might ensure higher-quality packages overall).

Consequences

We don’t foresee any major consequences to this decision, especially since we’ve already been using Python for the data engineering work and haven’t encountered any major issues with it in that area.