Why poly-repo

organization
management
Our reasons for structuring projects (both software and websites) in a poly-repo rather than mono-repo style.
Published

December 5, 2023

Context and problem statement

The core issue and question here is:

How do we decide to structure and organize our projects, both software products as well as documentation and training material?

We initially set out to build a single final software product that can be installed on servers and used as is. However, some components of our software could be useful on their own. So we’ve started building another product as a Git repo. Which has lead us into this issue and question, since we don’t want to start doing something major without considering why we are doing it and what the impact might be. And then coming to a conscious and agreed upon decision.

We also will be working on other, side projects for Steno Aarhus and other potential projects related, but not connected to the core Seedcase product. For example, we will be helping design and build CPR validation checks or volunteer databases. We will be creating multiple repositories/projects, so if we have a streamlined approach and workflow to creating and developing projects in this way, of having multiple repositories for different projects, it would be easier to manage.

See some discussion of the beginning of this issue here.

Decision drivers

  • Tracking and managing projects:
    • Looking over the list of issues in any given project right now can feel overwhelming because of the number of issues (the list will only grow). Trying to look through the list to find ones relevant to the component of the project you are working on can be a bit of an “analysis paralysis”.
    • Tracking issues in sub-projects/components within a single repo means making heavily use of custom labels, which can be a challenge from a management point of view.
    • I’d like to eventually get to a place where we can dedicate a chunk of time to working on one specific project. My inspiration for how to do that comes from the tidyverse team at RStudio/Posit. They let issues build up in an R package/project over time before dedicating time to working through as many of those issues as possible. Once done, they switch to another project that has a lot of issues.
  • Setting up continuous integration and deployment: We haven’t yet set up CI/CD, but a lot of standard templates are based on a one repo is one package/app/product approach (for instance, CI’s for testing and building Python packages or deploying Docker images to DockerHub).
  • We’re already splitting projects into other repos.
    • We’re building a separate seedcase-registry product, which is the data project registration component of seedcase. This product will be imported/loaded into seedcase.
  • Open source projects are usually a “one-repo is one-product/output” format (e.g. R or Python package). Contributors will likely be people who have experience working in open source communities and projects.
  • We are starting to and will be creating and building multiple products, both within the overall aim of the Seedcase Project, but also sub-projects at Steno and potential (independent) extensions to Seedcase products.
  • Providing common repository templates, build processes, and CI/CD for Steno Aarhus (and others in the future), so we’ll need to build these anyway.
  • We need to consider the workflow for the team over the long term. Some decisions make more sense in the short term, but in the long term don’t make sense.

Considered options

There are multiple existing posts about this exact issue of mono- vs poly-repos, which I am listing below, that helped me write up this decision.

Almost all sources say, which you choose depends on your own situation.

Mono-repo

A mono-repo is where all code for a product is kept and developed in one Git repository. Many large companies like Google and Facebook use a mono-repo approach to developing their software products. This approach looks a bit like this, where each project is a folder under the main repo:

main/
├── .git/
├── project1/
│   ├── tests/
│   ├── docs/
│   ├── src/
│   └── build
├── project2/
│   ├── tests/
│   ├── docs/
│   ├── src/
│   └── build
└── project3/
    ├── tests/
    ├── docs/
    ├── src/
    └── build

Pros:

  • It works quite well when the core software service will inevitably be deployed and used as a single service, for instance with Google’s Search Engine.
  • It’s a bit easier for smaller teams to use a mono-repo approach since it allows the team to move a bit faster in developing the product.
  • Project management and issue tracking all happens in one location, so it’s easier for a dedicated coordinator or manager to track the project.
  • Components can be tightly coupled and easily updated within a mono-repo.
  • Single codebase is enticing, since it theoretically makes it easier to get onboarded, manage, and track progress on a project.
  • All code is in one location, so finding code might be easier.

Cons:

  • Paradoxically, it is harder for smaller teams to manage the complexity of a mono-repo because of the reasons below.
  • Within the open source world, mono-repo’s are not common, so contributors might not know how to navigate the repo.
  • Tend to require custom built CI/CD tooling rather than make use the many open source templates available.
  • Versioning of the software happens all at once, so a change in one component requires a version update, even though other components don’t change.
  • Project and issue management all happens in the same repo, so for a small team with many many tasks to manage, it gets overwhelming to focus on what needs to be done.
  • If one component could be used as an independent product, it would have to be split out of the mono-repo, otherwise people would have to install the whole product just to use the small component they actually need or want.
  • Deploying and testing may take longer because you have to deploy and test the whole repo, so if a small change was made, that would trigger long deployment and testing times.
  • Effective management of a project may require more complex git branching processes.
  • It’s more difficult to manage ownership of code on a per-directory level.
  • If a bug or conflict occurs, it breaks the whole product.
  • As a codebase grows, the complexity involved in managing a mono-repo can increase substantially.

Poly-repo

(also known as multi-repo)

While many large companies use a mono-repo approach, there are also large companies who use a poly-repo approach, like Amazon. Within the open source world, poly-repos are extremely common. For instance, the tidyverse, ROpenSci, or Gen3 teams develop dozens of packages, and their teams are quite small.

This structure looks a bit like:

project1/
├── .git/
├── tests/
├── docs/
├── src/
└── build
project2/
├── .git/
├── tests/
├── docs/
├── src/
└── build
project3/
├── .git/
├── tests/
├── docs/
├── src/
└── build

In many ways, the pros and cons are the reverse with the mono-repo, but there are also other considerations included here.

Pros:

  • Testing and deployment per unit of change in code is faster.
  • Versioning of components is easier since each component is its own repo.
  • Project management on a repo-level is easier, since issues and progress is smaller and more focused.
  • Standard open source templates for CI/CD and other basic repo and build files can be used.
  • Onboarding for contributors is easier if they come from the open source community.
  • Open source projects very often follow this approach, so its easier to draw inspiration and learn how other projects do things.
  • Using version control, managing the commit history, and working with pull requests can be easier, since they will be specific to the project.

Cons:

  • Creating another project requires time and setup.
  • Project management at a organization level is a bit more challenging, since there are now multiple repos to manage rather than one.
  • Onboarding of team members can be more tricky because there are now many repos to consider and keep mental track of then before.
  • When something changes in one repo, managing its impact on other repos might be tricky, if strict de-coupling is not managed well and architectural designs aren’t followed or developed soon enough.

Hybrid

In general, whether you decide on following a mono-repo or poly-repo approach, there is always some level of mono- or poly- structure in the codebase. It is a bit of a spectrum and no project is truly at either end. However, explicitly deciding on a hybrid approach doesn’t seem like there are any benefits.

Decision outcome

Ultimately, we decided on using a poly-repo approach because we’ll need to build workflows and processes for developing multiple repositories simultaneously anyway, in addition to the easier project-level management and ability for users to install individual components as well. It also works well for our ideal team workflow of working on individual projects in rotations.

Consequences

  • Means we will need to develop templates for different project types (website, Python Package, R Package, Django App).
  • We’ll need to connect and synch common files across projects.
  • We’ll have to learn and apply team-based workflows around using this approach.