Working with a large Git monorepo can introduce continuous integration challenges, but there are some techniques you and your team can employ to overcome common hurdles.
What is a Monorepo?
As an example of what a monorepo is, let’s look at the numbers behind the repository at Oscar Health, a full stack health insurance company. The organization’s Git monorepo contains:
- 100,000 files
- 19 million lines of code
- 5 million lines of Python & SQL each
- 770,000 lines of golang
- 19 million lines of code
- Microservices
- 2,300 binary targets
- 12,000 library tests
- 6,100 test targets
- CI/CD
- 5,000 deployments a month
- 4 million tests a month
- 15TB of artifacts per month
Continuous Integration Challenges with Monorepos
As large monorepos continue to grow overtime, teams can experience continuous integration (CI) challenges. Running the entire test suite every day, for example, isn’t an option anymore.
Common continuous integration challenges with monorepos include:
- Calculating changed targets on a Git diff (or pull request)
- Caching on remote CI servers:
- Git cloning an entire monorepo and fetching related third parties from scratch on remote CI servers takes longer as your repo grows
- You must figure out caching techniques to ensure the large objects can be accessed on the remote CI server
Calculating Changed Targets
The first step is finding the changed files in your Git monorepo. Next, you will need to find all the transitively dependent test and binary targets from changed files and repo test them. The latter step can get complex.
A structure for calculating changed targets for a monorepo can look something like this:
You have four changed targets: A -> B -> C <- D (A depends on B; B depends on C; and D depends on C)
Thus, if there is a change to file C; that will change targets A, B, C, & D; if there is a change to file B; that will change targets A & B; if there is a change to file A; that will change target A.
As an example, in the Oscar Health monorepo, every build or diff can potentially have 10,000 targets that need to be tested, but on average they only need to test a few hundred using this calculation model.
Caching on Remote CI Servers
As you work with a monorepo, two types of objects will continue growing linearly over time: the monorepo itself and the third parties defined in the monorepo.
Techniques for caching these types of large objects can differ depending on the remote CI server type, whether long-running or ad hoc, with the latter being more complicated.
For long-running servers, you can employ the following strategies:
- Clone the monorepo once and use Git commands to fetch new changes
- Sync third parties when necessary
Ad hoc servers autoscale by nature, you will need to be more thoughtful about using them in your workflow and will need to use caching techniques.
One step you can take with ad hoc servers is to build a Docker image in the middle of the night to back up the base package, clone the monorepo, and Git fetch all third parties. Next, you commit that to a container to obtain an image and then cache the image.