The benefits of using a version control system are numerous. If you are new to this, or you are not convinced of it you may want to read this article. Assuming you have embraced a version control system, you store all your code, documentation and other files into a repository. Perfect, but what is to be expected as the years go buy? Both the size of the software and the history of it grows. As the size increases the time to clone the repository will do too. But why bother, you only clone it once, right?
Maybe, but what about new employees? Or if your computer breaks down? All of this are just a one-time cost after which many years will go by before ever having to do it again. But what about your build and test infrastructure? Depending on the approach the cost differs:
- The build/test start from a clean state. It is a VM that has just booted and needs to clone the repository.
- The build/test start from a clean state but they have an attached disk with the repository on it. Pulling the latest changes may be required, but the cost for that can be neglected.
- The build/test have a local repository as the machine does not revert itself to a clean state. Only pulling the latest changes is required.
The first approach suffers the most from a big repository, but it is the most agile and scalable. The second one works well if all machines work on the same branch, or do not run in parallel. This approach will hit its limits fast. The last approach does allow parallel executing, but it requires a persistent disk for each machine. This will quickly become a couple of terabytes which increases the cost maintaining your infrastructure. Not mentioning if you are planning on running your build/tests in the cloud, the costs of storing all that data will rapidly exceed your budget. Is there an alternative to this huge repository or is it a necessary evil?
Cloning a repository can be speed up by avoiding cloning everything. The amount of history that is cloned can be limited both horizontally (history of a branch) as vertically (different branches). Git for instance allows you to specify a depth when cloning. So instead of cloning the entire repository, we could also just clone a single version. This approach does not solve everything as it does not allow you to push changes. An alternative is to get the entire history of just one branch, while this is more data than just a single version it allows you to have two-way traffic.
This solves the problem of your builds and tests as cloning a single version should be enough. I have never encountered a situation where a build or test has to make changes to the repository and push them back. To save disk space locally, you can remove branches you don’t need. It is however preferred (or sometimes even required) to have a single server that contains all the information, acting as a back-up and a single point of truth. If the size of the repository becomes too large, the only option that is left is to throw away some history.
Just as with cloning, throwing away history can be done both horizontally and vertically. Which consists of throwing away branches that are merged into master and removing old commits respectively. Both should be done with care, to prevent deleting history that is still useful. It is hard to determine when something is no longer relevant, as it differs from project to project. Some factors that can be taken into account are the speed at which changes are made to code and the need for traceability.
After exploring ways to minimize the size of a single repository an alternative could be to use multiple smaller repositories. At first sight this might be the answer to everything. If it becomes too big, just cut it down into smaller pieces. But this is not always true, and it will only help you if it is not always required to clone all repositories. Instead of one build, you will have a separate build for each repository, this will not only save time cloning the repository, but during building as well. Instead of compiling all the code only a small part that has actually changed needs to be compiled.
A new problem however pops up: how do you determine the amount of repositories and what do you put where? This is crucial, as it will determine the amount of time that can be saved, if any at all. Since most companies either offer multiple programs, a single program with different components or different micro services it makes sense to have a repository for each of them. Besides that an extra repository containing some common functionality is helpful. On top of saving time it also enables you to check and block dependencies between components that should not exist. A downside however is that you have to determine where to put something. Is it something that can be re-used by different components or not? Since other teams that work on different components can not (and should not have to) look into repositories of components they do not work on, they only see what is in the shared repository.
While splitting up a huge repository into smaller ones may solve the problem of long cloning times, there are other ways to handle this as well. It is therefor not the right tool for the problem. Instead cloning a subset of the repository is preferred. The different repositories do yield other benefits such as faster builds, restrictions on component dependencies and separate history pruning for each repository. This last enables you to keep the history of a critical repository, and at the same time keep only a short history of fast changing or old code depending on your own need. The combination of using multiple repositories, selective cloning and throwing away obsolete history will prevent your repository from becoming a burden.