Developing games that have thousands of users and deliver constant updates is a complex endeavor. Beyond just working on new content and features, it also involves optimizing processes so that everything goes smoothly while allowing developers, analysts, and QA engineers to constantly receive new builds.
DevOps is a set of methods that can help reduce development time and speed up update releases. CI/CD (or continuous integration and delivery) is not so much a technology as it is an entire culture; it allows you to make small changes in a game with frequent commits more often and more reliably, and to deliver project modules to different departments and automate their testing — and all of this is a layer of work entirely invisible to the players.
Before starting work on War Robots Remastered, we already had a CI/CD pipeline in place for all projects (and the original War Robots was no exception). At that time, project builds took an average of 40–100 minutes. But the further we progressed with the remaster, the more problems we had with the build speed. Six months into things, the project build times began to reach 3 (or more) hours — with periodic freezes that could last up to 7–10 hours.
This became a completely untenable situation: QA couldn’t check builds dynamically, and developers also had to wait to see the result or start profiling. We had to seriously think about how to fix all this and return the build time to its original duration. My name is Alexandr Panov, I’m Deputy Technical Officer at Pixonic, MY.GAMES — today I’ll tell you how we made it happen.
The problems we faced in our previous situation
Let’s talk about how our application build pipeline looked like before. We used TeamCity from JetBrains as a CI server, and at that time there were 22 agents allocated to War Robots that could execute build scripts. They were located on four physical computers, that is, nodes. The configurations of the nodes were similar, and the average machine had the following characteristics:
OS: Windows
CPU: AMD Ryzen Threadripper 1950X 16-Core Processor 3.40 GHz
RAM: 128 GB
Disks: SSD of at least 1TB per agent and no more than two agents per disk
For the Unity Cache V1 server we used:
VM: Linux CentOS 7
RAM: 24 GB
Disk: 150 GB
Additionally, we had a Mac Mini installed for iOS builds.
We had agents running as a service working in the background independently of the user, rather than as a separate application. This was done for more flexible agent management. Moreover, this way we could have a separate user for each agent with their own rights and registry. However, while this turned out to be convenient, it caused a problem that wasn’t so easy to detect. During the build process, Unity suddenly started crashing without any identifying logs. However, a little search on the Internet helped determine the cause of the problem: the desktop heap.
What was going on? Windows has an unobvious setting so that the more processes running at the same time, the faster heap overflow occurs. Ssince we had several Unity builds running in parallel, this had a big impact on the overflow rate.
By increasing the value of HKEY_LOCAL_MACHINE\System\CurrentControlSet\Control\Session Manager\SubSystems\Windows experimentally, we managed to localize the crashes. But you should be careful with this, and Microsoft doesn’t recommend allocating more than 20480 KB of memory to the heap:
If you allocate too much memory to the desktop heap, negative performance may occur. Therefore, we don’t recommend setting a value greater than 20480 KB.
The more graphics presets, the longer the build process. What to do?
As mentioned above, the build time had noticeably increased — from 40–100 minutes to 3–4 hours, with periodic import freezes up to 10 hours. This was because we had divided the current graphics quality presets (and added some new ones). While previously we had only one preset, Legacy, now there are four: Legacy, ULD (Ultra-Low Definition), LD (Low Definition) and HD (High Definition).
Difference between ULD, LD and HD presets
We decided to start searching for the problem inside the code and began to look into our asset pre-import event handlers.
We use our custom tool that allows us to apply various graphics settings focused on the asset path mask. Thanks to this tool, during import, we can apply settings to textures, both for individual assets, for an entire folder, and for selected project directories. Moreover, we initially had additional custom importers that worked with different types of assets; mostly they were created to work with shaders and materials.
At some point, we realized that the build time was growing significantly. So, to determine the cause, we started looking at the editor and build logs. We discovered the following lines in those logs:
Hashing assets (38480 files)… 66.051 seconds
file read: 37.856 seconds (42992.805 MB)
wait for write: 22.787 seconds (I/O thread blocked by consumer, aka CPU bound)
wait for read: 9.976 seconds (CPUT thread waiting for I/O thread, aka disk bound)
hash: 53.298 seconds
They don’t mean anything bad, they just show that the system has started hashing assets in order to perform some further actions on them. In our case, the problem was that a message like this appeared in the log after importing every fifth to tenth asset. A full rehash of one asset was completed in a minute, so import time had skyrocketed. It usually happens when something in the code calls AssetDatabase.Refresh(). (This is what causes other event interceptors to loop, and Unity has to recalculate the entire asset.)
We found a lot of places like this since many plugins needed this functionality and covered them with traces. Then, we found out the source of refreshes: it turned out to be a subscription to AssetPostprocessor.OnPostprocessAllAssets. So, we discovered that the combination of our script and GPGSUpgrader (Google Play third-party file) had increased the build time.
Despite our discovery, the thrill didn’t last long, as we quickly found out that the problem had not been completely solved: in some cases, imports were faster, in others they didn’t change much. We began to study the problems of cache-servers, because the log showed that frozen builds often worked with assets for a very long time instead of just quickly downloading them and putting them in their library.
Our next step was to optimize the current Unity cache-server.
A small number of cache servers = heavy loads
When we create a project in Unity, we can’t just add an asset to it. This is because Unity changes it into its own formats separately for iOS and Android. To store these files, there’s a cache server; we can request already collected and reformatted assets from it, and they don’t need to be recalculated. If assets aren’t stored on the server, then they are recalculated, and this takes some time.
For reference, here are some numbers of our project configuration:
The repository size is more than 150GB
There are more than 1000 branches total in the repository
The number of files in the project is more than 170,000 (most of which are assets)
When we started working on War Robots Remastered, we used three cache-servers for the entire project: one was used by the developers, the rest were used for our CI and Dev/Release environment.
The problem was that the project had a very large number of small files; the server couldn’t endure the load, and it simply couldn’t distribute them on time. Network errors rarely occurred. Instead, errors like “Disk I/O is overloaded” occurred much more often due to the huge number of disk accesses. Because of this, the 5–7 hour freezes came about because when the cache server was unable to give files to clients, Unity began a full import of the project.
As a result, we decided to increase the number of cache-servers for CI to redistribute the load on the disks. Thus, we distributed 22 agents into a network of servers — three for each platform: Android-Dev, iOS-Dev, Win-Dev, Android-Release, iOS-Release, Win-Release, which significantly improved the situation. We also ran a 10-gigabit network to the agents. As a result, network problems stopped occurring, and the number of I/O errors significantly decreased.
But still, the import time left much to be desired. On average, the import from the cache server amounted to one and a half hours, and we decided to conduct an experiment with a new version of the cache server, Unity Accelerator.
At the time of our experiments, we were using Unity 2018.4, which didn’t allow the use of Asset Database V2, which we needed to use the Unity Accelerator, since the asset storage format was incompatible with the Unity Cache Server V1 version. At the same time, we had to update the Unity version to 2019.4.22f1; that’s when we managed to make the transition to a new cache server, which benefited the project. Now, importing from the iOS platform with the Library folder clearing looked like this:
Unity Accelerator turned out to be quite a feasible option: it has an acceptable import time, but it’s unable to warm up in 1 iteration, and several iterations are required to fill the caches.
So, having solved the problem with imports, we began to dig further.
What about bundles?
The next step in time optimization concerned the actual build of bundles within the project.
The level of quality varies for different types of devices. Consequently, almost all of our content is divided into quality packs, which already contain the necessary bundles with assets.
We identified two problems:
The problem of reusing bundles from previous builds
The problem of single-threaded bundle builds
The first problem comes from the fact that we collect a fairly large number of builds per day: about 60 (and even more on release days). As a rule, their content doesn’t changes don’t mean much: developers mostly check their work or test new features. During the bundle assembly process, all builds are uploaded to external storage.
Thus, the solution was the ability to specify a build in the repository before the building process, from which you can get ready-made bundles, as well as almost automatically get bundles from the last successfully built configuration from the same branch. And the organization of local caching of bundles on agent nodes allowed us to give up on assembling bundles in a specific build, immediately taking the ready-made ones.
But, unfortunately, with this approach a human factor comes into play, when developers decide whether the content has changed or not. This imposes significant limitations in the form of build errors, where content may be missing or changed.
The second problem is that for each quality preset, Unity assembles the bundles sequentially, one after another, so it takes more than an hour to assemble all four quality presets.
To avoid this (and reduce build time) we used this scheme:
After importing the project, we generate additional Unity projects for each required quality, and using Symlink, we link the Library folder for each generated project. But be careful: there are dynamic folders inside the library that are created during the build process. For example, the shader cache or Package folder may be recalculated, which can lead to problems in parallel builds. Accordingly, only static content and the database need to be linked.
We launch several instances of Unity simultaneously in separate processes and collect a separate quality in each.
We transfer the result from each build to the parent project.
Thus, we were able to parallelize the build time of bundles within one node.
Summing up: the results of our struggle with a huge build time
We managed to bring the build time back to its previous level (before the release of War Robots Remastered), and this was done despite the fact that the amount of content in the project had increased by 2.5 times. We used Unity Accelerator to make that happen. We also managed to remove import freezes by minimizing calls to AssetDatabase.Refresh().
Some information in numbers:
In the graph below you can see the distribution of build time by date in more detail:
We stopped rebuilding bundles, instead, we now take ready-made bundles from previously assembled builds. Further, we parallelized the build of bundles of different qualities, launching several instances of Unity simultaneously and linking assets.