Optimize CI/CD Build Times: A Comprehensive Guide
In this comprehensive guide, we will explore effective strategies to optimize CI/CD build times, addressing the common challenges encountered in large projects, particularly those leveraging Substrate/Polkadot SDK. Long build times can significantly impact developer productivity and project efficiency. Let's dive into the solutions to streamline your CI/CD pipelines.
Understanding the Bottleneck
CI/CD workflows often suffer from extended build times, especially when dealing with extensive dependency trees like those found in Substrate/Polkadot SDK, which can include over 500 crates. Recent analysis reveals significant build durations:
- Feature CI: Approximately 50 minutes (e.g., a successful run timing from 01:55 to 02:44 equals 49 minutes).
- Release: Timings can exceed 60 minutes (e.g., 02:45 to 03:47 totaling 62 minutes).
- Nightly Checks: While lighter in workload, these still consume around 18 minutes.
The primary bottleneck stems from compiling the node/runtime, which necessitates pulling in the entirety of the Polkadot SDK. This highlights the critical need for targeted optimizations.
Identifying Current Issues
To effectively optimize, it's crucial to pinpoint the specific issues contributing to prolonged build times. Several key areas need attention:
1. Redundant Rust Toolchain Installations
The issue of redundant Rust toolchain installations across different jobs is a significant time-waster in CI/CD pipelines. Each job often independently installs the Rust toolchain, even when there is an opportunity to share a common setup. For instance, jobs like quick-checks, security, lint-contracts, build-and-lint-node, and test-node all require Rust, but they set it up separately. This redundancy is evident in the various configurations:
quick-checksinstalls the stable Rust toolchain along withrustfmt.securityinstalls the stable Rust toolchain.lint-contractsinstalls the stable Rust toolchain,clippy,rust-src, and thewasm32target.build-and-lint-nodeinstalls the nightly Rust toolchain,wasm32,clippy, andrust-src.test-nodereinstalls the nightly Rust toolchain,wasm32, andrust-src.
The problem with these independent installations is that they consume valuable build time. Each installation process involves downloading and setting up the necessary components, which can take several minutes. By consolidating these installations, we can reduce the overall build time and streamline the CI/CD process. The solution lies in identifying the common requirements across jobs and setting up a shared toolchain installation that can be cached and reused. This approach not only saves time but also ensures consistency across different jobs, as they all use the same toolchain version.
2. Sequential Job Dependencies
The presence of sequential job dependencies in CI/CD workflows can significantly impede the overall build time. When jobs are structured to run one after another, even if they don't logically depend on each other, the workflow becomes less efficient. In the given scenario, the job sequence looks like this:
quick-checks → security → lint-contracts → build-contracts → test-contracts
→ build-and-lint-node → test-node → docker-check
This sequence indicates that security waits for quick-checks to complete, and lint-contracts waits for both quick-checks and security, and so on. However, a closer look reveals that some of these jobs could potentially run in parallel without affecting the outcome. For example, quick-checks and security might be designed to test different aspects of the codebase, and their execution doesn't necessarily depend on each other's completion. Similarly, build-and-lint-node could potentially run concurrently with lint-contracts, build-contracts, and test-contracts up to a certain point, depending on the specific dependencies within the project.
The key to addressing this issue is to identify the jobs that can be safely parallelized and reconfigure the CI/CD workflow accordingly. By running independent jobs concurrently, the overall build time can be reduced significantly. This approach requires a careful analysis of the job dependencies and a restructuring of the workflow to maximize parallelism.
3. Cargo Rebuilds Across Jobs
One of the frustrating issues in CI/CD pipelines is the Cargo rebuilds across jobs. Despite the presence of caching mechanisms, build artifacts sometimes don't fully transfer between jobs, leading to unnecessary recompilation. This problem often arises because different jobs might use different cache keys or have variations in their build environments, causing Cargo, the Rust package manager, to treat them as distinct build contexts. In the scenario described, the problem is evident in the following steps:
build-and-lint-nodecompiles the node and its dependencies.test-noderecompiles the node for tests, even though it was already built in the previous job. This happens because the test environment might have different configurations or cache keys.- Each contract build job compiles ink! dependencies independently, meaning there is no shared compilation context or caching across these jobs.
The consequence of these rebuilds is a significant increase in build time. Compiling a Rust project, especially one with many dependencies, can be time-consuming. When the same components are repeatedly built across different jobs, the overall efficiency of the CI/CD pipeline is compromised.
The solution to this issue lies in optimizing the caching strategy and ensuring that build artifacts are effectively shared between jobs. This can involve standardizing cache keys, using consistent build environments, and leveraging tools like sccache for distributed caching.
4. Contract Builds Not Parallelized Efficiently
When dealing with multiple contract builds in a CI/CD pipeline, it's crucial to ensure they are parallelized efficiently. Inefficient parallelization can lead to suboptimal resource utilization and longer build times. The described scenario involves four contract builds that run as separate matrix jobs. While running them in parallel is a step in the right direction, the efficiency is hampered by several factors:
- Each job installs
cargo-contractfrom scratch, which is a tool for building and testing ink! smart contracts. - Each job compiles ink! dependencies independently, meaning there is no shared compilation context or caching across these jobs.
These inefficiencies arise because each job essentially starts from a clean slate, without leveraging the work done by other jobs. Installing cargo-contract and compiling dependencies are time-consuming processes. When these steps are repeated in each job, the benefits of parallelization are diminished.
The key to improving efficiency is to share the common setup and compilation steps across the contract build jobs. This can be achieved by caching the cargo-contract installation and ensuring that dependencies are compiled only once and then reused across all jobs. This approach not only reduces build time but also ensures consistency in the build environment.
5. Docker Build Duplicates Work
A common pitfall in CI/CD pipelines is when Docker builds duplicate work, leading to wasted time and resources. This issue often occurs when the Docker build process doesn't leverage previously built artifacts and instead rebuilds components from scratch. In the described scenario, the release.yaml workflow's docker-publish job rebuilds the Docker image from the Dockerfile, which involves a full build process. This is despite the fact that the build-release job has already produced the necessary binary artifact.
The problem with this approach is that it duplicates the build effort. Building a Docker image can be a time-consuming process, especially if it involves compiling code or installing dependencies. When the same components are rebuilt in multiple stages of the CI/CD pipeline, the overall build time increases significantly.
To address this issue, the Docker build process should be optimized to reuse the artifacts produced in earlier stages. In this case, the docker-publish job should utilize the pre-built binary from the build-release job instead of rebuilding it from the Dockerfile. This can be achieved by downloading the artifact and incorporating it into the Docker image.
Proposed Solutions
To tackle these challenges, a multi-faceted approach is necessary. The proposed solutions are categorized into short-term, medium-term, and long-term strategies, each requiring varying levels of effort and offering different degrees of improvement.
Short-Term Solutions (Easy Wins)
These solutions are designed to provide immediate improvements with minimal effort.
-
Utilize Rust Cache Action:
-
Replace manual cache configurations with
Swatinem/rust-cache@v2. This action simplifies the caching process and ensures efficient management of Rust dependencies. -
Configuration:
- uses: Swatinem/rust-cache@v2 with: shared-key: "arkavo-node" cache-on-failure: true
-
-
Parallelize Quick Checks and Security Jobs:
- Remove the
needs: quick-checksdependency from the security job. Since these jobs are independent, running them in parallel will reduce the overall build time.
- Remove the
-
Cache cargo-contract Installation:
-
Implement caching for the
cargo-contractinstallation to avoid redundant setups. This can be achieved using GitHub Actions' caching mechanism. -
Configuration:
- name: Cache cargo-contract uses: actions/cache@v4 with: path: ~/.cargo/bin/cargo-contract key: cargo-contract-6.0.0-beta.1
-
-
Eliminate Redundant Runtime Builds:
-
The
build-and-lint-nodejob currently builds both the node and runtime, which is redundant since the runtime is a dependency of the node. -
Modify the job to skip the explicit runtime build step:
- name: Build node run: cargo +nightly build --package arkavo-node # This already builds runtime as dependency - name: Build runtime run: cargo +nightly build --package arkavo-runtime # Redundant!
-
Medium-Term Solutions (Moderate Effort)
These solutions require a bit more effort but offer substantial improvements in build times.
-
Reuse Artifacts in Docker Build:
-
Modify the
docker-publishjob to leverage the pre-built binary from thebuild-releasejob. This avoids rebuilding the application within the Docker build process. -
Configuration:
- name: Download node binary uses: actions/download-artifact@v4 with: name: arkavo-node-${{ github.sha }} path: ./target/release/ -
Update the Dockerfile to simply copy the binary, similar to the approach in
Dockerfile.ciinfeature.yaml.
-
-
Implement sccache for Distributed Caching:
-
Integrate Mozilla's
sccachewith the GitHub Actions cache backend to enable distributed caching. This allows for sharing compilation artifacts across different runners. -
Configuration:
- name: Setup sccache uses: mozilla-actions/sccache-action@v0.0.3
-
-
Combine Lint and Build Jobs:
- Merge the
lint-contractsandbuild-contractsjobs sinceclippyalready compiles the code. This consolidation reduces overhead and streamlines the workflow.
- Merge the
Long-Term Solutions (Significant Effort)
These solutions require substantial effort but provide the most significant long-term benefits.
-
Utilize Self-Hosted Runners with Persistent Cache:
- Switch to self-hosted runners to maintain a warm Cargo registry cache, incremental compilation artifacts, and pre-installed toolchains. This avoids the overhead of setting up runners from scratch for each job.
-
Consolidate Build Matrix:
- Replace the four parallel contract build jobs with a single
build-all-contractsjob. The overhead of managing four separate VMs may outweigh the benefits of parallelization.
- Replace the four parallel contract build jobs with a single
-
Optimize Cargo Workspace:
- Enable `lto =