Harvester Upgrade Hangs On Fleet Rollout: A Bug Report

by Alex Johnson 55 views

Introduction

This article addresses a bug encountered during the upgrade of a Harvester cluster from version v1.7.0-rc2 to a custom master-head build incorporating Rancher v2.13.0-rc2. The upgrade process appears to stall during the fleet rollout apply manifests, preventing the successful completion of the upgrade. This document details the bug, steps to reproduce it, expected behavior, environment details, and troubleshooting information. For anyone managing Harvester clusters, understanding potential upgrade issues is crucial for maintaining system stability and minimizing downtime. In this article, we'll dive into the specifics of this bug, offering insights and potential solutions.

Bug Description

The core issue revolves around the Harvester upgrade process becoming unresponsive during the application of manifests as part of the fleet rollout. The system appears to hang indefinitely, with no clear indication of progress or failure. This behavior was observed when upgrading from Harvester v1.7.0-rc2 to a custom master-head build that included Rancher v2.13.0-rc2. The logs indicate that the hvst-upgrade-gdbkn-apply-manifests job in the harvester-system namespace is failing, which is a critical component of the upgrade process. This type of issue can severely impact the reliability of cluster upgrades, potentially leading to prolonged downtime and system instability. Identifying the root cause and implementing a solution is vital for ensuring smooth and predictable upgrades in the future. This problem underscores the importance of thorough testing and robust error handling in upgrade procedures.

Steps to Reproduce

To replicate this bug, follow these steps:

  1. Set up a Harvester cluster running version v1.7.0-rc2.
  2. Prepare a custom master-iso build that incorporates Rancher v2.13.0-rc2.
  3. Initiate the upgrade process from v1.7.0-rc2 to the custom master-iso build.
  4. Monitor the upgrade process and observe if it hangs during the fleet rollout apply manifests stage.

This sequence of actions should reliably reproduce the bug, allowing for further investigation and debugging. Ensuring the reproducibility of a bug is crucial for developers to effectively diagnose and fix the issue. By following these steps, users and developers can consistently encounter the problem and work towards a solution. Clear and concise reproduction steps are essential for efficient bug reporting and resolution. The ability to reproduce the bug in a controlled environment is a significant step towards identifying the root cause.

Expected Behavior

During a successful upgrade, the system should seamlessly transition from the older version (v1.7.0-rc2) to the new custom master-head build with Rancher v2.13.0-rc2. The fleet rollout apply manifests stage should complete without errors, and the cluster should become fully operational on the new version. Ideally, there should be clear progress indicators or logs to track the upgrade's progress. The upgrade process should be automated, requiring minimal manual intervention. Any issues encountered during the upgrade should be clearly reported with actionable error messages. A successful upgrade maintains system availability and data integrity, ensuring that the cluster functions as expected after the transition. This smooth transition is crucial for minimizing downtime and maintaining user confidence in the system.

Environment

The environment in which this bug was observed includes:

  • Harvester version: v1.7.0-rc2 (source) -> custom master-head (target)
  • Underlying Infrastructure: Baremetal with Dell PowerEdge R630 (DL360 Gen 9)
  • Rancher version: v2.13.0-alphaN -> v2.13.0-rc2 (embedded Rancher)

This information provides context about the specific setup where the bug was encountered. The underlying hardware and software configurations can influence the behavior of the system, making it important to document these details. Knowing the specific versions of Harvester and Rancher involved helps in narrowing down the potential causes of the issue. The fact that the infrastructure is bare metal might introduce different considerations compared to virtualized environments. This detailed environmental context is valuable for developers and support teams in troubleshooting and resolving the bug effectively.

Detailed Logs and Analysis

The provided logs offer valuable insights into the problem. The Rancher patch file indicates several configuration settings, such as disabling multi-cluster management features and enabling embedded Cluster API. The Helm upgrade process appears to complete successfully, but the system gets stuck in a loop, repeatedly waiting for Rancher to be upgraded to v2.13.0-rc2. The critical error message, error: stream error: stream ID 9; INTERNAL_ERROR; received from peer, suggests a communication issue within the system. Analyzing the logs from the hvst-upgrade-gdbkn-apply-manifests job further confirms that the manifest application is failing. This failure seems to be the primary cause of the upgrade being hung. Understanding the exact nature of the communication error and why the manifests are not being applied is crucial for resolving the issue.

Support Bundle Analysis

The support bundle (supportbundle_8fe1a97a-c96e-4960-a713-4858faeee75c_2025-11-20T19-27-18Z.zip) contains comprehensive data about the system's state, configurations, and logs. Analyzing this bundle can reveal additional error messages, resource utilization patterns, and configuration discrepancies that might be contributing to the issue. Key areas to examine within the support bundle include the logs from various Harvester components, Kubernetes events, and the status of deployed resources. This deep dive into the system's operational data can provide a more complete picture of the problem, enabling a more targeted approach to troubleshooting. The support bundle is an invaluable resource for diagnosing complex issues and should be thoroughly reviewed during the debugging process.

Potential Causes

Based on the available information, several potential causes could be contributing to this bug:

  1. Communication Issues: The INTERNAL_ERROR suggests a problem with communication between different components within the cluster. This could be due to network issues, misconfigured services, or internal service failures.
  2. Manifest Application Failure: The failure of the hvst-upgrade-gdbkn-apply-manifests job indicates that Kubernetes manifests are not being applied correctly. This could be due to syntax errors in the manifests, insufficient permissions, or problems with the Kubernetes API server.
  3. Resource Constraints: The system might be experiencing resource constraints (CPU, memory, disk space) that are preventing the upgrade process from completing successfully. Monitoring resource utilization during the upgrade could help identify this as a potential cause.
  4. Rancher Upgrade Issues: There might be specific issues related to the upgrade of Rancher from v2.13.0-alphaN to v2.13.0-rc2. Checking the Rancher upgrade logs and release notes could provide relevant information.

Identifying the root cause requires a systematic approach, investigating each of these potential factors. A combination of these factors could also be at play, making the diagnosis more complex.

Workaround and Mitigation

Currently, there is no known workaround or mitigation for this issue. The upgrade process hangs indefinitely, requiring manual intervention to resolve. Potential steps to try include:

  • Rolling back to the previous version: If possible, rolling back to v1.7.0-rc2 might restore system functionality. However, this does not address the underlying issue preventing the upgrade.
  • Manually applying the manifests: Attempting to manually apply the manifests that are failing might bypass the automated process and allow the upgrade to proceed. However, this requires careful examination of the manifests and a good understanding of the system.
  • Increasing resource limits: If resource constraints are suspected, increasing the resource limits for relevant Kubernetes components might alleviate the issue.

These steps are tentative and may not guarantee a successful resolution. A more thorough investigation is needed to determine the root cause and implement a proper fix.

Conclusion

The bug encountered during the Harvester upgrade highlights the complexities involved in managing and upgrading distributed systems. The fleet rollout apply manifests failure leading to a hung upgrade process underscores the importance of robust error handling and thorough testing. By documenting the bug, providing reproduction steps, and analyzing logs and support bundles, we can gain valuable insights into the underlying causes. While there is no immediate workaround, identifying potential causes and suggesting mitigation steps is crucial for guiding further investigation and development efforts. Resolving this issue will not only improve the Harvester upgrade experience but also enhance the overall stability and reliability of the platform. For further information on Harvester and Rancher, consider exploring the official documentation and resources available on the Rancher website.