Kubernetes RGD: ExternalRef With ReadyWhen Not Blocking?

by Alex Johnson 57 views

Understanding the Issue with externalRef and readyWhen in Kubernetes RGD

When working with Kubernetes Resource Graph Definitions (RGD), the externalRef and readyWhen functionalities are crucial for managing dependencies and ensuring resources are in the desired state before proceeding. However, a peculiar issue arises when combining these two: externalRef performs the checks defined in readyWhen, but it doesn't block the RGD from moving forward, potentially leading to premature execution of dependent resources. This article delves into this problem, providing a detailed explanation and potential solutions.

The Expected Behavior of externalRef and readyWhen

In a typical Kubernetes workflow managed by RGD, externalRef is used to monitor the existence and state of external resources. This is particularly useful when dealing with resources that take time to provision or become ready, such as databases or other services. The readyWhen condition adds an extra layer of validation, ensuring that specific criteria are met before considering the resource ready. For instance, you might want to check if a StatefulSet has the desired number of replicas available before proceeding with a migration job.

Ideally, when an externalRef is defined with a readyWhen condition, the RGD should wait for the external resource to appear and satisfy the readyWhen condition before moving on to the next resource in the dependency graph. This ensures that all prerequisites are met, preventing issues caused by dependent resources being initiated too early.

The Problem: readyWhen Checks but Doesn't Block

The core issue is that while the readyWhen checks are indeed performed, the RGD doesn't seem to respect their outcome in blocking the execution flow. The RGD proceeds to the next resources even if the readyWhen condition evaluates to false. This behavior can lead to problems, especially when subsequent resources depend on the external resource being fully ready.

Consider a scenario where a database migration job depends on a MySQL StatefulSet being fully operational. The RGD uses externalRef to monitor the StatefulSet and readyWhen to check if the required number of replicas are available. If the readyWhen condition isn't blocking, the migration job might start before the database is ready, leading to migration failures or data corruption. This highlights the critical need for readyWhen to function as expected and block RGD progress until the conditions are met.

A Concrete Example: MySQL StatefulSet and Migration Job

Let's examine a specific example to illustrate this issue. Suppose you have a MySQL database deployed as a StatefulSet, and you want to run a migration job after the database is fully up and running. The RGD configuration might look like this:

apiVersion: kro.run/v1alpha1
kind: ResourceGraphDefinition
metadata:
  name: migratedb.kro.run
spec:
  schema:
    apiVersion: v1alpha1
    kind: MigrateDB
    spec:
      namespace: string | default=default    
 
  resources:
  - id: mySQLStatefulSetRef
    readyWhen:
     - ${mySQLStatefulSetRef.status.replicas == mySQLStatefulSetRef.status.availableReplicas}
    externalRef:
      apiVersion: apps/v1
      kind: StatefulSet
      metadata:
        name: mysql
        namespace: ${schema.spec.namespace}
 
  - id: migrateDBJob
    template:
      apiVersion: batch/v1
      kind: Job
      ...

In this configuration, mySQLStatefulSetRef uses externalRef to monitor the MySQL StatefulSet. The readyWhen condition checks if the number of replicas matches the number of available replicas, ensuring the database is fully ready. The migrateDBJob depends on the successful readiness of the mySQLStatefulSetRef.

The problem arises if the migrateDBJob starts even when the readyWhen condition is not met. This can happen if the RGD incorrectly interprets the readyWhen condition or doesn't properly block execution until it's satisfied. The migration job might then fail because the database isn't ready to accept connections or perform migrations.

Demonstrating the Issue with a False Condition

To further demonstrate this behavior, you can create a condition that will always evaluate to false, such as `${