Troubleshooting

Background¶

There are scenarios where the cluster provisioning is successful but the blueprint sync fails on day 0. Or the blueprint sync fails at a later point in time when you want to update certain software add-ons, policies, and services. This can occur for example when:

software add-ons in the blueprint are not correctly defined/written
there is an incompatibility between a version of an software-add on or service version (for ex. OPA Gatekeeper) and the cluster version
dependencies between software add-ons are not defined resulting in a failure

Viewing Details Of a Blueprint Sync Failure¶

When a blueprint sync failure occurs, the user can view the details on the Web Console by clicking the Blueprint Sync: Failed expand icon and take necessary action(s)

Hover over the red notification to view the reason for the failed status

Check out the failed add-on(s) details and reason for the failure

Users are allowed to make the required changes to the Blueprint configuration and retry for Blueprint sync

What happens when a blueprint sync fails¶

Generally, when a blueprint sync fails workload deployments on the cluster are blocked from being deployed. This is because the blueprint may carry very important policies, for ex. OPA Gatekeeper policies that have rules for application deployments. If those policies are not installed on the cluster, it can lead to security and compliance issues.

Therefore, it is recommended that an immediate step is taken to make the cluster usable again. Let's look at the steps that can be performed next.

Steps To Take Upon a Blueprint Sync Failure¶

Diagnose the general error: The error sometimes may be obvious and require just a simple tweak to the blueprint and/or underlying software add-ons to get it working again. For example, in the picture below, notice that the error represents that there is an incompatibility between the OPA Gatekeeper version defined in the blueprint and the cluster version which is v1.25 which is incompatible with OPA 3.7.1.

Triage the specific add-on deployments that failed: Sometimes it may be specific add-on deployments that failed. Clicking on the status of the add-on can reveal more details

Rollback to a previous version: If you are immediately blocked, you can roll back to a previous version of a blueprint by simply initiating the same process but just selecting the previous version of a blueprint.

Scenarios¶

Let's cover some different scenarios when blueprint sync failures can occur.

Scenario 1: Incompatibility between managed add-on/service and the cluster version¶

The below error is an example that occurs when provisioning a cluster of Kubernetes version 1.25 along with custom Blueprint which contains OPA Gatekeeper 3.7.1, a version of OPA Gatekeeper that is incompatible with Kubernetes version 1.25.

In this case, a validation error is being thrown by the platform to indicate that you must update to a version of OPA Gatekeeper in your installation profile in the blueprint that is 3.11.0 or higher.

Scenario 2: Cluster is unreachable¶

There may be a situation where your cluster is down or unreachable. In that case, when a blueprint sync fails, it will fail with a timeout error (specifically kubeapi-proxy giving a timeout error).

Scenario 3: Using the wrong default blueprint based on the cluster type¶

The system comes with blueprints for specific cluster types that can be used by default, or used as the base in a custom or golden blueprint. If the wrong default is used however it can lead to errors.

For example, for MKS clusters, the blueprint that should be used is default, minimal or default-upstream. Using any other default blueprint leads to an error.

Scenario 4: Attempting To Override Blueprint Fleet Configuration¶

Blueprint fleet can be used to update multiple clusters at once with a given blueprint. In this case, a cluster is assigned with a fleet label. However, if the user tries to manually do a one time update of a single cluster, because the cluster is part of the fleet, it will fail.

To overcome this, one must remove the fleet label from the single cluster and try again. See the blueprint fleet documentation for more details.

Scenario 5: The cluster runs out of space/memory for new add-on deployments.¶

In some cases, your cluster may run out of the space necessary for a given add-on deployment that comes with your new version of the blueprint.

Scenario 6: OPA Version Conflict After Upgrade¶

If a blueprint is associated with the default OPA profile before the 2.10 SaaS release, it will retrieve the old default version (3.14). After upgrading to the 2.10 SaaS release, when attempting to fetch the default OPA installation profile, the new version (3.16.3) is retrieved instead of the old one. This causes a conflict because the blueprint was originally designed to work with version 3.14.

To avoid this conflict, if no changes to the OPA version are expected, a custom installation profile should be created using version 3.14 (or whichever version the blueprint was working with). After creating the custom installation profile, it must be associated with the blueprint version. Only then will the blueprint use the custom OPA profile, preventing future conflicts.

Scenario 7: OPA Gatekeeper Deployment Fails on AKS Cluster with Azure Add-ons¶

If a blueprint is associated with the OPA profile on an AKS cluster where the azurePolicy add-on is enabled, the sync will fail due to a conflict with existing Gatekeeper components.

Error Message

Install failed : Unable to continue with install: ClusterRole \"gatekeeper-manager-role\" in namespace \"\" exists and cannot be imported into the current release: invalid ownership metadata; label validation error: missing key \"app.kubernetes.io/managed-by\": must be set to \"Helm\"; annotation validation error: missing key \"meta.helm.sh/release-name\": must be set to \"opa-gatekeeper-3.16.3\"; annotation validation error: missing key \"meta.helm.sh/release-namespace\": must be set to \"rafay-system\""

The azurePolicy add-on installs Gatekeeper components in the gatekeeper-system namespace. Since the required ClusterRole and ClusterRoleBinding already exist, the deployment of opa-gatekeeper in the rafay-system namespace fails.

To avoid this conflict, disable the azurePolicy add-on before publishing the blueprint with the OPA profile. After disabling the azurePolicy addonProfile, the blueprint syncs successfully without conflicts.