Troubleshooting
This section explains the frequently occurred errors during cluster provision
Resource Provisioning Failures¶
Scenario 1: Instance Type Not supported¶
The below error is an example that might occur at the time of cluster provision or adding a new nodegroup to the existing cluster
Validation
To overcome this issue, perform the below validations for instance types in a region:
- Check your Cloud Credentials (roles based or access id or secret) has the required permission to call ec2 AWS APIs. If the Cloud Credentials are role based, ensure all the appropriate IAM Policies are met
- Check whether the configuration has an instance type that is not available in the selected region
Scenario 2: Availability Zones¶
The below error is an example that might occur when the Cloud credentials does not have permission to create resources in the selected region during EKS cluster provision
Validation
Validate the permissions of the cloud credentials used for cluster provisioning to create the resources in that configured region
Scenario 3: Instance Type Permission¶
The below error is an example that might occur when the cloud credentials do not have permission to use a particular instance type, used in the EKS cluster configuration
Validation
- Check for permission and use the right instance type for the cloud credentials
- Rectify the permission on AWS to use the required configured instance type
Scenario 4: K8s version upgrade¶
During the k8s version upgrade to 1.25, the below error occurs if the aws-load-balancer-controller version is 2.4.6. The upgrade gets halted and the preflight check fails
Validation
Update the aws-load-balancer-controller to version v2.4.7 and then upgrade the k8s version to 1.25
Scenario 5: Removal of PSPs¶
The below error is an example that might occur when PSPs are found during the k8s version upgrade to 1.25.
Validation
PSPs are no longer supported in k8s v1.25, hence remove the PSPs and upgrade again
AWS Cloud Errors¶
When provisioning an EKS cluster, it might fail due to various AWS Cloud errors. These errors can stem from resource limitations, network connectivity issues, misconfigurations in the provisioning process, insufficient permissions, service outages impacting required AWS services, software bugs, and region-specific constraints. These factors can disrupt the EKS cluster provisioning process and necessitate troubleshooting to identify and resolve the underlying issues for successful deployment.
To gain insight into the failure and its underlying cause, click on Provision Status of the failed cluster
Click on Errors tab and expand the Cloud Error(s) section to access detailed information about AWS CloudFormation errors. This action will provide specific details regarding the encountered issues during the cluster provisioning process, enabling you to identify the root cause and take appropriate remedial actions for successful deployment.
Logs & Events¶
In the event of Cluster Provisioning failure on Day 0 due to any underlying issues, it's essential to diagnose and resolve them promptly. Along with the Errors, users can gain deeper insights into the errors and facilitate debugging by clicking on Logs & Events tab to view the CloudFormation stack events from AWS. This action provides access to comprehensive logs dating back to the creation of the stacks, enabling a thorough examination of events leading up to the provisioning failure.
This Logs & Events are available for the failure scenarios during Infra Creation and deletion, Bootstrap Node creation, and deletion and Bootstrap Creation In progress
Logs & Events are available when the cluster provisioning is complete, but the operation status is not ready, even though nodes are being created. This discrepancy might occur if the blueprint status is not initiated when incorrect images are applied.
If nodegroup creation fails on Day 2, users are allowed to pull the Logs and Events of a specific nodegroup. To access the logs and events pertaining to a specific nodegroup creation failure, click on the corresponding 'nodegroup creation failed' link and review the details.
In addition to retrieve logs and events through the user interface, users can also pull cloud logs and events using the API and CLI.
Important
IAM role policy ec2:GetConsoleOutput
is required to pull bootstrap cloud-init logs.