Provisioning
Cloud Credentials¶
The controller needs to be configured with GKE Credentials in order to programmatically create and configure required GCP infrastructure. These credentials securely managed as part of a cloud credential in the Controller.
The creation of a cloud credential is a "One Time" task. It can then be used to create clusters in the future when required. Refer GKE Credentials for additional instructions on how to configure this.
Important
To guarantee complete isolation across Projects (e.g. BUs, teams, environments etc.,), cloud credentials are associated with a specific project. These can be shared with other projects if necessary.
Prerequisites¶
Users must have the below setup in the GCP Console
-
Create Service Account with the below Roles:
- Compute Admin
- Kubernetes Engine Admin
- Service Account User
-
APIs on Google Cloud Platform
Enable the following APIs on your Google Cloud platform to provision a GKE cluster
- Cloud Resource Manager API: Used for validating user’s GCP project
- Compute Engine API: Used for validating and accessing various resources like zones, regions etc,. on GCP that are used by the GKE cluster
- Kubernetes Engine API
-
Cluster in a VPC network
- Ensure the firewall allows HTTP and HTTPs traffic
- Create the subnet that you want to use before you create the cluster
- GCP VPC is global but subnet should be in the same region as your target cluster
High Level Steps¶
The image below describes the high level steps to provision and manage GKE clusters using the controller.
sequenceDiagram
autonumber
participant user as User/Pipeline
participant rafay as Controller
participant boot as Bootstrap Node
participant gke as GKE Cluster
user->>rafay: Provision GKE Cluster (UI, CLI)
note over boot, gke: GCP Project
rect rgb(191, 223, 255)
note right of rafay: For Every New GKE Cluster
rafay->>boot: Provision Bootstrap VM in GCP Project
rafay->>boot: Apply GKE cluster spec
boot->>gke: Provision GKE Cluster
boot->>gke: Pivot CAPI mgmt resources
boot->>gke: Apply Cluster Blueprint
gke->>rafay: Establish Control Channel with Controller
rafay->>boot: Deprovision Bootstrap Node
gke->>rafay: GKE Cluster Ready
end
rafay->>user: GKE Cluster Provisioned
Self Service UI¶
The controller provides users with a "UI Wizard" type experience to configure, provision and manage GKE clusters. The wizard prompts the user to provide critical cluster configuration details organized into logical sections:
- General
- Network Settings
- NodePools
- Security
- Feature
- Advanced
Create Cluster¶
- Click Clusters on the left panel and the Clusters page appears
- Click New Cluster
- Select Create a New Cluster and click Continue
- Select the Environment Public Cloud
- Select the Cloud Provider GCP and Kubernetes Distribution GCP GKE
- Provide a cluster name and click Continue
Constraints
- a. The cluster name should not exceed 40 characters
- b. Always begin with a letter. The name cannot start with a number or any other character
- c. The cluster name should not end with a hyphen ("-")
General (Mandatory)¶
General section is mandatory to create a cluster
- Select the Cloud Credential from the drop-down created with GCP credentials
- Enter the required GCP Project ID name
- Select a Location Type, either Zonal or Regional
- On selecting Zonal, select a zone
- On selection Regional, select a Region and Zone
- Select a Control plane version
- Select a Blueprint Type and version
Important
Use the GCP Project ID and not the Project Name.
Network (Mandatory)¶
This section allows to customize the network settings
- Provide a Network Name and Node Subnet.
Field Name | Field Description |
---|---|
Network Name | The name of the Google Cloud network that the cluster will be created in. |
Node Subnet Name | The name of the subnet in the network that the nodes in the cluster will be created in. |
Note: Use the name for the network and node subnet. Do not use the CIDR.
If using the VPC Network of the same project, the values are set to default.
To utilize the shared VPC Network (shared by host projects via GKE Console), provide the shared network paths for Network Name and Node Subnet Name as illustrated below.
Field Name | Field Description |
---|---|
IPv4 network access | Choose the type of network to allow access to your cluster's workloads. Learn more Learn more about public and private clusters in Google Kubernetes Engine. |
Public Cluster | Choose a public cluster to allow access from public networks to the cluster's workloads. Routes aren't created automatically. This setting is permanent and cannot be changed after the cluster is created. Learn more Learn more about public and private clusters in Google Kubernetes Engine. |
Private Cluster | Choose a private cluster to assign internal IP addresses to Pods and nodes, isolating the cluster's workloads from public networks. This setting is permanent and cannot be changed after the cluster is created. Learn more Learn more about public and private clusters in Google Kubernetes Engine. |
Access control plane using its external IP address | Disabling this option locks down external access to the cluster control plane. Google still uses an external IP address for cluster management purposes, but it's not accessible to anyone. This setting is permanent. |
Enable Control plane global access | With control plane global access, access the control plane's private endpoint from any GCP region or on-premises environment, regardless of the cluster's region. Learn more |
Disable Default SNAT | To use Privately Used Public IPs (PUPI) ranges, disable the default source NAT used for IP masquerading. Learn more |
Cluster default Pod address range | Define the IP address range for all pods in the cluster. Use CIDR notation, leave blank for the default range. This setting is permanent. |
Maximum Pods per node | Determine the size of IP address ranges assigned to nodes on GKE. Pods on a node are allocated IP addresses from its assigned CIDR range. Optimize the partitioning of the cluster's IP address range at the node level. This setting is permanent. Learn more |
Service address range | Define the IP address range for Kubernetes services in the cluster's VPC network. Use CIDR notation, leave blank for the default range. This setting is permanent. |
Pod Secondary CIDR Range (Name) | Provide the Pod name reserved in the VPC network. Mandatory if shared VPC Network used |
Service Secondary CIDR Range (Name) | Provide the Service name reserved in the VPC network. Mandatory if shared VPC Network used |
- Select a Cluster Privacy, Private or Public and provide the relevant details
Important
When choosing the "Private" cluster privacy option, it is required to have at least one (1) Cloud NAT present in the project where the GKE cluster is being created.
-
Optionally, users can Add Firewall Rules.
Firewall Rules¶
Adding Firewall Rules for GKE Private Cluster allows users to create and manage firewall rules tailored to their specific needs. This capability enhances network security by providing granular control over inbound and outbound traffic within GKE private clusters. Users can define custom firewall policies, ensuring only authorized connections are permitted while blocking unauthorized access attempts. This feature empowers users with greater control and flexibility in managing network traffic within GKE private cluster environments.
- Click Add Firewall and Firewall Rule page appears
- Provide a name and the required details
Field Name Field Description Name* Specifies a name for the Firewall rule Description Provides an optional description for this resource. Include this field during resource creation Priority* Determines the priority of this rule. It accepts integers between 0 and 65535, inclusively, with a default value of 1000. Relative priorities govern which rule takes precedence in case of conflicts Action* Defines the action taken when traffic matches specified parameters: ALLOW or DENY Direction* Indicates the traffic direction to which this firewall applies: INGRESS or EGRESS. The default is INGRESS. For EGRESS traffic, sourceTags fields cannot be specified Source Ranges Specifies source IP ranges for which the firewall rule applies. Ranges must be in CIDR format and apply only to IPv4 traffic Destination Ranges If destination ranges are specified, the firewall rule applies exclusively to traffic with destination IP addresses within these ranges, which must be expressed in CIDR format. Only IPv4 is supported Rules* Lists rules defined by this firewall. Each rule describes a permitted connection using a protocol and port-range tuple Ports: Optional list of ports to which this rule applies, relevant only for UDP or TCP protocol. Each entry must be an integer or a range. If unspecified, the rule applies to connections through any port. Example inputs include: ["22"], ["80","443"], and ["12345-12349"] Protocol: Specifies the IP protocol to which this rule applies. The protocol type is mandatory when creating a firewall rule and can be a well-known protocol string (tcp, udp, icmp, esp, ah, ipip, sctp) or the IP protocol number Target Tags Enumerates tags controlling which instances the firewall rule applies to. If specified, the rule applies exclusively to instances in the VPC network tagged accordingly Note: Depending on the chosen Direction (Ingress or Egress), users are required to add either Source Ranges or Destination Ranges. If Ingress is selected, Source Ranges must be added, while for Egress, Destination Ranges are necessary. Adding either Source Ranges or Destination Ranges is mandatory, but not both simultaneously.
- Click Save
-
Optionally, enter the Pod Address Range and Service Address Range
- If not providing any value for Pod Address Range, each node in GKE receives a /24 alias IP range of 256 addresses for hosting the Pods that run on it
- If not providing any value for Service Address Range, service (cluster IP) addresses are taken from the cluster's subnet's secondary IP address range for Services. This range must be large enough to provide an address for all the Kubernetes Services you host in your cluster
-
Enter the count for Max Pods Per Node
Dataplane V2 and Network Policy Config¶
-
Enabling DataPlane V2 in GKE activates Kubernetes network policy and enhances network security, visibility, and scalability using eBPF technology. This option is recommended for optimal performance and is particularly beneficial for large-scale applications. If DataPlane V2 is enabled, users are allowed to enable DataPlaneV2 Metrics and DataPlane V2 Observability
- Enabling DataPlane V2 metrics allows the users to optimize network performance, troubleshoot connectivity issues, and ensure the overall reliability of your Kubernetes environment with greater ease.
- Enabling DataPlane V2 Observability provides a Managed Hubble CLI solution for real-time observation of network flows between Kubernetes workloads. This feature offers valuable insights into communication patterns, facilitating efficient monitoring and troubleshooting to ensure network reliability and security.
Refer to this page to learn about the limitations of GKE DataPlane V2.
-
Enabling the Calico Kubernetes Network Policy option in GKE clusters enhances security, visibility, and scalability by activating the Network Policy API. With fine-grained control and efficient eBPF technology, it ensures robust security and high performance for large-scale deployments, simplifying network policy management for secure microservices and multi-tenancy environments.
Refer to this page to learn about the limitations of GKE Network Policy.
Important
Users are allowed to enable either DataPlane V2 or Calico Kubernetes Network Policy during GKE cluster creation, but not both simultaneously.
- If Private Cluster is selected, there are two options: setting the Access Control Plane External IP or configuring an Authorized Network. In the former, only a specific external IP assigned by Google can access the cluster, while in the latter, a CIDR range is provided, permitting all IPs within that range to access the private cluster
NodePools¶
By default, a new cluster will be created with at least one node pool
- To add more node pools, click Add Node Pool
- Enter a name and select the required Node K8s version
- Enter the number of nodes
- Enable/disable Node Zone. On enabling, add one or more zone(s)
- Enable/disable cluster autoscaler to automatically create or delete nodes based on the workload
- Enable/disable Automatically upgrade nodes to the next available version. Enabling this option will automatically upgrade the nodes within a cluster to the latest available version. Ensure that the Node K8s version matches the control plane version exactly or is within one minor version lower when auto upgrade nodes version is enabled
- To implement a node pool upgrade strategy, activate the Configure Upgrade Settings. Enabling this option will display two strategies to choose from. Opt for either the Surge Upgrade or the Blue Green Upgrade based on the requirement
Surge Upgrade
On selecting surge upgrade, nodes are upgraded one by one or in small batches with controlled disruption. This type of strategy for upgrading node pools includes two (2) important settings: - Max Surge: This determines how many new nodes can be added at most to the node pool while upgrading. It ensures a controlled and gradual increase in capacity. - Max Unavailable: This sets the maximum number of nodes that can be offline simultaneously (not in Ready state) during the upgrade. It's about managing node downtime carefully to prevent service disruptions.
Important
- Sum of Max Surge and Max Unavailable should not exceed 20
- Max Surge value cannot be zero '0' if Max Unavailable is set to zero '0'
Blue Green Upgrade
On selecting Blue Green Upgrade, a new set of nodes with updates is created, validated, and can be switched to while keeping the old nodes as a backup, allowing for easy rollback if needed. This type of strategy for upgrading node pools includes three(3) specific pieces of information:
-
Batch Node Count: This is the fixed number of nodes to be gradually drained in each batch. If this number is set to zero, this step will be skipped entirely.
-
Batch Soak Duration: This indicates the amount of time, measured in seconds, to pause after every batch of nodes has been drained. During this pause, you can assess your workload to make sure everything is functioning as expected after the nodes have been upgraded.
-
Nodepool Soak Duration: After all batches have been completely drained, this duration in seconds is the waiting time before proceeding. It provides an opportunity for you to double-check your workload's health before proceeding further.
Important
The max duration for Batch Soak Duration is 604800 seconds
- Optionally, provide the details for Configure Node settings, GPU Settings, and Reservation Affinity details
Configure node settings
Configure node settings ensures that any new nodes created within this node pool will adhere to the specified configurations
GPU Configuration
Optionally, enable Configure GPU Settings and provide the required details
Reservation Affinity
Provide the required Reservation Affinity settings config details
- Optionally, provide Node networking, Node security, and Node metadata
- Click Save
Security (Optional)¶
This section allows to customize the Security Settings
- Enable Enable Workload Identity to connect securely to Google APIs from Kubernetes Engine workloads
- Enable Enable Google Groups for RBAC to grant roles to all members of a Google Workspace group. On enabling this option, enter the required group name
- Enable Enable Legacy Authorization to support in-cluster permissions for existing clusters or workflows and this prevents full RBAC support
- Provide Client Certificate to authenticate to the cluster endpoint
Field Name | Field Description |
---|---|
Workload Identity | Workload Identity is a feature of Google Kubernetes Engine (GKE) that allows workloads running on GKE to securely access Google Cloud services. It enables you to assign distinct, fine-grained identities and authorization for each application in your cluster. |
Google Groups for RBAC | Google Groups for RBAC lets you assign RBAC permissions to members of Google Groups in Google Workspace. Learn more |
Legacy Authorization | Legacy Authorization enables in-cluster permissions for existing clusters or workflows. It does not support full RBAC. Learn more |
Issue a client certificate | The "Issue a Client Certificate" setting controls whether a client certificate will be issued for the cluster. Client certificates provide an additional layer of security when authenticating to the cluster endpoint (Kubernetes API server). Note that certificates don't rotate automatically and revoking them can be difficult. You can still authenticate to the cluster using Identity and Access Management (IAM) or basic authentication, although it is not recommended. |
Feature Setting (Optional)¶
Enable the required features
Field Name | Field Description |
---|---|
Cloud Logging | Collect logs emitted by your applications and GKE infrastructure. Learn more |
Enable Cloud Monitoring | Monitor metrics emitted by your applications and GKE infrastructure. Learn more |
Enable Managed Service for Prometheus | Deploy managed collectors for Prometheus metrics within this cluster. These collectors must be configured using PodMonitoring resources. Learn more |
Enable Backup for GKE | Enable backup and restore for GKE workloads. Costs are based on the data size and the number of protected pods. Learn more |
Enable Filestore CSI Driver | Automatically deploy and manage the Filestore CSI Driver in this cluster. Learn more |
Enable Image Streaming | Allow workloads to initialize without waiting for the entire image to download. Learn more |
Enable Compute Engine Persistent Disk CSI Driver | Automatically deploy and manage the Compute Engine Persistent Disk CSI Driver. This feature is an alternative to using the gcePersistentDisk in-tree volume plugin. Learn more |
Advance Setting (Optional)¶
Proxy Configuration
Optionally, users can provide Proxy Configuration details.
- Select Enable Proxy if the cluster is behind a forward proxy.
- Configure the http proxy with the proxy information (ex: http://proxy.example.com:8080)
- Configure the https proxy with the proxy information (ex: http://proxy.example.com:8080)
- Configure No Proxy with Comma separated list of hosts that need connectivity without proxy. Provide the network segment range selected for provisioning clusters in the vCenter (ex: 10.108.10.0/24)
- Configure the Root CA certificate of the proxy if proxy is terminating non MTLS traffic
- Enable TLS Termination Proxy if proxy is terminating non MTLS traffic and it is not possible to provide the Root CA certificate of the proxy.
Once all the required config details are provided, perform the below steps
- Click Save Changes and proceed to cluster provisioning
- The cluster is ready for provision. Click Provision
Provision Progress¶
Once the user clicks on Provision, the system begins to go through a list of conditions for a successful provisioning as shown below
Successful Provisioning¶
Once all the steps are complete, the cluster is successfully provisioned as per the specified configuration. Users can now view and manage the GKE Cluster in the specified Project in the Controller. On successfully provisioning, the user can view the dashboards
Download Config¶
Administrators can download the GKE Cluster's configuration either from the console or using the RCTL CLI
Failed Provisioning¶
Cluster provisioning can fail if the user had misconfigured the cluster configuration (e.g. wrong cloud credentials) or encountered soft limits in their GCP account for resources. When this occurs, the user is presented with an intuitive error message. Users are allowed to edit the configuration and retry provisioning
Refer to Troubleshooting section to learn about potential failure scenarios.
Pause/Resume Provisioning¶
During cluster provision, if an error occurs or provisioning fails due to any configuration issues, users can pause provisioning, rectify the issues and resume the cluster provisioning
- On receiving any error as shown below, click Pause Provision
- Once the configuration details are rectified, click Resume Provision as shown below
Note: This process cleans up the resources that are not required
Refer to Troubleshooting section to learn more about potential failure scenarios.