Multi Tenancy

Every end customer (e.g. enterprise) get access to their own Org. With an Org, multiple options for multi tenancy are supported for user workspaces. These options are powered and enforced using Rafay's market leading Kubernetes Management platform's capabilities. The image below describes the various options for multi-tenancy supported by the platform.

Multi Tenancy Options

What this means is that different workspaces in the same customer org can leverage different tenancy models. This allows administrators to ensure that the most appropriate tenancy approach is used for the use case in mind.

#	Approach
1	Namespace
2	Virtual Cluster
3	Dedicated Cluster

Note

For GPU Cloud Providers, Rafay provides the means for them to support "multiple customers" with each customer assigned an Org. Review the following documentation for details and understand options for white labeling.

Namespace¶

This option is well suited for users that need access to a few GPUs, but need this to be on-demand with near instant access to the resources. A Kubernetes namespace allows the organization to partition an existing cluster into logical mini-clusters and assign them to users.

Users allocated a Kubernetes namespace will not have privileged access to the cluster. For example, they will not have cluster-wide privileges required to deploy applications that are packaged as CRDs.

Note

Users that require support for cluster-wide privileges are recommended to use the "Virtual Cluster" or "Dedicated Cluster" options described below.

Virtual Clusters¶

Virtual clusters (aka vClusters) are essentially full Kubernetes clusters that operate inside a namespace. Virtual clusters have their own API server that provides better isolation for use cases where namespaces are not practical.

Tenant Autonomy¶

A data scientist that needs to install software packaged as Kubernetes CRDs into a namespace because of lack of privileges. With a virtual cluster, the platform team can provide the user with full autonomy.

Separation of Duties¶

Instead of a complex, shared responsibility and support model, platform teams can focus on supporting and maintaining the underlying "host cluster" and the "namespace" in which the virtual cluster will operate in. They can delegate the administrative responsibilities of the virtual cluster to the end user.

Dedicated Clusters¶

When users require very large compute profiles (i.e. clusters with a large number of GPUs e.g. 100 GPUs), it may be practical to provide the user with a dedicated Kubernetes cluster. Some common examples of use cases that require this scale are massive scale, full cluster, distributed training and model fine tuning.

Although these tasks are generally not long running (i.e. operate for a few days or weeks), they generally require extremely high levels of concurrency for gang scheduling (i.e. no resource contention) with extreme latency (i.e. network bandwidth and locality) requirements. It is common for these clusters to leverage high end Nvidia Quantum-2 InfiniBand with support for GPUDirect RDMA capability.

Here is an example of a dedicated Kubernetes cluster managed by Rafay. As you can see, this cluster has 128 GPUs spanning 19 nodes.

Large Dedicated GPU Cluster

However, the cluster is architected so that only a subset of the 19 nodes have GPUs attached to them. In this example, this node has 8 available GPUs

Node with GPUs

Controls¶

Administrators can configure Rafay GPU PaaS to automatically enforce the following controls for the various multi-tenancy options supported.

#	Controls
1	Resource Quotas
2	RBAC
3	Network Policy
4	Cluster Policy
5	Identity based Access
6	Audit Logging
7	Cost Visibility

Resource Quotas¶

For namespace and virtual cluster based multi tenancy, when the namespace is created for the user, resource quotas and limits are automatically implemented and enforced for all resources, including GPUs. This ensures fair and efficient resource utilization among multiple users on the same Kubernetes cluster.

The image below shows the administrative experience in the web console for GPU requests and limits. For example, if the limit for the GPU resource is configured as "8", at any given time, only a maximum of 8 GPUs can be used by resources in this namespace.

apiVersion: v1
kind: ResourceQuota
metadata:
  name: gpu-quota
spec:
  hard:
    pods: "5"
    requests.cpu: "1"
    requests.memory: 1Gi
    requests.gpu: "2"
    limits.cpu: "4"
    limits.memory: 2Gi
    limits.gpu: "8"

Namespace Quotas for GPUs

Note

Learn more about Resource Quotas/Limits.

Dedicated clusters are provisioned into projects where resource quotas can be specified as well. In addition, the environment blueprint responsible for lifecycle management of the Kubernetes cluster will also be configured with limits for resources. For example, "max nodes" etc.

RBAC¶

Kubernetes RBAC is a critical security control to ensure that users and workloads only have access to resources required to execute their roles. By default, with Rafay GPU PaaS, users being that are allocated a namespace are automatically locked down with permissions at the namespace level only (i.e. using RoleBindings). This control ensures that the users have rights only within the specific namespace and do not have the ability to perform any cluster level commands.

Namespace RBAC

For Virtual and Dedicated Clusters, the user is automatically mapped to a "ClusterRole" RBAC.

Note

Learn more about Roles and SSO via Identity Provider

Network Policy¶

Network Policies are a mechanism to control network traffic flow within and from/to Kubernetes clusters. With GPU PaaS, all namespaces are locked down with a default network policy (admin can override) that blocks all resources in the namespace to exchange network traffic with "other namespaces" and "outside the cluster"

Namespace Network Policy

Note

Learn more about Network Policy Enforcement in the Rafay platform.

Cluster Policy¶

With all clusters under management, a cluster level policy (based on OPA Gatekeeper) is automatically enforced and implemented to strengthen governance. This provides the means to control what users can/cannot do on the cluster. This also ensures that the clusters are always in compliance with centralized policies such as:

All images must be from approved repositories
All pods must have resource limits
All pods must have label that lists a point-of-contact (email address)

Cluster Policy

Note

Learn more about Cluster Policy Enforcement in the Rafay platform.

Remote Access¶

To ensure highest levels of security, all users are required to centrally authenticate using the configured Identity Provider (IdP). Once successfully authenticated, an ephemeral service account for the user is federated on the remote cluster in a Just in Time (JIT) manner.

Users are provided with the means to remotely access their namespace and perform Kubectl operations using the Kubectl CLI or an integrated browser based shell.

ZTKA Web Shell

!!! note Learn more about Zero Trust Kubectl in the Rafay platform.

Audit Logging¶

A centralized and immutable audit trail is generated for all activity performed by the users via all supported interfaces. Administrators are provided with centralized access to the audit logs. The audit logs can also be configured to be streamed in real time to a configured SIEM.

Audit Logs

!!! note Learn more about Audit Logs are centrally aggregated in the Rafay platform.

Cost Visibility & Allocation¶

Administrators that configure and enable cost profiles for their Kubernetes clusters will benefit from the integrated cost visibility and allocation/governance capabilities in the platform. Enabling this is considered an industry best practice because it will provide the organization with a view into total spend, spend by workspace, spend by user etc. This data can then be used for internal billing or charge back workflows.