Token Factory Upgrade Instructions
Existing compute clusters created before v3.1.39 must be updated using the latest bootstrap YAML after upgrading the controller to v3.1.39.
This update is required to enable:
- GPU type visibility
- GPU selection during model deployment
- Custom monitoring alerts
These steps apply only to existing compute clusters created before v3.1.39.
Newly created compute clusters do not require these steps.
Step 1: Upgrade the Compute Cluster¶
Download the latest bootstrap YAML generated from the controller and re-apply it on the existing compute cluster.
kubectl apply -f <compute-name-compute>-bootstrap.yaml
During the upgrade process, services in the gaap-controller namespace are restarted, including:
- Endpoint pods
- Model deployment pods
Temporary service interruption may occur until all components become operational.
Step 2: Scale Down Existing Model Deployments (If Required)¶
This step is required only when the compute cluster does not have sufficient resources to bring up new model deployment pods during rolling upgrades.
In such scenarios, the new model deployment pods may remain in the Pending state because the existing pods are already consuming the available resources.
Scale down the existing model deployment to 0 replicas.
kubectl scale deployment -n gaap-controller <model-deployment-name> --replicas=0
The gaap-syncer automatically reconciles and brings up the model deployments again.
Post Upgrade Verification¶
After the upgrade is completed:
- GPU type information appears in the GPU Inventory by Type section
- GPU selection options are available during model deployment
- Custom monitoring alerts function as expected