Get Started with the Rafay MCP ServerΒΆ
OverviewΒΆ
This guide walks you through using the Rafay MCP server in Claude to perform two key workflows:
- Fleet Reporting β Generate executive-level reports covering cluster health, utilization, and cost consolidation across your entire fleet without opening the Rafay console
- Cluster Troubleshooting β Identify, diagnose, and remediate Kubernetes issues using natural language powered by Rafay platform visibility
AssumptionsΒΆ
This guide assumes the following are already in place:
- You have an active Rafay Platform account with operator-level access
- The Rafay MCP server is installed and configured in Claude Desktop
- An active managed Kubernetes cluster
If you have not yet installed and configured the Rafay MCP server, refer to the Rafay MCP Server Setup Guide before proceeding.
What You Will DoΒΆ
| Step | Action |
|---|---|
| 1 | Run fleet inventory and health reporting |
| 2 | Analyze cluster utilization and consolidation opportunities |
| 3 | Calculate cost based on actual cluster resource specs |
| 4 | Create demo issues on a cluster for troubleshooting |
| 5 | Diagnose and remediate issues using the MCP server |
Part 1 β Fleet ReportingΒΆ
Step 1 β Fleet InventoryΒΆ
Start a new Claude conversation. Use the Rafay MCP server to pull a complete inventory of your fleet:
"I need an executive-level infrastructure report. Start by giving me a complete inventory of all clusters across all projects. For each cluster include the name, project, status, blueprint, region, and when it was last updated."
Step 2 β Health & Status AnalysisΒΆ
"Now give me a health breakdown. How many clusters are healthy, degraded, or offline? For any cluster that is not fully healthy, tell me what is wrong β which addons have failed, which workloads are in error, and how long they have been in that state."
"Which clusters have had no workloads deployed to them in the last 30 days? These are candidates for immediate decommission."
"Are there any clusters running outdated blueprints? List them and show me how far behind they are from the current version."
Step 3 β Utilization AnalysisΒΆ
"For each cluster, how many workloads are currently deployed? Give me a workload count per cluster sorted from highest to lowest."
"Which clusters are running fewer than 3 workloads? Flag these as underutilized."
"Are there clusters in the same project and same region running similar workloads? These are the strongest consolidation candidates."
"Show me the distribution of workloads across projects β which projects are spreading workloads across many clusters versus consolidating onto fewer?"
Step 4 β Consolidation AnalysisΒΆ
"Based on the utilization data, identify which clusters could be consolidated. Specifically: which underutilized clusters are in the same region and project as a healthier, higher-utilization cluster that could absorb their workloads?"
"How many clusters could realistically be decommissioned if we consolidated underutilized workloads? Give me a conservative estimate and an aggressive estimate."
"Are there any clusters that exist only in dev or staging projects with zero production workloads? These should be prioritized for consolidation or scheduled shutdown."
Step 5 β Cost AnalysisΒΆ
Pull actual resource specificationsΒΆ
"For each cluster in the fleet, pull the actual resource specifications β number of nodes, CPU per node, memory per node, and storage."
"For any cluster where node specs are not directly available, use the cluster type and blueprint to infer the likely node shape β for example, a GPU cluster on H100s versus a standard CPU cluster on E5 Flex instances."
Apply cost ratesΒΆ
"Now apply the following cost rates to the resource data you collected:
- CPU: $0.030 per OCPU per hour (or $0.015 per vCPU per hour) - Memory: $0.002 per GB per hour - GPU (H100): $2.00 per GPU per hour - GPU (A100): $1.50 per GPU per hour - Block Storage: $0.025 per GB per month
For each cluster calculate: hourly cost, monthly cost (730 hours), and annual cost. Sum to a fleet total."
"Which clusters are the most expensive? Rank them by monthly cost from highest to lowest. Flag any cluster whose cost is disproportionate to its workload count."
Calculate consolidation savingsΒΆ
"For the underutilized clusters identified earlier, add up their total monthly resource cost. This is the potential monthly savings if those clusters are decommissioned."
"If we consolidated those workloads onto the surviving clusters, would the surviving clusters need to be right-sized up? Estimate the incremental cost increase and subtract it from the savings to get the net monthly saving."
"Which three clusters, if decommissioned today, would deliver the highest cost reduction with the lowest migration risk? Show me the cost of each and the net annual saving."
Build the cost summary tableΒΆ
"Give me a cost summary table with columns: Cluster Name, Project, Node Count, Total vCPU, Total Memory (GB), Total GPU, Monthly Cost, Workload Count, Cost per Workload. Sort by Monthly Cost descending."
Part 2 β Cluster TroubleshootingΒΆ
Step 1 β Create Demo IssuesΒΆ
Before running the troubleshooting prompts, deploy a set of broken resources onto a cluster to simulate a real-world incident.
Download the setup script and run it on the target cluster:
chmod +x create-cluster-issues.sh
./create-cluster-issues.sh
This creates the following broken resources in a demo-issues namespace:
| # | Resource | Issue |
|---|---|---|
| 1 | bad-image-pod |
ImagePullBackOff β invalid image tag |
| 2 | resource-hog-pod |
Pending β impossible resource requests |
| 3 | crashloop-pod |
CrashLoopBackOff β container exits with code 1 |
| 4 | broken-deployment |
0/3 replicas β bad image on all pods |
| 5 | no-endpoints-service |
No endpoints β selector matches no pods |
| 6 | missing-configmap-pod |
CreateContainerConfigError β ConfigMap not found |
| 7 | unbound-pvc |
Pending β storage class does not exist |
Step 2 β Fleet-Level DiscoveryΒΆ
Use the Rafay MCP server to identify which cluster has problems:
"Give me a health overview of all my clusters. Which clusters have resources in a non-running or degraded state? I want to know which clusters need attention right now."
"Which cluster is showing the most workload or addon failures? Give me the cluster name, project, and a summary of what is failing."
Step 3 β Cluster-Level DiscoveryΒΆ
Drill into the problem cluster to surface all affected resources:
"Look at the demo-issues namespace on the cluster. List every resource β pods, deployments, services, and PVCs β and tell me the status of each one. Flag anything that is not healthy."
"Give me a numbered list of every issue you can find in the demo-issues namespace. For each issue include: the resource name, resource type, current status, and a one-line description of what is wrong."
Step 4 β Diagnose and RemediateΒΆ
Work through each issue in the order surfaced by the server.
Issue 1 β ImagePullBackOffΒΆ
"The pod bad-image-pod is in ImagePullBackOff. What does this error mean, what is causing it, and what is the exact fix I need to apply?"
Fix:
kubectl delete pod bad-image-pod -n demo-issues
kubectl run bad-image-pod --image=nginx:latest -n demo-issues
Issue 2 β Pending PodΒΆ
"The pod resource-hog-pod has been Pending. What is preventing it from being scheduled, and what are my options to fix it?"
Fix:
kubectl delete pod resource-hog-pod -n demo-issues
kubectl run resource-hog-pod --image=nginx:latest \
--requests='cpu=100m,memory=256Mi' \
--limits='cpu=500m,memory=512Mi' \
-n demo-issues
Issue 3 β CrashLoopBackOffΒΆ
"The pod crashloop-pod is in CrashLoopBackOff. What am I looking for in the logs, and what is the fix?"
Fix:
kubectl logs crashloop-pod -n demo-issues --previous
kubectl delete pod crashloop-pod -n demo-issues
kubectl apply -f - <<EOF
apiVersion: v1
kind: Pod
metadata:
name: crashloop-pod
namespace: demo-issues
spec:
containers:
- name: app
image: busybox
command: ["sh", "-c", "sleep 3600"]
restartPolicy: Always
EOF
Issue 4 β Deployment with 0 Available ReplicasΒΆ
"The deployment broken-deployment shows 0/3 replicas available. What is wrong and how do I fix it without deleting the deployment?"
Fix:
kubectl set image deployment/broken-deployment app=nginx:latest -n demo-issues
kubectl rollout status deployment/broken-deployment -n demo-issues
Issue 5 β Service with No EndpointsΒΆ
"The service no-endpoints-service has no endpoints. What does this mean for any application trying to reach it, and how do I fix it?"
Fix:
kubectl get pods -n demo-issues --show-labels
kubectl patch service no-endpoints-service -n demo-issues \
--type='json' \
-p='[{"op":"replace","path":"/spec/selector/app","value":"broken-app"}]'
kubectl get endpoints no-endpoints-service -n demo-issues
Issue 6 β Missing ConfigMapΒΆ
"The pod missing-configmap-pod is in CreateContainerConfigError. What is causing this and what is the fastest way to fix it?"
Fix:
kubectl create configmap this-configmap-does-not-exist \
-n demo-issues \
--from-literal=APP_ENV=production \
--from-literal=LOG_LEVEL=info
kubectl get pod missing-configmap-pod -n demo-issues -w
Issue 7 β Unbound PVCΒΆ
"The PVC unbound-pvc has been in Pending state. What is causing this and how do I resolve it?"
Fix:
kubectl get storageclasses
kubectl delete pvc unbound-pvc -n demo-issues
kubectl apply -f - <<EOF
apiVersion: v1
kind: PersistentVolumeClaim
metadata:
name: unbound-pvc
namespace: demo-issues
spec:
accessModes:
- ReadWriteOnce
storageClassName: standard
resources:
requests:
storage: 10Gi
EOF
ConclusionΒΆ
In this guide you used the Rafay MCP server to perform work that would typically require an operator to navigate multiple console screens, pull data from different sources, and manually assemble reports.
In Part 1 β Fleet Reporting, the MCP server queried the Rafay platform API across all projects and clusters in a single conversation to:
- Produce a complete fleet inventory with health status, blueprint versions, and last-updated timestamps
- Surface underutilized and unhealthy clusters without opening the Rafay console
- Identify consolidation candidates by correlating workload counts, regions, and project boundaries
- Calculate real infrastructure cost by pulling actual node specs and applying per-resource cost rates
- Generate a ranked cost summary with cost-per-workload efficiency ratios ready to share with leadership
In Part 2 β Cluster Troubleshooting, the MCP server provided platform-level visibility to identify which clusters had problems and surfaced the affected workloads, allowing you to:
- Identify degraded clusters across the fleet with a single natural language query
- Drill into a specific namespace and retrieve a structured list of every broken resource
- Diagnose the root cause of seven distinct failure types through conversational prompts
- Apply targeted remediations for each issue without switching between tools
The Rafay MCP server compresses workflows that typically span multiple console sessions, CLI commands, and spreadsheets into a single Claude conversation β giving platform engineers, SREs, and infrastructure leads faster answers and more time to focus on higher-value work.









