Skills
Rafay Skills are a set of Agent Skills that let an AI assistant operate the Rafay Platform from natural language. Instead of remembering which API to call in what order, you describe the problem ("is this cluster healthy?", "the workload won't publish", "give me an org overview") and the assistant follows a pre-authored, opinionated workflow against the Rafay MCP server.
Info
The Rafay Skills are published at RafaySystems/rafay-skills and currently cover one reporting skill and three diagnostic skills.
What is a skill?ΒΆ
A skill is a single SKILL.md file containing YAML front-matter (name, description, argument hints) and a Markdown body that encodes the procedure the assistant should follow. The assistant reads the front-matter to decide when a skill applies, then executes the body's steps using whatever Rafay MCP tools are available.
Every skill in this repo is built on the same four MCP tools:
| Tool | Role |
|---|---|
rafay_describe |
Confirm supported resource types, field names, and pagination defaults before listing. |
rafay_list |
Enumerate resources (clusters, workloads, namespaces, blueprints, users, addons). |
rafay_get |
Fetch a single resource's full payload (cluster, workload, blueprint). |
rafay_execute |
Run an action such as kubectl against a cluster through Rafay's Zero Trust Kubectl channel |
The exact tool surface depends on your Rafay MCP server version, so the skills are written to read schemas from the host's tool descriptors rather than hard-coding field names.
PrerequisitesΒΆ
Before you run any skill
- Rafay MCP connected β Configure the Rafay MCP server in your assistant (for example, in Cursor's MCP settings or Claude's connector settings) and sign in if prompted.
- Project context β The MCP needs to resolve a Rafay project, either via the
RAFAY_PROJECTenvironment variable on the MCP process or via aproject-nameargument on each tool call. Without it, list and describe calls can be ambiguous or fail.
The diagnostic skills go a step further and treat project_name as a required, explicit input β they will not silently rely on RAFAY_PROJECT, so that lookups always hit the intended project.
The Skills CatalogΒΆ
| Skill | Reach for it when⦠| Writes anything? |
|---|---|---|
| general-dashboard | You want an org- or project-scoped overview: cluster health, breakdowns by type/project/blueprint, and (org-wide) user counts. | No β strictly read-only |
| diagnose-cluster-health | A cluster looks unhealthy, nodes are down, or pods/events look wrong. | No β read + kubectl reads |
| diagnose-workload | A workload won't publish, publish is stuck, or sync status looks wrong. | No β read + kubectl reads |
| diagnose-blueprint-sync | Blueprint or addon sync to a cluster is wrong, versions drift, or a change won't apply. | No β read + optional kubectl |
Important
All four are read-only by design: they observe and explain, they do not mutate cluster or platform state.
1. general-dashboard skillΒΆ
This skill produces a read-only Rafay overview rendered as consecutive Markdown tables. It is the right tool for "summarize my environment" rather than "fix this one thing."
Inputs: optional project name. Omit it for an org-wide view; supply it to scope to a single project.
How it works:
- Uses
rafay_list(withrafay_describeto confirm field shapes first) and deliberately avoidsrafay_getandkubectlunless you ask. - Lists clusters with
summary: true, paginating withlimit/offsetuntil counts reconcile β and records page completeness honestly in a Context table rather than guessing totals. - Org-wide scope additionally tallies local users and SSO users.
- Project scope additionally tallies workloads, namespaces, and the blueprint catalog (conditions, published state, versions, and per-blueprint cluster counts).
Notable guardrails:
- Health is reported only from the
healthfield β never from clusterstatusβ and is collapsed to a binary Healthy vs Unhealthy (no "Unknown" column). - The user-visible deliverable is tables only, in a fixed order, with no extra prose, headings, or footnotes.
- Usernames and emails are omitted unless explicitly requested.
- For "is cluster X healthy?", the skill hands off to diagnose-cluster-health rather than expanding inline.
2. diagnose-cluster-health skillΒΆ
This skill decides overall cluster health, flags nodes that aren't Ready, surfaces pods in bad states, and calls out events that signal real problems.
Inputs (both required): cluster_name, project_name.
How it works:
- Control plane first β
rafay_geton the cluster is the source of truth for Rafay-reported status and health. The skill parses the fullstatusblock using the API's verbatim field names. - Connectivity gate β The first
kubectlcall is alwayskubectl version. This confirms the cluster's Kubernetes API is reachable through Rafay before any heavier queries. - Data plane β Only if the gate succeeds does it check nodes (
get nodes, thendescribefor any NotReady), pods in non-Running/non-Succeeded states, and Warning events sorted by time. - Synthesis β A short verdict (healthy / degraded / unhealthy) that leads with the Rafay view, then nodes and critical pods, and explicitly flags mismatches (e.g. Rafay says healthy but many nodes are NotReady, or
kubectl versionfails).
Notable behaviors:
- Output-volume discipline β It follows a narrow β widen pattern, avoids unbounded dumps like
get pods -A -o wide, capsdescribeto a handful of already-flagged objects, and tightens queries when output is truncated or times out. - Graceful failure β If the connectivity gate fails, it reports a connectivity issue, summarizes only the control-plane findings, and clearly labels the data-plane assessment as not performed rather than pretending to have checked.
3. diagnose-workload skillΒΆ
Diagnoses workload publish, sync, and deployment failures.
Inputs: workload_name (required); project_name (required unless RAFAY_PROJECT already matches the workload's project).
How it works:
- Fetch the workload β
rafay_getwithresource_type=workload; parse status, conditions, publish/sync fields, errors, and last-transition times. - Infer the target cluster from the workload response rather than assuming the user knows it, then
rafay_getthat cluster to compare its readiness/connectivity against the workload state. - Drop to
kubectlwhen pods, deployments, or events matter β starting narrow (get pods -n <ns>,describe deployment,get events --sort-by) and widening only if inconclusive. - Synthesize β Reconcile what the Rafay API says about publish/sync, whether the cluster record agrees, and what
kubectlshows; call out mismatches (e.g. "API says published but pods are failing") and the next concrete check.
4. diagnose-blueprint-sync skillΒΆ
Diagnoses blueprint and addon sync drift on a cluster. A blueprint is an addon stack (its dependencies form a graph) that Rafay syncs to Kubernetes.
Inputs (both required): cluster_name, project_name. Notably, the user supplies the cluster, not the blueprint β the blueprint name and version are read off the cluster.
How it works:
- Cluster β
rafay_getthe cluster, then read the attached blueprint name and version from its payload (using the returned field names verbatim). - Blueprint + versions β
rafay_getthe blueprint andrafay_listitsblueprint_versionentries; relate the cluster's pinned version to the catalog. - Addons on the cluster β
rafay_listcluster_addonfor the cluster. This list is the primary place to see which addons failed or are stuck, using status/conditions/messages exactly as the API returns them. - Optional
kubectlβ Only when thecluster_addondata isn't enough to explain the drift. - Summarize β What the cluster reports for blueprint binding vs. what the catalog shows vs. the actual addon sync state, plus drift, blockers, and next steps.
Choosing the Right SkillΒΆ
- Start broad with general-dashboard when you don't yet know where the problem is β it's the read-only "what's the state of everything" view.
- Narrow to a single cluster's health with diagnose-cluster-health.
- Use diagnose-workload when a specific application won't deploy or publish.
- Use diagnose-blueprint-sync when the platform/addon layer (not the app) is the suspect β version drift, stuck addons, a blueprint change that won't apply.
Info
The diagnostic skills overlap intentionally: cluster-health can point at addon drift and will defer to blueprint-sync for the full narrative, and workload diagnosis pulls in the cluster record to separate app problems from infrastructure problems.
Installation and Repository LayoutΒΆ
The canonical source for every skill is:
skills/<skill-name>/SKILL.md
Claude Code loads project skills from .claude/skills/. There is no single symlink for the whole folder β each skill is its own symlinked directory, for example:
.claude/skills/diagnose-workload β ../../skills/diagnose-workload
.claude/skills/diagnose-blueprint-sync β ../../skills/diagnose-blueprint-sync
Other assistants can reference skills/ directly or follow their host's documented skills path.
Editing
Edit only under skills/. The .claude/skills entries are symlinks, so updating the linked SKILL.md updates Claude Code's view automatically. Don't duplicate skill bodies into .claude/ unless you are intentionally replacing a symlink with a copy.
Windows clones
If symlinks aren't created on clone, either copy each folder from skills/ into .claude/skills/, or enable git config core.symlinks true where supported.
Design PrinciplesΒΆ
A few conventions run through all four skills and are useful to keep in mind when extending them:
- Read-only by default. Skills observe and explain; none of them mutate platform or cluster state. Guardrails explicitly forbid implying that
rafay_getorkubectlran when they didn't. - Control plane before data plane. The Rafay API payload is the source of truth and is parsed first;
kubectlis correlation, gated behind a connectivity check. - Honest about gaps. Pagination completeness, ambiguous values, and failed lookups are recorded (e.g. in the dashboard's Context table) rather than papered over with invented totals.
- Verbatim API fields. Because the MCP surface varies by server version, skills use the field names the API actually returns instead of assuming a fixed schema.
- Volume discipline. Diagnostics start narrow and widen only when needed, so they stay usable on busy clusters and within tool output limits.