Skip to content

Skills

Rafay Skills are a set of Agent Skills that let an AI assistant operate the Rafay Platform from natural language. Instead of remembering which API to call in what order, you describe the problem ("is this cluster healthy?", "the workload won't publish", "give me an org overview") and the assistant follows a pre-authored, opinionated workflow against the Rafay MCP server.

Info

The Rafay Skills are published at RafaySystems/rafay-skills and currently cover one reporting skill and three diagnostic skills.


What is a skill?ΒΆ

A skill is a single SKILL.md file containing YAML front-matter (name, description, argument hints) and a Markdown body that encodes the procedure the assistant should follow. The assistant reads the front-matter to decide when a skill applies, then executes the body's steps using whatever Rafay MCP tools are available.

Every skill in this repo is built on the same four MCP tools:

Tool Role
rafay_describe Confirm supported resource types, field names, and pagination defaults before listing.
rafay_list Enumerate resources (clusters, workloads, namespaces, blueprints, users, addons).
rafay_get Fetch a single resource's full payload (cluster, workload, blueprint).
rafay_execute Run an action such as kubectl against a cluster through Rafay's Zero Trust Kubectl channel

The exact tool surface depends on your Rafay MCP server version, so the skills are written to read schemas from the host's tool descriptors rather than hard-coding field names.


PrerequisitesΒΆ

Before you run any skill

  1. Rafay MCP connected β€” Configure the Rafay MCP server in your assistant (for example, in Cursor's MCP settings or Claude's connector settings) and sign in if prompted.
  2. Project context β€” The MCP needs to resolve a Rafay project, either via the RAFAY_PROJECT environment variable on the MCP process or via a project-name argument on each tool call. Without it, list and describe calls can be ambiguous or fail.

The diagnostic skills go a step further and treat project_name as a required, explicit input β€” they will not silently rely on RAFAY_PROJECT, so that lookups always hit the intended project.


The Skills CatalogΒΆ

Skill Reach for it when… Writes anything?
general-dashboard You want an org- or project-scoped overview: cluster health, breakdowns by type/project/blueprint, and (org-wide) user counts. No β€” strictly read-only
diagnose-cluster-health A cluster looks unhealthy, nodes are down, or pods/events look wrong. No β€” read + kubectl reads
diagnose-workload A workload won't publish, publish is stuck, or sync status looks wrong. No β€” read + kubectl reads
diagnose-blueprint-sync Blueprint or addon sync to a cluster is wrong, versions drift, or a change won't apply. No β€” read + optional kubectl

Important

All four are read-only by design: they observe and explain, they do not mutate cluster or platform state.


1. general-dashboard skillΒΆ

This skill produces a read-only Rafay overview rendered as consecutive Markdown tables. It is the right tool for "summarize my environment" rather than "fix this one thing."

Inputs: optional project name. Omit it for an org-wide view; supply it to scope to a single project.

How it works:

  • Uses rafay_list (with rafay_describe to confirm field shapes first) and deliberately avoids rafay_get and kubectl unless you ask.
  • Lists clusters with summary: true, paginating with limit/offset until counts reconcile β€” and records page completeness honestly in a Context table rather than guessing totals.
  • Org-wide scope additionally tallies local users and SSO users.
  • Project scope additionally tallies workloads, namespaces, and the blueprint catalog (conditions, published state, versions, and per-blueprint cluster counts).

Notable guardrails:

  • Health is reported only from the health field β€” never from cluster status β€” and is collapsed to a binary Healthy vs Unhealthy (no "Unknown" column).
  • The user-visible deliverable is tables only, in a fixed order, with no extra prose, headings, or footnotes.
  • Usernames and emails are omitted unless explicitly requested.
  • For "is cluster X healthy?", the skill hands off to diagnose-cluster-health rather than expanding inline.

2. diagnose-cluster-health skillΒΆ

This skill decides overall cluster health, flags nodes that aren't Ready, surfaces pods in bad states, and calls out events that signal real problems.

Inputs (both required): cluster_name, project_name.

How it works:

  1. Control plane first β€” rafay_get on the cluster is the source of truth for Rafay-reported status and health. The skill parses the full status block using the API's verbatim field names.
  2. Connectivity gate β€” The first kubectl call is always kubectl version. This confirms the cluster's Kubernetes API is reachable through Rafay before any heavier queries.
  3. Data plane β€” Only if the gate succeeds does it check nodes (get nodes, then describe for any NotReady), pods in non-Running/non-Succeeded states, and Warning events sorted by time.
  4. Synthesis β€” A short verdict (healthy / degraded / unhealthy) that leads with the Rafay view, then nodes and critical pods, and explicitly flags mismatches (e.g. Rafay says healthy but many nodes are NotReady, or kubectl version fails).

Notable behaviors:

  • Output-volume discipline β€” It follows a narrow β†’ widen pattern, avoids unbounded dumps like get pods -A -o wide, caps describe to a handful of already-flagged objects, and tightens queries when output is truncated or times out.
  • Graceful failure β€” If the connectivity gate fails, it reports a connectivity issue, summarizes only the control-plane findings, and clearly labels the data-plane assessment as not performed rather than pretending to have checked.

3. diagnose-workload skillΒΆ

Diagnoses workload publish, sync, and deployment failures.

Inputs: workload_name (required); project_name (required unless RAFAY_PROJECT already matches the workload's project).

How it works:

  1. Fetch the workload β€” rafay_get with resource_type=workload; parse status, conditions, publish/sync fields, errors, and last-transition times.
  2. Infer the target cluster from the workload response rather than assuming the user knows it, then rafay_get that cluster to compare its readiness/connectivity against the workload state.
  3. Drop to kubectl when pods, deployments, or events matter β€” starting narrow (get pods -n <ns>, describe deployment, get events --sort-by) and widening only if inconclusive.
  4. Synthesize β€” Reconcile what the Rafay API says about publish/sync, whether the cluster record agrees, and what kubectl shows; call out mismatches (e.g. "API says published but pods are failing") and the next concrete check.

4. diagnose-blueprint-sync skillΒΆ

Diagnoses blueprint and addon sync drift on a cluster. A blueprint is an addon stack (its dependencies form a graph) that Rafay syncs to Kubernetes.

Inputs (both required): cluster_name, project_name. Notably, the user supplies the cluster, not the blueprint β€” the blueprint name and version are read off the cluster.

How it works:

  1. Cluster β€” rafay_get the cluster, then read the attached blueprint name and version from its payload (using the returned field names verbatim).
  2. Blueprint + versions β€” rafay_get the blueprint and rafay_list its blueprint_version entries; relate the cluster's pinned version to the catalog.
  3. Addons on the cluster β€” rafay_list cluster_addon for the cluster. This list is the primary place to see which addons failed or are stuck, using status/conditions/messages exactly as the API returns them.
  4. Optional kubectl β€” Only when the cluster_addon data isn't enough to explain the drift.
  5. Summarize β€” What the cluster reports for blueprint binding vs. what the catalog shows vs. the actual addon sync state, plus drift, blockers, and next steps.

Choosing the Right SkillΒΆ

  • Start broad with general-dashboard when you don't yet know where the problem is β€” it's the read-only "what's the state of everything" view.
  • Narrow to a single cluster's health with diagnose-cluster-health.
  • Use diagnose-workload when a specific application won't deploy or publish.
  • Use diagnose-blueprint-sync when the platform/addon layer (not the app) is the suspect β€” version drift, stuck addons, a blueprint change that won't apply.

Info

The diagnostic skills overlap intentionally: cluster-health can point at addon drift and will defer to blueprint-sync for the full narrative, and workload diagnosis pulls in the cluster record to separate app problems from infrastructure problems.


Installation and Repository LayoutΒΆ

The canonical source for every skill is:

skills/<skill-name>/SKILL.md

Claude Code loads project skills from .claude/skills/. There is no single symlink for the whole folder β€” each skill is its own symlinked directory, for example:

.claude/skills/diagnose-workload      β†’ ../../skills/diagnose-workload
.claude/skills/diagnose-blueprint-sync β†’ ../../skills/diagnose-blueprint-sync

Other assistants can reference skills/ directly or follow their host's documented skills path.

Editing

Edit only under skills/. The .claude/skills entries are symlinks, so updating the linked SKILL.md updates Claude Code's view automatically. Don't duplicate skill bodies into .claude/ unless you are intentionally replacing a symlink with a copy.

Windows clones

If symlinks aren't created on clone, either copy each folder from skills/ into .claude/skills/, or enable git config core.symlinks true where supported.


Design PrinciplesΒΆ

A few conventions run through all four skills and are useful to keep in mind when extending them:

  • Read-only by default. Skills observe and explain; none of them mutate platform or cluster state. Guardrails explicitly forbid implying that rafay_get or kubectl ran when they didn't.
  • Control plane before data plane. The Rafay API payload is the source of truth and is parsed first; kubectl is correlation, gated behind a connectivity check.
  • Honest about gaps. Pagination completeness, ambiguous values, and failed lookups are recorded (e.g. in the dashboard's Context table) rather than papered over with invented totals.
  • Verbatim API fields. Because the MCP surface varies by server version, skills use the field names the API actually returns instead of assuming a fixed schema.
  • Volume discipline. Diagnostics start narrow and widen only when needed, so they stay usable on busy clusters and within tool output limits.