Skip to main content

Skill Creator

Create new skills from scratch, improve existing skills, run evals to measure performance, or optimize skill descriptions for triggering accuracy. Does not cover skill installation, plugin packaging, or skill marketplace publishing. Produces a tested SKILL.md with eval results, benchmarks, and an optimized description.

Skill Creator

A skill for creating new skills and iteratively improving them.

At a high level, the process of creating a skill goes like this:

  • Decide what you want the skill to do and roughly how it should do it
  • Write a draft of the skill
  • Create a few test prompts and run claude-with-access-to-the-skill on them
  • Help the user evaluate the results both qualitatively and quantitatively
    • While the runs happen in the background, draft some quantitative evals if there aren't any (if there are some, you can either use as is or modify if you feel something needs to change about them). Then explain them to the user (or if they already existed, explain the ones that already exist)
    • Use the eval-viewer/generate_review.py script to show the user the results for them to look at, and also let them look at the quantitative metrics
  • Rewrite the skill based on feedback from the user's evaluation of the results (and also if there are any glaring flaws that become apparent from the quantitative benchmarks)
  • Repeat until you're satisfied
  • Expand the test set and try again at larger scale

Your job when using this skill is to figure out where the user is in this process and then jump in and help them progress through these stages. Of course, you should always be flexible and if the user is like "I don't need to run a bunch of evaluations, just vibe with me", you can do that instead. Then after the skill is done, you can run the skill description improver to optimize triggering.

Communicating with the user

The skill creator is liable to be used by people across a wide range of familiarity with coding jargon. Pay attention to context cues to understand how to phrase your communication. It's OK to briefly explain terms if you're in doubt.


Creating a skill

Capture Intent

Start by understanding the user's intent. The current conversation might already contain a workflow the user wants to capture (e.g., they say "turn this into a skill"). If so, extract answers from the conversation history first — the tools used, the sequence of steps, corrections the user made, input/output formats observed.

  1. What should this skill enable Claude to do?
  2. When should this skill trigger? (what user phrases/contexts)
  3. What's the expected output format?
  4. Should we set up test cases to verify the skill works?

Interview and Research

Proactively ask questions about edge cases, input/output formats, example files, success criteria, and dependencies. Wait to write test prompts until you've got this part ironed out. Check available MCPs for research if useful.

Write the SKILL.md

Based on the user interview, fill in these components:

  • name: Skill identifier
  • description: When to trigger, what it does. This is the primary triggering mechanism — include both what the skill does AND specific contexts for when to use it. Make descriptions a little bit "pushy" to combat undertriggering.
  • compatibility: Required tools, dependencies (optional, rarely needed)
  • the rest of the skill :)

Skill Writing Guide

Anatomy of a Skill

skill-name/
├── SKILL.md (required)
│   ├── YAML frontmatter (name, description required)
│   └── Markdown instructions
└── Bundled Resources (optional)
    ├── scripts/    - Executable code for deterministic/repetitive tasks
    ├── references/ - Docs loaded into context as needed
    └── assets/     - Files used in output (templates, icons, fonts)

Progressive Disclosure

Skills use a three-level loading system:

  1. Metadata (name + description) - Always in context (~100 words)
  2. SKILL.md body - In context whenever skill triggers (200 target / 300 ceiling lines)
  3. Bundled resources - As needed (unlimited, scripts can execute without loading)

Key patterns:

  • Keep SKILL.md at 200 lines target, 300 lines ceiling; if approaching this limit, extract detailed content to references/ files with clear pointers about when to read them.
  • Reference files clearly from SKILL.md with guidance on when to read them
  • For large reference files (>300 lines), include a table of contents

Domain organization: When a skill supports multiple domains/frameworks, organize by variant:

cloud-deploy/
├── SKILL.md (workflow + selection)
└── references/
    ├── aws.md
    ├── gcp.md
    └── azure.md

Claude reads only the relevant reference file.

Writing Patterns

Prefer using the imperative form in instructions.

Pre-flight gate pattern — Use when a step must not proceed until a prerequisite condition is confirmed:

**STOP. Do not [action] until [prerequisite] is complete. The only exception is [escape hatch if any].**

Writing Style

Try to explain to the model why things are important in lieu of heavy-handed musty MUSTs. Use theory of mind and try to make the skill general and not super-narrow to specific examples. Start by writing a draft and then look at it with fresh eyes and improve it.

Test Cases

After writing the skill draft, come up with 2-3 realistic test prompts — the kind of thing a real user would actually say. Share them with the user. Then run them.

Save test cases to evals/evals.json. Don't write assertions yet — just the prompts. You'll draft assertions in the next step while the runs are in progress.

{
  "skill_name": "example-skill",
  "evals": [
    {
      "id": 1,
      "prompt": "User's task prompt",
      "expected_output": "Description of expected result",
      "files": []
    }
  ]
}

See references/schemas.md for the full schema (including the assertions field, which you'll add later).

Reference: Read references/eval-workflow.md for the complete eval running process: spawning runs, drafting assertions, capturing timing, grading, aggregating benchmarks, and launching the viewer.

Reference: Read references/improvement-guide.md for how to think about skill improvements, the iteration loop, and blind comparison.

Reference: Read references/description-optimization.md for the full description optimization workflow: generating trigger eval queries, reviewing with the user, running the optimization loop, and applying results.

Reference: Read references/environment-guides.md for Claude.ai-specific and Cowork-specific adaptations of this workflow.


Package and Present (only if present_files tool is available)

Check whether you have access to the present_files tool. If you don't, skip this step. If you do, package the skill and present the .skill file to the user:

python -m scripts.package_skill <path/to/skill-folder>

After packaging, direct the user to the resulting .skill file path so they can install it.


Reference files

The agents/ directory contains instructions for specialized subagents. Read them when you need to spawn the relevant subagent.

  • agents/grader.md — How to evaluate assertions against outputs
  • agents/comparator.md — How to do blind A/B comparison between two outputs
  • agents/analyzer.md — How to analyze why one version beat another

The references/ directory has additional documentation:

  • references/schemas.md — JSON structures for evals.json, grading.json, etc.

Repeating one more time the core loop here for emphasis:

  • Figure out what the skill is about
  • Draft or edit the skill
  • Run claude-with-access-to-the-skill on test prompts
  • With the user, evaluate the outputs:
    • Create benchmark.json and run eval-viewer/generate_review.py to help the user review them
    • Run quantitative evals
  • Repeat until you and the user are satisfied
  • Package the final skill and return it to the user.

Please add steps to your TodoList, if you have such a thing, to make sure you don't forget. If you're in Cowork, please specifically put "Create evals JSON and run eval-viewer/generate_review.py so human can review test cases" in your TodoList to make sure it happens.

Good luck!

Search Framework Explorer

Search agents, skills, and standards