A weak article about this repository says Red Hat has a ready evaluator for Claude Code. That is too thin.

The stronger article is about triage. Teams accumulate skills, slash commands, hooks, agents, and CLAUDE.md rules. Some are broken. Some repeat each other. Some are too broad. Some look weak but contain a method worth rescuing. The hard question is not 'is this file long enough?' The hard question is 'does this asset still change how the agent works?'

The Red Hat evaluator is worth covering because it gives that question a structure: deterministic checks first, rubric judgment second, and optional marginal-value testing when one skill is disputed. That is the same discipline a serious article pipeline needs. Do not confuse weak form with no value. Put weak form into REVIEW, then decide whether the idea deserves a rewrite.

Do not make deletion the first verdict

The useful mental model is KEEP, REVIEW, REMOVE.

KEEP means the asset is structurally sound, specific, non-duplicative, and useful in real work. REVIEW means the asset has a recoverable idea but fails in form: vague trigger, broken reference, overlap, token bloat, or weak examples. REMOVE means the asset is unsafe, stale, redundant, or unable to prove value after review.

That middle bucket matters. In a Claude Code skill library, many bad files started as good operational memory. Deleting them because the first draft is weak throws away learning. Keeping them without repair pollutes context. REVIEW gives the team a place to rescue the idea without pretending the current artifact is ready.

Setup evaluation and skill evaluation are different jobs

The repository separates two commands because the questions are different.

/evaluate-setup reviews the whole workspace: skills, commands, CLAUDE.md, hooks, and agents. It asks whether the system is coherent. Are commands duplicating skills? Is CLAUDE.md repeating what belongs in a skill? Do hooks contain risky shell patterns? Are agent constraints backed by tool restrictions?

/evaluate-skill narrows the lens to one skill. That matters when the team disagrees about a single item. The question becomes: does this skill add marginal value beyond the rest of the setup, or is it just extra context?

Layer 1 catches mechanical failure

Layer 1 is the part to trust most because it is deterministic. The scanner parses real files and applies rules. Same input, same result.

The default run looks like this inside the Claude command:

uv run --project "$PROJECT_DIR" evaluate-setup scan .

The scan covers missing SKILL.md, invalid YAML frontmatter, absent descriptions, low-quality descriptions, token budget, broken references, near-duplicate skill bodies, prompt injection phrases, credential references, command scripts that do not exist, CLAUDE.md duplication, hook structure, and agent constraints.

This is where a quality pipeline should be strict. A broken reference is not an editorial opinion. A command pointing at a missing script is not a matter of taste. A skill that tells the agent to read ~/.ssh or uses sudo deserves a hard stop.

The presets reveal the intended policy

The presets show how the project expects teams to use the tool.

recommended treats structural breakage and security risk as errors. Missing skill files, missing required descriptions, broken references, prompt injection, credential access, missing agent descriptions, and missing referenced skills fail the run.

security turns off most style and structure rules and keeps only injection and credential checks. That is a fast safety pass, not a publication gate.

strict promotes more warnings to errors: description quality, frontmatter format, token budget, missing CLAUDE.md, disallowedTools parsing, and agent constraint mismatch. Use strict before sharing a skill pack across a team. Use recommended for routine maintenance. Use security when a risky third-party skill or command needs a quick quarantine pass.

Description quality is not cosmetic

The description-quality rule is small but important. It checks for third-person phrasing, use-case language such as use when, a 1,024-character ceiling, and a 20-character minimum.

That is not prose policing. In Claude Code, the description is part of routing. A skill that says 'I can help with tests' is weak because it does not say when it should load. A better description says the skill applies when the user is diagnosing flaky tests, changing shared fixtures, or adding coverage around a failing path.

For an article pipeline, the equivalent is the reader promise. A title and lede have to tell a Codex or Claude Code practitioner what decision the article helps them make. If the promise is vague, the article may still have a useful source, but it belongs in REVIEW.

Token budget is a context policy

The token-budget rule uses a 20,000-token context budget, assumes five concurrent skills, caps a single skill at 4,000 tokens, and also flags SKILL.md files over 500 lines.

The specific numbers are less important than the policy. A skill does not get unlimited context just because it is correct. If it is large, the content has to move into references that load only when needed. If the main SKILL.md is long, it competes with other project memory.

That is the practical reason to evaluate article seeds by substance rather than length. Length can be necessary when it adds source-backed analysis, commands, caveats, and examples. Length is not a quality measure by itself. A long article that repeats the README is still weak.

Layer 2 judges whether context is earned

Layer 2 is the editorial layer. Claude reads the actual files and scores skills on specificity, redundancy, trigger quality, token efficiency, and content quality. The setup report also includes cross-type review: should a skill become a hook, should a command become a skill, is CLAUDE.md duplicating skill content, are broad triggers colliding, is mandate stacking reducing autonomy?

This is where REVIEW becomes productive. A duplicated skill is not always garbage. It may contain the better checklist and the worse trigger. A bloated CLAUDE.md section may belong in a skill. A command may be too procedural to load automatically. The evaluator makes those placement questions visible.

Layer 3 tests marginal value, not trigger accuracy

Layer 3 is the most interesting part and the easiest to overread. It requires GOOGLE_API_KEY, asks Gemini to generate tasks from real repositories, runs Claude with all skills except the tested one, runs Claude again with the tested skill included, and then judges the difference.

That tells you whether the skill text improves output on generated tasks. It does not prove Claude Code will load the skill at the right moment in normal use. Trigger quality still has to be read in Layer 2 and tested by real workflow observation.

So the right interpretation is narrow: Layer 3 can support KEEP when the skill changes output in the right direction. It can support REMOVE when the skill repeatedly adds nothing. A mixed result should send the skill to REVIEW, not automatic deletion.

Run it on a real workspace

A practical trial starts with the repo-local workspace model from the README:

git clone https://github.com/redhat-community-ai-tools/claude-code-setup-evaluator.git
cd claude-code-setup-evaluator
uv sync
uv run .ai-workspace/scripts/setup.py

cd repositories
git clone <your-repo-url>
cd ..
claude

Then run setup review first:

/evaluate-setup

Save the full report to a file if the setup has more than a handful of assets. The report format is designed for this: inventory, Layer 1 checklist, rubric, cross-type review, and numbered suggestions. The final summary is short enough for a maintainer to act on.

Use REVIEW as the repair queue

The most useful workflow is not 'run evaluator, delete failures.' It is:

KEEP: passes structure, earns context, has a clear trigger, and is not redundant.
REVIEW: useful idea, bad shape. Rewrite trigger, split references, merge duplicates, fix commands, or move content between CLAUDE.md, skill, hook, and command.
REMOVE: unsafe, stale, duplicate with no unique checklist, or repeatedly no marginal value.

For a magazine pipeline, the mapping is similar. Published means KEEP. Rewrite means REVIEW. Archive means REMOVE. A weak article with a high-value source should not disappear; it should be rebuilt around the decision a practitioner actually faces.

A repair runbook for one disputed skill

When a skill lands in REVIEW, use /evaluate-skill instead of arguing from taste.

/evaluate-skill

Then inspect three things:

1. Layer 1: are there hard failures such as broken references, prompt injection, credential access, or invalid frontmatter?
2. Layer 2: does the skill have a narrow trigger and content that Claude would not already know?
3. Layer 3: on real repo tasks, does the output improve when the skill is present?

If the answer is structural failure plus useful idea, repair it. If the answer is clean structure but no distinct behavior, merge or remove it. If Layer 3 is unavailable because no GOOGLE_API_KEY exists, do not invent certainty. Mark the result as REVIEW until a human has read the content and tried it in a real workflow.

Security checks are necessary but not enough

The security rules are practical. Prompt injection patterns include instruction override phrases, hidden instruction tags, prompt leak requests, base64 evasion, and markdown-image exfiltration. Credential rules flag paths like ~/.ssh/, ~/.aws/credentials, ~/.kube/config, and environment variables such as API keys, database passwords, GitHub tokens, and JWT secrets.

Those checks should be hard gates for third-party skills. But passing them does not mean the asset is good. It only means the obvious safety failure was not found. The content still has to earn context, avoid duplication, and give the agent a decision rule it did not already have.

Cross-type review is the hidden value

The report's cross-type section is where mature Claude Code setups get value. Many teams put everything in the wrong place.

A repeatable shell action may belong in a slash command. A safety rule that must run before commit may belong in a hook. A broad project norm may belong in CLAUDE.md. A domain-specific workflow with examples may belong in a skill. An agent constraint must be backed by actual tool restrictions, not just a sentence saying the agent should not push code.

That placement review is more useful than another pass/fail score because it shows how to preserve the idea while changing the container.

What to add to AGENTS.md

A team using this evaluator should write the operating rule down:

Run /evaluate-setup before publishing shared Claude Code skills, commands, hooks, agents, or CLAUDE.md changes. Treat Layer 1 errors as blockers. Treat Layer 2 REVIEW results as repair work, not deletion. Use /evaluate-skill for disputed skills. Use Layer 3 only as marginal-value evidence; it does not prove trigger accuracy. Remove an asset only when it is unsafe, stale, redundant, or unable to show distinct value after review.

That rule keeps quality review from becoming a purge.

Where it fits in a Codex article pipeline

The same model fixes weak AI-written articles.

A deterministic gate should catch banned titles, missing sources, thin implementation signal, no practical takeaway, and obvious AI tone. An editorial layer should ask whether the article teaches a Codex, Claude Code, MCP, skill, or agent workflow decision. A rescue layer should preserve high-value sources and rebuild the article with source-backed analysis, runbooks, caveats, and a verdict.

The Red Hat evaluator is not about word count. It is about evidence before status change. That is the principle worth importing.

When not to use it

Skip the full evaluator for a tiny personal setup with one CLAUDE.md and no shared skills. The ceremony is not free. A short manual read is faster.

Do use it when a team has multiple skills, repeated commands, hooks with shell behavior, agents with tool restrictions, or a CLAUDE.md that has become a dumping ground. Do use it before publishing a skill library to other developers. Do use it when a useful idea looks weak and the team needs a fair REVIEW path before removal.

The boundary is simple: if the setup is large enough that no one remembers why each instruction exists, run the evaluator.

Save the Red Hat evaluator article as a skill-triage field guide. The value is the layered decision process: deterministic failures, rubric judgment, marginal-value tests, and a REVIEW bucket that preserves useful ideas without keeping weak artifacts in production.

Practical takeaway

Run /evaluate-setup before publishing shared Claude Code assets. Treat Layer 1 errors as blockers. Put vague, bloated, duplicated, or misplaced assets into REVIEW. Use /evaluate-skill for disputed skills, and read Layer 3 as marginal-value evidence only. Delete last, after the source idea has failed repair or shown no distinct value.