본문 바로가기
AI 리서치Tools · 9 min read

Runtime Contracts in Codex Skill Catalogs

The catalog is useful for Codex readers because it shows how a skill pack can use gates, manifests, schemas, runtime checks, and delegation boundaries instead of loose prompts.

The old draft should be rescued because the source is much stronger than the article shape. AI Research Skills is not just another collection of academic prompts. It is a catalog that tries to answer a harder question for agent builders: how do you make a skill pack produce artifacts that later agents can trust?

The useful object for Codex readers is the implementation pattern across specific files: .claude-plugin/marketplace.json defines the packaged plugins, catalog/skills.yml records skill purposes and verification metadata, docs/runtime-contract.md separates prompt instructions from executable runtime, docs/install.md names smoke tests, and tests/test_catalog.py plus tests/test_release_hygiene.py keep the index from drifting.

That makes it a good magazine rescue. The value is not that AI can help research. The value is a concrete example of a skill catalog with gates, handoffs, provenance, runtime boundaries, and tests.

The article should start with gates

The README's strongest claim is the three-gate research dossier: is the gap open, is closing it a contribution, and is it feasible? That is more interesting than a list of skills. It turns a vague research idea into a decision artifact before literature review, design, code, or manuscript work starts.

For Codex and Claude Code practitioners, the transferable lesson is simple. If a workflow can make expensive downstream work, create an early gate that emits a structured artifact. In this repo, the example is docs/example-topic-dossier.gaps.yml, with verdict: no-go, verdict: conditional-go, recall metadata, open questions, and downstream_consumer: research-design-helper. A skill catalog becomes safer when its first job is allowed to say no.

The catalog is not the runtime

docs/runtime-contract.md is the document that makes this source worth preserving. It says the catalog provides marketplace registry, install docs, and links to canonical SKILL.md files. The SKILL.md files provide portable workflow instructions and output contracts. The executable research layer lives in research-hub-pipeline, which can provide a CLI, MCP or REST server, Zotero, Obsidian, NotebookLM automation, and dashboard behavior.

That split prevents a common skill-pack mistake. Installing a plugin does not prove tools exist. Copying a SKILL.md file does not prove Zotero, NotebookLM, or the Python runtime is available. The article should tell readers to copy this contract style into their own Codex-facing skill packs.

The failure it prevents

Here is the concrete failure mode. A team publishes a skill called research-hub, Codex reads the SKILL.md, and the agent tries to run literature search. The machine has the marketplace entry installed, but not research-hub-pipeline; Zotero local API is closed; NotebookLM browser auth is missing. Without a runtime contract, the agent either invents success or turns the user into the debugger.

AI Research Skills handles that failure by making the preflight part of the documentation: research-hub describe --json, research-hub doctor, and a small research-hub auto ... --no-nlm run. That is the original lesson for engineering teams. Every skill that touches an external runtime should carry a preflight command and a failure branch.

Install checks have two layers

The Claude Code path is concrete:

claude plugin marketplace add WenyuChiou/ai-research-skills
claude plugin install research-workspace@ai-research-skills
claude plugin list

But claude plugin list only proves the Claude Code marketplace layer. The runtime contract and install guide add the actual executable preflight:

pip install research-hub-pipeline
research-hub describe --json
research-hub doctor
research-hub auto "agent-based modeling" --max-papers 3 --no-nlm

That distinction is a direct Codex lesson. A serious agent article should not stop at install commands. It should name the smoke test that proves the workflow can execute.

Copy the file layout before the prose

The practical part to copy is not a paragraph from the README. It is the control surface around the catalog:

.claude-plugin/marketplace.json
catalog/skills.yml
docs/runtime-contract.md
docs/install.md
docs/verification.md
docs/skill-directory.md
tests/test_catalog.py
tests/test_release_hygiene.py

For a Codex-facing skill pack, this is the minimum shape I would accept in review. marketplace.json tells Claude Code what can be installed. catalog/skills.yml tells humans and tests what each skill is for, when it runs, what it outputs, and where its canonical source lives. runtime-contract.md tells agents what install does not prove. The tests prevent a broken index from becoming a broken reader workflow.

Turn tests into editorial promises

tests/test_catalog.py is not product code, but it is editorial infrastructure. It asserts that every family has a canonical repo, every skill has a purpose, repo URL, skill URL, use cases, and outputs, the skill directory includes every catalog skill, and the marketplace JSON has the expected five plugin names.

tests/test_release_hygiene.py protects release trust: changelog headings require footer links, versions must descend, marketplace metadata.version must match the newest changelog entry, plugin versions must be semver, and sibling plugin versions are checked when those repos are cloned. A magazine article should call this out because it is the difference between a skill pack that reads well and a skill pack that can be maintained.

The eight-stage pipeline is a handoff map

docs/pipeline.md lays out the workflow from literature discovery through reviewer response. The interesting part is not the number of stages. It is that each stage produces the next stage's input: .bib and paper notes, topic_dossier.gaps.yml, design_brief.md, project_manifest.yml, code or figures, .paper/claims.yml, and reviewer-response artifacts.

That is how a skill catalog escapes chat history. A later agent should not reread every paper or ask the researcher to retell the project. It reads the manifest or claim file. Codex users should recognize the same pattern from engineering work: the durable artifact matters more than the conversation that produced it.

Manifests are the memory layer

The README names manifests as one of the three design principles. Research state lives in checked-in YAML and Markdown files under .research/ and .paper/. The example project manifest carries provenance from an upstream gap. The paper claims example records evidence artifacts, status, risk, and the sentence in the manuscript.

That is useful beyond academic research. In a software repo, the equivalent could be a design decision file, migration manifest, evaluation matrix, or release checklist. The pattern is the same: if future agents will resume work, store the state in files with a known shape instead of relying on a long chat transcript.

Anti-leakage is the real writing lesson

The paper claims example is the best editorial evidence in the repo. It demonstrates an anti-leakage rule: a claim with empty evidence_artifacts must carry status: gap and a non-empty gap_reason. It cannot be quietly treated as supported.

That is exactly the kind of rule Codex workflows require. If an agent cannot prove a claim from source, the output should carry an explicit gap marker. In engineering articles, that might mean unsupported benchmark numbers, vague security claims, or unverified install promises. The catalog's research schema gives a clean pattern: uncertainty is not hidden. It is typed.

Delegation is by task character

The cross-cutting tools are not stage labels. codex-delegate is for token-heavy mechanical work such as scaffolding, batch edits, tests, or plotting scripts. gemini-delegate is for long-context reading, bilingual rewriting, second-opinion review, or CJK-heavy output. research-hub-multi-ai is the router when one round needs multiple delegates and a reconciliation plan.

This is a good correction to sloppy multi-agent writing. Do not say 'use many agents.' Say which task character deserves which delegate, what file contains the handoff, and how outputs are reconciled. The catalog's .coord/multi_ai_plan.md pattern is the part to borrow.

The marketplace file is an index, not proof

.claude-plugin/marketplace.json defines five plugins: research-workspace, academic-writing-skills, zotero-skills, codex-delegate, and gemini-delegate. The YAML catalog then records skill purposes, use cases, outputs, repository URLs, verification status, and notes. That is strong catalog hygiene.

It is still not proof that every host behaves the same way. The README explicitly says Codex CLI, Cursor, Windsurf, Gemini CLI, Hermes, OpenClaw, and generic API clients can consume the same SKILL.md layer, but each host needs its own loading or prompt smoke test. A good article should keep that boundary visible.

Verification evidence is useful but bounded

docs/verification.md gives concrete evidence: real research-hub doctor checks, search results, test corpus outputs, NotebookLM brief verification, manifest parsing, orientation memo output, Codex and Gemini delegate checks, and Zotero local API probes. The catalog tests also assert the current YAML records 15 pass statuses.

The caveat is equally important. The verification report is point-in-time and maintainer-operated. The limitations section says the catalog is assembled and tested by one graduate-student researcher, has domain bias toward water resources and agent-based modeling, and does not assert CI-level Claude plugin install round trips. That makes the evidence useful, not absolute.

Where a Codex user should borrow from it

The best use of AI Research Skills for a Codex reader may be as a design pattern, not a research tool. Borrow the runtime contract. Borrow the artifact handoffs. Borrow the anti-leakage schema. Borrow the distinction between prompt-only skills, workspace-file skills, runtime-backed automation, deep API CRUD, and delegation skills.

If you are publishing your own skill pack, add docs/runtime-contract.md before you add more prompts. Then add one catalog test that walks catalog/skills.yml and fails if a skill lacks purpose, use_when, outputs, or a canonical SKILL.md URL. Say what install proves, what it does not prove, which smoke test matters, where outputs live, and which downstream skill consumes them. That is the difference between a prompt collection and an operating system for agent work.

What I would reject in review

I would reject a Codex skill catalog PR that only adds more prompts. A catalog entry should answer five review questions: where is the canonical SKILL.md, what files can the skill write, what external runtime does it require, what command proves the runtime is healthy, and what artifact lets the next agent continue?

In this repo, those answers are distributed across catalog/skills.yml, docs/runtime-contract.md, docs/install.md, examples under docs/, and tests. That is not accidental bureaucracy. It is how the catalog keeps a future agent from mistaking a readable instruction file for a verified operating environment.

A migration path for an existing prompt folder

If you already have a folder of prompts, do not rewrite everything at once. Pick one prompt that creates downstream work and convert only that path.

1. Create catalog/skills.yml with one skill entry.
2. Add docs/runtime-contract.md and name the runtime assumptions.
3. Define one output artifact, such as .research/topic_dossier.gaps.yml or .coord/multi_ai_plan.md.
4. Add tests/test_catalog.py assertions for purpose, use_when, outputs, and SKILL.md URL.
5. Add one smoke command in docs/install.md.
6. Run the skill once and commit the example output.

This is the implementation detail missing from the automated draft. The repo is useful because it shows the scaffolding around skills, not only the skill prose.

Two checklists for readers

If you are evaluating this catalog as a user, run this checklist before trusting a research workflow:

1. claude plugin list shows the expected plugin.
2. research-hub describe --json returns a runtime description.
3. research-hub doctor passes or names the missing dependency.
4. A tiny research-hub auto run completes with --no-nlm.
5. The skill produces a file artifact, not only chat text.

If you are publishing your own catalog, use the opposite checklist: define the install layer, define the runtime layer, define the artifact, define the downstream consumer, then add one test that fails when the catalog and docs drift. That is the step-by-step reader value here.

A rollout I would accept

Do not install all five plugins and call the workflow adopted. Start with one use case. If the team only needs paper comparison, install research-workspace, run research-hub doctor, and produce one literature_matrix.md. If the team writes manuscripts, add academic-writing-skills and test one claim-evidence audit. If the team delegates code-heavy research tooling, install codex-delegate and require a written handoff file.

The acceptance test is artifact quality. Did the skill produce a file the next agent can consume? Did unsupported claims carry status: gap? Did research-hub auto "agent-based modeling" --max-papers 3 --no-nlm run in the target environment? Did the user know whether Claude Code marketplace, raw SKILL.md, or Python runtime was installed? If those answers are no, the catalog is being used as prompt theater rather than a research workflow.

Save AI Research Skills as a skill-catalog field guide. It is valuable when readers study its gates, manifests, runtime contract, verification artifacts, and Codex/Gemini delegation boundaries; it is weak if reduced to a generic 'AI helps research' article.

Practical takeaway

Use the repo as a checklist for serious skill-pack design. Before publishing a Codex or Claude Code skill catalog, define the runtime contract, add one install smoke test, state what the marketplace or SKILL.md layer does not prove, require structured handoff files, mark unsupported claims as gaps, and document which delegate handles which task character. Then test one complete artifact handoff before adding more skills.

읽을거리로