도구2026년 6월 1일 · 11 min read

CodeBurn Cost Review for Codex and Claude Logs

The useful lesson is not a prettier token counter. CodeBurn gives teams a runbook for turning local agent logs into budget, retry, context, and model-efficiency review.

𝕏 in

EDITOR'S NOTEThe useful story is not terminal token telemetry by itself. CodeBurn is a local evidence ledger for agent work, useful when teams separate exact usage, estimated usage, derived productivity signals, and policy decisions.

The bill rarely arrives where the bad habit happened. A Codex session retries the same edit three times. Claude rereads files that should have been in context. A hidden Cursor model looks cheap until the estimate is mixed with exact provider logs. By the time the team sees a monthly total, the useful evidence has already been flattened.

That is why the weak article angle was wrong. 'A TUI tracks AI token spending in real time' is not enough for a serious Codex ecosystem article. A token dashboard is easy to describe and easy to overvalue.

The stronger story is that CodeBurn treats local agent logs as evidence. It reads Codex rollouts, Claude Code JSONL sessions, Cursor databases, OpenCode stores, Warp records, and many adjacent tool histories, then turns those artifacts into cost, model, task, retry, cache, and project views. Because it does not proxy requests or require API keys, it sits after the work instead of in the request path.

That distinction matters. CodeBurn is not a billing system, and it is not proof that a session produced valuable software. It is a ledger for asking better questions: which provider is exact, which provider is estimated, which sessions burned money without edits, which models cost more per successful edit, and which repo rules should change next week?

Start with the trust boundary

The first thing to say about CodeBurn is not that it supports many tools. The first thing is how it observes them. The README says it does not wrap AI tools, proxy requests, or ask for API keys. It reads local session data from disk and prices calls with LiteLLM pricing data, cached pricing, and model fallbacks.

That is a strong design choice for Codex and Claude Code users. It reduces request-path risk because prompts and completions do not pass through another live service. It also creates a different risk: the tool is only as correct as the local logs, provider parsers, and pricing assumptions it can see. A serious team should therefore label every number by confidence before acting on it.

Compare it to the three simpler options

CodeBurn is not the only way to reason about AI coding cost. The choice depends on what question the team is asking.

A provider billing dashboard is best when the question is 'what will we be charged?' It is the wrong tool when the question is 'which repo, model, retry loop, or tool habit caused the spend?' It usually cannot see project-level agent behavior.

A request proxy is best when the team controls the request path and wants live metering. It is the wrong tool when developers use multiple local agents, desktop tools, or vendor CLIs that should not route through another service.

A manual spreadsheet is best for subscription ownership and budget approvals. It is the wrong tool for detecting no-edit sessions, hidden model aliases, repeated file reads, context bloat, or cost per edit.

CodeBurn fits the fourth case: the work already happened in local agent tools, and the team needs evidence from those local records before changing repo policy.

Verify the runtime before the article promise

There is already a small but useful caveat in the source. The README says Node.js 20 or later. The package metadata and CLI launcher enforce Node.js 22.13.0 or later. That mismatch is exactly the kind of detail an article should surface, because it separates repository evidence from polished onboarding copy.

A practical first run should be boring:

node --version
npx codeburn --help
npx codeburn report --provider codex --format json
npx codeburn report --provider claude --format json

If Node is below 22.13.0, fix that before blaming session discovery. If reports are empty, inspect the provider paths before judging the cost math.

Use a baseline runbook, not a tour

The first week should produce a baseline, not a pile of screenshots. Run each provider separately and save raw JSON so the team can compare the same shape of evidence later:

mkdir -p .agent-cost-review/week-1
CODEX_HOME=~/.codex npx codeburn report --provider codex --format json > .agent-cost-review/week-1/codex.json
CLAUDE_CONFIG_DIR=~/.claude npx codeburn report --provider claude --format json > .agent-cost-review/week-1/claude.json
npx codeburn compare --provider codex -p 30days > .agent-cost-review/week-1/codex-compare.txt
npx codeburn compare --provider claude -p 30days > .agent-cost-review/week-1/claude-compare.txt

The baseline is done only when the reviewer can fill this out:

- Codex discovery path checked: yes/no
- Claude discovery path checked: yes/no
- Exact providers this week:
- Estimated providers this week:
- Top expensive session inspected manually:
- One AGENTS.md or CLAUDE.md rule to change:
- One thing we will not change because the evidence is estimated:

That small review loop is more valuable than browsing every panel in the TUI.

A 30-minute audit that produces a decision

A useful audit has a clock and an output. Give CodeBurn 30 minutes and end with one written decision.

# 0-5 minutes: prove discovery works
CODEX_HOME=~/.codex npx codeburn status --provider codex
CLAUDE_CONFIG_DIR=~/.claude npx codeburn status --provider claude

# 5-15 minutes: find cost concentration
npx codeburn report -p 7days --provider codex --format json > codex-7d.json
npx codeburn report -p 7days --provider claude --format json > claude-7d.json

# 15-25 minutes: inspect model behavior
npx codeburn compare --provider codex -p 30days
npx codeburn yield -p 30days

# 25-30 minutes: write the policy change
printf '%s\n' '- This week we will change one AGENTS.md rule based on CodeBurn evidence.' >> COST_REVIEW.md

The output should be one sentence, not a dashboard recap. Example: 'Codex sessions with no edits and more than two retries must stop and ask for a narrower task.' Or: 'Do not compare Cursor Auto totals with Codex totals without marking Cursor as estimated.' That is how a cost tool becomes an operating tool.

Configure budgets and aliases explicitly

The configuration path is part of the adoption boundary. CodeBurn stores persistent settings at ~/.config/codeburn/config.json. Do not leave model aliases or budget scope as tribal knowledge.

# Provider-specific budget boundaries
npx codeburn plan set custom --monthly-usd 200 --provider codex
npx codeburn plan set claude-max --provider claude

# Pricing aliases for proxy or renamed models
npx codeburn model-alias "company-sonnet-proxy" "claude-sonnet-4-6"
npx codeburn model-alias --list

# Display currency if the team reviews outside USD
npx codeburn currency USD

A reviewed config should look structurally like this:

{
  "currency": { "code": "USD" },
  "plans": {
    "codex": {
      "id": "custom",
      "monthlyUsd": 200,
      "provider": "codex",
      "resetDay": 1
    },
    "claude": {
      "id": "claude-max",
      "monthlyUsd": 200,
      "provider": "claude",
      "resetDay": 1
    }
  },
  "modelAliases": {
    "company-sonnet-proxy": "claude-sonnet-4-6"
  }
}

That snippet is not a magic setup file to paste blindly; it is the shape the reviewer should understand. If a provider is missing, that provider is outside the budget review. If an alias is wrong, pricing evidence can drift even when token parsing is correct.

Codex support is not a generic JSONL reader

The Codex provider is a good reason to keep this article. CodeBurn reads CODEX_HOME or ~/.codex, walks sessions/YYYY/MM/DD/rollout-*.jsonl, and rejects files whose first line is not a session_meta entry with an originator that starts with Codex. The provider notes also call out a 1 MB first-line cap because modern Codex session metadata can carry a large system prompt.

The token math is more important than the path. Codex can emit last_token_usage, cumulative total_token_usage, or neither. CodeBurn uses per-turn usage when present, computes deltas from cumulative totals when needed, and falls back to character estimates when logs do not carry usage. It also subtracts cached input tokens because OpenAI counts cached tokens inside input tokens, while the rest of CodeBurn wants cached usage separated.

That is useful analysis material. It tells readers that a Codex number in CodeBurn can be exact, delta-derived, or estimated, depending on what the session recorded. Those categories should not be collapsed into one confidence level.

Claude has a different parser boundary

Claude support is organized differently. The provider notes say Claude Code and Claude Desktop local-agent-mode sessions are JSONL, but Claude parsing lives in src/parser.ts rather than the provider shim. That parser handles full turn grouping, duplicated streaming message IDs, MCP tool inventory, and project-path details.

The cache-pricing detail is also worth keeping. The Claude notes describe newer cache creation fields for five-minute and one-hour ephemeral cache writes; CodeBurn prices the one-hour portion differently when the split is present and falls back to legacy aggregate pricing when it is not.

For a team using both Codex and Claude Code, the right comparison is not 'which dashboard looks nicer.' It is whether both parsers are telling you the same kind of truth. Codex cached-token semantics, Claude cache-write semantics, Desktop local-agent sessions, and provider-specific project paths all change what a number means.

Estimated providers need labels

The README is unusually explicit about some weak spots. Cursor's Auto model is hidden, so CodeBurn estimates it with Sonnet pricing. Kiro token counts are estimated from content length and costed at Sonnet rates because Kiro does not expose model information. Those are reasonable engineering fallbacks, but they are not invoice-grade facts.

This is where the article should become editorial. Mixing exact Codex token reports, Claude usage records, Cursor Auto estimates, and Kiro content-length estimates in one total can mislead a manager. The total may still be useful for spotting a runaway habit, but it should not decide vendor spend, employee performance, or model policy without confidence labels.

A good operating rule is simple: exact logs can trigger budget investigation; estimated logs can trigger workflow review. Do not reverse those.

Use a failure-mode checklist

A weekly protocol needs failure handling. Otherwise the first empty report becomes a false conclusion.

Discovery failure:
- Codex: check CODEX_HOME and ~/.codex/sessions/YYYY/MM/DD/rollout-*.jsonl.
- Claude: check CLAUDE_CONFIG_DIR, CLAUDE_CONFIG_DIRS, and ~/.claude/projects.
- Empty output: confirm the first Codex line is session_meta with a codex originator.

Cost failure:
- Hidden model: add a model-alias before comparing cost per edit.
- Estimated provider: label the report as workflow evidence, not billing evidence.
- Price surprise: compare against provider billing before changing budgets.

Workflow failure:
- Expensive no-edit session: inspect transcript before blaming the model.
- Retry-heavy edits: add a stop condition or narrower task rule.
- Context bloat: remove stale MCP servers, dead skills, or oversized instructions.

This is implementation detail that matters more than another chart. It tells the reader what to check, where to check it, and which kind of decision the evidence can support.

The waste scanner is the strongest operating feature

The waste scanner is stronger than the command name suggests because it is deterministic. It scans visible local records for patterns: repeated file reads, poor Read/Edit ratio, uncapped bash output, unused MCP servers, ghost agents, ghost skills, bloated CLAUDE.md, context-heavy sessions, junk directory reads, expensive retries, and costly sessions that produced no edits.

That is more useful than a generic 'save tokens' tip list. Each finding can be tied back to logs or local files. The thresholds are inspectable: healthy CLAUDE.md length, minimum edit counts for read/edit ratio, context-bloat limits, MCP usage thresholds, and cache multipliers all live in source.

The right way to use it is not to accept every suggestion. Run the waste scan weekly, inspect the source evidence, then convert only repeated findings into local rules.

Model efficiency should be read as workflow evidence

CodeBurn's model-efficiency module calculates edit turns, one-shot turns, retries, cost per edit, one-shot rate, and retries per edit for real non-synthetic primary calls. That is the kind of metric Codex readers actually need, because raw token totals do not tell whether a model is expensive because it solved hard work or because it kept circling.

A useful review might ask: did GPT-5.3 Codex cost more because it handled architecture changes, or because it retried simple edits? Did a cheaper model produce more retries per edit? Did cache hits reduce repeated context cost? Did a tool-heavy session end with a commit or just another plan?

These are not model benchmarks. They are local workflow questions. CodeBurn is most valuable when the reader keeps that distinction clear.

Plan budgets are for habits, not invoices

The plan-usage code supports provider-scoped plans, reset days, an 80 percent near-threshold, and projected month-end spend based on a trailing seven-day median daily cost. That is a good way to turn subscription anxiety into a weekly habit check.

A team can write a small policy:

npx codeburn plan set claude-max --provider claude
npx codeburn today --provider claude
npx codeburn report -p 30days --provider claude --format json

The output should answer whether the team is approaching its own budget boundary. It should not replace provider billing. Subscription plans, cached pricing, hidden models, local schema drift, and estimates all make the number operational rather than legal.

Exports make review possible

The export module produces summary, daily, activity, model, tool, bash, project, and session rows. It also guards CSV output from spreadsheet formula injection by prefixing dangerous leading characters. That small detail matters because cost reports often end up in spreadsheets, Slack, or internal docs.

For Codex teams, JSON export is the better first format:

npx codeburn export -f json

Keep the weekly review narrow. Look at the top five sessions by cost, the top models by cost per edit, retry-heavy edit turns, and the most expensive no-edit sessions. Then decide one policy change. A dashboard that produces ten new rules every week will be ignored.

Yield is useful but easy to overread

The README describes a yield view that correlates AI sessions with git commits by timestamp and marks outcomes such as productive, reverted, or abandoned. That is a useful triage lens. It is not causal proof.

A session can be expensive and still valuable because it found a dead end. A cheap session can be harmful if it landed a subtle bug. A commit timestamp can be close to an agent session without meaning the agent produced the commit. The metric should guide follow-up review, not become a scoreboard.

The better use is to ask why expensive sessions did not land, why reverted work passed the agent's own checks, and whether local instructions need stronger stop conditions.

The menubar app does not change the evidence model

The macOS menubar app is useful as a surface, not a separate source of truth. Its README says it shells out to the CodeBurn status command directly as argv every 60 seconds, with manual refresh asking the CLI for the slower waste findings. The release installer verifies checksums and stores a persistent CLI path.

That is a sensible design. It keeps the CLI as the data path and gives users quick visibility. But a serious article should not overplay it as live cost control. It is a convenient window into the same local ledger.

Privacy still needs an export policy

No proxy does not mean no risk. CodeBurn reads local session records. Those records may include project paths, tool names, prompts, command patterns, session identifiers, and sometimes sensitive context. The export pipeline is careful about CSV formulas, but the user still owns the data handling policy.

For a shared team, exports should go to a private location, strip project paths when they leave the engineering group, and avoid uploading raw session-level data to public trackers. Cost observability can become surveillance if the team starts ranking people by session spend without reading the work.

A weekly Codex review loop

A practical loop can be short and repeatable:

CODEX_HOME=~/.codex npx codeburn report --provider codex --format json
npx codeburn compare --provider codex -p 30days
npx codeburn yield -p 30days
npx codeburn export -f json

Review five items: expensive sessions, no-edit sessions, retry-heavy edit turns, context bloat, and model cost per edit. For each one, name the next action:

- Expensive session: inspect transcript before changing model policy.
- No-edit session: add a stop condition for planning-only loops.
- Retry-heavy edit turn: check whether AGENTS.md is missing a command, test, or file boundary.
- Context bloat: remove stale MCP servers, dead skills, or oversized repo memory.
- Model cost per edit: compare result quality before moving work to a cheaper model.

Then make one local change. That might be a narrower AGENTS.md instruction, a rule to cap bash output, a model-alias correction, a cleanup of unused MCP servers, or a stronger stop condition before the agent keeps retrying.

The key is that CodeBurn should change how the team runs agents. If it only creates a pretty number, it is trivia.

What to write into AGENTS.md

The article should give the reader a policy they can copy and adapt:

Every Friday, run CodeBurn for Codex and Claude separately. Treat Codex and Claude token usage as higher-confidence when session logs include real usage. Label Cursor Auto, Kiro, and other estimated providers before aggregating totals. Review top sessions by cost, retry-heavy edit turns, expensive no-edit sessions, context-bloat findings, and model cost per edit. Convert only repeated findings into repo rules. Do not use CodeBurn exports as vendor billing truth or individual performance ranking.

That is the difference between an article about a tool and an article that helps readers operate Codex better.

When not to use CodeBurn

Do not add CodeBurn to a team that is not ready to interpret estimates. Do not use it as a billing source when vendor dashboards are available. Do not use it to compare developers without reading commits, reviews, defects, and product outcomes. Do not act on provider totals if the underlying parser has not seen the local storage format before.

It is also overkill for occasional AI coding. The tool becomes worthwhile when agent work is frequent enough that retries, model choices, context size, and MCP inventory have real budget and quality impact.

Save the CodeBurn article, but not as a terminal-dashboard profile. It belongs as a field guide for turning Codex and Claude Code logs into cost evidence, with explicit labels for exact, derived, estimated, and policy-level signals.

Practical takeaway

Install only after verifying Node >=22.13.0. Run provider-specific reports for Codex and Claude before aggregating anything. Label estimated providers separately. Use compare, yield, export, and the waste scanner as review inputs, not automatic truths. The best outcome is a small weekly policy change in AGENTS.md or CLAUDE.md: fewer blind retries, cleaner context, safer MCP inventory, clearer model aliases, and a better stop condition before expensive sessions drift.

SOURCES

[1] Primary sourcegithub.com

[2] package metadata and CLI launchergithub.com

[3] provider registrygithub.com

[4] Codex provider notesgithub.com

[5] Claude provider notesgithub.com

[6] waste scannergithub.com

[7] model efficiencygithub.com

[8] plan usagegithub.com

[9] configuration modulegithub.com

[10] export pipelinegithub.com

[11] menubar appgithub.com

[12] security policygithub.com

[13] CLI launchergithub.com

[14] main command routergithub.com

[15] Codex parsergithub.com

[16] Claude session parsergithub.com

[17] context budgetgithub.com

codexclaude codecost observabilityai agentsworkflow

Claude Code 생태계를 앞서가세요

MCP 서버, 스킬, 에이전트 패턴, 바이브코딩 인사이트를 매주 전해드립니다.

무료 구독

OpenCode 전환 후 /code-review · /security-review 공백: opencode-power-pack의 SKILL.md 포팅 구조와 도입 조건

Anthropic 공식 Claude Code 플러그인의 code-review, security-review, feature-dev는 OpenCode에서 그대로 동작하지 않는다. opencode-power-pack은 이 워크플로우들을 OpenCode 네이티브 SKILL.md 포맷으로 번역하고, ~/.config/opencode/opencode.json 한 줄 설정으로 11개 스킬을 적재한다.

2026년 6월 11일

opencodeskillscode-review

도구8분 읽기

guard-skills: Claude Code diff에 catch-all 오류·환각 import·HPOS 패턴을 잡는 5개 리뷰 Skill

guard-skills는 Claude Code, Codex, Cursor가 생성한 코드·테스트·문서에 second-pass 리뷰를 수행하는 5개 Skill 패키지입니다. clean-code-guard는 GitClear·USENIX 연구에 근거한 14가지 AI 실패 패턴을 검사하고, woo-guard는 AI가 반복 생성하는 pre-HPOS WooCommerce 코드 패턴을 직접 겨냥합니다.

2026년 6월 11일

skillsclaude-codecode-review

도구7 min read

baoyu-design: What Changes When You Run claude.ai/design Locally in Claude Code

baoyu-design ports the claude.ai/design methodology to Claude Code by detecting the agent environment at runtime and loading a tool substitution table from `references/claude.md`. The design system pipeline — `compile-design-system.mjs` produces static lint config, `import-design-system.mjs` generates a token allowlist as `_ds_prompt.md` — enforces the same constraint at two layers. Neither is optional if you want session-to-session consistency.

2026년 6월 10일

skillsclaude-codedesign-systems