MEGA Code
Blog/Engineering

Why “hi” ships 8,400 tokens of skill metadata

The skill catalog inside Codex, Claude Code, and Gemini CLI is a one-shot system-prompt injection, decided before the host has ever seen your prompt, isolated to one host, and outcome-blind. MEGA Tron rebuilds the layer above all three so every property flips.

PublishedMay 20, 2026 · Engineering
Read time~8 min

Open Gemini CLI with 150 skills enabled and type hi. Roughly 8,400 tokens of skill metadata leave with your one syllable. Multiply that by every turn of every session. You are paying for context the model never reads, every time the cursor blinks.

Codex and Claude Code cap their catalogs so the bill does not grow unbounded (min(2% × ctx, 8,000 chars) on Codex, a configurable fraction on Claude). But they still inject the cap-full every turn (Codex) or every session (Claude). The contents are filled by alphabet or by past-use frequency. Never by what you actually typed.

Once you accept that framing, three apparently-separate complaints collapse into one architectural mistake. Let’s walk them.

Token leak
What you typed1 tok
What got sent to the model8,400 tok
scale
9,000
Same model, same session, same one-word prompt. The catalog ships every turn. The prompt is a rounding error against it.

1. Even a “hi” drags the entire catalog along

This is the token-leak problem. The hosts have never seen your current prompt when they decide what to inject. A one-word greeting ships the same payload as a serious request. Codex sends ~1,200 tokens regardless, Claude ~2,000, Gemini scales linearly with pool size and routinely shoves 10,000+ tokens past a few hundred installed skills.

A quick gut check: open your host CLI and count what’s loaded. Most users believe they have “maybe 20 skills.” Once you add the host bundles + everything you installed, it is typically 2–5× that. All of it ships, regardless of relevance, on every turn.

The bug is structural, not configurable. Caps reduce the dollar number but not the failure mode: the model is still asked to pick from a pile of metadata it did not request, and which the host chose without consulting the user’s message.


2. The host you used yesterday doesn’t talk to the one you use today

You spent a week tuning webhook-signer in Codex. It works perfectly there. Tomorrow you open Claude Code on the same project. webhook-signerisn’t there, or it’s an older copy you forgot to update. You fix a bug in the Codex copy; the Claude copy and the Gemini copy keep the bug.

Every host maintains its own pool, under its own directory layout, with its own version of every file. Editing a skill is a per-host chore, and forgetting one host means that host quietly runs a stale version for weeks. The three CLIs are three islands: same skills in name, drifting in content. (Gemini CLI is merging into Antigravity CLI with the same architecture, so the host count keeps going up, not down.)

This is the host-isolation problem.

Host isolation
Codexfresh
~/.codex/skills
webhook-signer
version: v1.3.0modified: Mon
Claude Codestale
~/.claude/skills
webhook-signer
version: v1.2.4modified: 3 weeks ago
Gemini CLIbroken
~/.gemini/skills
webhook-signer
version: v1.0.1modified: 2 months ago
One skill, three hosts. Same name, three different copies, and no host is aware the other two exist.

3. And nobody actually knows which skills are helping

You have a few dozen skills loaded. Quick: which 5 actually shifted an answer for the better last month? Which 3 are silently broken against a library update from last week? Which one is your model trying every time and failing because the API it documents has been deprecated?

You don’t know. Neither do the hosts. None of the three records whether a skill actually helped when it was loaded. Claude tracks invocation frequency, but frequency is not quality, and “least-invoked-first” eviction protects exactly the harmful-but-frequent skills you would want to drop. The model picks a broken skill, the skill fails silently, next turn it tries the same broken skill again. You see “the answer is weird” without knowing a stale skill is behind it.

This is the evidence-blind problem.


Same root cause behind all three

Each host’s skill catalog is a one-shot system-prompt injection that:

  1. Ignores your current promptwhen deciding what to ship → token leak.
  2. Lives inside one hostwith no cross-host channel → island problem.
  3. Records nothing about outcomes→ evidence-blind.

MEGA Tron rebuilds the catalog layer above each host so all three properties flip. The architecture maps to the three problems one-for-one: Unify, Optimize, Evolve.

Architecture
~/.codex/skills
~/.claude/skills
~/.gemini/skills
MEGA Tron
Unifypool.py
  • Master pool + cross-host symlinks
  • Verdict economy across hosts
Optimizerouter + dynamic_k
  • Per-turn semantic top-K against your actual prompt
  • Dynamic K from score distribution → flat token cost
Evolveverdicts/
  • Stop-hook self-evaluation, session-end
  • Adapts ranking from past results, retires failing skills
Three composable layers: Unify, Optimize, Evolve. Each works on its own; together they form an Autonomous AgentOpt substrate any host can plug into.

Three composable layers. Each works on its own; together they form an Autonomous AgentOpt substrate that any of the three hosts can plug into.


Does it actually work? The benchmarks.

A 200-query benchmark on a pool of third-party skills sampled deterministically from the open-source ecosystem (full report). No API calls; every number below is reproducible from the repo.

Coverage= fraction of the “gold” skills the model can actually see for each prompt, averaged over 150 in-distribution + 50 null prompts. Native hosts can never abstain on null prompts, which is why they cap at 0.750 regardless of pool size.

Pool sizePolicyCoverageTokens / turnMEGA Tron savings
59vanilla Codex0.7081,19311.3×
59vanilla Claude0.7501,97218.6×
59vanilla Gemini0.7503,56233.6×
59MEGA Tron0.9551061.0×
183vanilla Codex0.1851,1579.3×
183vanilla Claude0.7502,00016.1×
183vanilla Gemini0.75010,55485.1×
183MEGA Tron0.9351241.0×
500vanilla Codex0.0291,1917.6×
500vanilla Claude0.7503,40021.7×
500vanilla Gemini0.75029,295186.6×
500MEGA Tron0.8921571.0×

The last column reads as “that row uses this many times more tokens than MEGA Tron.” And MEGA Tron still scores higher on coverage at every row.

Tokens / turn · pool = 500
vanilla Gemini29,295 tok
overflows
vanilla Claude3,400 tok
vanilla Codex1,191 tok
MEGA Tron157 tok
Same prompt, same 500-skill pool, four different policies. MEGA Tron uses 1/186th the tokens of vanilla Gemini while still scoring higher coverage.

The same story plays out on the coverage axis. As the pool grows, vanilla Codex’s alphabetical char-budget cannibalises itself and Claude/Gemini sit pinned at the 0.75 ceiling. MEGA Tron stays flat near 0.9 because it rebuilds the catalog per turn against what you actually typed.

Coverage vs pool size
0.250.500.751.0059183500pool size
MEGA Tron
vanilla Claude
vanilla Gemini
vanilla Codex
Vanilla Codex's alphabetical char-budget drops 94% of coverage by 500 skills. Claude + Gemini hit a 0.75 ceiling regardless of pool. MEGA Tron stays flat near 0.9 because the catalog is rebuilt per turn against your actual prompt.

You don’t need 500 skills for this to matter. At 59 skills(the size most users actually run), MEGA Tron already lifts coverage from 0.71–0.75 to 0.955 while using ~18× fewer tokens than Codex and ~34× fewerthan Gemini. As the pool grows, the gap widens on both axes: vanilla Codex’s alphabetical char-budget drops 94% of its coverage by 500 skills, vanilla Gemini’s catalog grows 8× in tokens, and MEGA Tron stays flat near 0.9 coverage at ~150 tokens.

Cap ≠ fix. When the host caps its catalog (Codex’s min(2% × ctx, 8,000 chars) or Claude’s skillListingBudgetFraction), the content of what survives is decided by alphabet or by invocation frequency, never by what you actually typed.


How the three layers work

① Unify: one pool, three hosts

A skill is the same SKILL.md regardless of which CLI invokes it. MEGA Tron treats the three native locations as a single logical pool, with one master copy under $XDG_DATA_HOME/mega-tron/pool/skills/ and symlinks fanning out to each host.

Two consequences: edit once, applies everywhere. Fix a bug in webhook-signerand Codex, Claude, and Gemini all see the fix on the next turn. And a cross-host verdict economy, because MEGA Tron is the layer that records the verdicts in the first place, a HELPFUL in Claude lifts the same skill’s rank in Codex on the next prompt.

② Optimize: per-turn context engineering

A task-specific retrieval embedder (BGE-M3 by default; SkillRet-Embedding-0.6B, Qwen3-Embedding, Voyage, OpenAI all swappable) ranks every skill against the query. Then a fast matmul: (1, dim) × (dim, N) cosine, ~15 ms on cached vectors for 150 skills, sub-50 ms even at 500.

K is decided from the shape of the score distribution itself, not as a fixed top-5. Unambiguous prompts get a tight K=1; genuinely null prompts get K=0 (the host abstains); ambiguous clusters widen the window. The result is a flat ~150 tokens of catalog regardless of how the pool grows.

③ Evolve: Autonomous AgentOpt via verdict feedback

When the session ends, a hook scans the transcript for the model’s own <skill-used name="..." verdict="..."/> tags plus any script invocations from the tool-use log. The ranking formula then blends pure cosine with three forms of evidence: a Beta-smoothed count bonus, a per-context match against past helpful/harmful natural-language contexts, and a related-verdict polarity pulled from the embedding store.

Two retirement rules close the loop: three consecutive HARMFUL verdicts archive a skill (it drops out of candidate sets entirely); after five total verdicts, a high harmful ratio downgrades it to suspect with its rank halved. Cold-start skills with zero evidence are unaffected, since the blend collapses cleanly back to pure cosine when there is nothing to blend.


Try it

MEGA Tron runs entirely on your machine. No API keys to manage, no prompts leaving the box, no per-call cost. The embedder runs on CPU/MPS/CUDA depending on what you have; nothing crosses the network. (Agentic re-rank is the one optional path that calls an LLM, and you bring your own key.)

# 1. Install the Python package: drops the binary at ~/.local/bin/mega-tron.
uv tool install mega-tron

# 2. Wire your host CLIs and add ~/.local/bin to your shell PATH.
~/.local/bin/mega-tron setup

Then open a new terminal. The next turn in any host ships with the right skills in context, and never with skills that have silently broken on you.

First-run cost is ~1–3 minutes for the embedder model download (BGE-M3, ~570 MB, multilingual) plus one-time embedding of every discovered skill. Subsequent runs reuse the cache and finish in seconds. mega-tron setup is idempotent, safe to re-run any time you add a new host, and --uninstall reverses everything cleanly.


Repository, installation guide, and the full routing algorithm spec: github.com/mega-edo/mega-tron. Apache-2.0. Contributions and issue reports welcome.