Why “hi” ships 8,400 tokens of skill metadata
The skill catalog inside Codex, Claude Code, and Gemini CLI is a one-shot system-prompt injection, decided before the host has ever seen your prompt, isolated to one host, and outcome-blind. MEGA Tron rebuilds the layer above all three so every property flips.
Open Gemini CLI with 150 skills enabled and type hi. Roughly 8,400 tokens of skill metadata leave with your one syllable. Multiply that by every turn of every session. You are paying for context the model never reads, every time the cursor blinks.
Codex and Claude Code cap their catalogs so the bill does not grow unbounded (min(2% × ctx, 8,000 chars) on Codex, a configurable fraction on Claude). But they still inject the cap-full every turn (Codex) or every session (Claude). The contents are filled by alphabet or by past-use frequency. Never by what you actually typed.
Once you accept that framing, three apparently-separate complaints collapse into one architectural mistake. Let’s walk them.
1. Even a “hi” drags the entire catalog along
This is the token-leak problem. The hosts have never seen your current prompt when they decide what to inject. A one-word greeting ships the same payload as a serious request. Codex sends ~1,200 tokens regardless, Claude ~2,000, Gemini scales linearly with pool size and routinely shoves 10,000+ tokens past a few hundred installed skills.
A quick gut check: open your host CLI and count what’s loaded. Most users believe they have “maybe 20 skills.” Once you add the host bundles + everything you installed, it is typically 2–5× that. All of it ships, regardless of relevance, on every turn.
The bug is structural, not configurable. Caps reduce the dollar number but not the failure mode: the model is still asked to pick from a pile of metadata it did not request, and which the host chose without consulting the user’s message.
2. The host you used yesterday doesn’t talk to the one you use today
You spent a week tuning webhook-signer in Codex. It works perfectly there. Tomorrow you open Claude Code on the same project. webhook-signerisn’t there, or it’s an older copy you forgot to update. You fix a bug in the Codex copy; the Claude copy and the Gemini copy keep the bug.
Every host maintains its own pool, under its own directory layout, with its own version of every file. Editing a skill is a per-host chore, and forgetting one host means that host quietly runs a stale version for weeks. The three CLIs are three islands: same skills in name, drifting in content. (Gemini CLI is merging into Antigravity CLI with the same architecture, so the host count keeps going up, not down.)
This is the host-isolation problem.
3. And nobody actually knows which skills are helping
You have a few dozen skills loaded. Quick: which 5 actually shifted an answer for the better last month? Which 3 are silently broken against a library update from last week? Which one is your model trying every time and failing because the API it documents has been deprecated?
You don’t know. Neither do the hosts. None of the three records whether a skill actually helped when it was loaded. Claude tracks invocation frequency, but frequency is not quality, and “least-invoked-first” eviction protects exactly the harmful-but-frequent skills you would want to drop. The model picks a broken skill, the skill fails silently, next turn it tries the same broken skill again. You see “the answer is weird” without knowing a stale skill is behind it.
This is the evidence-blind problem.
Same root cause behind all three
Each host’s skill catalog is a one-shot system-prompt injection that:
- Ignores your current promptwhen deciding what to ship → token leak.
- Lives inside one hostwith no cross-host channel → island problem.
- Records nothing about outcomes→ evidence-blind.
MEGA Tron rebuilds the catalog layer above each host so all three properties flip. The architecture maps to the three problems one-for-one: Unify, Optimize, Evolve.
- Master pool + cross-host symlinks
- Verdict economy across hosts
- Per-turn semantic top-K against your actual prompt
- Dynamic K from score distribution → flat token cost
- Stop-hook self-evaluation, session-end
- Adapts ranking from past results, retires failing skills
Three composable layers. Each works on its own; together they form an Autonomous AgentOpt substrate that any of the three hosts can plug into.
Does it actually work? The benchmarks.
A 200-query benchmark on a pool of third-party skills sampled deterministically from the open-source ecosystem (full report). No API calls; every number below is reproducible from the repo.
Coverage= fraction of the “gold” skills the model can actually see for each prompt, averaged over 150 in-distribution + 50 null prompts. Native hosts can never abstain on null prompts, which is why they cap at 0.750 regardless of pool size.
| Pool size | Policy | Coverage | Tokens / turn | MEGA Tron savings |
|---|---|---|---|---|
| 59 | vanilla Codex | 0.708 | 1,193 | 11.3× |
| 59 | vanilla Claude | 0.750 | 1,972 | 18.6× |
| 59 | vanilla Gemini | 0.750 | 3,562 | 33.6× |
| 59 | MEGA Tron | 0.955 | 106 | 1.0× |
| 183 | vanilla Codex | 0.185 | 1,157 | 9.3× |
| 183 | vanilla Claude | 0.750 | 2,000 | 16.1× |
| 183 | vanilla Gemini | 0.750 | 10,554 | 85.1× |
| 183 | MEGA Tron | 0.935 | 124 | 1.0× |
| 500 | vanilla Codex | 0.029 | 1,191 | 7.6× |
| 500 | vanilla Claude | 0.750 | 3,400 | 21.7× |
| 500 | vanilla Gemini | 0.750 | 29,295 | 186.6× |
| 500 | MEGA Tron | 0.892 | 157 | 1.0× |
The last column reads as “that row uses this many times more tokens than MEGA Tron.” And MEGA Tron still scores higher on coverage at every row.
The same story plays out on the coverage axis. As the pool grows, vanilla Codex’s alphabetical char-budget cannibalises itself and Claude/Gemini sit pinned at the 0.75 ceiling. MEGA Tron stays flat near 0.9 because it rebuilds the catalog per turn against what you actually typed.
You don’t need 500 skills for this to matter. At 59 skills(the size most users actually run), MEGA Tron already lifts coverage from 0.71–0.75 to 0.955 while using ~18× fewer tokens than Codex and ~34× fewerthan Gemini. As the pool grows, the gap widens on both axes: vanilla Codex’s alphabetical char-budget drops 94% of its coverage by 500 skills, vanilla Gemini’s catalog grows 8× in tokens, and MEGA Tron stays flat near 0.9 coverage at ~150 tokens.
Cap ≠ fix. When the host caps its catalog (Codex’s min(2% × ctx, 8,000 chars) or Claude’s skillListingBudgetFraction), the content of what survives is decided by alphabet or by invocation frequency, never by what you actually typed.
How the three layers work
① Unify: one pool, three hosts
A skill is the same SKILL.md regardless of which CLI invokes it. MEGA Tron treats the three native locations as a single logical pool, with one master copy under $XDG_DATA_HOME/mega-tron/pool/skills/ and symlinks fanning out to each host.
Two consequences: edit once, applies everywhere. Fix a bug in webhook-signerand Codex, Claude, and Gemini all see the fix on the next turn. And a cross-host verdict economy, because MEGA Tron is the layer that records the verdicts in the first place, a HELPFUL in Claude lifts the same skill’s rank in Codex on the next prompt.
② Optimize: per-turn context engineering
A task-specific retrieval embedder (BGE-M3 by default; SkillRet-Embedding-0.6B, Qwen3-Embedding, Voyage, OpenAI all swappable) ranks every skill against the query. Then a fast matmul: (1, dim) × (dim, N) cosine, ~15 ms on cached vectors for 150 skills, sub-50 ms even at 500.
K is decided from the shape of the score distribution itself, not as a fixed top-5. Unambiguous prompts get a tight K=1; genuinely null prompts get K=0 (the host abstains); ambiguous clusters widen the window. The result is a flat ~150 tokens of catalog regardless of how the pool grows.
③ Evolve: Autonomous AgentOpt via verdict feedback
When the session ends, a hook scans the transcript for the model’s own <skill-used name="..." verdict="..."/> tags plus any script invocations from the tool-use log. The ranking formula then blends pure cosine with three forms of evidence: a Beta-smoothed count bonus, a per-context match against past helpful/harmful natural-language contexts, and a related-verdict polarity pulled from the embedding store.
Two retirement rules close the loop: three consecutive HARMFUL verdicts archive a skill (it drops out of candidate sets entirely); after five total verdicts, a high harmful ratio downgrades it to suspect with its rank halved. Cold-start skills with zero evidence are unaffected, since the blend collapses cleanly back to pure cosine when there is nothing to blend.
Try it
MEGA Tron runs entirely on your machine. No API keys to manage, no prompts leaving the box, no per-call cost. The embedder runs on CPU/MPS/CUDA depending on what you have; nothing crosses the network. (Agentic re-rank is the one optional path that calls an LLM, and you bring your own key.)
# 1. Install the Python package: drops the binary at ~/.local/bin/mega-tron. uv tool install mega-tron # 2. Wire your host CLIs and add ~/.local/bin to your shell PATH. ~/.local/bin/mega-tron setup
Then open a new terminal. The next turn in any host ships with the right skills in context, and never with skills that have silently broken on you.
First-run cost is ~1–3 minutes for the embedder model download (BGE-M3, ~570 MB, multilingual) plus one-time embedding of every discovered skill. Subsequent runs reuse the cache and finish in seconds. mega-tron setup is idempotent, safe to re-run any time you add a new host, and --uninstall reverses everything cleanly.
Repository, installation guide, and the full routing algorithm spec: github.com/mega-edo/mega-tron. Apache-2.0. Contributions and issue reports welcome.

