MEGA Code

MEGA Tech Report · Experiments & Evaluation

Measured, not marketed.

Workflow optimization on the GEPA suite, and wisdom curation on SkillsBench.

Same datasets, same seeds, same grading as the published baselines.

Read the Tech Report

Optimization · 4 Benchmarks · GPT-4.1 Mini

Workflow optimization on the GEPA suite

MEGA's workflow optimizer benchmarked against MIPROv2, TextGrad, GEPA, and Feedback Descent on four compound AI systems from GEPA's released eval code — multi-hop QA, instruction following, retrieval-based claim verification, and privacy-aware delegation. Identical train/val/test splits and seeds; all methods measured under the same grading criteria.

+7.03pts

Over the strongest prior optimizer (GEPA)

Aggregate 76.55 vs 69.52

SystemHotpotQAIFBenchHoVerPUPAAgg.
Baseline38.0047.7946.3378.5752.67
MIPROv258.0049.1548.3383.3759.71
TextGrad62.3348.6447.6785.6861.08
Feedback Descent68.3354.5957.6785.6666.56
GEPA69.0055.9556.6796.4669.52
MEGA72.6761.0574.6797.8176.55

Scores are test-split accuracy on identical splits (HotpotQA 150/100/300, IFBench 150/100/294, HoVer 150/100/300, PUPA 111/111/221). Baseline, MIPROv2, TextGrad, and GEPA scores are cited from the GEPA paper; Feedback Descent scores are cited from its authors. MEGA used smaller validation sets than GEPA's original splits (100 vs 300 for HotpotQA) to reduce optimization cost.

Curation · 84 Tasks · 11 Domains

Wisdom curation quality on SkillsBench

Whether compositional curation translates into downstream task performance. 84 tasks across 11 domains, each verified by deterministic pytest assertions in isolated Docker containers. Four conditions share the same agent (Gemini 3 Flash), the same 4,207-asset skill pool, the same tasks, and the same verifier — the only variable is the skill-discovery and orchestration method.

46.5%

Highest pass rate, best efficiency, lowest latency

0.566 score/Mtok vs SkillNet 0.424

SystemPass Rate (%)Avg Tokens/Task (k)Curation Latency (s/task)Efficiency (score/Mtok)
No Skills31.58940.353
AgentSkillOS41.11,189403.40.345
SkillNet41.798337.80.424
MEGA (WG)46.582211.80.566

Efficiency is pass rate per megatoken consumed (score/Mtok). Curation latency captures the time to discover and orchestrate skills before execution — AgentSkillOS does sequential LLM calls across a capability tree (up to six levels) plus a DAG plan call; MEGA's PCST retrieval runs over a pre-indexed graph with no LLM calls in the retrieval phase.

Real-World Analysis · 7 Systems · Performance Comparison

Real work. Real results.

MEGA Code vs. 7 leading systems — measured on tasks developers actually ship.

Compared head-to-head in A/B performance and across 4 structural dimensions. Every claim on this page is evidence-backed.

Reproduce the Results

Skill Quality Performance · 5 Systems

Token Usage by System

Each system generated skills from a 10-round full-stack development session (FastAPI + React + Gemini chat app). 4 skills were extracted and evaluated using HF Upskill's eval harness — 5 test cases per skill, tested on both Claude Sonnet and Haiku. Competitors received only 1–2 sentence prompts, no detailed traces. Baseline (no skill) shown as reference.

1/5

the tokens, same tasks

169K vs 897K

MEGA Code

0% vs baseline

897K

HF Upskill

0% vs baseline

897K

anthropic-skill-creator

0% vs baseline

897K

Baseline (No Skill)

reference

897K

claude-code-skill-factory

+0% vs baseline

897K

skill-builder

+0% vs baseline

897K

Vertical line marks baseline (no skill). Bars exceeding baseline mean the system used more tokens than having no skill at all.

Combined Average Score

Mean score across all 8 runs per system (4 skills × 2 models). Each skill was scored on 5 test cases, measuring whether the generated skill correctly guided the AI agent to produce the expected output.

Combined (Sonnet + Haiku)

MEGA Code

169K tokens

0%

HF Upskill

763K tokens

0%

anthropic-skill-creator

826K tokens

0%

Baseline (No Skill)

897K tokens

0%

skill-builder

2,024K tokens

0%

claude-code-skill-factory

1,448K tokens

0%

Sonnet Only

MEGA Code

0%

HF Upskill

0%

Baseline

0%

anthropic-skill-creator

0%

skill-builder

0%

claude-code-skill-factory

0%

Haiku Only

MEGA Code

0%

HF Upskill

0%

anthropic-skill-creator

0%

Baseline

0%

skill-builder

0%

claude-code-skill-factory

0%

Structural Quality Comparison

Each cell marks circle (full), triangle (partial), or cross (absent) across 8 structural dimensions of generated skill files.

Structural ElementMEGA CodeHF Upskillskill-factoryskill-builder
Frontmatter completeness
Trigger precision
Preconditions
Workflow specificity
Rule reasoning (Why/Effect)
Anti-pattern coverage
Common Mistakes (why-it-happens)
Success Indicators

Key Findings

Token Efficiency Winner

MEGA Code achieves the lowest total token usage — 169K vs 763K–2M for competitors. 5× reduction from baseline.

Highest Combined Score

78% combined average score vs 65% baseline. The Why/Effect rule structure and preconditions enable correct application in edge cases.

Perfect Structural Quality

16/16 structural score — the only system with explicit preconditions, Why/Effect reasoning on every rule, and verifiable success indicators.

Privacy-First Pipeline

The only system with automated privacy filtering (8 pattern categories) before any data leaves the local machine.

Methodology

Test Harness

HF Upskill eval (identical conditions)

Models

Claude Sonnet & Claude Haiku

Test Cases

5 per skill, auto-generated by Opus via HF Upskill

Source Material

10-round full-stack dev session; competitors given minimal prompts only

Skills Evaluated

4 full-stack skills from a FastAPI + React + Gemini project

Systems

MEGA Code vs. HF Upskill, anthropic-skill-creator, claude-code-skill-factory, skill-builder, Baseline (no skill)

Technical Capability Comparison · 7 Systems

How MEGA Code differs architecturally

7 skill-generation systems compared across 4 structural dimensions.

MEGA Code

Figures it out from your real work

Silently captures your coding sessions

Generates skills AND strategies autonomously

Learns from your entire project history

7 Other Systems

User tells the system what to build

Requires a task description as seed

Generates one skill per prompt

No cross-session learning

1

Input Source

Auto-captures real coding sessions via lifecycle hooks. No prompt, no trace, no interaction needed.

Requires user-written task description. HF Upskill truncates traces to 4K chars. Most need manual seed.

2

Automation Level

Fully autonomous — zero-touch from capture to quality-gated output. Run once, forget.

Semi-automatic at best (HF Upskill). Most are interactive with human gates at every step.

3

Strategy Extraction

Dual output: task-specific Skills + cross-domain Strategies as distinct artifact types.

Task-specific skills only. No system separates strategy-level patterns from skill-level instructions.

4

Quality Control

LLM judging, multi-metric gating, threshold filtering, and automated privacy masking (8 categories).

3 of 7 have some QC. 4 have none documented. No system has privacy filtering.

System Summary

SystemInput SourceAutomationStrategyQuality Control
MEGA CodeAuto-captured sessions, multi-session corpusFully autonomousLLM judging, gating, privacy masking
HF UpskillTask prompt + optional tracesSemi-automaticAutomated tests, threshold gate
SkillWeaverWeb exploration (self-generated)Fully autonomousPartialIterative self-practice only
skill-creator (Anthropic)Task prompt + subagent transcriptsInteractivePartialLLM judging, A/B testing, human gate
skill-builderTask promptInteractiveManual checklist only
Claude-Skill-BuilderTask prompt + 40+ pre-built skills + marketplaceInteractive / Semi-automaticStructural conventions only
claude-code-skill-factoryTask prompt via guided Q&A + templatesSemi-automaticStructural validation only
MakeSkillNatural-language specSemi-automaticNone documented

Conclusion

MEGA Code is the only system that achieves fully autonomous skill and strategy generation directly from your real coding sessions.

The A/B performance comparison confirms this: MEGA Code achieves the highest combined score (78%) with the lowest token usage (169K) — an 81% reduction from baseline — while maintaining perfect structural quality (16/16). It also applies automated privacy filtering before any data leaves the user's machine.

See It In Action

Watch MEGA optimize an agent.

Step through an optimization run end-to-end.

Watch the Demo