MEGA Code ⏐ Agents that improve autonomously

MEGA Tech Report · Experiments & Evaluation

Measured, not marketed.

Workflow optimization on the GEPA suite, and wisdom curation on SkillsBench.

Same datasets, same seeds, same grading as the published baselines.

Optimization · 4 Benchmarks · GPT-4.1 Mini

Workflow optimization on the GEPA suite

MEGA's workflow optimizer benchmarked against MIPROv2, TextGrad, GEPA, and Feedback Descent on four compound AI systems from GEPA's released eval code — multi-hop QA, instruction following, retrieval-based claim verification, and privacy-aware delegation. Identical train/val/test splits and seeds; all methods measured under the same grading criteria.

+7.03pts

Over the strongest prior optimizer (GEPA)

Aggregate 76.55 vs 69.52

System	HotpotQA	IFBench	HoVer	PUPA	Agg.
Baseline	38.00	47.79	46.33	78.57	52.67
MIPROv2	58.00	49.15	48.33	83.37	59.71
TextGrad	62.33	48.64	47.67	85.68	61.08
Feedback Descent	68.33	54.59	57.67	85.66	66.56
GEPA	69.00	55.95	56.67	96.46	69.52
MEGA	72.67	61.05	74.67	97.81	76.55

Scores are test-split accuracy on identical splits (HotpotQA 150/100/300, IFBench 150/100/294, HoVer 150/100/300, PUPA 111/111/221). Baseline, MIPROv2, TextGrad, and GEPA scores are cited from the GEPA paper; Feedback Descent scores are cited from its authors. MEGA used smaller validation sets than GEPA's original splits (100 vs 300 for HotpotQA) to reduce optimization cost.

Curation · 84 Tasks · 11 Domains

Wisdom curation quality on SkillsBench

Whether compositional curation translates into downstream task performance. 84 tasks across 11 domains, each verified by deterministic pytest assertions in isolated Docker containers. Four conditions share the same agent (Gemini 3 Flash), the same 4,207-asset skill pool, the same tasks, and the same verifier — the only variable is the skill-discovery and orchestration method.

46.5%

Highest pass rate, best efficiency, lowest latency

0.566 score/Mtok vs SkillNet 0.424

System	Pass Rate (%)	Avg Tokens/Task (k)	Curation Latency (s/task)	Efficiency (score/Mtok)
No Skills	31.5	894	—	0.353
AgentSkillOS	41.1	1,189	403.4	0.345
SkillNet	41.7	983	37.8	0.424
MEGA (WG)	46.5	822	11.8	0.566

Efficiency is pass rate per megatoken consumed (score/Mtok). Curation latency captures the time to discover and orchestrate skills before execution — AgentSkillOS does sequential LLM calls across a capability tree (up to six levels) plus a DAG plan call; MEGA's PCST retrieval runs over a pre-indexed graph with no LLM calls in the retrieval phase.

Real-World Analysis · 7 Systems · Performance Comparison

Real work. Real results.

MEGA Code vs. 7 leading systems — measured on tasks developers actually ship.

Compared head-to-head in A/B performance and across 4 structural dimensions. Every claim on this page is evidence-backed.

Reproduce the Results

Skill Quality Performance · 5 Systems

Token Usage by System

Each system generated skills from a 10-round full-stack development session (FastAPI + React + Gemini chat app). 4 skills were extracted and evaluated using HF Upskill's eval harness — 5 test cases per skill, tested on both Claude Sonnet and Haiku. Competitors received only 1–2 sentence prompts, no detailed traces. Baseline (no skill) shown as reference.

1/5

the tokens, same tasks

169K vs 897K

MEGA Code

0% vs baseline

897K

HF Upskill

0% vs baseline

897K

anthropic-skill-creator

0% vs baseline

897K

Baseline (No Skill)

reference

897K

claude-code-skill-factory

+0% vs baseline

897K

skill-builder

+0% vs baseline

897K

Vertical line marks baseline (no skill). Bars exceeding baseline mean the system used more tokens than having no skill at all.

Combined Average Score

Mean score across all 8 runs per system (4 skills × 2 models). Each skill was scored on 5 test cases, measuring whether the generated skill correctly guided the AI agent to produce the expected output.

Combined (Sonnet + Haiku)

MEGA Code

169K tokens

HF Upskill

763K tokens

anthropic-skill-creator

826K tokens

Baseline (No Skill)

897K tokens

skill-builder

2,024K tokens

claude-code-skill-factory

1,448K tokens

Sonnet Only

MEGA Code

HF Upskill

Baseline

anthropic-skill-creator

skill-builder

claude-code-skill-factory

Haiku Only

MEGA Code

HF Upskill

anthropic-skill-creator

Baseline

skill-builder

claude-code-skill-factory

Structural Quality Comparison

Each cell marks circle (full), triangle (partial), or cross (absent) across 8 structural dimensions of generated skill files.

Structural Element	MEGA Code	HF Upskill	skill-factory	skill-builder
Frontmatter completeness
Trigger precision
Preconditions
Workflow specificity
Rule reasoning (Why/Effect)
Anti-pattern coverage
Common Mistakes (why-it-happens)
Success Indicators

Key Findings

Token Efficiency Winner

MEGA Code achieves the lowest total token usage — 169K vs 763K–2M for competitors. 5× reduction from baseline.

Highest Combined Score

78% combined average score vs 65% baseline. The Why/Effect rule structure and preconditions enable correct application in edge cases.

Perfect Structural Quality

16/16 structural score — the only system with explicit preconditions, Why/Effect reasoning on every rule, and verifiable success indicators.

Privacy-First Pipeline

The only system with automated privacy filtering (8 pattern categories) before any data leaves the local machine.

Methodology

Test Harness

HF Upskill eval (identical conditions)

Models

Claude Sonnet & Claude Haiku

Test Cases

5 per skill, auto-generated by Opus via HF Upskill

Source Material

10-round full-stack dev session; competitors given minimal prompts only

Skills Evaluated

4 full-stack skills from a FastAPI + React + Gemini project

Systems

MEGA Code vs. HF Upskill, anthropic-skill-creator, claude-code-skill-factory, skill-builder, Baseline (no skill)

Technical Capability Comparison · 7 Systems

How MEGA Code differs architecturally

7 skill-generation systems compared across 4 structural dimensions.

MEGA Code

Figures it out from your real work

✓

Silently captures your coding sessions

✓

Generates skills AND strategies autonomously

✓

Learns from your entire project history

7 Other Systems

User tells the system what to build

—

Requires a task description as seed

—

Generates one skill per prompt

—

No cross-session learning

Input Source

✓

Auto-captures real coding sessions via lifecycle hooks. No prompt, no trace, no interaction needed.

—

Requires user-written task description. HF Upskill truncates traces to 4K chars. Most need manual seed.

Automation Level

✓

Fully autonomous — zero-touch from capture to quality-gated output. Run once, forget.

—

Semi-automatic at best (HF Upskill). Most are interactive with human gates at every step.

Strategy Extraction

✓

Dual output: task-specific Skills + cross-domain Strategies as distinct artifact types.

—

Task-specific skills only. No system separates strategy-level patterns from skill-level instructions.

Quality Control

✓

LLM judging, multi-metric gating, threshold filtering, and automated privacy masking (8 categories).

—

3 of 7 have some QC. 4 have none documented. No system has privacy filtering.

System Summary

System	Input Source	Automation	Strategy	Quality Control
MEGA Code	Auto-captured sessions, multi-session corpus	Fully autonomous	✓	LLM judging, gating, privacy masking
HF Upskill	Task prompt + optional traces	Semi-automatic	—	Automated tests, threshold gate
SkillWeaver	Web exploration (self-generated)	Fully autonomous	Partial	Iterative self-practice only
skill-creator (Anthropic)	Task prompt + subagent transcripts	Interactive	Partial	LLM judging, A/B testing, human gate
skill-builder	Task prompt	Interactive	—	Manual checklist only
Claude-Skill-Builder	Task prompt + 40+ pre-built skills + marketplace	Interactive / Semi-automatic	—	Structural conventions only
claude-code-skill-factory	Task prompt via guided Q&A + templates	Semi-automatic	—	Structural validation only
MakeSkill	Natural-language spec	Semi-automatic	—	None documented

Conclusion

MEGA Code is the only system that achieves
fully autonomous skill and strategy generation
directly from your real coding sessions.

The A/B performance comparison confirms this: MEGA Code achieves the highest combined score (78%) with the lowest token usage (169K) — an 81% reduction from baseline — while maintaining perfect structural quality (16/16). It also applies automated privacy filtering before any data leaves the user's machine.

See It In Action

Watch MEGA optimize an agent.

Step through an optimization run end-to-end.

Watch the Demo