MEGA Code
Blog/Tech Report

MEGA: Self-Evolving Agent Optimization Infrastructure via Wisdom Graph

A unified infrastructure where each optimization cycle produces durable assets, compositional reasoning over those assets guides the next cycle, and operational evidence refines both the accumulated wisdom and the reasoning that governs it.

PublishedMay 5, 2026 · Tech Report
Read time~24 min

SkillsBench Pass Rate

46.5%+4.8 vs SkillNet

84 tasks · 11 domains · Gemini 3 Flash

Curation Latency

11.8s−3.2× SkillNet

No LLM calls during retrieval

Aggregate · GPT-4.1 Mini

75.02+5.50 vs GEPA

HotpotQA · IFBench · HoVer · PUPA

Wisdom Graph Pool

4,207PCR-decomposed

Skills · strategies · curation patterns · trajectories

00Abstract

As coding agents increasingly handle implementation, the central challenge shifts from building individual agents to building an infrastructure that systematically improves them.

Current approaches optimize agent systems without accumulating transferable knowledge, accumulate knowledge without compositional reasoning over it, and lack a mechanism for that knowledge to self-evolve through operational evidence. MEGA (Meta Evaluation-Grounded Adaptation) addresses these gaps as a self-evolving infrastructure: each optimization cycle produces durable assets, compositional reasoning over those assets guides subsequent optimization, and operational evidence refines both the accumulated wisdom and the reasoning that governs it.

Layer 1

Session-based Wisdom Generation

Distills reusable wisdom from agent sessions through behavioral pattern clustering and empirical A/B validation, transforming each process into a durable asset.

BIRCHQuality GateA/B Validation

Layer 2

Wisdom Reasoning & Curation

Decomposes assets into atomic PCR units within a typed Wisdom Graph and performs deductive, abductive and inductive reasoning to expand implicit relations.

PCRWG-DBPCSTROI

Layer 3

Evaluation-Driven Optimization

Multi-agent collaborative optimization over heterogeneous workflows, attributing improvement to specific changes through controlled, fixed-seed evaluation.

Seed-EpochEvidenceMeta-learning

01The Infrastructure Bottleneck

Agent-based software development is entering a new paradigm. Coding agents are increasingly performing implementation, while the human role shifts toward defining optimization objectives, designing evaluation data, and interpreting results. In this transition, the central question is no longer whether an agent can solve a single task well, but whether there exists an infrastructure that systematically and repeatedly improves agent systems based on data. Tools for automating agent development are proliferating. Stateless automation loops such as the Ralph Loop repeat execution cycles, and test-driven development provides verification at small scale. Prompt optimizers (DSPy, GEPA) and workflow optimizers (AFlow, Flow) automatically improve agent systems. Skill libraries (Voyager, Memento-Skills) store reusable procedures for recurring tasks, while adaptive memories (Reflexion, ACE, ReasoningBank) distill reasoning strategies and reflective insights from past attempts to guide future behavior. Each of these advances addresses an important aspect of the problem. What remains open is an infrastructure that, beyond merely connecting these approaches, unifies them through a self-evolving cycle in which agent sessions are distilled into verified knowledge, compositional reasoning over that knowledge guides optimization, and optimization evidence refines both the knowledge and the reasoning that governs it.

1.1 · Three limitations of current automation

Limit 01

Optimization without accumulation

Stateless loops automate execution without learning from prior cycles. Prompt and workflow optimizers improve a single project, but the rationale—which strategies worked under which conditions—remains unstructured and non-transferable.

Limit 02

Knowledge without composition

Skill libraries store procedural knowledge but cannot reason about which skills to compose, in what order, under what conditions. The compositionality gap does not shrink with scale.

Limit 03

Knowledge that cannot self-evolve

Without a feedback loop in which optimization evidence verifies whether a composition was actually effective, unverified knowledge accumulates alongside valid knowledge and retrieval quality gradually degrades.

1.2 · MEGA: Evaluation-Driven Agent Development with Self-Evolving Curation

MEGA is an agent development infrastructure that transforms each optimization process into a durable asset, reasons over accumulated assets to compose context-specific execution plans, and evolves the composition logic itself through operational evidence. The key mechanism is a typed Wisdom Graph in which wisdom is decomposed into atomic PCR (Primary-Context-Resultant) units; logical reasoning discovers implicit relations among them; and compositional retrieval assembles plans that include bridging knowledge unreachable by embedding similarity alone. This reasoning substrate is not static—it self-evolves through attributed evidence produced by a multi-agent optimization loop.

Layer 1

Session-based Wisdom Generation

distills reusable wisdom from agent sessions through two-stage clustering and multi-stage quality gating including behavioral A/B validation, transforming each process into structured, reasoning-ready material.

Layer 2

Wisdom Reasoning & Curation

decomposes wisdom into atomic PCR triplets, performs deductive, abductive, and inductive reasoning to expand implicit relations, and assembles context-specific execution plans through compositional retrieval based on the Prize-Collecting Steiner Tree (PCST) formulation, with ROI-driven self-evolution.

Layer 3

Evaluation-Driven Optimization

performs multi-agent collaborative optimization, attributes improvement effects to specific strategy changes through controlled evaluation, and feeds evidence back to Layer 2—driving the self-evolution of both curation strategies and optimization trajectories.

The architecture is designed so that, as optimization cycles accumulate across projects, the Wisdom Graph matures toward delivering effective optimization guidance from the first query alone.

Recent literature has rapidly advanced along three axes relevant to agent development automation: agent memory and procedural reuse, structured knowledge and skill composition, and automatic optimization of agent systems. Each axis has achieved meaningful progress in isolation; this section reviews them in turn and identifies the gaps that remain when they operate independently.

Recent work has rapidly expanded persistent memory from simple storage toward reflective organization, procedural reuse, and cross-agent transfer. In the skill-reuse line, Voyager built a library in which agents generate and accumulate code-based skills in a Minecraft environment. Memento-Skills generalized this idea to general-purpose agents, automatically generating markdown skill files on the SRDP framework, rewriting skills upon failure, and selecting them via an InfoNCE and offline-RL behavioral router. ProcMEM formalizes reusable procedural Skills as natural-language units with activation, execution, and termination conditions, converting episodic experience into executable procedural memory.

Reflective and adaptive memory is also active. Reflexion proposed verbal reinforcement learning that feeds linguistic self-reflection into subsequent attempts. Self-Refine demonstrated an iterative critique-and-improve loop driven entirely by the model’s own feedback. MemGPT formalized long-term memory architecture through tiered memory and virtual context management. Dynamic Cheatsheet builds self-curated memory at test time to adapt problem-solving strategies; ACE addresses its context-collapse problem—where monolithic rewriting progressively loses accumulated knowledge—by decomposing memory into atomic bullet units with incremental delta updates (ICLR 2026).

A growing body of work extracts reusable experience from trajectories and transfers it across tasks or agents. DS-Agent reuses expert cases through a case-based reasoning pipeline (ICML 2024). Agent KB proposes a shared memory infrastructure across heterogeneous agent frameworks, foregrounding the foundation for collective agent intelligence. Agent Workflow Memory extracts multi-step routines from successful trajectories for reuse. SWE-Exp accumulates both successful and failed trajectories as a multi-layered experience bank for reusing repair expertise in software issue resolution. ReasoningBank distills generalizable reasoning strategies from both successful and failed trajectories into structured memory items, demonstrating agent self-evolution at test time (ICLR 2026).

This line of work has made meaningful progress in agent memory. However, most systems treat accumulated artifacts as independent units retrieved by similarity, without modeling the relations among them—prerequisite dependencies, mutual exclusions, or conditional applicability. Whether the unit is a procedural skill, a reflective strategy, or a cross-agent experience bank, stored items remain structurally isolated: the system knows what it has, but not how those items relate to one another. The authors of ReasoningBank explicitly identify compositional memory as an open direction. The next section examines work that imposes such relational structure.

A parallel line of work imposes structure on knowledge—ranging from commonsense facts to operational skills—and attempts reasoning or composition over it. ConceptNet and ATOMIC organize commonsense relations in fixed-type node ontologies, while Pearl formalizes the notions of sufficiency and necessity in causal inference, providing theoretical tools for reasoning beyond mere correlation. On the retrieval side, GraphRAG improved answer quality through entity graphs and community-level summaries, LightRAG introduced graph-aware indexing with incremental updates, and HippoRAG 2 unified factual, sense-making, and associative memory through non-parametric continual learning (ICML 2025). These systems demonstrate that imposing structure on factual knowledge enables better retrieval, but they do not address the composition of operational skills and strategies.

More recent work moves beyond storing skills toward orchestrating and curating them. AgentSkillOS organizes skills into a hierarchical capability tree and composes them through task-specific DAG orchestration with three composition strategies (quality-first, efficiency-first, simplicity-first), demonstrating that structured composition substantially outperforms flat invocation. SkillNet connects skills through a typed multi-relational graph with four relation categories (similar_to, belong_to, compose_with, depend_on) and provides a multi-dimensional quality evaluation framework, achieving consistent reward improvements across multiple agent architectures. Together, these systems validate a key premise: the bottleneck in skill-based agents is not skill availability but orchestration—how skills are organized, selected, and composed for a given task.

However, composition still operates at the whole-skill level—neither system decomposes a compound skill into atomic sub-units whose internal dependencies can be individually reasoned about. Relational structures also remain relatively coarse: a hierarchical tree without cross-branch edges, or a small set of discrete relation types that do not distinguish the strength of association. Nor do these systems perform logical inference to discover relations that were never explicitly recorded, or self-refine through operational evidence—once a skill is registered or a relation labeled, it remains static regardless of whether downstream execution confirms or contradicts it. The next section examines whether optimization systems address these gaps from the other direction.

Research on automatically improving agent system performance is growing rapidly and bifurcating into two tiers defined by the optimization object. The first tier optimizes prompts, instructions, demonstrations, and related settings over fixed LM programs. DSPy frames LM calls as declarative programs whose prompts and examples are automatically optimized by a compiler. MIPROv2 jointly searches instructions and few-shot demonstrations for multi-stage LM programs. GEPA reflectively analyzes trajectories—including tool calls and intermediate outputs—to evolve prompt updates. Feedback Descent generalizes text-level optimization through pairwise comparison with cumulative textual rationale. Optimas optimizes heterogeneous configurations including prompts, hyperparameters, and model parameters while maintaining component-level local rewards for globally aligned optimization of compound AI systems.

The second tier treats the workflow as a mutable optimization target at the code-node level, rather than refining prompts within an LM program. AFlow formalizes agentic workflow optimization as a search problem over code-represented workflows; the core of optimization lies not in better-phrasing prompt chains but in restructuring LLM-invoking nodes, their connections, task decomposition, execution order, and control flow at the code level. Flow represents workflows as activity-on-vertex graphs and performs dynamic subtask allocation and continuous workflow refinement. The defining characteristic of this tier is that it treats agent systems as mutable code/graphs and optimizes node composition, edge structure, task decomposition, and execution policies as first-class optimization variables.

The two tiers share two structural limitations. First, neither accumulates why a particular strategy worked under particular conditions as structured, cross-project transferable knowledge. Second, both tiers assume that evaluation data is externally provided and treat it as fixed input, even though a growing body of work demonstrates that synthetic data meaningfully improves agent capabilities—SWE-Next for coding agents, ProgSearch for web research agents, and SandMLE for ML engineering agents—leaving the evaluation dataset itself outside the optimization loop.

Recent empirical studies independently corroborate the first limitation. Yu et al. demonstrate that test-driven verification is often insufficient—345 erroneous patches passed SWE-bench tests due to inadequate coverage—and Joshi et al. characterize existing SWE bench agents as stateless, treating each issue independently without cross-task experience accumulation. Stateless automation loops such as the Ralph Loop exhibit the same pattern: each cycle does not learn from prior cycles, so productive work and waste are automated equally.

The preceding review reveals a consistent pattern. Memory systems accumulate experience but do not decompose it into compositionally reasonable units (Section 2.1). Graph-based systems and skill orchestration frameworks impose structure but do not decompose skills into atomic reasoning units, discover implicit relations through logical inference, or self-refine through execution evidence (Section 2.2). Optimization systems improve agent performance but do not structure transferable knowledge about why a strategy succeeded, nor do they generate the evaluation data on which optimization depends (Section 2.3).

03Architecture & Design Principles

MEGA’s three layers form a cycle in which optimization produces evidence, evidence refines the Wisdom Graph, and the refined graph guides subsequent optimization:

𝒲⁽ᵗ⁾ = ℒ₁(𝒯⁽ᵗ⁾ ∪ 𝒯₃⁽ᵗ⁻¹⁾), Π⁽ᵗ⁾ = ℒ₂(𝒲⁽ᵗ⁾, ℰ⁽ᵗ⁻¹⁾), ⟨ℰ⁽ᵗ⁾, 𝒯₃⁽ᵗ⁾⟩ = ℒ₃(Π⁽ᵗ⁾)(1)

Here, t indexes the outer self-evolution cycle of MEGA. 𝒯⁽ᵗ⁾ denotes general agent session traces, while 𝒯₃⁽ᵗ⁾ denotes execution sessions produced by Layer 3’s optimization loop. The union 𝒯⁽ᵗ⁾ ∪ 𝒯₃⁽ᵗ⁻¹⁾ indicates that Layer 1 consumes both trace sources through a single PCR extraction pipeline. We initialize ℰ⁽⁰⁾ = ∅ and 𝒯₃⁽⁰⁾ = ∅, so the first cycle operates without prior evidence. The three layers form a self-reinforcing cycle. Layer 3 produces two outputs that flow backward: evaluation evidence ℰ⁽ᵗ⁾ refines Layer 2’s curation and planning, while execution sessions 𝒯₃⁽ᵗ⁾ become new input to Layer 1 for fresh wisdom distillation.

MEGA three-layer cycle: Sessions feed Layer 1 which produces wisdom into Layer 2, which produces plans into Layer 3, which feeds evidence and sessions back.
Figure 1 MEGA architecture. 𝒲 comprises four wisdom types (skill, strategy, curation pattern, optimization trajectory). Π(q) is a context-specific execution plan. Evidence includes verdicts, curation feedback, and optimization trajectories. Execution sessions 𝒯₃ feed into Layer 1 as new extraction input.

3.1 · Inter-layer artifact contract

Specifying each layer’s inputs and outputs makes the system boundaries precise. Table 1 summarizes this contract.

Table 1Inter-layer artifact contract. The input, output, and core transformation of each layer are specified.
LayerInputOutputCore transformation
1Session traces 𝒯 ∪ 𝒯₃Wisdom assets 𝒲Clustering + quality gating
2𝒲 + evidence ℰExecution plan Π(q)PCR reasoning + PCST + ROI
3Π(q) + execution resultsEvidence ℰ, sessions 𝒯₃Fixed-seed attribution

3.2 · Four types of wisdom

MEGA accumulates not only skills but also strategies, curation patterns, and optimization trajectories as causal judgments. All four types share PCR semantics within WG-DB and are subject to the same reasoning and curation mechanisms.

Skill

Reusable procedural knowledge produced by Layer 1 and consumed by Layer 2 retrieval.

ProducerLayer 1
Consumer
Layer 2Retrieval

Strategy

Decision rules and judgment criteria produced by Layer 1 and consumed by Layer 2 curation.

ProducerLayer 1
Consumer
Layer 2Curation

Curation pattern

Which skills and strategies worked in what order and under what conditions—produced by Layer 2/3, consumed by Layer 2 planning.

ProducerLayer 2/3
Consumer
Layer 2Planning

Optimization trajectory

Which optimization trajectories succeeded or failed—produced by Layer 3, consumed by Layer 2/3.

ProducerLayer 3
Consumer
Layer 2/3

Curation patterns, in particular, create substantial value in practice. Many failures stem not from the absence of individual skills but from invoking the right skills in the wrong order, under the wrong conditions, or with premature confidence. Knowing which orderings were stable, which evidence thresholds must be met before proceeding to the next step, and which failure signals should trigger a return to exploration mode—these are Wisdom-level judgments that cannot be captured at the Knowledge level. Optimization trajectories are likewise not mere logs. They encode operational wisdom about “what changes were attempted, what improved, and what broke,” serving as higher-order wisdom that Layer 2 references when determining exploration/exploitation strategies for future tasks.

3.3 · Wisdom Graph as a reasoning substrate

In the DIKW pyramid of information science, the transition from Knowledge to Wisdom has traditionally been discussed in terms of value judgment and effectiveness. We reinterpret that transition computationally. A Knowledge Graph is a network representation of information. It structures factual relations as nodes and edges — “A is related to B,” “X is a type of Y.” ConceptNet can store “a car runs on a road,” but it cannot derive on its own that “when the road is wet, the car’s braking distance increases.” This is structured storage, not reasoning.

MEGA’s WG-DB operates beyond this representational level. We use the term wisdom to denote a reusable causal judgment of the form “under context v_C, action v_P produces resultant v_R, realized via method m.” Each wisdom asset decomposes into atomic PCR (Primary–Context–Resultant) triplets with typed dependencies — prerequisite, exclusion, conditional branching, and fallback — among sub-units, and four disjoint subtypes (skills, strategies, curation patterns, optimization trajectories) share these PCR semantics.

The Wisdom Graph (WG-DB) is then the directed typed multi-graph G = (V, E) whose nodes are these PCR components and whose edges carry Sufficiency/Necessity scores σ(e) = (S(e), N(e)) ∈ [0, 1]² inspired by Pearl’s causal calculus, with a role-fluid node pool that lets the same concept serve as Primary in one triplet and as Context in another. Over this structure, MEGA performs deduction, abduction, and induction to derive relations that were never explicitly recorded, and operational evidence feeds back to enable the graph to self-refine. Furthermore, WG-DB does not remain a reasoning substrate for a single project. Skills, strategies, curation patterns, and optimization trajectories accumulated across multiple projects mitigate the cold-start problem of new projects, enabling them to begin from previously validated structures.

04Layer 1 · Session-based Wisdom Generation

Layer 1 distills raw conversation sessions into reusable wisdom assets. This is not simple log storage but knowledge distillation: it extracts structured rules, workflows, and strategies from sessions spanning hundreds to thousands of turns, and passes only those that survive multi-stage quality verification—structural filtering, self-evaluation scoring, and behavioral A/B testing—to Layer 2. As a prerequisite, privacy preserving filtering—including secret masking and path anonymization—is applied client-side before any data leaves the local environment, ensuring that the entire pipeline operates on sanitized traces.

Raw Traces 𝒯
Privacy-Preserving FilteringClient-side, before networkTask Labeling & Evidence GroundingDomain Pre-ClusteringBIRCH Behavioral Pattern DiscoveryClient-side, before networkLLM-based Knowledge SynthesisConflict Resolution & Structural FilteringHash dedup + LLM conflictBehavioral A/B ValidationWith-skill vs. baselineCross-Run Deduplication
Validated Wisdom Assets 𝒲
Figure 2 Layer 1 extraction pipeline. Client-side privacy filtering precedes all processing. BIRCH discovers behavioral patterns in a single O(n) pass, and multi-stage quality gating admits only verified wisdom.

4.1 · The extraction pipeline

Many memory systems either feed entire sessions back as long context or vectorize session fragments for retrieval. This approach is useful for short FAQ-style problems but reveals three limitations in complex agent workflows. First, raw traces mix signal and noise, making the unit of reuse unclear. Second, when sessions that solved the same problem with different strategies are mixed, retrieval may return conflicting instructions. Third, returning entire sessions blurs the boundaries of privacy filtering and evidence attribution.

Distilled wisdom, by contrast, first determines “what is the essential pattern” before storing it. Layer 1’s task is not text compression but the separation of only those structures from a session that have reuse value. Only by explicitly structuring which error signal was the starting point, which action was the key fix, which result was verified, and which exception conditions apply can Layer 2 accept the output as a reasoning substrate.

4.2 · Two-stage clustering: domain separation + behavioral pattern discovery

The technical core of Layer 1 is two-stage clustering.

Stage 1

Domain Pre-Clustering

Sessions are classified by domain via greedy assignment based on embedding similarity, where domains emerge from the data rather than being predefined. This stage narrows the search space for subsequent BIRCH clustering, improving efficiency.
Stage 2

Behavioral Pattern Discovery

BIRCH is applied within each domain cluster. Each sub-cluster is represented by a Clustering Feature (CF) tuple CF = (N, LS, SS), where N is the count, LS = Σᵢ₌₁ᴺ xᵢ is the linear sum of session embeddings, and SS = Σᵢ₌₁ᴺ ‖xᵢ‖² is the squared sum. CFs are additive — merging two disjoint sub-clusters requires only:
CF₁ + CF₂ = (N₁ + N₂, LS₁ + LS₂, SS₁ + SS₂)(2)

This O(1) merge operation enables single-pass O(n) processing over the full session set. Using the radius-based threshold criterion, a new session embedding x is assigned to the closest existing sub-cluster when the resulting radius remains within the threshold T:

R(CF′) = √( SS′ / N′ − ‖LS′‖² / N′² ) ≤ T(3)

where CF′ = CF + (1, x, ‖x‖²). If the condition is violated, a new sub-cluster is created. This mechanism produces an adaptive number of sub-clusters without requiring a pre-specified k — a critical property since the number of behavioral patterns in agent sessions cannot be known a priori. The CF-Tree structure further ensures memory efficiency by bounding tree width through a branching factor.

This two-stage structure transforms Layer 1 from “conversation storage” into behavioral-pattern-based knowledge distillation. Sessions that are semantically similar but behaviorally distinct — sessions that solved the same error with different strategies — are separated and synthesized into independent wisdom assets.

4.3 · Synthesis and structural filtering

Behavioral patterns extracted from clusters are synthesized by an LLM into structured wisdom. This synthesis is not simple summarization but a process of generalizing common patterns within a cluster and specifying exception conditions. Synthesized candidates undergo structural filtering: (1) hash-based exact deduplication (O(n)), (2) LLM-based conflict detection — identifying and resolving contradictory instructions and mutually exclusive candidates, (3) cross-run embedding-similarity deduplication — merging candidates extracted from different sessions that are semantically identical. This structural filtering prevents knowledge-base contamination. For sessions where extraction could not complete, a pending management mechanism preserves partial results, enabling subsequent retries to skip already-completed stages and resume processing.

4.4 · Behavioral A/B validation

Evaluating extracted knowledge typically relies on proxy signals — metadata quality, LLM-generated ratings, or structural completeness. These signals assess the artifact itself rather than its effect on downstream agent behavior. MEGA Layer 1 introduces behavioral A/B validation, which directly measures the causal impact of each wisdom candidate on task performance. Test cases are generated alongside each wisdom candidate using an automated teacher-student evaluation protocol: a high-capability model produces both the wisdom document and its associated test suite; the same or a stronger model then executes the test suite under treatment and control conditions. For each wisdom candidate that passes synthesis and structural filtering, these test cases are executed in parallel under two conditions:

Group 1

Treatment group

execution with the extracted wisdom injected into the system prompt

Group 2

Control group

execution with the same input under the baseline condition without wisdom

The outputs of the two runs are compared to empirically measure performance lift (accuracy improvement) and token efficiency (cost change). These two ROI axes form the basis for accept/reject decisions. Three key differentiators define this design. First, the evaluation criterion is observable behavioral change, not subjective quality judgment. Second, the two axes of performance lift and token efficiency are independently valuable — wisdom that significantly reduces cost even at equivalent accuracy can still pass. Third, this validation runs automatically for all candidates, maintaining wisdom quality at the operational stage without human review.

Consequently, Layer 1’s quality assurance comprises three tiers: structural filtering (deduplication and conflict removal), self-evaluation scoring, and behavioral A/B validation (empirical ROI measurement). Self-evaluation scoring collects factual, countable properties of the generated document — such as the ratio of rules backed by concrete commands or code examples — and aggregates them via a weighted harmonic mean that naturally penalizes any single weak dimension. Only wisdom that passes this entire pipeline enters Layer 2’s reasoning substrate.

05Layer 2 · Wisdom Reasoning & Curation

Layer 2 connects wisdom distillation (Layer 1) with optimization (Layer 3). Before any asset enters the graph, a security screening gate inspects the full skill package—instruction documents, prompt templates, and attached scripts—for two classes of contamination: instruction-level threats such as prompt injection or workflow hijacking embedded in text, and execution-level threats such as exfiltration logic or destructive commands in code. Only packages that pass screening become eligible for graph insertion. At build-time, Layer 2 decomposes incoming wisdom into atomic PCR units, normalizes nodes, and expands the graph through logical inference; at query-time, it assembles context-specific execution plans through compositional retrieval and evidence-driven curation.

5.1 · Compositional retrieval, not skill selection

On the surface, MEGA may appear similar to a skill combination system in that it bundles multiple skills into a plan. However, the essence of a combination engine is scoring and combining a given candidate pool. In MEGA, by contrast, Layer 2 extends the graph through reasoning before combination. The system does not stop at selecting already-stored items; it infers bridge, prerequisite, exclusion, and fallback relations among stored items that have not yet been made explicit.

A combination engine typically selects “three similar skills” and lists them in a prompt. MEGA, however, first identifies which types of wisdom are jointly needed, then decomposes the role each wisdom plays within the plan. Some items become direct execution targets, others become conditionals, and still others become fallback rules activated only upon failure. Plan generation in MEGA is therefore not the construction of a candidate set but the placement of role-differentiated artifacts into an execution graph. Through trajectories and verdict feedback, MEGA produces conditional wisdom about “when not to invoke,” learning the boundaries of success and failure simultaneously.

5.2 · PCR atomic decomposition

Each wisdom asset undergoes hierarchical PCR decomposition: a compound skill is factored into atomic sub-skill units, each represented as a PCR triplet. The logical dependencies among sub-skills — prerequisite ordering, mutual exclusion, and conditional branching — are captured as typed edges within the PCR graph. This decomposition enables Layer 2 to reason about the internal structure of complex skills, not merely their external interfaces, thereby supporting compositional retrieval at sub-skill granularity.

Definition 1

Wisdom Triplet

ω = (v_P, v_C, v_R, m) ∈ V³ × M. v_P (Primary): the core action, v_C (Context): the application condition, v_R (Resultant): the expected outcome, m (Method): implementation details. Each edge e = (u, v) carries a dual-axis score inspired by Pearl’s notions of sufficiency and necessity:

σ(e) = (S(e), N(e)) ∈ [0, 1]²(4)

S(ℯ) denotes the degree to which the target node follows when the source node is present (sufficiency), and N(ℯ) denotes the degree to which the target node is unlikely to occur without the source node (necessity).

WG-DB uses a Role-Fluid node pool: the same node can serve as P in one triplet and as C in another. Unlike structures such as ConceptNet or ATOMIC that assign fixed types to nodes, this design allows the same concept (e.g., “Git branching strategy”) to function naturally as an action to perform (Primary) in one context and as a precondition (Context) in another. This role fluidity naturally generates cross-skill connections and forms a richer relational network while reducing the node count compared to fixed-type graphs.

5.3 · PCR reasoning: three classical modes

PCR triplets are not merely a storage format. They are what transforms Layer 2 from “a knowledge graph with retrieval preprocessing” into a logical inference layer. The directional relations among P, C, and R provide the structural basis for three classical modes of reasoning.

Deduction

Transitive causal inference

When P —σ₁→ C and C —σ₂→ R relations exist in the PCR structure, P —σ′→ R is inferred transitively. Multiplicative score decay is applied at each hop so that longer-range inferences receive exponentially more conservative scores, suppressing false positives. This inference applies not only within a single triplet but equally to cross-wisdom causal chains formed when one wisdom's result (R) matches another wisdom's action (P).

Abduction

Reverse causal inference

When P → R and C → R edges exist — that is, when action P and condition C share the same result R — the system back-traces "why the same result?" to discover a hidden P → C relation. The key premise is that the Necessity score of C → R must be sufficiently high. A Bayesian posterior estimate combining observed result coverage, the causal strength of C → R, and semantic relatedness yields the confidence of candidate relations.

Induction

Statistical pattern discovery

When multiple P nodes are simultaneously connected to the same C and the same R — Pₙ → C, Pₙ → R — the system examines the possibility of a direct relation C → R. A hypergeometric test verifies whether this co-occurrence exceeds chance, and the C → R edge is inferred only if statistically significant.

Figure 3 · PCR Reasoning. The directional P→C→R structure of PCR provides the structural basis for three classical modes of reasoning. Solid lines denote observed edges; dashed lines denote inferred edges.

5.4 · PCST-based compositional retrieval

Embedding similarity search returns only knowledge directly similar to the query. It cannot discover “bridging knowledge” — knowledge essential for connecting multiple skills but exhibiting low similarity to the query — which is critical for complex tasks.

WG-DB formalizes this as a Prize-Collecting Steiner Tree (PCST) problem over the undirected projection Ḡ of WG-DB, obtained by retaining the maximum causal strength for each undirected node pair:

T* = arg max[T ⊆ Ḡ, T connected] [ Σ[v ∈ V(T)] π(v; q) − Σ[e ∈ E(T)] c(e) ](5)

π(v; q) is a prize proportional to query relevance, and c(e) is a cost proportional to causal weakness. Since c(e) ≥ 0, any optimal connected solution is a tree. Non-seed nodes in the PCST solution — those with low query similarity but essential for connecting seeds — constitute the bridging wisdom.

Crucially, PCST does not retrieve only skill-type wisdom. Since all wisdom types coexist in the same graph, PCST performs mixed-type subgraph assembly that cross-composes heterogeneous wisdom types including strategies, curation patterns, and trajectories.

  • Seed (high sim)
  • Bridge (low sim, essential)
  • Excluded
PCST-based retrieval graph: seed s₁ connects to two bridge nodes b₁ and b₂, each connecting to a second seed cluster, with excluded nodes o₁ and o₂ shown faded outside the Steiner tree.
Figure 4 PCST-based retrieval. Seed nodes have high query similarity, while Bridge nodes have low similarity but are included in the Steiner tree solution as essential connectors between seeds.

5.5 · From subgraph to execution

The output of PCST is not an immediately executable script. It is a reasoning subgraph representing “what should be considered together.” Here, Layer 2 performs a second transformation: converting the mixed-type subgraph into a plan artifact with execution ordering. Skill nodes become execution candidates; strategy nodes become gating rules and priority constraints; curation patterns provide topological constraints and dependency ordering; and trajectories provide initial bias in conservative/exploratory branches. The output Π(q) of Layer 2 is therefore not a simple top-k list but the linearization of a plan graph with assigned execution rules.

5.6 · ROI-based curation: self-evolution, not ranking

The second core role of Layer 2 is to continuously evaluate and refine the effectiveness of accumulated wisdom. This is not simple ranking — it is a self-evolution mechanism in which the system calibrates its own judgments against historical performance.

Predicted Utility Estimation

For each wisdom ωᵢ:

pᵢ = τ(simᵢ, stageᵢ) × η(ωᵢ)(6)

τ(simᵢ, stageᵢ) is the historical transfer rate. η(ωᵢ) is a saturation-based evidence confidence that scales with feedback accumulation, guaranteeing a minimum exploration opportunity even for items with no evidence.

Adaptive Gating

The system adaptively switches between a cold-start gate (permissive, exploration-first) and a warm gate (conservative, LCB-based) according to system confidence:

g_roi(ωᵢ) = pᵢ ≥ θ_cold (cold) | lcbᵢ > θ_warm (warm)(7)

Four-Axis Composite Scoring & Portfolio ROI

For items that pass the gate, a composite score is computed across four axes: embedding similarity (sim), evidence strength (evi), stage fitness (stage_fit), and benchmark (bench). Expected performance is estimated as a score-weighted average, and a Correction Factor CF = 𝔼[δ_actual / δ_predicted] is applied to calibrate against past prediction–outcome discrepancies. This CF constitutes the system’s “self-awareness”: it learns how accurate its own predictions have been.

5.7 · Wisdom self-evolution: a system that improves with use

After each curation session completes, per-wisdom attributions are extracted from the session outcome, decomposing the session-level result into individual contribution levels, observed effects, and optional revision suggestions. These attributions drive self-evolution along three axes: individual-level calibration of the transfer function τ, compositional-level reinforcement that promotes frequently successful combinations to curation strategies, and content-level evolution that flags wisdom items accumulating revision suggestions for the maintenance pipeline.

τ⁽ⁿ⁾ = τ_prio (n=0) | (τ̄_obs + τ_prior)/2 (0<n≤k) | τ̄_obs (n>k)(8)
promote(c) ⟺ |S(c)| ≥ n_min ∧ r̄(c) ≥ r_min(9)
evolve(ω) ⟺ |U(ω)| ≥ θ_evo(10)

When a pattern group accumulates sufficient members, an offline synthesis step — executed in batch, outside the live curation loop — compresses them into a single best-practice pattern capturing the common successful structure and weighted average performance. All learned state feeds into the next curation cycle.

5.8 · Wisdom maintenance: auto graph hygiene

Self-evolution improves wisdom over time, but without active maintenance the graph can degrade through duplication, contradiction, and staleness. Layer 2 performs three maintenance operations directly on the PCR structure.

Hygiene

PCR-based deduplication

Wisdom pairs whose P, C, R nodes are semantically equivalent are merged; evidence counts and edge scores consolidate into the surviving variant.

Hygiene

Contradiction resolution

Same P, C with opposing R is detected by polarity analysis. Stronger evidence wins; the weaker variant is annotated with exclusion conditions—a global contradiction becomes a context-bounded judgment.

Hygiene

Evidence-triggered update

Wisdom that accumulates negative feedback is cross-referenced with domain-similar items and externally validated; targeted patches preserve the PCR structure and evidence history.

06Layer 3 · Evaluation-Driven Optimization

As discussed in Section 2.3, existing automation approaches share a common limitation: they do not learn optimization strategies themselves. TDD-based verification causes test infrastructure to become technical debt during workflow redesign; stateless retry loops such as Ralph Loop fail to distinguish productive work from waste; and prompt/parameter optimization retains only the results while leaving “how the optimization was performed” unstructured. Layer 3 targets this gap.

There is, however, a more fundamental limitation that has received less attention. Current prompt and workflow optimizers define each LLM invocation as a module Mᵢ = (πᵢ, θᵢ, Xᵢ, Yᵢ) and optimize primarily at the prompt level. A multi-hop QA system, for example, is decomposed into a chain of such modules — a query generator, a document summarizer, an answer extractor — each improved by refining its prompt πᵢ. This prompt-centric paradigm scales linearly in the number of modules: as task complexity grows, more modules are introduced, each duplicating preamble context in its prompt.

We therefore take a broader view of the optimization target. Following prior formalizations of agentic workflows, we model an agent workflow as a directed graph 𝒜 = (𝒩, ℰ) whose nodes are of three heterogeneous types — deterministic code nodes, single LLM calls parameterized by prompt and decoding settings, and tool-using agent nodes performing multi-turn reasoning with tool invocations — and whose edges define execution order. Under this formulation, a prompt-centric optimizer restricts optimization to the π-component of LLM-call nodes alone, whereas Layer 3 targets the full tuple (prompt, tool configuration, code logic, orchestration structure) across all node types.

Real-world agent systems, built on frameworks such as OpenAI/Claude Agent SDK, Google ADK, and Vercel AI SDK, exhibit exactly this heterogeneous structure. Not every LLM call warrants an agent node — agent nodes incur substantially higher cost due to multi-turn reasoning and tool invocation — but where a task requires tool use, context management, and conditional branching, a single agent node can subsume what would otherwise be multiple prompt-only modules. Jointly optimizing prompt, tool configuration, code logic, and orchestration structure across all node types is the scope that Layer 3 targets.

6.1 · The Adapt pipeline

The optimization target of Layer 3 is not a single prompt but an entire heterogeneous workflow encompassing code nodes, LLM calls, tool-using agents, orchestration logic, and evaluation pipelines. Given an existing agent system, Layer 3 inventories its assets, reverse-engineers a formal specification, generates evaluation data where absent, and initiates iterative optimization against the resulting baseline. The principle throughout is to respect the existing code structure while making targeted improvements, with mandatory user interaction gates at critical decisions.

  1. 1

    Asset Scan inventories source code, LLM call structure, evaluation metrics and any available data; characterizes the optimization baseline.

  2. 2

    Adaptive Entry chooses the starting point (code only / PRD only / data only / any combination) and skips stages whose outputs are already present.

  3. 3

    Research Agents (conditional) when evaluation data or code is missing, parallel agents survey SOTA techniques and identify public datasets to ground synthesis decisions.

  4. 4

    Reverse PRD / Data Synthesis reconstructs a Pipeline Requirement Document; when no dataset exists, derives a category–difficulty matrix and generates stratified train/val sets (with mandatory human approval at the seed scenario gate).

  5. 5

    Shared Optimization Loop Seed-Epoch iterative optimization; central optimization agent coordinates Scientist, Code Reviewer, and Redesign sub-agents through diagnosis → hypothesis → execution → measurement.

  6. 6

    Meta-learning Agent extracts which strategies worked under which conditions as structured wisdom and registers them in Layer 2; surfaced capability gaps become new atomic skills.

6.2 · Multi-agent collaboration inside the optimization loop

Layer 3 employs over a dozen specialized agents, each with strictly bounded responsibilities. Table 3 summarizes the key agents by phase.

Table 3Some key agents in the Layer 3 pipeline.
PhaseAgentRole
ResearchSOTA ResearchSurveys state-of-the-art techniques and benchmarks
ResearchDataset SearchSearches for existing open datasets
PRDReverse EngineerReverse-engineers existing codebase into a Pipeline Requirement Document
DataData AnalysisDesigns synthetic data strategy from the PRD
DataSample Workers3 parallel workers by difficulty (easy / medium / hard)
OptimizationScientistError pattern analysis with root-cause vs. victim distinction and ROI prioritization
OptimizationCode ReviewerStatic analysis + dry-run validation + anti-heuristic enforcement
OptimizationOptimization AgentCentral deliberation: diagnosis → hypothesis → fix → measurement
OptimizationRedesign AgentArchitectural restructuring when incremental optimization stagnates
Post-opt.Meta-learning AgentExtracts optimization trajectories and registers them in Layer 2

The optimization loop proceeds in two stages. In Iteration 0 (Baseline Analysis), the initial system is evaluated and the scientist agent performs multi-faceted diagnostics: error pattern classification by fixability (prompt-fixable, code-fixable, architecture-required), per-node root-cause analysis that distinguishes genuine failure sources from downstream victims, and ROI-based prioritization. A five-signal architectural verdict — error concentration, context loss ratio, error type diversity, uniform failure rate across LLM nodes, and user goal gap — determines whether incremental improvement or full architectural redesign is warranted.

Iteration 1+ (Delta Mode): The central optimization agent directly performs delta analysis and repeats fixed-seed evaluation under the Seed-Epoch structure. At each iteration, the agent applies up to three targeted fixes (prompt improvement, structural code fix, or parameter tuning), submits them to the code reviewer for validation, and measures the result. The architectural principle is centralized deliberation with specialized execution: the central optimization agent directly decides “what is the problem, what should be tried, and how should the results be interpreted,” while specialized agents perform only well-bounded tasks.

6.3 · Data-aware optimization

Most agent optimization systems treat training data as fixed and optimize only the workflow. Layer 3 extends the optimization scope to include the data itself: before entering the workflow optimization loop, the system analyzes whether observed failures stem from workflow limitations or from gaps in training data coverage, and intervenes accordingly.

Layer 3 addresses this through a conditional data augmentation step that executes between baseline evaluation and the optimization loop. The system first classifies the evaluation structure into one of three types: discrete checkers, rubric-based evaluation, or continuous metrics. Each discovered dimension is assessed along three axes: density (how many training examples cover it), diversity (how varied those examples are), and baseline performance.

The critical insight is the distinction between data gaps and workflow problems: a dimension with sufficient, diverse training data but low performance indicates a workflow problem that augmentation cannot solve, while a dimension with absent or sparse data indicates a genuine coverage gap. Augmentation targets only the latter.

6.4 · Seed-Epoch: overfitting regulation

Even when multiple agents iteratively improve a system, whether those improvements are genuine is a separate question. If evaluation data is resampled at every iteration, it becomes impossible to determine whether observed performance gains result from strategy changes or simply from drawing easier samples. This “meaningless delta” problem structurally undermines the reliability of optimization.

Definition 2

Seed-Epoch

An evaluation dataset 𝒟ₑ = Sample(𝒟_train, n, ξₑ) fixed by a deterministic seed ξₑ at epoch e. All iterations i within an epoch are evaluated on the same 𝒟ₑ:

Δᵢ = Acc(𝒟ₑ, vᵢ) − Acc(𝒟ₑ, vᵢ₋₁)(11)

Since the evaluation dataset is identical for all iterations within the same epoch e, data variance is eliminated, and Δᵢ can attribute its primary source of variation to the effect of strategy changes.

The core idea is “changing only the strategy while keeping the same test problems.” This discipline ensures that instead of merely an impression that performance improved, there is a record of which changes produced which differences under identical conditions. When |Δᵢ| < ε for m consecutive iterations, optimization within the current epoch is deemed saturated, and the epoch advances to a new seed ξₑ₊₁. The new epoch evaluates on fresh data, thereby naturally validating whether strategies overfitted to the previous epoch.

6.5 · Optimization safeguards

Seed-Epoch ensures that performance deltas are attributable. Three additional safeguards ensure that the improvements themselves are genuine.

Safeguard

Anti-heuristic enforcement

Code modifications are restricted to structural transforms — JSON parsing, encoding normalization, null handling, type coercion, error recovery. Content-aware regexes and lookup tables are prohibited.

Safeguard

Overfitting detection

The system tracks the train–val accuracy gap across epochs. When the gap exceeds a threshold, the next epoch restricts optimization to prompt improvements only.

Safeguard

Fixability-based goals

Targets are computed from the baseline error distribution — prompt-fixable, code-fixable, architecture-required — reflecting what the current structure can realistically achieve rather than arbitrary numbers.

6.6 · Cross-layer compounding cycle

Most optimization systems improve performance on the current project and stop. When the same type of problem is encountered in a subsequent project, there is no structural mechanism to reuse insights. Layer 3 closes this gap by feeding structured evidence back to Layer 2 after each optimization cycle.

Definition 3

Verdict

A verdict is a multi-dimensional judgment tuple (Δacc, Δlat, Δtok) recording the accuracy delta, latency change, and token usage change observed after applying a wisdom-informed strategy change under fixed-seed evaluation. Every verdict generated during optimization is fed back to Layer 2 to update η(ω) (evidence confidence) and τ(·) (historical transfer rate).

Furthermore, the meta-learning agent extracts patterns discovered during the optimization process itself — which strategy combinations worked under which conditions, which trajectories reduced failures — as structured optimization trajectories and registers them in Layer 2. These trajectories enable the system, when encountering a similar situation in a future project, to start from a validated path rather than exploring from scratch. This is what makes Layer 3 not a one-shot optimizer but a self-reinforcing optimization engine that drives the compounding cycle of the entire MEGA framework.

07Experiments & Evaluation

The evaluation of MEGA is organized along two axes: (1) whether the curation quality of accumulated wisdom translates into improved downstream performance, and (2) whether MEGA’s optimization surpasses existing LM-centric optimizers. Experiment configurations, optimization, and per-benchmark results for both axes are available at github.com/mega-edo/mega_benchmark.

7.1 · Wisdom curation quality on SkillsBench

SkillsBench — 84 tasks across 11 domains, deterministic pytest assertions in isolated Docker containers. All conditions share the same agent (Gemini 3 Flash), 4,207-asset skill pool, tasks, verifiers, and Docker environment; only the skill discovery and orchestration method varies.

Table 4Pass rate, token usage and curation latency on SkillsBench. Efficiency = pass rate per megatoken.
MethodPass rate %Avg tokens / task (k)Curation latency (s)Efficiency (score / Mtok)
No Skills
31.58940.353
AgentSkillOS
41.11,189403.40.345
SkillNet
41.798337.80.424
MEGA (WG)Best
46.582211.80.566

MEGA (WG) achieves the highest pass rate (46.5%) with the best efficiency (0.566 score/Mtok). PCST retrieval requires no LLM calls during the retrieval phase itself, yielding the lowest curation latency (11.8s vs. 37.8s and 403.4s).

7.2 · Optimization performance on GPT-4.1 Mini

Compared against MIPROv2, TextGrad, GEPA, and Feedback Descent on four GEPA benchmarks. All baselines and MEGA share identical evaluation checkers and grading criteria.

(a) HotpotQA
  1. BL
    38.00
  2. MIPROv2
    58.00
  3. TG
    62.33
  4. FD
    68.33
  5. GEPA
    69.00
  6. MEGA
    72.67
(b) IFBench
  1. BL
    47.79
  2. MIPROv2
    49.15
  3. TG
    48.64
  4. FD
    54.59
  5. GEPA
    55.95
  6. MEGA
    61.05
(c) HoVer
  1. BL
    46.33
  2. MIPROv2
    48.33
  3. TG
    47.67
  4. FD
    57.67
  5. GEPA
    56.67
  6. MEGA
    74.67
(d) PUPA
  1. BL
    78.57
  2. MIPROv2
    83.37
  3. TG
    85.68
  4. FD
    85.66
  5. GEPA
    96.46
  6. MEGA
    97.81
Table 5Benchmark results on GPT-4.1 Mini. Baselines cited from GEPA and Feedback Descent.
MethodHotpotQAIFBenchHoVerPUPAAggregate
Baseline
38.0047.7946.3378.5752.67
MIPROv2
58.0049.1548.3383.3759.71
TextGrad
62.3348.6447.6785.6861.08
Feedback Descent
68.3354.5957.6785.6666.56
GEPA (best)
69.0055.9556.6796.4669.52
MEGABest
72.6761.0574.6797.8176.55

MEGA outperforms GEPA by +7.03 aggregate and Feedback Descent by +9.99. The largest absolute gain is on HoVer (+18.00 over GEPA), where multi-hop retrieval benefits most from compositional wisdom that chains retrieval strategies across hops.

08Conclusion

This report presents MEGA, an infrastructure that unifies evaluation-driven agent optimization with self-evolving wisdom curation. Layer 1 distills agent sessions into verified knowledge through behavioral A/B validation, ensuring that only empirically confirmed wisdom enters the system. Layer 2 decomposes this knowledge into atomic PCR units within a typed Wisdom Graph, performs logical reasoning to discover implicit compositional relations, and assembles context-specific execution plans through compositional retrieval. Layer 3 jointly optimizes heterogeneous agent workflows—code nodes, LLM calls, and tool-using agents—and attributes performance changes to specific strategy modifications, producing operational evidence that continuously refines both the Wisdom Graph and the curation logic that governs it.

Empirical results support this design along two axes. On SkillsBench (84 tasks, 11 domains), MEGA achieves a 46.5% pass rate with an efficiency of 0.566 score/Mtok, surpassing SkillNet (41.7%, 0.424) and AgentSkillOS (41.1%, 0.345) while consuming fewer tokens per task. On four established workflow benchmarks, MEGA attains an aggregate score of 76.55 on GPT-4.1 Mini—7.03 points above the strongest reported baseline. These gains reflect the architecture as a whole: behavioral validation gating graph entries, compositional reasoning shaping execution plans, and attributable measurement isolating strategy-driven effects.

These results are not attributable to any single layer in isolation. Layer 1’s behavioral validation controls the precision of graph entries; Layer 2’s compositional reasoning determines the structural quality of execution plans; Layer 3’s attributable measurement provides the empirical signal that calibrates both. The three layers form a closed loop—distillation feeds reasoning, reasoning guides optimization, and optimization produces evidence that refines both the knowledge and the curation logic—and the observed gains reflect this loop operating end to end.

09References

  1. [1]R. L. Ackoff. From data to wisdom. Journal of Applied Systems Analysis, 16:3–9, 1989.
  2. [2]L. A. Agrawal et al. GEPA: Reflective prompt evolution can outperform reinforcement learning. ICLR 2026 (Oral).
  3. [3]Z. Chen et al. Agent KB: cross-domain experience for agentic problem solving. arXiv:2507.06229, 2025.
  4. [4]D. Edge et al. From local to global: a graph RAG approach. arXiv:2404.16130, 2024.
  5. [5]M. X. Goemans & D. P. Williamson. A general approximation technique for constrained forest problems. SIAM J. Computing, 1995.
  6. [6]B. Gu et al. Flow: modularized agentic workflow automation. ICLR 2025.
  7. [7]S. Guo et al. DS-Agent: automated data science via case-based reasoning. ICML 2024.
  8. [8]Z. Guo et al. LightRAG: simple and fast retrieval-augmented generation. arXiv:2410.05779, 2024.
  9. [9]B. J. Gutierrez et al. From RAG to memory: non-parametric continual learning. ICML 2025.
  10. [10]Hugging Face. Upskill: agent skill generation and evaluation. github.com/huggingface/upskill, 2025.
  11. [11]G. Huntley. Everything is a Ralph Loop. ghuntley.com/loop, 2025.
  12. [12]T. Joshi, S. Chowdhury, F. Uysal. SWE-Bench-CL: continual learning for coding agents. arXiv:2507.00014, 2025.
  13. [13]O. Khattab et al. DSPy: compiling declarative LM calls into pipelines. ICLR 2024.
  14. [14]Y. Lee, J. Boen, C. Finn. Feedback descent: open-ended text optimization. arXiv:2511.07919, 2025.
  15. [15]H. Li et al. AgentSkillOS: organizing, orchestrating & benchmarking agent skills. arXiv:2603.02176, 2026.
  16. [16]H. Li et al. SkillNet: create, evaluate, and connect AI skills. arXiv:2603.04448, 2026.
  17. [17]X. Li et al. SkillsBench: benchmarking agent skills across diverse tasks. arXiv:2602.12670, 2026.
  18. [18]J. Liang et al. SWE-Next: scalable real-world software engineering tasks. arXiv:2603.20691, 2026.
  19. [19]A. Madaan et al. Self-refine: iterative refinement with self-feedback. NeurIPS 2023.
  20. [20]H. Mi et al. ProcMEM: reusable procedural memory via non-parametric PPO. arXiv:2602.01869, 2026.
  21. [21]K. Opsahl-Ong et al. Optimizing instructions and demonstrations for multi-stage LM programs. EMNLP 2024.
  22. [22]S. Ouyang et al. ReasoningBank: scaling agent self-evolving with reasoning memory. ICLR 2026.
  23. [23]C. Packer et al. MemGPT: towards LLMs as operating systems. arXiv:2310.08560, 2023.
  24. [24]S. Pandit et al. Synthesizing agentic data for web agents. arXiv:2510.13913, 2025.
  25. [25]J. Pearl. Causality: Models, Reasoning, and Inference. Cambridge UP, 2nd ed., 2009.
  26. [26]O. Press et al. Measuring and narrowing the compositionality gap. EMNLP Findings 2023.
  27. [27]M. Sap et al. ATOMIC: an atlas of machine commonsense. AAAI 2019.
  28. [28]N. Shinn et al. Reflexion: language agents with verbal reinforcement learning. NeurIPS 2023.
  29. [29]R. Speer, J. Chin, C. Havasi. ConceptNet 5.5: an open multilingual graph. AAAI 2017.
  30. [30]M. Suzgun et al. Dynamic cheatsheet: adaptive memory for test-time learning. arXiv preprint, 2025.
  31. [31]G. Wang et al. Voyager: open-ended embodied agent with LLMs. arXiv:2305.16291, 2023.
  32. [32]J. Wang. Stateful reflective developer platforms (SRDP). arXiv:2512.22716, 2025.
  33. [33]Z. Z. Wang et al. Agent workflow memory. arXiv:2409.07429, 2024.
  34. [34]Y. Wu et al. Optimas: optimizing compound AI systems with locally aligned rewards. arXiv preprint, 2025.
  35. [35]X. Xiang et al. SWE-Exp: experience-driven software issue resolution. arXiv:2507.23361, 2025.
  36. [36]B. Yu et al. UTBoost: rigorous evaluation of coding agents on SWE-Bench. ACL 2025.
  37. [37]M. Yuksekgonul et al. TextGrad: automatic differentiation via text. Nature 2025.
  38. [38]J. Zhang et al. AFlow: automating agentic workflow generation. ICLR 2025 (Oral).
  39. [39]Q. Zhang, C. Hu, et al. Agentic context engineering (ACE). ICLR 2026.
  40. [40]T. Zhang, R. Ramakrishnan, M. Livny. BIRCH: efficient data clustering. SIGMOD 1996.
  41. [41]H. Zhou et al. Memento-Skills: let agents design agents. arXiv:2603.18743, 2026.
  42. [42]Y. Zhou et al. Synthetic sandbox for training ML engineering agents. arXiv:2604.04872, 2026.