Day 1 is the worst your product will ever be

Agents iterating on a single task are table stakes. The new thing is the outer loop: a system that observes its own behavior in production, drafts evals against new failure modes, and ships fixes without a human in every step. A short essay on self-evolving software.

PublishedJun 4, 2026 · Engineering

TopicSelf-evolving software

Read time~5 min

A few things in life get better with time. Good investments compound. A skill sharpens with practice. Your taste in food or wine might even deepen over the years. The best software products do too — GitHub started as git hosting and now runs entire developer workflows; Stripe started by charging cards and now runs entire financial stacks.

Similar patterns have been described elsewhere. Nassim Taleb’s antifragility— systems that get stronger from stress. Garry Tan’s complexity ratchet— each step up locking in permanently. Andrej Karpathy’s autoresearch— AI doing its own R&D loop. These ideas converge on the same insight: systems that improve from their own operation rather than degrade from it.

We like to refer to this as self-evolving software. A system that observes its own behavior in production, modifies itself in response, and gets sharper without a human in every loop.

Software has always followed a pattern of continual improvement — but always through humans. Engineers reading bug reports, PMs sitting with customers, designers iterating on flows. However, people don’t scale to every trace.

What’s changed is that the loop can now close without us.

And once it does, the day you deploy your agent to production stops being the high point — it becomes the worst your product will ever be.

The trajectory

Traditional software degrades as the world moves around it. Self-evolving software starts lower and climbs with every closed loop. Day 1 is the high point of one curve and the floor of the other.

The inner loop is solved. The outer loop is new.

Agents iterating on a single task — the inner loop — have been around for a couple of years. They’re table stakes.

What’s new is the outer loop: the system observing its own behavior across thousands of production runs and improving in response. Every production trace is an input distribution your tests were never designed to cover. Most teams treat that data as exhaust. In a self-evolving system, traces are re-ingested and become the primary asset.

The mechanism is simple to describe and hard to build. A user submits a query the agent fumbles. The trace gets ingested. An eval is drafted that captures the failure. An agent proposes a change. The change has to pass the full eval suite — including the new one — before it ships. The failure mode is permanently fixed. Each turn, the system evolves itself a step further. The “hard to build” part is the infra that makes every one of those steps reliable.

The outer loop

Trace

agent fumbles in prod

Ingest

captured, replayable

Eval

drafted from the failure

Change

agent proposes a fix

Gate

must pass full suite

next failure

Invariant: a change ships only if it passes every prior eval and the new one. Each turn permanently closes one failure mode.

Five steps, every one a possible failure point. The hard part isn't describing the loop. It's the infra that makes each step reliable enough to run unattended.

Evals: telling improvement from drift

The loop only works if you can tell improvement from drift.

For the deterministic parts of the system, the playbook already exists. The industry has spent decades developing the methods and knowhow to keep regressions out: unit testing, TDD, CI. These are just as valuable in an agentic system as they ever were.

For the non-deterministic, agentic parts, the equivalent is evals— assertions about how the agent layer should behave: did it answer the right question, call the right tool, refuse what it should refuse. Without evals, every model upgrade is a roll of the dice. Each new failure mode in production becomes a new eval, and each eval becomes a permanent constraint on what’s allowed to ship.

Evals are the source of truth for the agentic layer — the authoritative measure of whether the system is improving, and once one is in place, it binds what’s allowed to ship. But what goes into an eval is a matter of human judgment and taste. Whoever writes the evals defines what “better” means for the product. That’s where the human stays in the loop.

The harness sets the ceiling

Underpinning the whole system is the harness: the infrastructure that gives the agent its context, defines its tools, manages memory across runs, and verifies what it produces before anything ships. The model is the engine; the harness is everything around it that makes the engine useful.

How you shape the context decides what the agent can reason about. How you design the tools decides what it can do reliably. How you handle long runs — checkpointing, compaction, resumption — decides whether it works at all beyond a single turn. How you wire its invariants — what the harness makes structurally impossible, no matter what the agent tries — decides whether you can trust the loop to run on its own. The quality of all of this sets the ceiling on how much your software can self-evolve.

The harness

Invariants

Memory

Tools

Context

Model

Contextwhat it can reason about
Toolswhat it can do reliably
Memorycheckpoint · compact · resume
Invariantsstructurally impossible to violate

The model is the engine. The harness is everything around it that makes the engine useful — and decides the ceiling on how much of the loop can run on its own.

The real engineering happens here — not in the prompt, not in the choice of model.

Humans move up the stack

The obvious question: are we describing software that programs itself, with humans optional?

No. We’re describing software that maintains and extends itself. Humans move up the stack. The outer loop handles the work of keeping the current product sharp — catching new failure modes, tightening behavior, closing regressions. Humans handle what the loop can’t: deciding what the product should become, what “better” actually means, which bets to make. Taste and direction.

When the system maintains itself, humans are free to operate at the strategic layer.

Built with the right anchors, the right harness, and a closed outer loop, your product gets sharper every day it spends in production. The bug report that arrives tonight becomes the eval that ships tomorrow becomes the capability your competitor doesn’t have next quarter.

Day one used to be the high point. Now it’s the starting line.