Two Roads to Durable Agents: Replay vs. Snapshot — Eric Allam, Trigger.dev

Two Roads to Durable Agents: Replay vs. Snapshot — Eric Allam, Trigger.dev

More

Descriptions:

Eric Allam, co-founder of Trigger.dev, delivers a conference talk examining why the stateless “shared nothing” backend architecture that has dominated web development for 30 years fundamentally breaks down when applied to long-running AI agents. Drawing a line from CGI in 1993 through PHP, Ruby on Rails, and serverless, he argues that agents are sessions, not transactions — they need to stay alive and recoverable across failures, code deployments, and indefinite time horizons.

The talk contrasts two architectural approaches to durability: the replay model and the snapshot model. The replay approach, borrowed from durable execution engines like Temporal, wraps every side effect in a cached step so the system can fast-forward past completed work on retry. It provides clean audit trails and human-in-the-loop pause points but forces developers to write deterministic code in a rigid structure and creates versioning headaches when the underlying logic changes. The snapshot model instead periodically saves the full compute environment state — more flexible for complex agent workloads but with different tradeoffs around storage size and recovery fidelity.

Allam proposes treating agent state as two separable halves: the context (an append-only log of all LLM inputs, outputs, tool calls, and results — durable via any standard database or object store) and the execution environment (file system contents, running subprocesses, cloned repos, in-memory datasets). He cites research suggesting agent working duration is doubling every four to seven months, making production-grade durability infrastructure an urgent engineering priority. Trigger.dev is building tooling specifically targeting these deployment challenges.


📺 Source: AI Engineer · Published May 10, 2026
🏷️ Format: Deep Dive

1 Item

Channels