📝stateful-agent-workflows-and-recovery.md

Stateful Agent Workflows and Recovery

The Real Problem in Agent Automation

A lot of agent systems look impressive in their happy path.

They can:

search for context
call tools
write code
update tickets
move through a multistep workflow

That is fine. The real question is what happens after the system has already done some work and something goes wrong in the middle.

That is where most agent workflows become fragile.

The problem is not initial intelligence. The problem is state and recovery.

Why Stateless Demos Mislead People

If every run starts from scratch and either succeeds or disappears, the system can look simpler than it really is.

Real workflows are rarely that clean.

They usually involve state across time:

a branch already exists
a ticket was already updated
a deploy request was already submitted
a PR comment was already posted
a cleanup step partially ran
a tool call succeeded but the confirmation was lost

Now the system needs to answer a much harder question:

What is true right now, and what should I do next without making things worse?

The Failure Mode I See Constantly

The most common bad pattern is a workflow that only knows how to advance, not how to reconcile.

It assumes that:

previous steps definitely completed
external systems definitely match expected state
retries are harmless
duplicate side effects are acceptable

In real systems, those assumptions do not hold consistently.

That is why idempotence, checkpoints, and reconciliation matter so much.

A Workflow Is Really a State Machine

Once an agent workflow has side effects, it should be modeled more like a state machine than a conversational script.

For example:

drafted -> approved -> implemented -> validated -> documented -> done

That is more useful than a vague sequence of prompts because each state can carry rules.

For each state, I want to know:

what must already be true
what action transitions out of the state
what evidence confirms the transition
what rollback or reconciliation options exist

That structure prevents a lot of accidental chaos.

Checkpoints Matter More Than People Think

If a workflow has more than a couple of steps, I want explicit checkpoints.

A checkpoint is not just a log line. It is a persisted record of:

current state
key outputs from the last step
identifiers for created resources
whether the step is safe to retry
what validation proved the step finished

That gives the system something concrete to recover from later.

Without that, every retry becomes a guess.

Idempotence Is Not Optional

If the workflow can be retried, then the side effects need to be designed around repeated execution.

That often means:

create-or-update semantics instead of create-only
stable resource keys
deduplication tokens
conditional writes
explicit existence checks before creation

This principle shows up everywhere.

Bad pattern:

await createPullRequest({ branch, title, body });

Better pattern:

const existing = await findPullRequestByBranch(branch);
 
if (existing) {
  await updatePullRequest(existing.id, { title, body });
} else {
  await createPullRequest({ branch, title, body });
}

The second version is less dramatic. It is also what survives retries.

Reconciliation Beats Blind Retry

When a step fails, the right next action is often not retry. It is reconcile.

That means asking:

did the side effect already happen
is the external system in the expected state
is the local checkpoint stale
do we need to resume, repair, or abort

This is especially important when the workflow spans multiple systems.

A naive retry policy against distributed state is just a duplication engine.

The Systems I Care About Most Are the Ones That Can Resume

I trust a workflow more when it can:

stop mid-run
inspect current truth
continue from the correct state

That sounds obvious, but plenty of systems cannot actually do it.

They can only restart from zero or require manual cleanup first.

That is a sign the workflow is not really engineered yet.

External Truth Beats Internal Memory

Agent systems often keep a lot of progress in memory or prompt context. That is convenient right up until the process restarts or the session changes.

For any workflow with meaningful side effects, I want the source of truth to live outside ephemeral context.

Usually that means:

persisted workflow state
durable step outputs
resource IDs tied to the run
timestamps and last-known result

Prompt context can help the agent think. It should not be the only place the workflow remembers what happened.

What a Checkpoint Record Can Look Like

It does not have to be complicated.

{
  "runId": "wf_1024",
  "state": "validated",
  "branch": "feature/pagination",
  "pullRequestId": 418,
  "lastSuccessfulStep": "run_tests",
  "lastUpdatedAt": "2026-03-08T14:18:00Z",
  "retrySafe": true
}

That is enough to answer important questions later.

Recovery Policies Should Be Explicit

Different failures deserve different responses.

Failure	Good response
network timeout after create	reconcile before retry
validation failure	stay in state and request correction
permission failure	block and escalate
partial cleanup	resume cleanup using stored identifiers
stale branch or conflict	re-evaluate plan with current repo state

If all failures trigger the same generic retry loop, the workflow is underdesigned.

The Human Handoff Still Matters

Good recovery design does not eliminate humans. It makes human intervention precise.

When the workflow cannot recover safely, the handoff should be explicit:

what state the workflow reached
what succeeded already
what failed
what resource identifiers are involved
what the recommended next action is

That is far better than dumping the user into a vague error message after six hidden side effects have already happened.

Where This Shows Up in Practice

This is not just an agent problem. It is a systems problem that agents make more visible.

You see it in:

code generation pipelines
issue and PR automation
deployment workflows
test orchestration
document processing pipelines
long-running research agents

The moment the system spans time and side effects, state and recovery become first-class design concerns.

My Practical Design Rules

When I build a multistep agent workflow, I want these properties.

every side-effecting step has a named state
each state has a persisted checkpoint
important operations are idempotent or deduplicated
recovery starts with reconciliation, not blind retry
external truth beats conversational memory
unrecoverable states produce a precise human handoff

If several of those are missing, the workflow probably looks better in demos than it will behave in production.

The Main Takeaway

The hard part of agent automation is not making something happen once.

It is making the system reliable when:

side effects already occurred
state is partially updated
retries are dangerous
sessions get interrupted
external systems drift away from your assumptions

That is why I think the future of serious agent systems will look less like giant prompt chains and more like well-designed state machines with strong recovery behavior.