Pattern: Fallback and Recovery

Category: Tool Use Source: FOR-0012 Status: Documented

When to Use

When operations can fail and the system needs to degrade gracefully rather than crash. Essential for production systems where reliability matters — tool calls may fail, APIs may be down, LLM responses may be malformed. The agent needs backup plans.

How It Works

Attempt the primary operation (tool call, API request, generation)
If it fails, analyze the failure type (timeout, bad input, service down, malformed response)
Apply the appropriate fallback strategy:
- Retry: Same operation, possibly with adjusted parameters
- Alternative tool: Use a different tool that achieves the same goal
- Simpler method: Fall back to a less sophisticated but more reliable approach
- Cached/default response: Use saved data or default answers
- Human escalation: Alert a human if no automated fallback works
Log the failure and recovery for later analysis
Continue processing with the fallback result

Example

A digital talent that pulls real-time pricing data from an API. If the API times out, it retries once. If it fails again, it falls back to cached pricing from the last successful fetch. If no cache exists, it flags the report as "pricing data unavailable — manual update required" and escalates to the human operator.

Tradeoffs

Pro	Con
System stays operational despite failures	Each fallback layer adds complexity
Builds user trust through reliability	Fallback responses may be lower quality
Failures are logged for systematic improvement	Over-engineering fallbacks for rare failures wastes effort
Graceful degradation over hard crashes	Must test each fallback path independently

Factory Usage

Role Factory auto-improve: If a modification does not improve the score, the change is reverted (fallback to previous version).
Agent trigger system: Non-trigger phrases redirect to the correct agent rather than failing silently — a form of routing fallback.