16. Reliability and Resilience
This section defines how the architecture behaves under failure conditions and how reliability is achieved without excessive coupling or complexity. Resilience is designed in, not retrofitted.
16.1 Failure Modes
Failure modes are explicitly recognised and handled.
Common failure modes include:
- Capability execution failure
- Integration endpoint unavailability
- AI-assisted executor timeout or rejection
- Partial data availability
Principles:
- Failures are contained within their execution boundary
- Failures are explicit and observable
- Silent or ambiguous failure is avoided
Each capability declares its expected failure modes and recovery strategies.
16.2 Graceful Degradation
The architecture supports graceful degradation.
Degradation strategies include:
- Switching from AI-assisted to deterministic execution
- Returning partial results with reduced confidence
- Deferring execution for asynchronous completion
Degradation:
- Is capability-specific
- Is declared in capability contracts
- Preserves user progress and intent
This ensures continuity of operation even under adverse conditions.
16.3 Retries and Idempotency
Retries are controlled and predictable.
Rules:
- Idempotent operations are safe to retry automatically
- Non-idempotent operations require explicit retry semantics
- Retry limits and backoff are policy-driven
Idempotency:
- Is declared per capability
- Uses explicit idempotency keys where applicable
- Prevents duplicate side effects
This avoids cascading failures and unintended duplication.
16.4 Observability
Observability is mandatory and end-to-end.
Observability includes:
- Structured logs
- Metrics for latency, success rate, and cost
- Distributed tracing across capability invocations
Observability data:
- Is correlated by capability and invocation
- Distinguishes deterministic and AI-assisted paths
- Supports diagnosis, optimisation, and governance
This provides the feedback loops necessary for continuous improvement.