Skip to main content

16. Reliability and Resilience

This section defines how the architecture behaves under failure conditions and how reliability is achieved without excessive coupling or complexity. Resilience is designed in, not retrofitted.

16.1 Failure Modes

Failure modes are explicitly recognised and handled.

Common failure modes include:

  • Capability execution failure
  • Integration endpoint unavailability
  • AI-assisted executor timeout or rejection
  • Partial data availability

Principles:

  • Failures are contained within their execution boundary
  • Failures are explicit and observable
  • Silent or ambiguous failure is avoided

Each capability declares its expected failure modes and recovery strategies.

16.2 Graceful Degradation

The architecture supports graceful degradation.

Degradation strategies include:

  • Switching from AI-assisted to deterministic execution
  • Returning partial results with reduced confidence
  • Deferring execution for asynchronous completion

Degradation:

  • Is capability-specific
  • Is declared in capability contracts
  • Preserves user progress and intent

This ensures continuity of operation even under adverse conditions.

16.3 Retries and Idempotency

Retries are controlled and predictable.

Rules:

  • Idempotent operations are safe to retry automatically
  • Non-idempotent operations require explicit retry semantics
  • Retry limits and backoff are policy-driven

Idempotency:

  • Is declared per capability
  • Uses explicit idempotency keys where applicable
  • Prevents duplicate side effects

This avoids cascading failures and unintended duplication.

16.4 Observability

Observability is mandatory and end-to-end.

Observability includes:

  • Structured logs
  • Metrics for latency, success rate, and cost
  • Distributed tracing across capability invocations

Observability data:

  • Is correlated by capability and invocation
  • Distinguishes deterministic and AI-assisted paths
  • Supports diagnosis, optimisation, and governance

This provides the feedback loops necessary for continuous improvement.