Skip to main content

18. Operational Model

This section defines how the architecture is operated day-to-day. The operational model prioritises clarity, predictability, and fast recovery without embedding operational concerns into delivery teams.

18.1 Monitoring and Metrics

Monitoring is capability-centric.

Metrics are collected for:

  • Capability invocation volume
  • Success and failure rates
  • Latency by execution mode
  • Cost indicators for AI-assisted execution

Monitoring principles:

  • Metrics are aligned to capability contracts
  • Deterministic and AI-assisted paths are distinguishable
  • Thresholds and alerts are policy-driven

This provides operational visibility without excessive instrumentation.

18.2 Incident Management

Incidents are managed at the capability level.

Incident handling includes:

  • Rapid identification of affected capabilities
  • Isolation of failing execution paths
  • Clear escalation paths to capability owners

Principles:

  • Incidents do not require system-wide shutdowns
  • Degradation strategies are preferred to outages
  • Communication is outcome-focused, not system-focused

This limits blast radius and recovery time.

18.3 Change Management

Change management is lightweight and continuous.

Changes include:

  • Capability version updates
  • Policy and configuration changes
  • Executor substitution or rollback

Rules:

  • Changes are reversible
  • Impact is assessed per capability
  • Changes do not require coordinated releases across teams

This supports safe evolution without central bottlenecks.

18.4 Operational Runbooks

Runbooks provide actionable operational guidance.

Runbooks include:

  • Common failure scenarios
  • Diagnostic steps
  • Recovery and rollback procedures

Runbooks:

  • Are capability-specific
  • Are maintained alongside capability definitions
  • Are accessible to both operations and delivery teams

This ensures consistent response under pressure.