18. Operational Model
This section defines how the architecture is operated day-to-day. The operational model prioritises clarity, predictability, and fast recovery without embedding operational concerns into delivery teams.
18.1 Monitoring and Metrics
Monitoring is capability-centric.
Metrics are collected for:
- Capability invocation volume
- Success and failure rates
- Latency by execution mode
- Cost indicators for AI-assisted execution
Monitoring principles:
- Metrics are aligned to capability contracts
- Deterministic and AI-assisted paths are distinguishable
- Thresholds and alerts are policy-driven
This provides operational visibility without excessive instrumentation.
18.2 Incident Management
Incidents are managed at the capability level.
Incident handling includes:
- Rapid identification of affected capabilities
- Isolation of failing execution paths
- Clear escalation paths to capability owners
Principles:
- Incidents do not require system-wide shutdowns
- Degradation strategies are preferred to outages
- Communication is outcome-focused, not system-focused
This limits blast radius and recovery time.
18.3 Change Management
Change management is lightweight and continuous.
Changes include:
- Capability version updates
- Policy and configuration changes
- Executor substitution or rollback
Rules:
- Changes are reversible
- Impact is assessed per capability
- Changes do not require coordinated releases across teams
This supports safe evolution without central bottlenecks.
18.4 Operational Runbooks
Runbooks provide actionable operational guidance.
Runbooks include:
- Common failure scenarios
- Diagnostic steps
- Recovery and rollback procedures
Runbooks:
- Are capability-specific
- Are maintained alongside capability definitions
- Are accessible to both operations and delivery teams
This ensures consistent response under pressure.