15. Performance and Cost Management
This section defines how performance and cost are actively managed across the architecture. Performance and cost are treated as first-class design constraints, not operational afterthoughts.
15.1 Latency Budgets
Latency is managed through explicit latency budgets.
Latency budgets:
- Are defined per capability
- Are aligned to user interaction patterns
- Are enforced by policy where possible
Budget classes (examples):
- Class A: sub-100ms (real-time UI interactions)
- Class B: 100-500ms (interactive experiences)
- Class C: 500ms-2s (non-critical synchronous actions)
- Class D: asynchronous / background (minutes+)
Budgets distinguish between:
- User-facing synchronous interactions
- Background or asynchronous execution
- AI-assisted execution paths
When latency budgets cannot be met:
- Degradation strategies are applied
- Execution mode may be constrained
- Users are informed explicitly where relevant
This ensures predictable user experience under varying conditions.
15.2 Deterministic vs AI Cost Profiles
Deterministic and AI-assisted execution have distinct cost profiles.
Deterministic execution:
- Predictable and low variance
- Scales linearly with usage
- Preferred for high-volume workloads
AI-assisted execution:
- Higher and more variable cost
- Sensitive to input size and complexity
- Constrained by quotas and policies
Cost profiles:
- Are declared per capability
- Influence execution path selection
- Are visible to governance and operations teams
Cost controls (examples):
- Per-capability token/call budgets with enforcement at the gateway
- Priority-based execution where high-priority consumers can exceed soft quotas
- Cost-aware executor selection (prefer deterministic when cost sensitivity flag set)
- Sampling and canarying: limit AI usage initially and measure ROI
This enables informed trade-offs between flexibility and cost.
15.3 Caching and Reuse
Caching is used to reduce latency and cost where appropriate.
Caching strategies include:
- Result caching for deterministic capabilities
- Partial caching of intermediate results
- Reuse of parsed or normalised artefacts
Rules:
- Caching respects data sensitivity and classification
- Cache invalidation is explicit
- AI-assisted outputs are cached cautiously and selectively
- Cache policies should specify TTL, invalidation triggers, and allowed storage zones
Caching improves efficiency without compromising correctness or trust.
15.4 Scaling Characteristics
The architecture scales by design.
Scaling characteristics:
- DXP scales with user demand
- Capability gateway scales with invocation volume
- Executors scale independently based on workload type
AI-assisted execution:
- Scales more slowly and is capacity-constrained
- Is protected by quotas and prioritisation
- Is isolated from deterministic execution paths
Monitoring metrics to track:
- Invocation volume per capability and execution mode
- Latency P50/P95/P99 by capability and path
- Cost per invocation and token usage (AI)
- Cache hit rates and reuse benefits
- Human review time saved (where applicable)
This separation ensures that high-cost execution does not degrade baseline system performance.