15. Performance and Cost Management

This section defines how performance and cost are actively managed across the architecture. Performance and cost are treated as first-class design constraints, not operational afterthoughts.

15.1 Latency Budgets

Latency is managed through explicit latency budgets.

Latency budgets:

Are defined per capability
Are aligned to user interaction patterns
Are enforced by policy where possible

Budget classes (examples):

Class A: sub-100ms (real-time UI interactions)
Class B: 100-500ms (interactive experiences)
Class C: 500ms-2s (non-critical synchronous actions)
Class D: asynchronous / background (minutes+)

Budgets distinguish between:

User-facing synchronous interactions
Background or asynchronous execution
AI-assisted execution paths

When latency budgets cannot be met:

Degradation strategies are applied
Execution mode may be constrained
Users are informed explicitly where relevant

This ensures predictable user experience under varying conditions.

15.2 Deterministic vs AI Cost Profiles

Deterministic and AI-assisted execution have distinct cost profiles.

Deterministic execution:

Predictable and low variance
Scales linearly with usage
Preferred for high-volume workloads

AI-assisted execution:

Higher and more variable cost
Sensitive to input size and complexity
Constrained by quotas and policies

Cost profiles:

Are declared per capability
Influence execution path selection
Are visible to governance and operations teams

Cost controls (examples):

Per-capability token/call budgets with enforcement at the gateway
Priority-based execution where high-priority consumers can exceed soft quotas
Cost-aware executor selection (prefer deterministic when cost sensitivity flag set)
Sampling and canarying: limit AI usage initially and measure ROI

This enables informed trade-offs between flexibility and cost.

15.3 Caching and Reuse

Caching is used to reduce latency and cost where appropriate.

Caching strategies include:

Result caching for deterministic capabilities
Partial caching of intermediate results
Reuse of parsed or normalised artefacts

Rules:

Caching respects data sensitivity and classification
Cache invalidation is explicit
AI-assisted outputs are cached cautiously and selectively
Cache policies should specify TTL, invalidation triggers, and allowed storage zones

Caching improves efficiency without compromising correctness or trust.

15.4 Scaling Characteristics

The architecture scales by design.

Scaling characteristics:

DXP scales with user demand
Capability gateway scales with invocation volume
Executors scale independently based on workload type

AI-assisted execution:

Scales more slowly and is capacity-constrained
Is protected by quotas and prioritisation
Is isolated from deterministic execution paths

Monitoring metrics to track:

Invocation volume per capability and execution mode
Latency P50/P95/P99 by capability and path
Cost per invocation and token usage (AI)
Cache hit rates and reuse benefits
Human review time saved (where applicable)

This separation ensures that high-cost execution does not degrade baseline system performance.

15.1 Latency Budgets​

15.2 Deterministic vs AI Cost Profiles​

15.3 Caching and Reuse​

15.4 Scaling Characteristics​

15.1 Latency Budgets

15.2 Deterministic vs AI Cost Profiles

15.3 Caching and Reuse

15.4 Scaling Characteristics