Skip to main content

15. Performance and Cost Management

This section defines how performance and cost are actively managed across the architecture. Performance and cost are treated as first-class design constraints, not operational afterthoughts.

15.1 Latency Budgets

Latency is managed through explicit latency budgets.

Latency budgets:

  • Are defined per capability
  • Are aligned to user interaction patterns
  • Are enforced by policy where possible

Budget classes (examples):

  • Class A: sub-100ms (real-time UI interactions)
  • Class B: 100-500ms (interactive experiences)
  • Class C: 500ms-2s (non-critical synchronous actions)
  • Class D: asynchronous / background (minutes+)

Budgets distinguish between:

  • User-facing synchronous interactions
  • Background or asynchronous execution
  • AI-assisted execution paths

When latency budgets cannot be met:

  • Degradation strategies are applied
  • Execution mode may be constrained
  • Users are informed explicitly where relevant

This ensures predictable user experience under varying conditions.

15.2 Deterministic vs AI Cost Profiles

Deterministic and AI-assisted execution have distinct cost profiles.

Deterministic execution:

  • Predictable and low variance
  • Scales linearly with usage
  • Preferred for high-volume workloads

AI-assisted execution:

  • Higher and more variable cost
  • Sensitive to input size and complexity
  • Constrained by quotas and policies

Cost profiles:

  • Are declared per capability
  • Influence execution path selection
  • Are visible to governance and operations teams

Cost controls (examples):

  • Per-capability token/call budgets with enforcement at the gateway
  • Priority-based execution where high-priority consumers can exceed soft quotas
  • Cost-aware executor selection (prefer deterministic when cost sensitivity flag set)
  • Sampling and canarying: limit AI usage initially and measure ROI

This enables informed trade-offs between flexibility and cost.

15.3 Caching and Reuse

Caching is used to reduce latency and cost where appropriate.

Caching strategies include:

  • Result caching for deterministic capabilities
  • Partial caching of intermediate results
  • Reuse of parsed or normalised artefacts

Rules:

  • Caching respects data sensitivity and classification
  • Cache invalidation is explicit
  • AI-assisted outputs are cached cautiously and selectively
  • Cache policies should specify TTL, invalidation triggers, and allowed storage zones

Caching improves efficiency without compromising correctness or trust.

15.4 Scaling Characteristics

The architecture scales by design.

Scaling characteristics:

  • DXP scales with user demand
  • Capability gateway scales with invocation volume
  • Executors scale independently based on workload type

AI-assisted execution:

  • Scales more slowly and is capacity-constrained
  • Is protected by quotas and prioritisation
  • Is isolated from deterministic execution paths

Monitoring metrics to track:

  • Invocation volume per capability and execution mode
  • Latency P50/P95/P99 by capability and path
  • Cost per invocation and token usage (AI)
  • Cache hit rates and reuse benefits
  • Human review time saved (where applicable)

This separation ensures that high-cost execution does not degrade baseline system performance.