Categories Machine Learning

AISRE: It is time for AI Site Reliability Engineering

Definition of AISRE

AISRE (AI Site Reliability Engineering) applies SRE principles to ensure predictable, safe, cost‑efficient, and governable behavior of AI-powered systems whose runtime behavior depends on probabilistic models, dynamic retrieval pipelines, external vendors, foundation models and agentic tool chains.

Reliability objects expand from service availability to semantic fidelity (grounded answers), safety/policy adherence, economic efficiency (token & GPU spend), and controlled behavioral evolution (drift resistance & rollback).

From SRE to AISRE
SRE assures that a request is served fast, accurately, and within capacity for largely deterministic systems. AISRE must in addition, govern a probabilistic, data‑conditioned cognition chain whose intermediate semantics, freshness, and tool actions shape trust, safety, and cost. New surfaces (prompts, indices, embeddings, planners, tool schemas) silently mutate outcomes without tripping infra-alarms. Here are some examples:

Press enter or click to view image in full size

If any terms in this table and other parts of the article are unfamiliar, I suggest you refer to the book “AI Engineering: Building Applications with Foundation Models” by Chip Huyen for deeper conceptual background.

Core focus areas of AISRE (layered view)

To illustrate the AISRE objective more, I continue with a layered decomposition as it reduces metric blindness. Each layer has distinct failure modes, rollback units, and observability primitives. Treat them as separable reliability surfaces, but correlate signals (e.g., retrieval freshness drift preceding semantic accuracy drop) to shorten MTTD. While many teams don’t implement the foundation model themselves and rely on model APIs, I maintain the logical sequence from foundation to product layer for conceptual clarity. Feel free to focus on the layers most relevant to your implementation.

A. Foundation model layer

The foundation model layer comprises the core generative and embedding models (the base models) plus their adapters; these establish the system’s baseline semantics, latency envelope, and safety guarantees.

  • Model & variant reliability (pin & diff versions)
  • Decoding performance (TTFT, TPOT, goodput)
  • Useful output tokens delivered to the user per second
  • Safety & structured output enforcement
  • Controlled sampling & evaluator determinism

B. Retrieval (RAG) layer

RAG is a layer injecting external corpus context into prompts to improve grounding & freshness. It hinges on ingestion freshness, relevance quality, context efficiency, and index rollback safety. These are the core reliability dimensions of the RAG layer:

  • Corpus ingestion & freshness SLIs
  • Chunking & overlap policy validation
  • Hybrid retrieval precision/recall, rerank_gain, semantic vs lexical blend efficacy
  • Index health
  • Context efficiency
Press enter or click to view image in full size

Instrument hybrid retrieval & basic efficiency

C. Agentic / Tooling layer

Planner + tool execution chains extending model capability with iterative reasoning and external side‑effects. This layer’s reliability focuses on plan convergence, action correctness, loop prevention, and safe writes. Essential reliability aspects of this layer are:

  • Plan convergence (plan_steps_p95, loop_abort_rate)
  • Action success & tool error taxonomy (tool_timeout_rate, tool_schema_mismatch_rate)
  • Write-action governance (unsafe_action_block_rate, human_escalation_latency)
  • Multi-tool latency budget (tool_chain_latency_p95)
  • Planner drift (planner_success_delta)
Press enter or click to view image in full size

Loop guard / duplicate action breaker

D. Product / Experience layer

User interaction surfaces translating model outputs into value. Here reliability watches correction signals, regeneration dynamics, safety UX balance, personalisation, and billing integrity. Reliability factors such as:

  • User corrections (regeneration_rate, early_abort_rate, edit_distance_mean)
  • Session semantic satisfaction proxy (avg_grounded_turns_per_session)
  • Personalisation effectiveness (personalisation_uplift_semantic_accuracy)
  • Billing integrity (unattributed_token_ratio, orphan_tool_cost_ratio)
  • Safety UX balance (false_refusal_rate vs unsafe_output_rate tradeoff curve)

E. Cross-cutting

These are shared concerns that affect cross-layer of your AI system like, monitoring for changes, tracking data flow, managing costs, and ensuring compliance. These help you control how your AI evolves while keeping costs reasonable. Important areas to monitor include:

  • Drift & change management (semantic_accuracy_delta)
  • Economic optimisation (cost_per_successful_task, token_waste_ratio)
  • Observability & lineage (prompt_template_hash, retrieval_doc_ids)
  • Governance & compliance (policy_violation_rate, pii_leak_rate)
  • Feedback flywheel quality (feedback_capture_rate, label_latency)
Press enter or click to view image in full size

Semantic drift detector paging on regression

Failure taxonomy

Now, let us take a look into some failure scenarios. Different failure classes often co‑occur (e.g., retrieval staleness elevating hallucination rate). Detection relies on layered SLIs plus semantic / structural evaluators, infra metrics alone rarely surface these early.

Press enter or click to view image in full size

Not every class or signal applies to all AI products, tailor the set to user value, risk surface, and architectural depth. Start minimal, expand with observed failure patterns.

Representative SLIs / SLO examples

Selecting SLIs for AISRE means balancing semantic quality, retrieval fidelity, safety, orchestration control, cost, and user experience without creating monitoring and metric noises. These indicators should directly tie to business value. For instance, goodput (useful output tokens delivered per second) correlates with user satisfaction while retrieval precision, directly affects task completion rate and build trust. The most effective implementation of AI-SLIs breach technical reliability with measurable business impacts. This is a rich subdomain that is substantial enough to evolve into a specialized field of its own.

That said, in general it is suggested to treat semantic + retrieval + safety as the core triad and then layer agent/tool and economics metrics only once you have clear ownership and action playbooks. Separate SLIs as:

  • Online SLIs (real‑time counters or streaming evals: latency, cost, unsafe_output_rate)
  • Nearline SLIs (batched semantic / grounding evals, retrieval precision samples)
  • Offline SLIs (periodic corpus freshness audits, planner success benchmarks)

AI system reliability metrics

Effective monitoring requires thoughtful selection of metrics that provide visibility into your AI system’s performance across critical dimensions. While technical metrics track system health, business-aligned AISRE (AI Site Reliability Engineering) metrics connect system performance to organisational outcomes. By monitoring both technical and business metrics, teams can prioritise improvements that directly impact user satisfaction, revenue, and operational efficiency. This guide offers a structured approach to metric selection across key reliability dimensions, helping you identify potential issues before they impact users and translate technical performance into business value. Categories of metrics track Semantic (hallucination_rate), Retrieval-RAG (context_precision@5), Inference (p95_ttft_ms), Agent/Tools (action_success_rate), Safety (unsafe_output_rate), Economics (cost_per_successful_task), Drift (semantic_accuracy_delta_vs_baseline), Product UX (regeneration_rate), Ops (mttr_model_rollback).

AISRE maturity ladder

Growing AISRE capability is compounding, and instrumentation precedes automation, which precedes adaptive optimisation. I put this ladder together to help transition from SRE to AISRE. Use it to pick the smallest next investment that shortens detection or rollback time. Avoid jumping levels before semantic evaluation baselines are stable.

Press enter or click to view image in full size

Control points & guard patterns

These are engineered control points where you enforce policies, collect end‑to‑end lineage, and bound blast radius. Each control point should: emit minimally sufficient structured telemetry at first, have a rollback/disable path, and then fail safe (contain rather than propagate risk). Therefore in short, instrument first and then enforce. You can start with these control points to uphold constraints, preserve traceability, and contain impact:

  1. Gateway: Unified interface, model & retriever fingerprinting, fallback chain, per-intent quotas.
  2. Retrieval: Hybrid cascade (BM25 → embedding → reranker), freshness schedule, differential retriever canary.
  3. Router: Intent & complexity → smallest capable model / retriever pair; escalate on semantic score < threshold.
  4. Safety sandwich: Pre-filter (PII/injection) → generation/agent loop → post-filter (toxicity/format/tool output) → repair or fallback.
  5. Agent supervisor: Max steps, loop and duplicate-action heuristics, tool allowlist, high-risk write-action approval workflow.
  6. Caching layer: Exact + semantic (guarded), KV & prompt caches, retrieval result cache, cost-aware eviction.
  7. Index Ops: Snapshot + rollback for vector index, embedding model version pin.

Incident playbook skeleton

This playbook provides a high-level example of how to respond to reliability incidents in AI systems. It’s not a step-by-step guide, but a template that shows how different signals (like semantic quality, retrieval freshness, and version history) can be combined to detect issues. It also emphasizes the importance of freezing key components (like prompts, models, and indexes) early in the process to preserve evidence for investigation. You should customise thresholds, alerting rules, and recovery criteria based on your system’s risk profile and traffic volume.

Incident type: Hallucination + Retrieval Staleness Composite
Detection: hallucination_rate > 3% & context_precision@5 drop > 10% vs baseline for 10 min.
Immediate actions:

  • Freeze prompt + model + retriever version
  • Switch to last good vector index snapshot (if drift localised)
  • Run differential semantic + grounding eval (shadow) vs prior bundle
  • Inspect embedding_lag_minutes & ingestion job health
  • Root cause data: retrieval freshness metrics, index build diff, model fingerprint diff, semantic & grounding score deltas.
  • Exit criteria: hallucination_rate < 1.5% & context_precision@5 within 2% of baseline for 60 min.

Starter checklist

This is a high‑level AISRE bootstrap checklist: a minimal cross‑layer starting point to establish semantic + retrieval + safety visibility, controlled change surfaces, and fast rollback. Treat it as a seed, prune items you cannot action within an incident and only add new checks after a real failure revealed blind spots.

☐ Define cross-layer SLIs (semantic_accuracy, context_precision@5, loop_abort_rate, cost_per_successful_task, unsafe_output_rate)

☐ Version & log (prompt, model, embedding model, index snapshot, reranker, planner)

☐ Daily retrieval / planner canary vs baseline

☐ Shadow + differential eval before promoting model OR retriever OR planner

☐ Agent loop & unsafe action guards deployed

☐ Retrieval freshness & unused_context dashboards

☐ Rollback scripts (model/prompt/index/reranker/planner) & composite runbook

Conclusion: What makes AISRE special

AISRE elevates reliability from infrastructure correctness to semantic, retrieval, orchestration, and product behavior guarantees. It unifies observability (data + semantics + retrieval + agent actions + infra), embeds drift & feedback into ops loops, and treats model, retrieval, planner, and policy evolution as first-class deployables with error budgets. Traditional SRE foundations remain necessary, but AISRE adds retrieval science, agent governance, and product telemetry literacy to keep AI systems trustworthy at scale.

You May Also Like