Production AI Evaluation & Guardrails
Most organisations first experience generative AI as a polished internal demo. Real users bring messy phrasing, adversarial curiosity, incomplete context, and expectations shaped by human support—not by a model card. The gap between “works in the boardroom” and “works on Tuesday at 2 a.m.” is not model size; it is evaluation discipline, guardrails, and operational ownership. This playbook describes how we structure that gap before scale, including teams that must answer to risk, legal, and security stakeholders.
Why demos mislead
Demos usually hide three production realities. First, distribution shift: training examples, RAG corpora, and eval sets rarely match live phrasing, languages, or edge cases. Second, compound failure: once an agent can call tools, read email, or move money, a small language error becomes a large business error. Third, cost and latency: unconstrained context, chain-of-thought verbosity, and retries that were fine for ten testers break budgets and SLAs at ten thousand.
Production readiness means naming acceptable error rates per task class, measuring them continuously, and having an escalation path when they move. Anything else is a prototype wearing a blazer.
Three layers of evaluation
We treat evaluation as a stack—not a single accuracy number—so improvements in one layer do not mask regressions elsewhere.
Offline & held-out suites
Fixed prompts and labelled outputs for core workflows; retrieval-quality tests when RAG is involved; bilingual or domain-specific subsets; red-team transcripts from prior incidents. Runs in CI before every prompt, tool schema, or model version change.
Online & shadow traffic
Compare candidate models or prompts against production on sampled traffic without user impact. Track task completion rate, escalation rate, grounding violations, toxicity flags, and tool-call success. Shadow mode catches regressions benchmarking cannot see.
Human-rated samples
Stratified sampling (new users, refunds, outages, multilingual turns). Human raters judge correctness, tone, policy adherence—not just thumbs up/down. The goal is calibrated agreement with SMEs, which becomes the ground truth for automated metrics.
Guardrails that survive contact with users
Guardrails are not a bolt-on moderation API. Effective systems combine:
- Input controls: schema validation for structured requests, intent classifiers or routers for unstructured chat, attachment scanning, rate limits per user cohort, and PII stripping where appropriate.
- Output controls: policy classifiers on drafts, citations required for factual claims when using RAG, format validators for downstream systems (JSON contracts, numeric ranges).
- Tool boundaries: allow-lists per role, deterministic idempotency keys for payments or writes, and mandatory confirmation steps crossing spend or legal thresholds—even if the model “insists.”
Design principle: the model proposes; policies dispose. Agents should never be the sole authority on irreversible actions. Human-in-the-loop, dual control, or on-chain governance hooks (where Xenqube designs Web3-bound workflows) reinforce the same separation of duties.
Monitoring without drowning in dashboards
Start with a small set of KPIs wired to alerting. Typical production signals:
- Latency p95/p99 and token economics per workflow (detect runaway completions).
- Grounding violations in RAG (answers not supported by retrieved chunks).
- Escalation and handoff rates to humans (rising rate often precedes churn).
- Refusal specificity (blanket refusals often indicate prompt fragility).
- Error taxonomy on tool calls (timeouts, authentication, malformed arguments).
Alert on trends and distribution shifts, not single weird traces—then reserve deep dives for incidents with full prompt, retrieval context, and decision log.
Retraining, RAG refresh, and prompt evolution
Models do not age like traditional software—they age like inventory. Documents change, regulation changes, product SKUs change. A maintainable pipeline includes: versioned corpora with deletion and provenance; scheduled refresh jobs; evaluation gates before promoting a new index; and clear ownership for who approves “knowledge” updates. For fine-tuned weights, the same discipline applies: data lineage, bias checks on new slices, and rollback to the last known-good artifact.
This mirrors what strong AI development partners publish as standard: data preprocessing, labelling, training, and ongoing monitoring—not a one-time model handoff.
Governance and audit readiness
Regulated teams ask one question repeatedly: Why did the system do that on that day? Answerability requires append-only decision logs (inputs, retrieved sources, model version, policy version, tool results), model and prompt versioning tied to releases, and export formats legal can consume. Our AI governance use case expands on traceability patterns for enterprise and hybrid on-chain controls.
Rollout checklist (condensed)
- Task taxonomy with per-class acceptable error and escalation rules.
- Offline eval suite in CI; shadow traffic for risky changes.
- Guardrails on inputs, outputs, and tools; human paths for high-risk actions.
- Monitoring SLOs, incident runbooks, and rollback including prompt/model pins.
- Data and RAG lifecycle owners; scheduled refresh with eval gates.
- Red-team schedule (prompt injection, jailbreaks, data exfiltration attempts).
If you are shipping agents that touch customer funds, identity, or governance votes, the same checklist applies—only the stakes and the speed of incident response change. We build those systems with explicit policy bounds and optional on-chain audit anchors where your architecture demands it.
