Observability for AI in Production: Logging, Metrics, and Alerts That Actually Matter
Traditional application monitoring tells you if your system is running. AI observability tells you if it's working. The gap between those two things is where production AI problems live — and where most teams have blind spots.
Casey R. Taylor
OpsGenius
A traditional application monitoring stack will tell you whether your AI system is up. It will tell you if the container is running, if the API is returning 200s, if CPU and memory are within normal bounds.
What it won't tell you is whether your AI system is actually working.
An AI inference endpoint can return 200s while producing outputs that are wrong, drifted, or subtly degraded in ways that only become visible across thousands of requests. A voice agent can handle calls without errors while routing them incorrectly. A document processing pipeline can run without exceptions while extracting the wrong data from every tenth document.
This is the observability gap for AI systems. Filling it requires a layer of instrumentation that goes beyond what DevOps teams typically build for traditional software — and most teams don't build it until something has already gone wrong.
The Three Layers of AI Observability
Layer 1: Infrastructure Metrics (Table Stakes)
This is the layer traditional monitoring handles well. CPU, memory, disk I/O, network throughput, container health, pod restart counts. For Kubernetes-based deployments, this means Prometheus for metric collection and Grafana (or CloudWatch, Azure Monitor) for dashboards and alerting.
Infrastructure metrics matter for AI systems, but they're insufficient on their own. High CPU utilization tells you something about load. It doesn't tell you anything about output quality.
What infrastructure monitoring should cover for AI workloads:
- Request latency by percentile — p50, p95, p99. AI inference is slower than typical API calls; latency distributions are wide and the tail matters more
- Queue depth for async AI workers — a growing queue is an early warning sign before latency or errors become visible
- GPU utilization if you're running local inference — underutilization means wasted spend; consistent near-100% utilization means a scaling event is coming
- Container restart frequency — restarts indicate instability; a container restarting once a week is a different problem than one restarting hourly
Layer 2: Application and Pipeline Traces
Distributed tracing gives you visibility into what happens inside a request — which components were called, in what order, how long each took, and where failures occurred.
For AI systems, which are typically orchestrations of multiple calls (an LLM call, a vector database lookup, a CRM API write, a webhook trigger), tracing is how you answer questions like:
- Which step in the pipeline is causing latency?
- When this request failed, what was the last successful step?
- How much of total request time is spent waiting on the LLM vs the database vs downstream APIs?
Tools like OpenTelemetry provide a standard instrumentation layer that works across cloud providers. The data flows to whatever backend you're using — Datadog, Grafana Tempo, AWS X-Ray, Azure Application Insights.
The critical thing to capture in AI pipeline traces:
- Input tokens and output tokens per LLM call — token counts directly drive cost and latency
- Model version used — so you can correlate behavior changes with model updates
- Tool call chains for agentic systems — which tools were called, in what order, with what arguments
- Retry counts — a call that succeeds on the third attempt is not the same as a call that succeeds on the first
Layer 3: Output Quality Signals
This is the layer most teams are missing. It's also the most important one for AI systems.
Output quality monitoring means continuously sampling what your AI system is actually producing and evaluating whether it meets your quality criteria. The implementation varies by use case:
For structured output tasks (data extraction, classification, routing decisions): output quality can be evaluated programmatically. Does the extracted field match the expected format? Does the classification fall within the expected distribution? Is the routing decision consistent with business rules? These checks can run on every output and alert on deviations.
For natural language output tasks (summarization, drafting, customer-facing responses): programmatic evaluation is harder. Options include:
- Using an LLM-as-judge — a second, lightweight model call that evaluates the primary output against defined criteria (tone, accuracy, completeness, appropriate length)
- Human review of sampled outputs — slower but more reliable for high-stakes applications
- Proxy metrics — thumbs down rates, escalation rates, re-ask rates from end users, where those signals are available
The goal isn't to evaluate every output exhaustively. It's to maintain enough visibility that quality degradation is detected in hours, not weeks.
What to Alert On
Most AI systems are over-alerted on infrastructure (every CPU spike becomes a page) and under-alerted on what actually matters. A calibrated alert setup for a production AI system looks something like this:
Page immediately:
- Error rate exceeds threshold for more than 5 minutes
- P99 latency exceeds SLA for more than 5 minutes
- Container crash loop (more than 3 restarts in 10 minutes)
- Queue depth growing continuously for more than 15 minutes without drainage
Alert for business-hours review:
- Output quality score drops more than 15% from baseline
- Token usage spikes unexpectedly (may indicate prompt injection or runaway loops)
- Unusual distribution in classification outputs (may indicate input drift)
- External API error rate from a dependency rises above 5%
Weekly review metrics:
- Cost per request trend over time
- Output quality trend over time
- Latency trend over time
- Distribution of input types and edge case frequency
Structured Logging for AI Systems
Logs are how you debug production incidents after they happen. For AI systems, the standard application log (timestamp, level, message) isn't enough.
Every AI inference call should log — in structured JSON — at minimum:
- A unique trace ID that ties the log entry to the full request trace
- The input passed to the model (or a hash of it, if inputs are sensitive)
- The model and version used
- Input and output token counts
- Latency in milliseconds
- The raw output or a hash of it
- Any metadata relevant to your use case (customer ID, workflow ID, classification result)
This level of logging feels like overkill until you're debugging an incident at 11 PM and trying to reconstruct what the system did for a specific request three hours ago.
Log retention matters too. AI quality issues often only become visible in aggregate, across weeks of data. Retaining structured logs for 30–90 days is standard for production AI systems; shorter retention creates blind spots.
Connecting Observability to Operations
Observability infrastructure is only valuable if it's connected to an operational process that acts on what it surfaces.
This means:
- Alert thresholds are reviewed and tuned after the first few weeks in production — not set once and forgotten
- Output quality dashboards are reviewed on a defined cadence, not just when something breaks
- Runbooks exist for the most common alert types: who gets paged, what they check first, what the resolution steps are
- Post-incident reviews capture what observability gaps allowed the issue to persist and what instrumentation would have caught it earlier
Teams that build observability infrastructure but don't operationalize it tend to end up with dashboards nobody looks at and alerts that get muted because they fire too often. The infrastructure and the process have to be built together.
OpsGenius builds and operates observability stacks for production AI systems — instrumentation, dashboards, alert design, and the operational processes that make them useful. If you're running AI in production without full-stack observability, let's talk about what you're missing.
