Application performance monitoring was built for a specific assumption about software behavior: that a given set of inputs, processed by a deterministic system, will produce a predictable set of outputs. APM tools measure latency, error rates, memory utilization, and request throughput. They tell you whether your system is doing what it's supposed to do, where it's slow, and when it fails. This is enormously useful for deterministic systems. It is largely insufficient for probabilistic ones.
LLMs are probabilistic systems. The same prompt submitted twice will produce different outputs. A failure mode might not surface as an error code — it might surface as an output that is technically valid JSON but semantically wrong in ways that only matter two steps later in the pipeline. An LLM that was performing well on your test set may start degrading when a model provider updates the underlying weights without notice. None of these failure patterns are visible in a traditional APM dashboard.
What LLM observability needs to capture
Prompt provenance. When an LLM produces a bad output, debugging starts with understanding exactly what prompt was submitted — not the template, but the rendered prompt including the context, the system instructions, and the user turn. Observability systems that log only the user input (and not the full assembled prompt) make this reconstruction expensive or impossible.
Output semantics. The fact that an LLM returned a 200 status code and valid JSON is not a signal that the output was correct. LLM observability needs to capture the output in a form that allows semantic evaluation: did the model stay within the guardrails? Did it answer the question it was asked? Did it cite sources accurately? This requires evaluation logic that sits above the transport layer — not something traditional APM tools were designed to support.
Model drift over time. LLM providers update their models continuously. Sometimes these updates improve performance on your task. Sometimes they degrade it. Without semantic observability that tracks output quality over time, model drift is invisible until a customer reports a problem — and by then, the degradation has been happening for weeks. The right observability system runs continuous evaluations against a golden set, surfaces degradation before it becomes visible to users, and gives you the data to evaluate whether switching providers or pinning to a specific model version is warranted.
We backed Portkey.ai because they understood this problem set with unusual depth. The product design reflects a genuine understanding of what enterprise teams need when they're running LLM-powered features at scale — not just traffic routing and cost analytics, but the semantic observation layer that makes AI applications debuggable. That distinction is what separates infrastructure from tooling, and it's the distinction that matters at the enterprise scale.