Category Guide
Understand what LLM observability is, how it differs from monitoring and analytics, and how to add cost visibility to your observability stack.
LLM observability is the practice of instrumenting your AI applications so you can understand their behavior in production. Every call to OpenAI, Anthropic, Google, or any other LLM provider generates data — token counts, latency, error codes, model identifiers, and costs. Observability tools capture this data and make it queryable, so you can answer questions about your system that you did not anticipate in advance.
The concept borrows from traditional software observability (logs, metrics, traces) but adapts it for the unique characteristics of LLM applications. LLM calls are nondeterministic — the same input can produce different outputs. They are expensive — a single request to a frontier model can cost cents, not fractions of a cent. And they are opaque — a 200 status code tells you nothing about whether the response was actually good. These properties mean that standard APM tools miss most of what matters.
A well-instrumented LLM application lets you trace a request from the user through every model call in the chain, see how many tokens each step consumed, know what each step cost, and evaluate whether the output met quality standards. Without this instrumentation, debugging a production issue means reading logs and guessing. With it, you can see exactly what happened, why it was slow, and what it cost.
The scope of LLM observability is broad. It includes prompt tracing and debugging (tools like Langfuse and LangSmith), cost tracking and margin analysis (tools like MarginDash and Helicone), and quality evaluation. Most teams end up using a combination of tools because no single tool covers every dimension well.
Latency (time to first token and total response time) directly affects user experience. LLM calls are slow compared to traditional API calls — response times of 1-10 seconds are normal. Tracking latency per model, per feature, and over time helps you identify regressions and choose models that meet your performance requirements. Streaming responses make time-to-first-token the more meaningful metric for user-facing features.
Token usage (input and output) is the fundamental unit of LLM cost. Input tokens are what you send to the model (prompts, context, system instructions). Output tokens are what the model generates. Output tokens are typically 3x to 5x more expensive. Tracking token usage per request, per feature, and per customer reveals where your budget is going and where optimization is possible.
Error rates and failure modes in LLM applications go beyond HTTP status codes. A 200 response can contain a refusal, a hallucination, or an off-topic answer. Observability means tracking both hard failures (rate limits, timeouts, 500 errors) and soft failures (model refusals, malformed outputs, safety filter triggers). The ratio of retries to successful completions is another signal that standard monitoring misses.
Cost per request and cost per customer connects technical metrics to business outcomes. A single metric — cost per customer — can tell you whether your pricing model works. If a customer paying $49/month is consuming $60/month in API calls, that is a problem no amount of latency optimization will fix. Cost observability requires knowing the price of every model you use, which changes frequently across LLM providers.
Prompt and response quality is the hardest metric to measure but arguably the most important. Some teams use LLM-as-judge evaluations, where a second model scores the output of the first. Others use human evaluation on a sample. The key insight from an observability perspective is that quality must be tracked alongside cost — optimizing for cost alone leads to models that are cheap but produce poor results.
| Capability | MarginDash | Helicone | Langfuse | LangSmith | Datadog LLM |
|---|---|---|---|---|---|
| Cost tracking | Basic | ||||
| Per-customer cost | No | No | No | No | |
| Revenue/margin tracking | No | No | No | No | |
| Prompt tracing | No | ||||
| Evaluation / scoring | No | No | No | ||
| Cost simulator | No | No | No | No | |
| Budget alerts | No | No | |||
| Stripe integration | No | No | No | No | |
| Open source | No | No | No | No | |
| Pricing | Free and paid tiers | Free / paid tiers | Free (OSS) | Free / $39/seat | From $15/host/mo |
These three terms are often used interchangeably, but they describe different things. Understanding the differences helps you pick the right tools and avoid gaps in your instrumentation.
LLM monitoring is about watching for known problems. You define thresholds — error rate above 5%, latency above 3 seconds, daily cost above $500 — and get alerted when they are breached. Monitoring is reactive. It tells you that something went wrong, but not why. It works well for problems you have already encountered and can define in advance.
LLM observability is about having enough data to investigate problems you did not anticipate. It is the instrumentation layer that makes monitoring possible. Observability tools collect traces, metrics, and logs at a granularity that lets you drill into any request and understand what happened. When monitoring fires an alert, observability gives you the data to diagnose the cause.
LLM analytics is about understanding patterns and making business decisions. It sits on top of the same data but asks different questions: Which customers are most expensive? Which features consume the most tokens? What would it cost if we swapped this model for a cheaper one? Analytics is forward-looking — it helps you optimize pricing, plan capacity, and forecast costs.
In practice, most production LLM applications need all three. Observability provides the data foundation. Monitoring watches for problems. Analytics turns the data into business insights. Some tools cover multiple categories — Helicone spans monitoring and basic analytics, Langfuse covers observability and evaluation, MarginDash focuses on analytics and cost optimization — but no single tool covers everything.
Traditional observability focuses on performance and reliability. LLM observability adds a third dimension: cost. Every API call to a language model has a direct dollar cost that varies by model, by token count, and by provider. A request that takes 2 seconds and returns a 200 status code might cost $0.002 or $0.08 depending on which model served it. Standard observability tools have no concept of this cost layer.
Cost observability becomes critical when you are reselling AI features to customers. If your product makes LLM calls on behalf of users, you need to know the cost per customer — not just the aggregate monthly bill. A customer on a $49/month plan who consumes $60/month in API calls is losing you money, and you will not know it without per-customer cost tracking.
Multi-provider environments make cost tracking harder. Most production applications use models from multiple providers — OpenAI for chat, Anthropic for analysis, Google for embeddings. Each provider has its own pricing structure, its own usage dashboard, and its own billing cycle. Without a unified cost layer, you are switching between three or four dashboards and reconciling the numbers manually. A cost-aware observability stack normalizes pricing data across all providers into a single view, so you can compare cost per request regardless of which model served it.
Privacy is a design decision in observability tooling. Debugging tools need to see your prompts and responses — that is how they provide tracing and evaluation. Cost tracking tools do not. The minimum data needed for accurate cost calculation is model name, token counts, and a customer identifier. No prompts, no responses, no end-user content. This distinction matters for teams with data residency requirements, regulated industries, or customers who are sensitive about their data flowing through third-party services. You can run a full cost observability layer without exposing any content.
The cost dimension also enables a form of optimization unique to LLM applications: model swapping. A model that costs 1/40th the price and scores within 5% on public benchmarks might be perfectly adequate for a specific feature. But you cannot make that decision without cost data at the feature level and benchmark data at the model level.
Pricing changes silently and frequently. AI providers update model pricing without advance notice, and new models launch with different cost structures every few weeks. An observability stack that relies on hardcoded prices drifts out of date quickly. A maintained pricing database that syncs daily — covering 100+ models across OpenAI, Anthropic, Google, AWS Bedrock, Azure, and Groq — keeps cost calculations accurate without manual maintenance. For a deeper look at AI cost management frameworks, see our dedicated guide.
MarginDash adds cost visibility to your observability stack. Track costs across 100+ models from OpenAI, Anthropic, Google, and more. Connect to Stripe to see margin per customer. Use the cost simulator to find cheaper models without sacrificing quality. Set up in 5 minutes.
Start Tracking LLM Costs →No credit card required
Create an account, install the SDK, and see your first margin data in minutes.
See My Margin DataNo credit card required