Category Guide
AI agents chain dozens of API calls per request. Monitor total workflow cost, attribute it to the customer who triggered it, and set budget limits before a single run burns through your margin.
AI agents are systems that use large language models to complete multi-step tasks autonomously. Unlike a single API call that takes a prompt and returns a response, an agent plans a sequence of actions, executes them, evaluates the results, and decides what to do next. A customer support agent might read a ticket, search a knowledge base, draft a response, check it against policy, and send it — each step involving one or more LLM calls. A coding agent might plan changes, write code, run tests, read errors, and retry — looping until the task passes or a limit is reached.
This multi-step nature is what makes agent monitoring fundamentally different from standard LLM monitoring. With a single API call, the cost is predictable — you know the model, you can estimate the token count, and the price is fixed. With an agent, the number of steps varies per request. One customer query might resolve in two calls. The next might require twelve, including retries, tool use, and context window expansion. The cost of a single agent workflow can vary by 10x or more depending on the complexity of the input.
Traditional monitoring tools were built for single request-response cycles. They track latency, error rates, and throughput per endpoint. Agent workflows break this model because a single user action triggers a cascade of internal API calls, each with its own cost, latency, and failure mode. Without agent-specific monitoring, you see a flat list of API calls with no connection to the workflow or customer that generated them.
The cost implications compound quickly. If you are running agents on behalf of customers — AI-powered search, automated reporting, code generation as a service — a handful of customers running complex workflows can consume more API budget than the rest of your user base combined. Agent monitoring gives you the visibility to see this happening before it erodes your margin.
| Capability | LangSmith | Langfuse | Helicone | MarginDash |
|---|---|---|---|---|
| Multi-step trace visualization | Limited | No | ||
| Prompt / response logging | No | |||
| Total cost per workflow | Basic | Basic | ||
| Per-customer cost attribution | No | No | No | |
| Revenue / margin tracking | No | No | No | |
| Cost simulator | No | No | No | |
| Budget alerts | No | No | ||
| Open source | No | No | No |
LangSmith and Langfuse excel at debugging agent behavior. MarginDash focuses on the cost and revenue side — per-customer P&L, model swap simulation, and budget enforcement. Most teams use a debugging tool alongside a cost tool.
Retries and error recovery. When an agent step fails — a malformed tool call, a rate limit, an unexpected response format — the agent retries. Each retry is another API call with full context. A workflow that normally costs $0.05 can cost $0.50 if the agent hits three retries on a frontier model, because each retry includes the entire conversation history in the prompt.
Tool use and function calling. Agents that use tools generate additional token overhead. The tool definitions, the function call arguments, and the tool responses all consume tokens. An agent with access to ten tools pays the token cost of describing all ten tools in every planning step, even if it only uses two. As the number of available tools grows, the per-step cost grows with it.
Long and growing context windows. Agent workflows accumulate context with every step. The first step might send 500 tokens. By step eight, the conversation history — including all previous tool calls and results — might push the prompt to 50,000 tokens. Input tokens are cheaper than output tokens, but at that volume the cost is material. Some agent frameworks truncate context to manage this, but many do not.
Recursive reasoning loops. Some agents are designed to self-critique and iterate. A planning agent might generate a plan, evaluate it, revise it, evaluate again, and repeat until a quality threshold is met. If the threshold is poorly calibrated or the task is ambiguous, the agent can loop indefinitely. Without cost monitoring and budget limits, a recursive loop can burn through hundreds of dollars before anyone notices.
These factors combine to make agent costs highly variable. Two customers sending similar-looking requests can generate workflows that differ in cost by an order of magnitude. Per-customer cost management becomes essential — not optional — the moment you deploy agents in production.
Total cost per workflow is the sum of all API call costs across every step in a single agent run. This is the number that tells you whether a given workflow is economically viable at your current pricing. If a customer pays $0.10 per query and the agent workflow costs $0.30 to complete, you are losing $0.20 per request — and scaling makes it worse, not better.
Steps per completion measures how many API calls the agent needed to finish the task. A well-tuned agent completes most tasks in 2-4 steps. If your average is climbing — 6, 8, 12 steps — that signals prompt quality issues, poor tool definitions, or tasks that are too ambiguous for the model. More steps means more cost, more latency, and more failure surface area.
Token chain length is the cumulative token count across all steps of a workflow. Unlike single-call token tracking, the chain length captures the compounding effect of growing context. A workflow that starts with 1,000 input tokens per step might end at 40,000 per step as conversation history accumulates. Tracking chain length helps you identify workflows where context pruning or summarization would save money.
Failure and retry rates tell you how often agent steps fail and how many retries are needed. High retry rates on specific tools or models indicate integration issues that are silently inflating costs. A 20% retry rate on a step that costs $0.03 effectively makes that step cost $0.036 — and at scale, those hidden costs add up.
Cost per customer ties everything together. When you attribute workflow costs to the customer who triggered them, you can build a per-customer P&L that accounts for multi-step agent usage. This is the metric that tells you whether your pricing model works — or whether a subset of customers is consuming all your margin through expensive agent workflows.
Trace-based monitoring is what tools like LangSmith provide. Every agent step is captured as a span in a trace, with full prompt and response content. This is powerful for debugging — you can see exactly what the agent did at each step, why it chose a particular tool, and where it went wrong. The tradeoff is data volume and privacy: you are sending all prompt content to a third-party service, and storage costs grow with usage. Trace-based tools answer the question "why did the agent do this?" They are less focused on "what did it cost?"
Log-based monitoring is the approach taken by tools like Langfuse. Agent steps are logged as events with metadata, and you can reconstruct traces from the logs. This is more flexible than opinionated tracing — you can instrument any framework — but it still centers on observability rather than cost tracking. Log-based tools are good at showing you what happened. They require additional work to answer per-customer cost questions.
Cost-focused monitoring is what MarginDash provides. Instead of capturing full prompts and traces, the SDK logs only model name, token counts, and customer ID at each step. Cost calculation happens server-side against a pricing database covering 100+ models across OpenAI, Anthropic, Google, AWS Bedrock, Azure, and Groq. The result is per-customer cost attribution for multi-step workflows without the privacy overhead of full prompt logging. The tradeoff is that you cannot debug prompts with this data alone — but that is a separate concern from cost monitoring.
These approaches are not mutually exclusive. Many teams run a trace-based tool for debugging alongside a cost-focused tool for unit economics. The key is to separate the two concerns: debugging why an agent failed is a different problem from understanding whether it is profitable. Trying to solve both with one tool usually means doing neither well.
The fundamental challenge with agent cost tracking is attribution. A single customer request can trigger a cascade of API calls across different models, and the total cost needs to be attributed to the customer who initiated it. Without this attribution, you can see your aggregate API bill going up, but you cannot tell which customers are driving the increase.
With MarginDash, you tag each event with a customer ID. When an agent runs a multi-step workflow — planning, tool calls, retries, final response — every step is associated with that customer. The dashboard aggregates all steps and shows you the total cost per customer alongside their revenue (via Stripe sync or revenue passed through the API). The result is a per-customer P&L that accounts for the true cost of agent workflows, not just the cost of a single API call.
This visibility changes how you think about pricing. If your agent workflows average $0.08 per request but the top 5% of customers average $0.40 per request, flat pricing loses money on your heaviest users. Per-customer cost data lets you design tiered pricing, usage caps, or model routing strategies that protect your margins without penalizing the majority of customers who are profitable.
The cost optimization opportunity is also significant. The cost simulator lets you pick a feature, swap the underlying model, and see projected savings based on your actual token usage. For agent workflows, this means you can test whether a cheaper model handles the planning step adequately, or whether moving tool-use steps to a faster model reduces costs without degrading completion rates. You are making decisions with real data instead of guessing.
Budget alerts are important for any AI workload, but they are critical for agent workflows. A single agent run can consume more tokens than hundreds of simple API calls. A runaway loop — an agent stuck retrying a failing tool call, or recursively refining an output that never meets the quality threshold — can burn through your daily budget in minutes. By the time you notice it on your provider dashboard, the damage is done.
For agent workflows, the per-customer threshold is the most important. It catches the scenario where a single customer's usage spikes — whether due to a complex request, a bug in the agent logic, or simple heavy usage — before it erodes your margin on that account.
Per-feature alerts are useful when different agent workflows have different cost profiles. A document analysis agent that processes long PDFs will naturally cost more per run than a classification agent that handles short text. Setting per-feature thresholds lets you monitor each workflow against its own baseline rather than a single global number.
MarginDash tracks multi-step agent workflow costs and attributes them to the customer who triggered them. Connect to Stripe to see margin per customer. Set budget alerts before a single agent run burns through your margin.
Start Monitoring Agent Costs →No credit card required
Create an account, install the SDK, and see your first margin data in minutes.
See My Margin DataNo credit card required