Provider Guide
A comprehensive comparison of every major LLM API provider — what they offer, what they cost, and how their models perform on public benchmarks.
Pricing and benchmark data updated daily.
| Provider | Models | Cheapest Input/1M | Most Expensive/1M | Top Model | Best Score |
|---|---|---|---|---|---|
| 18 | $0.02 | $2.00 | Gemini 3.1 Pro Preview | 57.2 | |
| OpenAI | 44 | $0.03 | $150.00 | GPT-5.3 Codex (xhigh) | 54.0 |
| Anthropic | 25 | $0.25 | $15.00 | Claude Opus 4.6 (Adaptive Reasoning, Max Effort) | 53.0 |
| Z AI | 14 | $0.07 | $1.00 | GLM-5 (Reasoning) | 49.8 |
| Kimi | 5 | $0.39 | $0.60 | Kimi K2.5 (Reasoning) | 46.8 |
| Alibaba | 52 | $0.05 | $1.60 | Qwen3.5 397B A17B (Reasoning) | 45.0 |
| MiniMax | 4 | $0.30 | $0.40 | MiniMax-M2.5 | 41.9 |
| DeepSeek | 14 | $0.27 | $1.35 | DeepSeek V3.2 (Reasoning) | 41.7 |
| xAI | 9 | $0.20 | $3.00 | Grok 4 | 41.5 |
| Xiaomi | 3 | $0.10 | $0.10 | MiMo-V2-Flash | 41.5 |
| Amazon | 13 | $0.04 | $2.50 | Nova 2.0 Pro Preview (medium) | 35.7 |
| Mistral | 23 | $0.10 | $4.00 | Magistral Medium 1.2 | 27.1 |
| NVIDIA | 10 | $0.04 | $1.20 | NVIDIA Nemotron 3 Nano 30B A3B (Reasoning) | 24.3 |
| Nous Research | 5 | $0.13 | $1.00 | Hermes 4 - Llama-3.1 405B (Reasoning) | 18.6 |
| Meta | 13 | $0.03 | $2.50 | Llama 4 Maverick | 18.4 |
| Allen Institute for AI | 3 | $0.10 | $0.20 | Olmo 3 7B Think | 16.8 |
| InclusionAI | 3 | $0.07 | $0.14 | Ling-flash-2.0 | 15.7 |
| Cohere | 3 | $0.50 | $3.00 | Command A | 13.5 |
| AI21 Labs | 5 | $0.20 | $2.00 | Jamba 1.7 Large | 12.5 |
Showing providers with 3 or more priced models. Prices in USD per 1M input tokens.
The cheapest model in each quality tier, ranked by input price. Useful for picking a default model at each capability level.
| Tier | Model | Provider | Input (per 1M) | Output (per 1M) | Intelligence |
|---|---|---|---|---|---|
| Frontier | MiMo-V2-Flash | Xiaomi | $0.10 | $0.30 | 41.5 |
| Mid-tier | gpt-oss-120B (high) | OpenAI | $0.04 | $0.19 | 33.3 |
| Budget | Gemma 3n E4B Instruct | $0.02 | $0.04 | 6.4 |
Intelligence Index is a composite benchmark score. Higher is better.
OpenAI, Anthropic, and Google remain the dominant frontier players, but DeepSeek, Mistral, and xAI have closed the gap on many benchmarks at substantially lower pricing. The decision is no longer "which provider is best" — it is "which provider is best for this specific task at this price point."
Start with benchmarks that match your use case. MMLU-Pro and GPQA measure general reasoning. AIME measures mathematical ability. Coding benchmarks like HumanEval and SWE-bench matter for developer tools. A model that scores 5% lower on a benchmark you do not care about but costs 10x less is the better choice.
Beyond benchmarks, consider operational factors: rate limits, uptime, geographic availability, and support. A model that is 2% better but has frequent outages can cost more in engineering time than the performance difference is worth. For production workloads, reliability often matters more than marginal benchmark gains.
Public benchmarks are useful for narrowing the field, but run your own prompts against two or three candidates and measure what matters to your users. A model that excels at academic math might underperform on your specific extraction or summarization task. Build with a clean abstraction layer so you can swap providers as pricing and capabilities change.
No single provider offers the best model at every price point for every task. A common pattern is routing complex reasoning to a frontier model (GPT-5, Claude 4, Gemini 2.5 Pro) and simpler tasks like classification or extraction to a budget model (DeepSeek V3.2, Mistral Small, Gemini Flash). This can reduce costs by 50-80% on simpler workloads without affecting quality where it matters.
The OpenAI-compatible API format has become a de facto standard. Most providers offer endpoints that accept the same request format, making it straightforward to swap models without rewriting application code. Anthropic and Google diverge on some features (tool use schemas, multimodal inputs), but frameworks like LiteLLM or the Vercel AI SDK normalize these differences.
The operational challenge is cost visibility. Each provider has its own billing dashboard, usage metrics, and token counting. When you use three or four providers simultaneously, understanding total AI cost per customer requires aggregating data from all of them. MarginDash handles this with one SDK that tracks usage across every provider and connects it to revenue.
Vendor lock-in risk is often overstated. The real lock-in is not in the API format but in provider-specific features: fine-tuned models, cached context windows, or batch processing APIs. Keep your core request/response layer portable and accept that some provider-specific features will create soft dependencies.
The dominant pricing model is per-token, with separate rates for input and output. Output tokens are typically 2-4x more expensive. For reasoning models like o3 and DeepSeek R1, the gap is wider because chain-of-thought processing generates large volumes of intermediate tokens. Enterprise volume commitments can reduce per-token costs by 20-40%, but require predictable usage.
For teams building products on top of LLM APIs, per-token pricing creates variable costs that scale with usage. This becomes unpredictable when your customers control the input — a customer who sends 10x more context costs 10x more to serve, even on the same plan. Understanding per-customer cost variance is essential for maintaining healthy margins.
Prompt caching is an often-overlooked factor. OpenAI, Anthropic, and Google all offer cached input pricing for repeated prompts (like system prompts), which can cut input costs significantly. But caching behavior differs by provider — some cache automatically, some require explicit API parameters, and cache lifetimes vary. This makes apples-to-apples comparisons harder than headline per-token rates suggest.
Each provider has a distinct philosophy around pricing, model variety, and target use cases. Here is what you need to know about each one.
OpenAI remains the default choice for most teams. Their lineup spans from GPT-4o Mini for cost-sensitive workloads to o3 for deep reasoning. Pricing sits in the mid-to-premium range — rarely the cheapest, but the most mature ecosystem. Virtually every LLM tool and tutorial assumes the OpenAI API format, which means less friction during development. For enterprises, Azure-hosted models provide SOC 2 compliance, HIPAA eligibility, and regional data residency.
One caveat: the o-series reasoning models generate internal chain-of-thought tokens that you pay for but do not see in the response, making cost-per-request harder to predict. If you use o3 in production, tracking actual token consumption per request is essential.
Anthropic's Claude family is known for strong instruction-following, long-context performance, and safety. Claude 4 and Claude 3.5 Sonnet are frequently cited as top models for coding and nuanced reasoning. Anthropic was the first provider to offer 200K context windows as standard, making it the natural choice for applications that process long documents or codebases.
Pricing is competitive at the mid-tier with Sonnet and Haiku, though frontier models sit at the premium end. Anthropic uses its own Messages API rather than the OpenAI-compatible format, so switching requires some integration work. Prompt caching significantly reduces costs for applications with repeated system prompts.
Google's Gemini family competes across all tiers. Gemini 2.5 Pro matches GPT-5 and Claude 4 on benchmarks, while Gemini Flash offers aggressive pricing for high-volume workloads. Google's unique advantage is native multimodal support — text, images, video, and audio in a single model. Pricing is often the most competitive among the three major frontier providers, and a generous free tier through Google AI Studio is useful for prototyping.
The main consideration is pace of change. Google iterates rapidly on the Gemini lineup, sometimes deprecating versions on shorter timelines. Build with version pinning and test against new releases before switching.
DeepSeek has disrupted pricing by offering competitive benchmark scores at a fraction of the cost. DeepSeek V3.2 and R1 demonstrate that high-quality models do not require premium pricing. R1 is an open-weights reasoning model that competes with OpenAI's o1 on math and coding benchmarks — and can be self-hosted to eliminate per-token costs entirely.
The trade-offs are operational: the company is based in China (raising data residency questions for some organizations), the API has experienced capacity constraints during peak demand, and models may behave differently on safety-sensitive tasks. For internal tools, batch processing, and cost-sensitive workloads, these trade-offs are often acceptable.
Mistral (Paris) offers efficient models from the affordable Mistral Small to frontier-capable Mistral Large 3. EU-hosted infrastructure makes it the default for GDPR-conscious organizations. OpenAI-compatible API makes it easy to add as a secondary provider.
xAI's Grok 4 competes at the frontier level with large context windows and competitive pricing. The ecosystem is still maturing — documentation, SDK support, and community resources lag behind OpenAI and Anthropic.
Meta's Llama models are open-weights — you self-host or access through third-party providers like AWS Bedrock, Azure, or Groq. No per-token costs to Meta, but self-hosting requires GPU infrastructure and ML operations expertise.
Cohere specializes in enterprise search and RAG. Their Command models are capable generalists, but the differentiator is tight integration between language models, embeddings, and reranking. Strong for retrieval workflows, not the first choice for complex reasoning.
Per-token pricing is only part of the cost equation. A provider that is 30% cheaper but experiences regular outages or aggressive rate limits can cost more in engineering time than the savings are worth. Rate limits vary significantly across providers and tiers — a free-tier account may be limited to a handful of requests per minute, while a high-volume production account gets thousands. If your application has bursty traffic, rate limit headroom matters more than average throughput.
The OpenAI chat completions format has become the lingua franca of LLM APIs. DeepSeek, Mistral, Together AI, and Groq all accept the same request format, so switching providers can be as simple as changing the base URL and API key. Anthropic and Google diverge for some features, but most teams use an abstraction layer to normalize the differences. Where compatibility breaks down is in advanced features: function calling schemas, structured output guarantees, vision input encoding, and streaming chunk formats all vary between providers. Test thoroughly when switching rather than assuming compatibility.
Geographic availability also matters. If your servers are in Europe but your LLM endpoint is in the US, every call incurs transatlantic latency. OpenAI and Google offer multi-region endpoints. Anthropic is available through AWS Bedrock and Google Cloud for regional availability. For latency-sensitive applications like real-time chat, choosing a provider with a nearby endpoint can matter as much as generation speed.
Knowing the price per token is the first step. Knowing how much each customer costs you — across OpenAI, Anthropic, Google, DeepSeek, and every other provider — is the step most teams skip. MarginDash connects usage to Stripe revenue and shows you margin per customer.
See My Margin DataNo credit card required
Create an account, install the SDK, and see your first margin data in minutes.
See My Margin DataNo credit card required