Model Comparison
Every OpenAI model ranked by intelligence score and priced per million tokens.
Pricing and benchmark scores updated daily.
| Model | Input (per 1M tokens) | Output (per 1M tokens) | Intelligence Index | Tier |
|---|---|---|---|---|
| GPT-5.3 Codex (xhigh) | $1.75 | $14.00 | 54.0 | Frontier |
| GPT-5.2 (xhigh) | $1.75 | $14.00 | 51.3 | Frontier |
| GPT-5.2 Codex (xhigh) | $1.75 | $14.00 | 49.0 | Frontier |
| GPT-5.1 (high) | $1.25 | $10.00 | 47.7 | Frontier |
| GPT-5.2 (medium) | $1.75 | $14.00 | 46.6 | Frontier |
| GPT-5 (high) | $1.25 | $10.00 | 44.6 | Frontier |
| GPT-5 Codex (high) | $1.25 | $10.00 | 44.6 | Frontier |
| GPT-5.1 Codex (high) | $1.25 | $10.00 | 43.1 | Frontier |
| GPT-5 (medium) | $1.25 | $10.00 | 42.0 | Frontier |
| GPT-5 mini (high) | $0.25 | $2.00 | 41.2 | Frontier |
| o3-pro | $20.00 | $80.00 | 40.7 | Frontier |
| GPT-5 (low) | $1.25 | $10.00 | 39.2 | Mid-tier |
| GPT-5 mini (medium) | $0.25 | $2.00 | 38.9 | Mid-tier |
| GPT-5.1 Codex mini (high) | $0.25 | $2.00 | 38.6 | Mid-tier |
| o3 | $2.00 | $8.00 | 38.4 | Mid-tier |
| GPT-5.2 (Non-reasoning) | $1.75 | $14.00 | 33.6 | Mid-tier |
| gpt-oss-120B (high) | $0.04 | $0.19 | 33.3 | Mid-tier |
| o4-mini (high) | $1.10 | $4.40 | 33.1 | Mid-tier |
| o1 | $15.00 | $60.00 | 30.8 | Mid-tier |
| GPT-5.1 (Non-reasoning) | $1.25 | $10.00 | 27.4 | Mid-tier |
| GPT-5 nano (high) | $0.05 | $0.40 | 26.8 | Mid-tier |
| GPT-4.1 | $2.00 | $8.00 | 26.3 | Mid-tier |
| GPT-5 nano (medium) | $0.05 | $0.40 | 25.9 | Mid-tier |
| o3-mini | $1.10 | $4.40 | 25.9 | Mid-tier |
| o1-pro | $150.00 | $600.00 | 25.8 | Mid-tier |
| o3-mini (high) | $1.10 | $4.40 | 25.2 | Mid-tier |
| gpt-oss-20B (high) | $0.03 | $0.14 | 24.5 | Budget |
| gpt-oss-120B (low) | $0.04 | $0.19 | 24.5 | Budget |
| GPT-5 (minimal) | $1.25 | $10.00 | 23.9 | Budget |
| o1-preview | $15.00 | $60.00 | 23.7 | Budget |
| GPT-4.1 mini | $0.40 | $1.60 | 22.9 | Budget |
| GPT-5 (ChatGPT) | $1.25 | $10.00 | 21.8 | Budget |
| gpt-oss-20B (low) | $0.03 | $0.14 | 20.8 | Budget |
| GPT-5 mini (minimal) | $0.25 | $2.00 | 20.7 | Budget |
| o1-mini | $1.10 | $4.40 | 20.4 | Budget |
| GPT-4o | $2.50 | $10.00 | 18.6 | Budget |
| GPT-4o | $2.50 | $10.00 | 17.3 | Budget |
| GPT-5 nano (minimal) | $0.05 | $0.40 | 15.6 | Budget |
| GPT-4.1 nano | $0.10 | $0.40 | 14.9 | Budget |
| GPT-4o | $5.00 | $15.00 | 14.5 | Budget |
| GPT-4 Turbo | $10.00 | $30.00 | 13.7 | Budget |
| GPT-4 | $30.00 | $60.00 | 12.8 | Budget |
| GPT-4o mini | $0.15 | $0.60 | 12.6 | Budget |
| GPT-3.5 Turbo | $0.50 | $1.50 | 9.0 | Budget |
Prices in USD. Updated daily. 44 OpenAI models with pricing and benchmark data.
Each dot is an OpenAI model. Higher is smarter. Further left is cheaper.
Showing key models. Hover over dots for details. Full list in the table above.
OpenAI's model catalog has grown significantly. The GPT-5 family is the current flagship line, spanning from GPT-5 nano (high-volume, cost-sensitive workloads) through GPT-5 (general-purpose default) up to GPT-5.2 (highest benchmark scores). The GPT-4.1 series remains available and still widely used in production, particularly GPT-4.1 mini for teams that validated on it and prefer stability over switching. The o-series (o3, o4-mini) are reasoning models that use chain-of-thought processing — they excel at math, logic, and coding but consume more output tokens per request. Codex models are optimized for code generation and software engineering tasks.
What makes this catalog difficult to navigate is that OpenAI now offers multiple compute tiers for the same base model. GPT-5 can appear as GPT-5 (minimal), (low), (medium), (high), and (xhigh) — each with different pricing and quality characteristics. The tier you select can change costs by an order of magnitude.
The practical challenge is no longer just "which model" but "which model at which tier for which feature." A customer-facing chatbot might use GPT-5 (medium), a background summarization pipeline GPT-5 nano (low), and a complex analysis feature o3. Each combination has a different cost profile, and the aggregate bill hides the per-feature and per-customer breakdown that actually matters for margin.
For budget-sensitive, high-volume tasks — classification, extraction, simple summarization — GPT-5 nano or GPT-4.1 mini are the right starting points. They can cut costs by 10-40x compared to GPT-5 or o3 with minimal quality loss. For general production workloads — customer-facing chat, content generation, document analysis — GPT-5 offers the best quality-to-cost ratio. For complex reasoning — multi-step math, code generation with debugging, scientific analysis — the o-series models (o3, o4-mini) are purpose-built and worth the higher per-request cost.
A common production pattern is to start with GPT-5, validate it works, then experiment with cheaper models. Many teams find GPT-5 nano or GPT-4.1 mini can handle 60-80% of request volume with no noticeable quality drop. The remaining requests stay on the more capable model. This model routing is where the biggest cost savings come from.
The mistake most teams make is choosing a model once and never revisiting it. OpenAI releases new models regularly, and the price-performance ratio shifts with each release. Reviewing your model choices quarterly — with per-feature cost data to inform decisions — is what separates teams that control AI costs from teams that hope for the best.
OpenAI's reasoning models — o3 and o4-mini — work fundamentally differently from the GPT series. When you send a prompt to o3, it generates a chain of intermediate reasoning steps before producing the final answer. This "thinking" process happens in the output, which means reasoning models produce significantly more output tokens per request than a standard GPT model for the same prompt.
This has direct cost implications. Even though o3's per-token pricing ($2.00/$8.00 per million tokens) is comparable to GPT-5's ($1.25/$10.00), the actual cost per request can be 3-5x higher because o3 generates more output tokens during reasoning. On math benchmarks (AIME), logic puzzles, and complex coding tasks, reasoning models score substantially higher. On simple tasks like classification or extraction, the reasoning overhead adds cost without improving results.
A factor teams overlook is that chain-of-thought tokens are not user-facing. The final answer may be short, but the model generated hundreds or thousands of reasoning tokens to get there — all billed at the output token rate. For tasks where a standard GPT model already produces correct answers, those reasoning tokens are pure waste.
The practical rule: if a competent human could answer the question in under five seconds without writing anything down, a reasoning model is overkill. If they would need a scratch pad or several minutes of focused thought, a reasoning model will produce a better answer. o4-mini offers a budget-friendly entry point — less capable than o3 on the hardest problems, but significantly cheaper and still better than standard models on logic-heavy tasks.
No credit card required
GPT-5 nano sits at the bottom of the family — the cheapest option, built for high-volume workloads where per-request cost matters more than peak quality. Classification, tagging, simple extraction, and routing tasks are natural fits. GPT-5 mini occupies the middle ground: meaningfully more capable than nano on complex prompts but still priced well below the full GPT-5. Many production applications route the majority of traffic through mini and only escalate to the full GPT-5 for requests that need it.
The standard GPT-5 model is the general-purpose workhorse, scoring highest on intelligence benchmarks among non-reasoning models. For teams that want one model for most things without complex routing logic, GPT-5 (medium or high tier) is the default choice. GPT-5.1 and GPT-5.2 push quality further, with GPT-5.2 (xhigh) representing the highest benchmark scores in the OpenAI catalog — at correspondingly higher prices.
GPT-5 Codex is purpose-built for code generation and software engineering tasks — understanding codebases, generating functions, writing tests, and debugging. If your product includes AI-powered code features, Codex is likely to outperform the general-purpose GPT-5 on those tasks while potentially costing less per equivalent-quality output, since it needs fewer tokens to produce correct code.
OpenAI now offers many models at multiple compute tiers: minimal, low, medium, high, and xhigh. The same underlying model is available at each tier, but higher tiers apply more compute per request — producing better responses at higher per-token pricing. This gives engineering teams a lever to optimize costs without switching models entirely. For many production workloads with structured prompts and well-defined output formats, a lower tier performs identically to a higher tier.
The pricing differences between tiers can be substantial. Moving from GPT-5 (xhigh) to GPT-5 (minimal) reduces per-token costs significantly while keeping the same model architecture. This is lower-risk than switching models entirely, which may change behavior in ways that break your prompts.
The practical approach: default to a middle tier (medium or high) during development, then A/B test lower tiers in production. Track accuracy, user satisfaction, and task completion rate alongside cost per request. If quality holds, drop the tier. If not, you have the data to justify the higher spend.
OpenAI regularly deprecates older models. The typical pattern is an announcement followed by a grace period before the old model stops accepting requests. Teams with hardcoded model names and no migration plan can find themselves scrambling to update, test, and deploy on a tight timeline.
Migrations are not search-and-replace. A newer model may produce longer or shorter outputs, interpret ambiguous instructions differently, or change structured response formats. Testing prompts against the new model and validating output quality is essential — skipping this step is how teams end up with subtle quality regressions in production.
Deprecation also affects costs. Per-token pricing almost always changes with a newer model — sometimes cheaper, sometimes more expensive. OpenAI has historically dropped prices over time, but newer, more capable models sometimes cost more. Teams that track cost per model and per feature can predict the financial impact of a migration before it happens, rather than discovering it on the next invoice.
The single biggest optimization is model routing — sending different request types to different models based on complexity. A request classifier (which can itself be a cheap model like GPT-5 nano) routes simple tasks to budget models and complex tasks to capable ones. Teams that implement model routing typically see 40-70% cost reductions compared to sending everything to a single high-end model, with minimal quality impact.
Prompt optimization is the second lever. Shorter prompts cost less because they use fewer input tokens. A well-written 500-token prompt often produces better results than a sloppy 2,000-token prompt and costs 75% less on input. Caching is the third — if your application sends the same or similar prompts repeatedly, caching responses eliminates the API call entirely. OpenAI also offers prompt caching at the API level for long system prompts, reducing input token cost for requests that share a common prefix.
Finally, per-customer and per-feature budgets prevent cost surprises. Without them, a single customer with unusual usage patterns can blow through your expected costs. Per-feature budgets also identify which features are the most expensive to operate, giving you the data to decide whether to optimize, reprice, or deprecate them.
Knowing the price per token is the first step. Knowing how much each customer costs you — and whether they are profitable — is the step most teams skip. MarginDash connects OpenAI usage to Stripe revenue and shows you margin per customer.
See My Margin DataNo credit card required
Create an account, install the SDK, and see your first margin data in minutes.
See My Margin DataNo credit card required