Blog · February 15, 2026
The Cheapest AI Models That Are Actually Good
Most teams pick from a shortlist of three or four well-known models. GPT-4.1, Claude 4.5 Sonnet, maybe Gemini. These are good models. They are not good value.
We ranked 272 models from our pricing database (397 models across 41 vendors) by intelligence per dollar. The results surprised us: the best value models deliver flagship-level benchmark scores at a fraction of the cost.
How we ranked them
For each model, we calculated:
- Cost per 1,000 requests based on each vendor's list price
- Intelligence Index — a composite benchmark score from MMLU-Pro, GPQA, and AIME
- Value score — Intelligence Index divided by cost per 1,000 requests. Higher means more intelligence per dollar.
We filtered to 272 models that have both published pricing and benchmark scores. Models with zero or missing pricing were excluded.
One caveat: list price doesn't tell the full story for model swaps. Different models burn different token counts for the same task, so a model with a low sticker price can cost more per task. When we compare swap savings below, we normalize for token efficiency using data from the Artificial Analysis Intelligence Index (AAII).
The top 10 value picks
These models score at least 25 on the Intelligence Index (solid production quality) and cost under $2.50 per 1,000 requests. Sorted by value score — the most intelligence per dollar first.
| # | Model | Vendor | Intelligence Index | $/1K requests | Value score |
|---|---|---|---|---|---|
| 1 | MiMo-V2-Flash | Xiaomi | 41.4 | $0.25 | 165.6 |
| 2 | GLM-4.7-Flash (Reasoning) | Z AI | 30.1 | $0.27 | 111.5 |
| 3 | GPT-5 nano (high) | OpenAI | 26.7 | $0.25 | 106.8 |
| 4 | Grok 4.1 Fast (Reasoning) | xAI | 38.5 | $0.45 | 85.6 |
| 5 | DeepSeek V3.2 (Reasoning) | DeepSeek | 41.6 | $0.49 | 84.9 |
| 6 | Grok 4 Fast (Reasoning) | xAI | 34.9 | $0.45 | 77.6 |
| 7 | gpt-oss-120B (high) | OpenAI | 33.3 | $0.45 | 74.0 |
| 8 | MiniMax-M2.1 | MiniMax | 39.5 | $0.90 | 43.9 |
| 9 | GPT-5 mini (high) | OpenAI | 41.0 | $1.25 | 32.8 |
| 10 | Gemini 3 Flash Preview (Reasoning) | 46.4 | $2.00 | 23.2 |
The #1 spot goes to Xiaomi's MiMo-V2-Flash: an Intelligence Index of 41.4 — comparable to Claude 4.5 Sonnet (42.9) — at $0.25 per 1,000 requests.
Seven of the ten models in this list aren't from OpenAI or Anthropic. The best value in AI right now is coming from DeepSeek, Xiaomi, xAI, Z AI, MiniMax, and Google.
Flagship intelligence doesn't have to cost flagship prices
Here are models in our database scoring 40+ on the Intelligence Index, sorted by cost. The cheapest costs $0.25. The most expensive costs $60.00. Same benchmark tier.
| Model | Vendor | Intelligence Index | $/1K requests |
|---|---|---|---|
| MiMo-V2-Flash | Xiaomi | 41.4 | $0.25 |
| DeepSeek V3.2 (Reasoning) | DeepSeek | 41.6 | $0.49 |
| GPT-5 mini (high) | OpenAI | 41.0 | $1.25 |
| GLM-4.7 (Reasoning) | Z AI | 42.0 | $1.55 |
| Kimi K2 Thinking | Kimi | 40.7 | $1.85 |
| Gemini 3 Flash Preview (Reasoning) | 46.4 | $2.00 | |
| Kimi K2.5 (Reasoning) | Kimi | 46.7 | $2.10 |
| GLM-5 (Reasoning) | Z AI | 49.6 | $2.60 |
| GPT-5 (medium) | OpenAI | 41.8 | $6.25 |
| GPT-5.1 (high) | OpenAI | 47.6 | $6.25 |
| Gemini 3 Pro Preview (high) | 48.4 | $8.00 | |
| Grok 4 | xAI | 41.4 | $10.50 |
| Claude 4.5 Sonnet (Reasoning) | Anthropic | 42.9 | $10.50 |
| Claude Opus 4.5 (Reasoning) | Anthropic | 49.7 | $17.50 |
| Claude Opus 4.6 (Adaptive Reasoning) | Anthropic | 53.0 | $17.50 |
| o3-pro | OpenAI | 40.7 | $60.00 |
Green rows: value picks (under $3 per 1,000 requests). Gray rows: common defaults ($6+).
Eight models clear the 40+ Intelligence Index bar for under $3 per 1,000 requests. Many more — from OpenAI, Google, xAI, and Anthropic — deliver similar scores between $6 and $60.
The price difference between the cheapest (MiMo at $0.25) and most expensive (o3-pro at $60.00) flagship model is 240x. The Intelligence Index difference is 0.7 points — o3-pro actually scores lower than MiMo.
Where the defaults land
These are the models most teams use without thinking twice. Here's how they compare to the value leaders — adjusted for token efficiency using AAII normalization.
| Default model | II | List $/1K* | Better value alternative | II | List $/1K* | Adj. savings |
|---|---|---|---|---|---|---|
| o3-pro | 40.7 | $60.00 | DeepSeek V3.2 (Reasoning) | 41.6 | $0.49 | 50x |
| Claude 4.5 Haiku (Non-reasoning) | 31.0 | $3.50 | Grok 4 Fast (Reasoning) | 34.9 | $0.45 | 21x |
| GPT-4.1 | 25.6 | $6.00 | gpt-oss-120B (high) | 33.3 | $0.45 | 16x |
| Claude 4.5 Sonnet (Non-reasoning) | 37.1 | $10.50 | MiMo-V2-Flash | 41.4 | $0.25 | 13x |
*List price per 1,000 requests. Adjusted savings account for token efficiency using AAII normalization — see methodology below.
In every case, the alternative scores higher on benchmarks and costs less — even after adjusting for token efficiency. The o3-pro to DeepSeek V3.2 swap is the most dramatic: higher intelligence score, 50x cheaper in normalized cost.
Many enterprise teams stay on these models for compliance, security, or API stability reasons — not because they've compared the alternatives. That's the legacy tax.
Before you swap everything
Benchmarks aren't the full picture. There are real reasons teams choose higher-priced models:
- API reliability and uptime. OpenAI and Anthropic have years of production API infrastructure. Newer providers may have less mature SLAs.
- Latency and reasoning overhead. Many of the value leaders in this list are reasoning models. They may have fast time-to-first-token, but total response time can be significantly longer because the model "thinks" before answering. For latency-sensitive applications like real-time chat, test actual end-to-end response times — not just benchmarks.
- Task-specific performance. Intelligence Index measures general reasoning. Your customer support chatbot might perform differently than the benchmarks predict. Always test on your own data.
- Ecosystem and tooling. SDK support, function calling, structured outputs, and documentation vary by vendor.
- Data residency and compliance. Some vendors may not meet your regulatory requirements.
The point isn't that you should switch to MiMo tomorrow. It's that you should know what you're paying for — and whether the premium is justified by your actual requirements.
How we got these numbers
All pricing and Intelligence Index scores come from the MarginDash model database: 397 models across 41 vendors, synced daily from vendor pricing pages.
The “Where the defaults land” comparison table uses AAII-normalized costs. Different models consume different numbers of tokens for the same task — a model with a low list price can cost more per task if it burns significantly more tokens. We normalize using token consumption data from the Artificial Analysis Intelligence Index (AAII) benchmark to estimate what a model swap would actually cost in production. This is the same methodology our cost simulator uses.
All prices reflect standard real-time inference. Batch pricing, cached-input discounts, and volume agreements will shift the numbers — in some cases significantly.
You can explore all 397 models, filter by vendor, and run your own cost comparisons inside MarginDash — sign up free to access the model database and cost simulator.