Unique Tool — No Competitor Has This

LLM Inference Cost Calculator 2026:
Cost Per Million Tokens by GPU & Provider

Input your model size and monthly token volume. Get instant cost-per-million-token breakdown across 32 GPU cloud providers — with direct comparison against OpenAI and Anthropic API pricing.

OpenAI GPT-4o Mini API $0.75/1M tokens
Claude Haiku API $0.80/1M tokens
Cheapest self-hosted (H100 spot) calculating…
In millions of tokens/month (enter 100 for 100M tokens)
# Provider & GPU Pricing $/GPU/hr Tokens/sec [EST] Cost / 1M Tokens Monthly Cost vs GPT-4o Mini Region
⚠ Throughput estimates only. Token throughput varies significantly by inference framework (vLLM, TGI, llama.cpp, SGLang), batch size, context length, quantization (FP16, INT8, INT4), and hardware utilization. Values shown assume FP16 precision with vLLM at moderate batch sizes. Actual throughput may be 30–300% higher or lower. Always benchmark on your specific workload before making procurement decisions.
GPU pricing sourced from GridStackHub live databaseApr 2026
Showing 0 providers

Get weekly LLM cost intelligence

GPU price moves, token cost analysis, and provider comparisons — every Monday in your inbox.

How This Calculator Works

Cost per token for self-hosted inference is a function of GPU hourly rate and token throughput. Here's the math:

1. Token Throughput Estimate

Each GPU has a known approximate throughput for a given model size, measured in tokens/second. Throughput scales inversely with parameter count — a 70B model runs ~10× slower than a 7B model on the same hardware.

throughput = gpu_base_tps × model_scale_factor

2. Cost Per Million Tokens

Divide the GPU hourly rate by the time required to generate 1 million tokens. A cheaper-but-slower GPU can beat an expensive-but-fast one on cost-per-token.

cost/1M = ($/hr) × (1,000,000 / throughput / 3,600)

3. Monthly Total Cost

Based on your monthly token volume, we calculate how many GPU-hours you'd need and multiply by the hourly rate. This is your all-in compute cost before storage/egress.

monthly = (tokens/month / throughput / 3600) × $/hr

4. vs. API Comparison

Self-hosted cost per million is compared directly against OpenAI GPT-4o Mini ($0.75/1M) and Claude Haiku ($0.80/1M). The savings percentage shows how much cheaper self-hosting is at your volume.

savings = (api_rate − self_hosted_rate) / api_rate × 100%

GPU Token Throughput Reference (2026)

Approximate tokens/second for a 7B model at FP16 precision with vLLM. Throughput scales inversely with parameter count.

GPU 7B Model (tok/s) 13B Model (tok/s) 34B Model (tok/s) 70B Model (tok/s) Relative Speed
NVIDIA H200 12,000 6,480 2,520 1,200
Fastest
NVIDIA H100 8,000 4,320 1,680 800
NVIDIA A100 (80GB) 4,500 2,430 945 450
NVIDIA L40S 3,500 1,890 735 350
NVIDIA RTX 4090 2,500 1,350 525 250
NVIDIA A10G 2,200 1,188 462 220
NVIDIA L4 2,000 1,080 420 200
NVIDIA T4 1,200 648 252 120

* Estimates assume vLLM with FP16 precision, moderate batch sizes (8–32), and no quantization. INT4 quantization can increase throughput 2–4×. Model size vs GPU VRAM: 7B requires ~14GB (fits on RTX 4090+), 70B requires ~140GB VRAM (requires H100/H200/multi-GPU setup). See full model cost analysis →

Frequently Asked Questions

How is cost per million tokens calculated for self-hosted LLMs?
Cost per million tokens = (GPU hourly rate) × (time to generate 1 million tokens). Time to generate 1M tokens = 1,000,000 ÷ (tokens per second throughput). For example: an H100 at $2/hr running a 7B model at 8,000 tokens/sec needs 1,000,000 ÷ 8,000 = 125 seconds = 0.0347 hours, costing just $0.069 per million tokens — compared to $0.75/1M on the OpenAI API.
Is self-hosting LLMs always cheaper than using the OpenAI API?
At volume, yes. At low volumes, no. The break-even point depends on your model quality requirements, GPU tier, and pricing model. For 7B–13B models on H100 spot (RunPod ~$1.35/hr), you break even vs the OpenAI API at roughly 5–20M tokens/month. Below that, the operational overhead of managing GPU instances may not be worth the savings. Reserved or on-demand pricing shifts break-even to 20–100M tokens/month.
What's the cheapest GPU for running a 7B model in 2026?
For raw cost-per-token on a 7B model, the RTX 4090 on spot marketplaces (Vast.ai, RunPod) often wins at $0.35–$0.45/hr, delivering ~2,500 tokens/sec for approximately $0.040–0.050/1M tokens. The A100 on Vultr at $0.125/hr on-demand is the best on-demand pick (very low cost, no spot risk). H100 spot at $1.35/hr is faster but the cost-per-token is similar given proportionally higher throughput.
Do I need multiple GPUs for 70B models?
Yes. A 70B model at FP16 requires ~140GB VRAM. You'll need either an H200 (141GB), 2× A100 80GB, 2× H100 80GB, or 4× L40S/RTX 4090. This calculator shows per-GPU cost but for 70B+ models, your actual cost is 2× or more the listed per-GPU rate. Use the model size selector — when you select 70B, the calculator adjusts throughput estimates to reflect this constraint.
What inference framework should I use for best token throughput?
vLLM is the current gold standard for throughput on NVIDIA GPUs, especially at high batch sizes. TGI (Hugging Face Text Generation Inference) offers good compatibility. SGLang is excellent for structured outputs. llama.cpp is best for CPU inference or low-VRAM GPU inference with quantization. The throughput estimates in this calculator assume vLLM — other frameworks may be 20–50% slower at similar batch sizes.