Is self-hosting LLMs cheaper than using the OpenAI API?

Yes, at scale. Self-hosted LLMs on commodity GPU clouds (RunPod, Vast.ai, Lambda) typically cost $0.05–$0.50 per million tokens vs $0.75–$0.80 per million tokens for OpenAI and Anthropic APIs. Break-even typically occurs around 5–50M tokens/month depending on GPU tier.

What token throughput can I expect from an H100 running Llama 70B?

An H100 running Llama 70B with vLLM produces approximately 800–1,200 tokens/second at batch size 1. Throughput scales with batch size but also with available HBM memory bandwidth. The H100 SXM5 with 80GB HBM3 and 3.35 TB/s bandwidth leads the pack for 70B inference.

What GPU is cheapest for running a 7B model?

For 7B models, the RTX 4090 on spot marketplaces (Vast.ai, RunPod) often delivers the lowest cost-per-token below $0.05/1M. The A100 on Vultr at $0.125/hr on-demand is a top pick for on-demand reliability. H100 spot on RunPod at $1.35/hr gives higher throughput but similar cost-per-token.

LLM Cost Per Token 2026 — API vs Self-Host

Model Size

Monthly Token Volume

In millions of tokens/month (enter 100 for 100M tokens)

Pricing Model

Region

#	Provider & GPU	Pricing	$/GPU/hr	Tokens/sec [EST]	Cost / 1M Tokens	Monthly Cost	vs GPT-4o Mini	Region

⚠ Throughput estimates only. Token throughput varies significantly by inference framework (vLLM, TGI, llama.cpp, SGLang), batch size, context length, quantization (FP16, INT8, INT4), and hardware utilization. Values shown assume FP16 precision with vLLM at moderate batch sizes. Actual throughput may be 30–300% higher or lower. Always benchmark on your specific workload before making procurement decisions.

GPU pricing sourced from GridStackHub live database — Apr 2026

Showing 0 providers

Get weekly LLM cost intelligence

GPU price moves, token cost analysis, and provider comparisons — every Monday in your inbox.

How This Calculator Works

Cost per token for self-hosted inference is a function of GPU hourly rate and token throughput. Here's the math:

1. Token Throughput Estimate

Each GPU has a known approximate throughput for a given model size, measured in tokens/second. Throughput scales inversely with parameter count — a 70B model runs ~10× slower than a 7B model on the same hardware.

throughput = gpu_base_tps × model_scale_factor

2. Cost Per Million Tokens

Divide the GPU hourly rate by the time required to generate 1 million tokens. A cheaper-but-slower GPU can beat an expensive-but-fast one on cost-per-token.

cost/1M = ($/hr) × (1,000,000 / throughput / 3,600)

3. Monthly Total Cost

Based on your monthly token volume, we calculate how many GPU-hours you'd need and multiply by the hourly rate. This is your all-in compute cost before storage/egress.

monthly = (tokens/month / throughput / 3600) × $/hr

4. vs. API Comparison

Self-hosted cost per million is compared directly against OpenAI GPT-4o Mini ($0.75/1M) and Claude Haiku ($0.80/1M). The savings percentage shows how much cheaper self-hosting is at your volume.

savings = (api_rate − self_hosted_rate) / api_rate × 100%

GPU Token Throughput Reference (2026)

Approximate tokens/second for a 7B model at FP16 precision with vLLM. Throughput scales inversely with parameter count.

GPU	7B Model (tok/s)	13B Model (tok/s)	34B Model (tok/s)	70B Model (tok/s)	Relative Speed
NVIDIA H200	12,000	6,480	2,520	1,200	Fastest
NVIDIA H100	8,000	4,320	1,680	800
NVIDIA A100 (80GB)	4,500	2,430	945	450
NVIDIA L40S	3,500	1,890	735	350
NVIDIA RTX 4090	2,500	1,350	525	250
NVIDIA A10G	2,200	1,188	462	220
NVIDIA L4	2,000	1,080	420	200
NVIDIA T4	1,200	648	252	120

* Estimates assume vLLM with FP16 precision, moderate batch sizes (8–32), and no quantization. INT4 quantization can increase throughput 2–4×. Model size vs GPU VRAM: 7B requires ~14GB (fits on RTX 4090+), 70B requires ~140GB VRAM (requires H100/H200/multi-GPU setup). See full model cost analysis →

Frequently Asked Questions

How is cost per million tokens calculated for self-hosted LLMs?

Cost per million tokens = (GPU hourly rate) × (time to generate 1 million tokens). Time to generate 1M tokens = 1,000,000 ÷ (tokens per second throughput). For example: an H100 at $2/hr running a 7B model at 8,000 tokens/sec needs 1,000,000 ÷ 8,000 = 125 seconds = 0.0347 hours, costing just $0.069 per million tokens — compared to $0.75/1M on the OpenAI API.

Is self-hosting LLMs always cheaper than using the OpenAI API?

At volume, yes. At low volumes, no. The break-even point depends on your model quality requirements, GPU tier, and pricing model. For 7B–13B models on H100 spot (RunPod ~$1.35/hr), you break even vs the OpenAI API at roughly 5–20M tokens/month. Below that, the operational overhead of managing GPU instances may not be worth the savings. Reserved or on-demand pricing shifts break-even to 20–100M tokens/month.

What's the cheapest GPU for running a 7B model in 2026?

For raw cost-per-token on a 7B model, the RTX 4090 on spot marketplaces (Vast.ai, RunPod) often wins at $0.35–$0.45/hr, delivering ~2,500 tokens/sec for approximately $0.040–0.050/1M tokens. The A100 on Vultr at $0.125/hr on-demand is the best on-demand pick (very low cost, no spot risk). H100 spot at $1.35/hr is faster but the cost-per-token is similar given proportionally higher throughput.

Do I need multiple GPUs for 70B models?

Yes. A 70B model at FP16 requires ~140GB VRAM. You'll need either an H200 (141GB), 2× A100 80GB, 2× H100 80GB, or 4× L40S/RTX 4090. This calculator shows per-GPU cost but for 70B+ models, your actual cost is 2× or more the listed per-GPU rate. Use the model size selector — when you select 70B, the calculator adjusts throughput estimates to reflect this constraint.

What inference framework should I use for best token throughput?

vLLM is the current gold standard for throughput on NVIDIA GPUs, especially at high batch sizes. TGI (Hugging Face Text Generation Inference) offers good compatibility. SGLang is excellent for structured outputs. llama.cpp is best for CPU inference or low-VRAM GPU inference with quantization. The throughput estimates in this calculator assume vLLM — other frameworks may be 20–50% slower at similar batch sizes.

LLM Inference Cost Calculator 2026:Cost Per Million Tokens by GPU & Provider