Live DB Pricing — GPU-First Calculator

GPU Cost Per Token Calculator
Self-Hosting vs API in 2026

Pick your GPU. Enter your tokens/second throughput. Get the exact $/1M token cost from live market pricing — compared against OpenAI and Anthropic API rates.

OpenAI GPT-4o $5.00/1M tokens
Claude Haiku $0.25/1M tokens
H100 self-hosted calculating…

Every dollar you spend on the OpenAI or Anthropic API is a per-token cost — billed to the millisecond, no fixed overhead. Self-hosting an LLM on rented GPU hardware flips the model: you pay a fixed hourly rate and generate as many tokens as your hardware allows. The math that determines whether self-hosting wins is straightforward:

cost per 1M tokens = (GPU $/hr) ÷ (tokens/sec × 3,600) × 1,000,000

An H100 SXM5 running Llama 3 70B with vLLM achieves approximately 800 tokens/second. At the cheapest on-demand rate ($1.79/hr from Shadeform), that's $0.62 per million tokens — 8× cheaper than OpenAI GPT-4o at $5.00/1M. The break-even isn't about raw cost-per-token, though: it's about utilization. The GPU runs whether you're generating tokens or not. At $1.79/hr, you're spending $43/day just to have the GPU on standby. You need enough daily request volume to justify that fixed cost before savings materialize.

That threshold differs dramatically by GPU tier. A B200 at $5.29/hr needs far higher throughput to justify the daily fixed cost, but delivers it at 2,500+ tok/sec — making per-token economics compelling at scale. An A100 40GB at $0.89/hr has a much lower minimum utilization bar but processes tokens more slowly. This calculator pulls live pricing from the GridStackHub database across 32 providers and shows you exactly where the math lands for your specific GPU and throughput. Enter your throughput, see your cost, compare it to the API — and find your break-even.

⚙️ Calculator Inputs
Cheapest on-demand rate:
Reference: H100 SXM5 + Llama 3 70B (vLLM, FP16) ≈ 800 tok/s
Average input + output tokens per API call
Spot is cheapest but can be interrupted mid-job
📊 Results — H100 SXM5 at 800 tok/s
$/Hour (cheapest provider)
 
Self-Hosted $/1M Tokens
 
vs GPT-4o API ($5/1M)
 
Break-Even vs GPT-4o
 
Source Provider / Model $/hr Throughput $/1M Tokens vs GPT-4o vs Claude Haiku
📌 Pricing sourced live from the GridStackHub database (32+ providers, updated daily). Throughput estimates assume vLLM at FP16 precision, batch size 1–4. Actual results vary with model size, quantization, and batch strategy. Always benchmark your specific workload before committing to long-term GPU reservations.

GPU Throughput Reference — Llama 3 70B (vLLM, FP16)

Reference tokens/second for running Llama 3 70B at FP16 with vLLM at batch size 1–4. Use the "Use this" buttons to auto-fill the calculator above.

GPU VRAM Memory BW Llama 3 70B tok/s Notes
H100 SXM5 80GB HBM3 3.35 TB/s ~800 Best single-GPU 70B inference; flagship choice
H100 PCIe 80GB HBM2e 2.0 TB/s ~500 ~35% slower than SXM5; lower hourly cost
B200 192GB HBM3e 8.0 TB/s ~2,500 Blackwell flagship; 3× H100 throughput on 70B
A100 80GB 80GB HBM2e 2.0 TB/s ~450 Cost-effective for 70B; needs 2× for comfort
A100 40GB 40GB HBM2e 1.6 TB/s ~180 (2×GPU needed) Too small for 70B at FP16; use INT4 or 2× GPUs

* Throughput values for Llama 3 70B at FP16, vLLM, batch size 1–4. INT8 quantization increases throughput 1.5–2×. For 7B models, multiply by ~9–10×. Use the model-size calculator for 7B/13B/34B →

Frequently Asked Questions

How do I calculate cost per token when renting a GPU?
The formula is: cost per 1M tokens = (GPU $/hr) ÷ (tokens/sec × 3,600) × 1,000,000. For example, an H100 SXM5 at $1.79/hr achieving 800 tok/s on Llama 3 70B: 1,000,000 ÷ (800 × 3,600) = 0.000347 GPU-hours needed per 1M tokens. At $1.79/hr, that's $0.62 per million tokens. Compare that to OpenAI GPT-4o at $5.00/1M — self-hosting delivers 8× cheaper tokens, before factoring in GPU idle time.
Is renting an H100 cheaper than using the OpenAI API?
Yes, per token — but only if you have enough daily volume to cover the fixed GPU cost. At $1.79/hr on-demand, an H100 costs $42.96/day whether you generate 1 token or 100 billion. At OpenAI GPT-4o pricing ($5/1M tokens), you'd spend $42.96 on 8.6M tokens. So self-hosting breaks even when you need more than 8.6M tokens/day from your GPU. At 800 tok/s, the H100 can theoretically generate 69.1M tokens/day at 100% utilization — so at any utilization above ~12%, you're ahead financially.
What's the break-even point for self-hosting LLMs on H100?
For an H100 at $1.79/hr compared against GPT-4o ($5/1M): break-even = ($1.79/hr × 24h) ÷ $5.00 × 1,000,000 = 8.6M tokens/day, or roughly 8,600 requests/day at 1,000 tokens per request. That's 360 requests/hour, or 6 per minute — a modest throughput for a production API. Compared against Claude Haiku ($0.25/1M), the break-even is 20× higher: 172M tokens/day. Self-hosting rarely beats Haiku except at very high scale.
What throughput should I expect from an H100 running Llama 3 70B?
An H100 SXM5 (80GB HBM3) running Llama 3 70B at FP16 with vLLM achieves approximately 700–900 tokens/second at batch size 1–4. At batch size 16–32, throughput can reach 1,200–1,800 tok/s but latency per request increases proportionally. H100 PCIe runs 30–40% slower due to lower HBM bandwidth (2.0 TB/s vs 3.35 TB/s on SXM5). INT8 quantization (AWQ or GPTQ) typically boosts throughput 1.5–2× with minimal quality loss on instruction-following tasks.
Which GPU gives the best cost per token for LLM inference in 2026?
For 70B models, the H100 SXM5 on spot (RunPod ~$1.35/hr) delivers the best cost-per-token: approximately $0.47/1M at 800 tok/s. The B200 is competitive at scale — higher hourly cost but 3× throughput yields similar $/token economics at $5.29/hr and 2,500 tok/s ($0.59/1M). For smaller models (7B–13B), an A100 40GB or RTX 4090 on spot marketplaces (Vast.ai, RunPod) often wins — use the model-size calculator for that analysis.

Need the full picture?

Compare all 32 providers, all GPU models, all pricing types — with hidden cost breakdowns and 30-day price forecasts.