Cheapest verified H200 SXM cloud price (Lambda, 141GB HBM3e, on-demand) — 76% more VRAM than H100 at a per-GPU price competitive with H100 SXM5. Same CUDA code runs unchanged. More memory per dollar for large model inference.
NVIDIA H200 Cloud Pricing — Live Table (April 2026)
GridStackHub tracks NVIDIA H200 pricing across 10 cloud providers. The H200 SXM is available on-demand from specialist providers starting at $2.99/hr, while hyperscalers (AWS, Google Cloud, Azure) primarily offer H200 via reserved capacity. Here is every provider we track:
| Provider | Instance / Config | GPU VRAM | Pricing Type | Price | Status |
|---|---|---|---|---|---|
| Lambda | 1x H200 SXM | 141 GB HBM3e | On-demand | $2.99/hr | VERIFIED |
| Crusoe Energy | H200 SXM | 141 GB HBM3e | On-demand | $3.15/hr | VERIFIED |
| CoreWeave | H200 SXM | 141 GB HBM3e | On-demand | $3.49/hr | VERIFIED |
| RunPod | NVIDIA H200 SXM | 141 GB HBM3e | On-demand | $3.59/hr | VERIFIED |
| Nebius | H200 SXM | 141 GB HBM3e | On-demand | $3.60/hr | VERIFIED |
| RunPod | H200 NVL | 143 GB HBM3e | Community / Spot | $0.50/hr* | LIMITED |
| 8-GPU nodes below — price shown is total/hr (per-GPU in parentheses) | |||||
| Google Cloud | a3-megagpu-8g (8x H200) | 8× 141 GB | On-demand | $40.32/hr ($5.04/GPU) | VERIFIED |
| AWS | p5e.48xlarge (8x H200) | 8× 141 GB | On-demand | $42.08/hr ($5.26/GPU) | VERIFIED |
| Azure | ND H200 v5 (8x H200) | 8× 141 GB | On-demand | $44.52/hr ($5.57/GPU) | VERIFIED |
| CoreWeave | H200_141GB_SXM5_x8 | 8× 141 GB | On-demand | $48.00/hr ($6.00/GPU) | VERIFIED |
*RunPod H200 NVL community cloud pricing reflects spot/shared-node availability and may not be consistently available. SXM5 instances are the standard for production workloads. Data sourced from GridStackHub's live pricing database, April 21, 2026. VERIFIED = confirmed via live provider API. Prices subject to change — check provider directly before committing.
H200 availability is constrained through mid-2026. Demand from AI labs and hyperscalers keeps H200 SXM supply tight. Lambda and Crusoe offer the most consistent on-demand access. For reserved capacity or multi-node clusters, contact providers directly — most offer negotiated rates for 90-day+ commitments below on-demand pricing.
NVIDIA H200 vs H100: Full Specification Comparison
The H200 is not a new architecture — it shares the Hopper GPU with the H100 but adds a dramatically upgraded memory system. Here is what changed and what stayed the same:
| Spec | NVIDIA H200 SXM | NVIDIA H100 SXM5 | Delta |
|---|---|---|---|
| GPU Memory | 141 GB HBM3e | 80 GB HBM3 | H200 +76% |
| Memory Bandwidth | 4.8 TB/s | 3.35 TB/s | H200 +43% |
| Memory Type | HBM3e | HBM3 | H200 (newer gen) |
| FP8 Throughput | 3,958 TFLOPS | 3,958 TFLOPS | Same |
| BF16 Throughput | 1,979 TFLOPS | 1,979 TFLOPS | Same |
| GPU Architecture | Hopper | Hopper | Same |
| CUDA Compatibility | Full | Full | Same — no code changes |
| Min Cloud Price (1 GPU) | $2.99/hr (Lambda) | ~$1.74/hr | H100 cheaper/hr |
| Cost per GB VRAM | $0.0212/GB | $0.0218/GB | H200 −3% |
| Token/sec (memory-bound inference) | ~43% faster | Baseline | H200 (bandwidth advantage) |
| 70B model on 1 GPU (BF16) | Yes (~140GB) | No (requires 2× H100) | H200 |
| TDP (Power) | 700W | 700W | Same |
| Cloud Availability | Growing (10+ providers) | Broad (15+ providers) | H100 more available |
The story in two sentences: H200 and H100 have identical compute throughput — same TFLOPS, same Hopper architecture, same CUDA code runs unchanged. The difference is entirely in memory: 141GB HBM3e at 4.8 TB/s versus 80GB HBM3 at 3.35 TB/s. That memory delta determines when to choose each GPU.
Why the H200's 43% bandwidth advantage matters for inference: During autoregressive LLM decoding (generating each token), the GPU constantly reads the KV cache and model weights from VRAM. This is memory-bandwidth-limited, not compute-limited. More bandwidth = more tokens per second, directly, at any batch size. The H200's bandwidth advantage translates roughly 1:1 to inference throughput for memory-bound models.
When H200 Beats H100: Use Case Guide
Both GPUs run the same CUDA code. The decision is about memory: does your workload benefit from more VRAM, faster bandwidth, or both?
Choose H200 when:
- You're serving Llama 3.1 70B or similar models in BF16 precision. Llama 3.1 70B requires approximately 140GB at BF16. A single H200 (141GB) fits it with room for a KV cache. On H100, you'd need two GPUs with tensor parallelism, doubling cost and adding 15–25% latency from cross-GPU communication.
- Inference throughput is your primary constraint. For any autoregressive decoding workload, the H200's 4.8 TB/s bandwidth versus H100's 3.35 TB/s delivers roughly 43% faster token generation — directly improving latency and increasing tokens-per-dollar at any batch size.
- You're running long-context inference (32K+ token windows). KV cache memory scales linearly with context length. At 128K context windows, KV cache alone can exceed 40GB for a 70B model — eliminating the H100's working memory almost entirely and making the H200 the only option for single-GPU operation.
- You're building RAG pipelines with large retrieval contexts. Each retrieved document added to context increases KV cache memory. H200's 141GB gives meaningful headroom over H100's 80GB for retrieval-augmented workloads at scale.
- You want to deploy future models on existing hardware. Next-generation open models (Llama 4, future Mixtral variants) will continue growing in parameter count. H200's 141GB provides a larger runway before you need multi-GPU inference infrastructure.
Choose H100 when:
- You're training or fine-tuning at scale. H100 and H200 have identical compute throughput for training. H100 is cheaper per hour and has broader multi-node cluster availability. For long training runs, H100 is often the more cost-effective choice.
- Your model fits in 80GB and you're compute-bound. Models under 40B parameters at BF16 fit comfortably in H100's 80GB. If your bottleneck is FLOPS rather than memory bandwidth, H100's lower cost per hour wins on economics.
- You need the widest provider choice and maximum availability. H100 is available from 15+ providers with better reserved-instance options and more competitive spot pricing. If availability SLAs matter, H100 has a deeper supply pool.
- Cost per hour is your primary constraint and model fits H100. Lambda H200 at $2.99/hr vs H100 at ~$1.74/hr is a 72% premium. If you don't need the extra memory or bandwidth, that premium doesn't pay back.
Run the H200 vs H100 numbers for your workload
Enter your model size, hours per month, and precision — get exact monthly cost for H200, H100, and 50+ other GPU configurations side by side.
Open Calculator →What Models Fit on a Single NVIDIA H200 (141GB)?
141GB HBM3e is a meaningful jump from H100's 80GB. Here's a practical reference for which models fit at various precision levels, including KV cache headroom:
| Model | Parameters | BF16 Weights | 4-bit Weights | Fits on H200? |
|---|---|---|---|---|
| Llama 3.2 3B | 3B | ~6 GB | ~1.5 GB | Yes — massive headroom, run multiple |
| Llama 3.1 8B | 8B | ~16 GB | ~4 GB | Yes — run 4–6 replicas simultaneously |
| Llama 3.1 70B | 70B | ~140 GB | ~35 GB | Yes (BF16, tight) — requires KV cache tuning |
| Mistral 7B / Mixtral 8×7B | 7B / 47B | ~14GB / ~94GB | ~3.5GB / ~24GB | Yes — both fit comfortably at BF16 |
| Mixtral 8×22B (MoE) | 141B active | ~282 GB (needs 2×) | ~71 GB | 4-bit: Yes — BF16 requires 2 H200s |
| Llama 3.1 405B | 405B | ~810 GB (needs 6×) | ~202 GB | 4-bit: needs 2× H200 — BF16 needs 6+ |
| Falcon 180B | 180B | ~360 GB (needs 3×) | ~90 GB | 4-bit: Yes — BF16 requires multi-GPU |
*Weight estimates assume 2 bytes/parameter for BF16 and ~0.5 bytes/parameter for 4-bit. Production inference requires additional memory for KV cache — estimate 20–40% overhead for typical serving configurations. Larger batches or longer context windows require proportionally more KV cache memory.
H200 Availability in 2026: What to Expect
H200 supply is constrained by NVIDIA production capacity and aggressive allocation by AI hyperscalers and foundation model labs. Here is the realistic availability picture by provider category:
Independent Clouds
Lambda, Crusoe, CoreWeave, RunPod — best on-demand availability. No waitlists for 1–4 GPU configs. Spot pricing available on some platforms.
Hyperscalers (On-Demand)
AWS p5e, Google a3-mega, Azure ND H200 v5 — sporadic on-demand availability. Reserved instances more accessible. Expect waitlists for immediate access.
Multi-Node Clusters (8×+)
Available via CoreWeave, Crusoe, and hyperscalers. 90-day+ commitments strongly preferred. Contact sales for dedicated allocations below on-demand rack rates.
H200 NVL Variant
143GB NVL configuration available in limited community clouds (RunPod spot). Less common than SXM5. Not recommended for production-critical workloads without SLA.
Availability outlook: H200 supply is expected to improve in H2 2026 as NVIDIA ramps production and the Blackwell (B200) ramp creates secondary-market H200 capacity from labs upgrading. For mission-critical workloads, secure reserved H200 capacity now — spot pricing is attractive but availability windows are unpredictable.
Ask GridStackHub About H200 Pricing
Get answers from live pricing data — compare providers, estimate monthly costs, or evaluate H200 vs alternatives for your specific workload.
Frequently Asked Questions
Track H200 Prices and Get Alerts
H200 pricing is moving in 2026 as production scales and providers compete for AI inference workloads. GridStackHub tracks every change daily — here is how to stay ahead:
Get H200 price alerts
We'll notify you when H200 prices drop, new providers list capacity, or a better deal appears. Free — no credit card required.
Compare H200 vs every alternative for your workload
Set your model size, batch size, hours per month — see exact monthly cost for H200, H100, B200, AMD MI300X, and 50+ more configurations.
Open GPU Cost Calculator →