Live data — H200 pricing updated daily from provider APIs
$2.99/hr

Cheapest verified H200 SXM cloud price (Lambda, 141GB HBM3e, on-demand) — 76% more VRAM than H100 at a per-GPU price competitive with H100 SXM5. Same CUDA code runs unchanged. More memory per dollar for large model inference.

NVIDIA H200 Cloud Pricing — Live Table (April 2026)

GridStackHub tracks NVIDIA H200 pricing across 10 cloud providers. The H200 SXM is available on-demand from specialist providers starting at $2.99/hr, while hyperscalers (AWS, Google Cloud, Azure) primarily offer H200 via reserved capacity. Here is every provider we track:

Provider Instance / Config GPU VRAM Pricing Type Price Status
Lambda 1x H200 SXM 141 GB HBM3e On-demand $2.99/hr VERIFIED
Crusoe Energy H200 SXM 141 GB HBM3e On-demand $3.15/hr VERIFIED
CoreWeave H200 SXM 141 GB HBM3e On-demand $3.49/hr VERIFIED
RunPod NVIDIA H200 SXM 141 GB HBM3e On-demand $3.59/hr VERIFIED
Nebius H200 SXM 141 GB HBM3e On-demand $3.60/hr VERIFIED
RunPod H200 NVL 143 GB HBM3e Community / Spot $0.50/hr* LIMITED
8-GPU nodes below — price shown is total/hr (per-GPU in parentheses)
Google Cloud a3-megagpu-8g (8x H200) 8× 141 GB On-demand $40.32/hr ($5.04/GPU) VERIFIED
AWS p5e.48xlarge (8x H200) 8× 141 GB On-demand $42.08/hr ($5.26/GPU) VERIFIED
Azure ND H200 v5 (8x H200) 8× 141 GB On-demand $44.52/hr ($5.57/GPU) VERIFIED
CoreWeave H200_141GB_SXM5_x8 8× 141 GB On-demand $48.00/hr ($6.00/GPU) VERIFIED

*RunPod H200 NVL community cloud pricing reflects spot/shared-node availability and may not be consistently available. SXM5 instances are the standard for production workloads. Data sourced from GridStackHub's live pricing database, April 21, 2026. VERIFIED = confirmed via live provider API. Prices subject to change — check provider directly before committing.

H200 availability is constrained through mid-2026. Demand from AI labs and hyperscalers keeps H200 SXM supply tight. Lambda and Crusoe offer the most consistent on-demand access. For reserved capacity or multi-node clusters, contact providers directly — most offer negotiated rates for 90-day+ commitments below on-demand pricing.

NVIDIA H200 vs H100: Full Specification Comparison

The H200 is not a new architecture — it shares the Hopper GPU with the H100 but adds a dramatically upgraded memory system. Here is what changed and what stayed the same:

Spec NVIDIA H200 SXM NVIDIA H100 SXM5 Delta
GPU Memory 141 GB HBM3e 80 GB HBM3 H200 +76%
Memory Bandwidth 4.8 TB/s 3.35 TB/s H200 +43%
Memory Type HBM3e HBM3 H200 (newer gen)
FP8 Throughput 3,958 TFLOPS 3,958 TFLOPS Same
BF16 Throughput 1,979 TFLOPS 1,979 TFLOPS Same
GPU Architecture Hopper Hopper Same
CUDA Compatibility Full Full Same — no code changes
Min Cloud Price (1 GPU) $2.99/hr (Lambda) ~$1.74/hr H100 cheaper/hr
Cost per GB VRAM $0.0212/GB $0.0218/GB H200 −3%
Token/sec (memory-bound inference) ~43% faster Baseline H200 (bandwidth advantage)
70B model on 1 GPU (BF16) Yes (~140GB) No (requires 2× H100) H200
TDP (Power) 700W 700W Same
Cloud Availability Growing (10+ providers) Broad (15+ providers) H100 more available

The story in two sentences: H200 and H100 have identical compute throughput — same TFLOPS, same Hopper architecture, same CUDA code runs unchanged. The difference is entirely in memory: 141GB HBM3e at 4.8 TB/s versus 80GB HBM3 at 3.35 TB/s. That memory delta determines when to choose each GPU.

Why the H200's 43% bandwidth advantage matters for inference: During autoregressive LLM decoding (generating each token), the GPU constantly reads the KV cache and model weights from VRAM. This is memory-bandwidth-limited, not compute-limited. More bandwidth = more tokens per second, directly, at any batch size. The H200's bandwidth advantage translates roughly 1:1 to inference throughput for memory-bound models.

When H200 Beats H100: Use Case Guide

Both GPUs run the same CUDA code. The decision is about memory: does your workload benefit from more VRAM, faster bandwidth, or both?

Choose H200 when:

  • You're serving Llama 3.1 70B or similar models in BF16 precision. Llama 3.1 70B requires approximately 140GB at BF16. A single H200 (141GB) fits it with room for a KV cache. On H100, you'd need two GPUs with tensor parallelism, doubling cost and adding 15–25% latency from cross-GPU communication.
  • Inference throughput is your primary constraint. For any autoregressive decoding workload, the H200's 4.8 TB/s bandwidth versus H100's 3.35 TB/s delivers roughly 43% faster token generation — directly improving latency and increasing tokens-per-dollar at any batch size.
  • You're running long-context inference (32K+ token windows). KV cache memory scales linearly with context length. At 128K context windows, KV cache alone can exceed 40GB for a 70B model — eliminating the H100's working memory almost entirely and making the H200 the only option for single-GPU operation.
  • You're building RAG pipelines with large retrieval contexts. Each retrieved document added to context increases KV cache memory. H200's 141GB gives meaningful headroom over H100's 80GB for retrieval-augmented workloads at scale.
  • You want to deploy future models on existing hardware. Next-generation open models (Llama 4, future Mixtral variants) will continue growing in parameter count. H200's 141GB provides a larger runway before you need multi-GPU inference infrastructure.

Choose H100 when:

  • You're training or fine-tuning at scale. H100 and H200 have identical compute throughput for training. H100 is cheaper per hour and has broader multi-node cluster availability. For long training runs, H100 is often the more cost-effective choice.
  • Your model fits in 80GB and you're compute-bound. Models under 40B parameters at BF16 fit comfortably in H100's 80GB. If your bottleneck is FLOPS rather than memory bandwidth, H100's lower cost per hour wins on economics.
  • You need the widest provider choice and maximum availability. H100 is available from 15+ providers with better reserved-instance options and more competitive spot pricing. If availability SLAs matter, H100 has a deeper supply pool.
  • Cost per hour is your primary constraint and model fits H100. Lambda H200 at $2.99/hr vs H100 at ~$1.74/hr is a 72% premium. If you don't need the extra memory or bandwidth, that premium doesn't pay back.

Run the H200 vs H100 numbers for your workload

Enter your model size, hours per month, and precision — get exact monthly cost for H200, H100, and 50+ other GPU configurations side by side.

Open Calculator →
Track GPU Prices — Free → | AMD MI300X Pricing → | Full GPU Comparison →

What Models Fit on a Single NVIDIA H200 (141GB)?

141GB HBM3e is a meaningful jump from H100's 80GB. Here's a practical reference for which models fit at various precision levels, including KV cache headroom:

Model Parameters BF16 Weights 4-bit Weights Fits on H200?
Llama 3.2 3B 3B ~6 GB ~1.5 GB Yes — massive headroom, run multiple
Llama 3.1 8B 8B ~16 GB ~4 GB Yes — run 4–6 replicas simultaneously
Llama 3.1 70B 70B ~140 GB ~35 GB Yes (BF16, tight) — requires KV cache tuning
Mistral 7B / Mixtral 8×7B 7B / 47B ~14GB / ~94GB ~3.5GB / ~24GB Yes — both fit comfortably at BF16
Mixtral 8×22B (MoE) 141B active ~282 GB (needs 2×) ~71 GB 4-bit: Yes — BF16 requires 2 H200s
Llama 3.1 405B 405B ~810 GB (needs 6×) ~202 GB 4-bit: needs 2× H200 — BF16 needs 6+
Falcon 180B 180B ~360 GB (needs 3×) ~90 GB 4-bit: Yes — BF16 requires multi-GPU

*Weight estimates assume 2 bytes/parameter for BF16 and ~0.5 bytes/parameter for 4-bit. Production inference requires additional memory for KV cache — estimate 20–40% overhead for typical serving configurations. Larger batches or longer context windows require proportionally more KV cache memory.

H200 Availability in 2026: What to Expect

H200 supply is constrained by NVIDIA production capacity and aggressive allocation by AI hyperscalers and foundation model labs. Here is the realistic availability picture by provider category:

Independent Clouds

Lambda, Crusoe, CoreWeave, RunPod — best on-demand availability. No waitlists for 1–4 GPU configs. Spot pricing available on some platforms.

Hyperscalers (On-Demand)

AWS p5e, Google a3-mega, Azure ND H200 v5 — sporadic on-demand availability. Reserved instances more accessible. Expect waitlists for immediate access.

Multi-Node Clusters (8×+)

Available via CoreWeave, Crusoe, and hyperscalers. 90-day+ commitments strongly preferred. Contact sales for dedicated allocations below on-demand rack rates.

H200 NVL Variant

143GB NVL configuration available in limited community clouds (RunPod spot). Less common than SXM5. Not recommended for production-critical workloads without SLA.

Availability outlook: H200 supply is expected to improve in H2 2026 as NVIDIA ramps production and the Blackwell (B200) ramp creates secondary-market H200 capacity from labs upgrading. For mission-critical workloads, secure reserved H200 capacity now — spot pricing is attractive but availability windows are unpredictable.

Ask GridStackHub About H200 Pricing

Get answers from live pricing data — compare providers, estimate monthly costs, or evaluate H200 vs alternatives for your specific workload.

Frequently Asked Questions

Is H200 better than H100 for inference in 2026?
Yes — for memory-bound inference workloads, the H200 is meaningfully faster than H100. The H200 SXM carries 141GB HBM3e at 4.8 TB/s memory bandwidth, versus the H100 SXM5's 80GB at 3.35 TB/s. That 43% bandwidth increase directly translates to faster autoregressive token generation for models that saturate memory bandwidth — which describes most LLM inference. For models above 70B parameters, the H200 also fits the full model on a single GPU where H100 requires multi-GPU tensor parallelism, adding latency and cost. For compute-bound training workloads, the H200 advantage is smaller — both use identical Hopper GPU cores with the same TFLOPS.
What is the cheapest H200 cloud provider in 2026?
Lambda offers the cheapest verified H200 SXM on-demand pricing at $2.99/hr (141GB HBM3e) as of April 2026. Crusoe Energy is second at $3.15/hr in US Texas with consistent on-demand availability. CoreWeave ($3.49/hr) and RunPod ($3.59/hr) are close behind with reliable SXM5 configurations. Nebius offers H200 in EU Finland at $3.60/hr for European-region deployments. For 8-GPU nodes, Google Cloud starts at $40.32/hr ($5.04/per GPU), offering the most competitive hyperscaler pricing. GridStackHub tracks all H200 pricing daily — set a price alert to be notified when prices drop or new providers list availability.
Is H200 available on-demand or do I need a reservation?
H200 availability is constrained but improving. Independent cloud providers — Lambda, Crusoe Energy, CoreWeave, RunPod, and Nebius — all offer H200 on-demand capacity as of April 2026, though availability windows can be limited during peak demand. Hyperscalers (AWS, Google Cloud, Azure) offer H200 primarily via reserved instances and committed-use contracts; on-demand access at hyperscalers is sporadic. Availability is expected to remain tight through mid-2026 as NVIDIA ramps H200 production and the Blackwell transition creates allocation uncertainty. Lambda and Crusoe have shown the most consistent H200 on-demand availability in 2026. For multi-node clusters, direct contact with providers is recommended.
H200 vs B200: which should I choose in 2026?
Choose H200 if you need on-demand availability today at predictable, stable pricing — H200 is available from 10+ providers at $2.99–$6.00/GPU/hr with mature tooling and immediate CUDA compatibility. No software changes required from H100 workloads. Choose B200 if you're planning ahead for 2026 workloads requiring maximum throughput for training or large-scale batch inference. The B200 delivers 2–2.5× the inference throughput of H200 through its new Blackwell architecture with 192GB HBM3e, but pricing is still settling ($8–$12+/hr where available) and software maturity is still catching up. For most teams running production inference in H1 2026, H200 is the pragmatic choice. B200 becomes compelling for large-scale training runs and future-proofed inference infrastructure.
What models fit on a single NVIDIA H200 (141GB)?
The H200 SXM's 141GB HBM3e fits a wide range of models: any model under ~70B parameters comfortably at BF16 (Llama 3.1 70B is tight at ~140GB but fits with careful KV cache management), up to ~350B parameters in 4-bit quantization, and Mixtral 8×22B in 4-bit (~71GB). Llama 3.1 405B requires approximately 202GB in 4-bit — marginal on a single H200, comfortable on two. The key advantage over H100: Llama 3.1 70B requires two H100s at BF16 (with tensor parallelism overhead), but runs on a single H200 — eliminating cross-GPU communication latency and halving GPU cost for 70B-class serving.

Track H200 Prices and Get Alerts

H200 pricing is moving in 2026 as production scales and providers compete for AI inference workloads. GridStackHub tracks every change daily — here is how to stay ahead:

Compare H200 vs every alternative for your workload

Set your model size, batch size, hours per month — see exact monthly cost for H200, H100, B200, AMD MI300X, and 50+ more configurations.

Open GPU Cost Calculator →
Track GPU Prices — Free → | AMD MI300X Alternative → | Full GPU Cost Comparison →