AI compute costs are the single largest operational expense for most ML teams. The average ML team overpays by 40–60% compared to what's possible with provider selection and pricing model optimization alone. Here are 7 strategies with current price data.
Switch to Spot Instances for Training
Spot GPUs use unused cloud capacity at steep discounts. H100 spot instances are available on Vast.ai for $1.49/hr vs $2.23/hr on-demand at CoreWeave. The key: checkpoint your training job frequently (every 15–30 minutes). When an instance is preempted, your next run resumes from the last checkpoint. Most modern training frameworks (PyTorch Lightning, Hugging Face Accelerate) support automatic checkpointing.
Use Provider Arbitrage
The same GPU model costs 3–4× more at hyperscalers vs specialized GPU clouds. An H100 on AWS costs ~$3.90/hr. The same GPU on Lambda costs $1.99/hr. For workloads without compliance requirements (HIPAA, FedRAMP, SOC2), specialized GPU clouds deliver identical compute at dramatically lower prices. Use GridStackHub's comparison table to find the cheapest provider for your GPU model.
Right-Size Your GPU Selection
Not every workload needs an H100. For inference serving, smaller GPUs often deliver better cost-per-token. A single L40S at $1.10/hr can serve the same inference load as an H100 at $2.23/hr for many model sizes (7B–13B parameters). Benchmark your specific model on multiple GPU types using our cost-per-token calculator before committing to a configuration.
Buy Reserved Capacity for Steady Workloads
If your GPU utilization stays above 50%, reserved pricing beats on-demand every time. CoreWeave's 1-year H100 reserved rate is ~$1.79/hr vs $2.23/hr on-demand — a 20% savings. For a single GPU running continuously, that's $3,848/year in savings. Model your actual utilization before reserving; low-utilization workloads do better on on-demand.
Split Training and Inference Optimally
Training requires raw throughput (H100/H200/B200 with NVLink). Inference often runs efficiently on cheaper GPUs (L40S, A100, even A10G for smaller models). Run training jobs on the cheapest spot capacity available for your GPU tier. Serve inference from reserved capacity on right-sized GPUs. This split can cut total infrastructure costs by 25–45% vs running everything on a single GPU type.
Optimize Batch Sizes and Memory Usage
Suboptimal batch sizes leave GPU memory underutilized, forcing you to rent more GPUs than you need. Use tools like PyTorch's memory profiler to identify the maximum batch size your model supports. A 2× increase in batch size often translates to 2× throughput without adding GPUs — effectively halving your compute cost per inference.
Use Multi-Region Spot Markets
Spot availability and pricing varies by region. Vast.ai, RunPod, and other GPU marketplaces allow you to bid for capacity across multiple regions. Setting up multi-region training infrastructure (with data stored in S3-compatible storage accessible across regions) lets you always capture the cheapest available spot capacity globally.
Find the cheapest GPU for your workload
Compare spot, on-demand, and reserved pricing across 32+ providers instantly.
Open GPU Calculator →