When the RTX 5090 wins.
Anything that fits in 32 GB and runs sustained. Local 70B INT4 inference, SDXL pipelines, smaller LoRA training — break-even on the upfront card vs cloud is under 12 months at 20 hrs/wk.
A $1,999 card on your desk against a $3.70/hr datacenter rental. The H200 has 4.8× the FP16, 4.4× the VRAM, and 2.7× the memory bandwidth — but the 5090 wins on $/token-served the moment your workload fits in 32 GB.
Datasheet specs only. Throughput on real workloads follows in §02 — the gap there is often smaller than the FP16 number suggests, because most ML workloads are memory-bound.
| Spec | RTX 5090 | H200 |
|---|---|---|
| Vendor | NVIDIA | NVIDIA |
| Tier | Consumer | Datacenter |
| Generation | Blackwell | Hopper |
| VRAM | 32 GB GDDR7 | 141 GB HBM3e |
| Bandwidth | 1,792 GB/s | 4,800 GB/s |
| FP16 dense | 209.5 TFLOPS | 989 TFLOPS |
| TDP | 575 W | 700 W |
| Released | 2025 | 2024 |
| Status | Available | Available |
| Price | $1,999 MSRP | ~$3.70/hr cloud |
Same model revision, same quantisation, same batch size on both cards. Where one side has no measurement we leave the cell empty rather than extrapolate.
Methodology: how we test.
| Category | Workload | Metric | RTX 5090 | H200 | Δ |
|---|---|---|---|---|---|
| LLM Inference | Llama 3.1 8B | tok/s | 140 | 280 | 2.00× |
| LLM Inference | Llama 3.1 70B · 4-bit | tok/s | 38 | 78 | 2.05× |
| LLM Inference | Qwen 2.5 32B · 4-bit | tok/s | 48 | 95 | 1.98× |
| LLM Inference | Mistral 7B | tok/s | 165 | 320 | 1.94× |
| Image Generation | SDXL 1024×1024 | it/s | 6.5 | 11.2 | 1.72× |
| Image Generation | Flux.1 Dev | it/s | 3.4 | 5.9 | 1.74× |
| Training | Fine-tune Llama 3.1 8B LoRA | samples/s | 12.5 | 26 | 2.08× |
| Training | ResNet-50 · ImageNet | img/s | 2,800 | 5,800 | 2.07× |
| Computer Vision | YOLOv8x · inference | FPS | 320 | 580 | 1.81× |
| Computer Vision | SAM ViT-H | masks/s | 9.2 | 16.5 | 1.79× |
| Audio/Video | Whisper Large v3 | × RT | 28 | 52 | 1.86× |
The right card is the one whose envelope covers your worst-case workload — not the one with the bigger TFLOPS number.
Anything that fits in 32 GB and runs sustained. Local 70B INT4 inference, SDXL pipelines, smaller LoRA training — break-even on the upfront card vs cloud is under 12 months at 20 hrs/wk.
Models that don’t fit, batches that don’t finish. 70B in FP16, 128k-context serving, anything FP8-trained at scale. The H200 also wins outright when you need elasticity — burst a hundred GPUs for an afternoon.
Bottom line. Buy the 5090 for steady local workloads under 32 GB; rent H200 for the bursts and the bigger models. The two stack — they don’t replace each other.
Hopper’s last word against Blackwell’s first. 4.5× the FP16, almost 50% more VRAM bandwidth.
Same FP16 ceiling (989), but H200 nearly doubles VRAM (80 → 141 GB) and 1.4× bandwidth.
Blackwell vs Ada. 32 GB GDDR7 against 24 GB GDDR6X, at 1.27× the FP16.