When the H100 SXM wins.
Compute-bound workloads with neat 80 GB models. FP8 training, ResNet-style throughput, anything that already saturates the H100 with the model resident in 80 GB — the extra memory does nothing.
Same Hopper SM count, same 989 FP16 TFLOPS — but the H200 nearly doubles VRAM (80 → 141 GB) and lifts memory bandwidth from 3.35 TB/s to 4.8 TB/s. The compute ceiling is unchanged; the envelope around it grew.
Datasheet specs only. Throughput on real workloads follows in §02 — the gap there is often smaller than the FP16 number suggests, because most ML workloads are memory-bound.
| Spec | H100 SXM | H200 |
|---|---|---|
| Vendor | NVIDIA | NVIDIA |
| Tier | Datacenter | Datacenter |
| Generation | Hopper | Hopper |
| VRAM | 80 GB HBM3 | 141 GB HBM3e |
| Bandwidth | 3,350 GB/s | 4,800 GB/s |
| FP16 dense | 989 TFLOPS | 989 TFLOPS |
| TDP | 700 W | 700 W |
| Released | 2022 | 2024 |
| Status | Widely available | Available |
| Price | ~$2.50/hr cloud | ~$3.70/hr cloud |
Same model revision, same quantisation, same batch size on both cards. Where one side has no measurement we leave the cell empty rather than extrapolate.
Methodology: how we test.
| Category | Workload | Metric | H100 SXM | H200 | Δ |
|---|---|---|---|---|---|
| LLM Inference | Llama 3.1 8B | tok/s | 240 | 280 | 1.17× |
| LLM Inference | Llama 3.1 70B · 4-bit | tok/s | 65 | 78 | 1.20× |
| LLM Inference | Qwen 2.5 32B · 4-bit | tok/s | 80 | 95 | 1.19× |
| LLM Inference | Mistral 7B | tok/s | 280 | 320 | 1.14× |
| Image Generation | SDXL 1024×1024 | it/s | 10.5 | 11.2 | 1.07× |
| Image Generation | Flux.1 Dev | it/s | 5.4 | 5.9 | 1.09× |
| Training | Fine-tune Llama 3.1 8B LoRA | samples/s | 22 | 26 | 1.18× |
| Training | ResNet-50 · ImageNet | img/s | 5,400 | 5,800 | 1.07× |
| Computer Vision | YOLOv8x · inference | FPS | 540 | 580 | 1.07× |
| Computer Vision | SAM ViT-H | masks/s | 15 | 16.5 | 1.10× |
| Audio/Video | Whisper Large v3 | × RT | 48 | 52 | 1.08× |
The right card is the one whose envelope covers your worst-case workload — not the one with the bigger TFLOPS number.
Compute-bound workloads with neat 80 GB models. FP8 training, ResNet-style throughput, anything that already saturates the H100 with the model resident in 80 GB — the extra memory does nothing.
Memory-bound workloads. 70B in FP16 fits on a single H200; 128k-context KV cache fits with headroom; MoE shards stop spilling. Real-world LLM serving is mostly memory bandwidth-bound, which is where the H200’s 1.4× shows up.
Bottom line. If your worst-case workload is LLM inference, pick H200. If it’s training a model that already fits and saturates an H100, the upgrade buys nothing — just rent more H100s.
Hopper’s last word against Blackwell’s first. 4.5× the FP16, almost 50% more VRAM bandwidth.
Blackwell vs Ada. 32 GB GDDR7 against 24 GB GDDR6X, at 1.27× the FP16.
The biggest consumer card vs a real datacenter accelerator. When does the 5090 actually catch up?