H200. Specs, benchmarks, $/hr.
Same compute as an H100, almost double the memory: 141 GB HBM3e at 4.8 TB/s. The first datacenter card to fit a 70B in full FP16 on a single GPU, and the cheapest path to long-context serving on Hopper.
H200, specified.
Dense FP16 from the NVIDIA datasheet. Bandwidth is peak; sustained will be lower. Price reflects cheapest verified hourly as of the date stamped at the top.
| Vendor | NVIDIA |
| Tier | Datacenter |
| Generation | Hopper |
| VRAM | 141 GB · HBM3e |
| Memory bandwidth | 4,800 GB/s |
| FP16 dense | 989 TFLOPS |
| TDP | 700 W |
| Released | 2024 |
| Price | ~$3.70/hr cloud |
| Status | Available |
Eleven workloads, one card.
Throughput on the same set of repeatable workloads we use across the register. Same quantisation across cards in the same row; latency reported with p95 in the methodology notes.
Numbers without a measurement on this chip are marked "—". Cross-card comparisons live on the head-to-head pages.
| Category | Workload | Metric | H200 | Notes |
|---|---|---|---|---|
| LLM Inference | Llama 3.1 8B | tok/s | 280 | tokens per second · single-stream · FP16 |
| LLM Inference | Llama 3.1 70B · 4-bit | tok/s | 78 | tokens per second · single-stream · INT4 GPTQ |
| LLM Inference | Qwen 2.5 32B · 4-bit | tok/s | 95 | tokens per second · single-stream · INT4 |
| LLM Inference | Mistral 7B | tok/s | 320 | tokens per second · single-stream · FP16 |
| Image Generation | SDXL 1024×1024 | it/s | 11.2 | iterations per second · 30 steps · FP16 |
| Image Generation | Flux.1 Dev | it/s | 5.9 | iterations per second · 28 steps · FP16 |
| Training | Fine-tune Llama 3.1 8B LoRA | samples/s | 26 | samples per second · seq 2k · BF16 |
| Training | ResNet-50 · ImageNet | img/s | 5,800 | images per second · BS=256 · BF16 |
| Computer Vision | YOLOv8x · inference | FPS | 580 | frames per second · BS=1 · FP16 |
| Computer Vision | SAM ViT-H | masks/s | 16.5 | masks per second · 1024×1024 · FP16 |
| Audio/Video | Whisper Large v3 | × RT | 52 | multiples of real-time · CPU offload off |
What fits in 141 GB, really.
FP16 weights = 2 bytes × parameters. INT4 cuts that 4× with small quality loss. Fine-tuning needs 3–4× more memory for gradients, optimiser, activations.
| Model | Params | FP16 | INT8 | INT4 | Fits on H200? |
|---|---|---|---|---|---|
| Llama 3.1 8B | 8B | 16 GB | 8 GB | 4 GB | FP16, INT8 and INT4 |
| Qwen 2.5 14B | 14B | 28 GB | 14 GB | 7 GB | FP16, INT8 and INT4 |
| Qwen 2.5 32B | 32B | 64 GB | 32 GB | 16 GB | FP16, INT8 and INT4 |
| Llama 3.1 70B | 70B | 140 GB | 70 GB | 36 GB | FP16, INT8 and INT4 |
| DeepSeek V3 | 671B MoE | 1.3 TB | 671 GB | 336 GB | No |
| Llama 3.1 405B | 405B | 810 GB | 405 GB | 203 GB | No |
H200 head-to-heads.
H200 vs B200 →
Hopper’s last word against Blackwell’s first. 4.5× the FP16, almost 50% more VRAM bandwidth.
H100 SXM vs H200 →
Same FP16 ceiling (989), but H200 nearly doubles VRAM (80 → 141 GB) and 1.4× bandwidth.
RTX 5090 vs H200 →
The biggest consumer card vs a real datacenter accelerator. When does the 5090 actually catch up?