When the H100 SXM wins.
When your stack assumes CUDA. Custom kernels, NVLink-tight multi-GPU, FP8 transformer engine paths, and any production training pipeline that has been hardened on NVIDIA.
NVIDIA's volume datacenter card against AMD's flagship CDNA 3 accelerator. The MI300X brings 2.4× the VRAM and 2.6× the FP16 TFLOPS — at a cloud price that varies more than the silicon spec ever could.
Datasheet specs only. Throughput on real workloads follows in §02 — the gap there is often smaller than the FP16 number suggests, because most ML workloads are memory-bound.
| Spec | H100 SXM | MI300X |
|---|---|---|
| Vendor | NVIDIA | AMD |
| Tier | Datacenter | Datacenter |
| Generation | Hopper | CDNA 3 |
| VRAM | 80 GB HBM3 | 192 GB HBM3 |
| Bandwidth | 3,350 GB/s | 5,300 GB/s |
| FP16 dense | 989 TFLOPS | 2,615 TFLOPS |
| TDP | 700 W | 750 W |
| Released | 2022 | 2023 |
| Status | Widely available | Available |
| Price | ~$2.50/hr cloud | ~$3–5/hr cloud |
Same model revision, same quantisation, same batch size on both cards. Where one side has no measurement we leave the cell empty rather than extrapolate.
Methodology: how we test.
| Category | Workload | Metric | H100 SXM | MI300X | Δ |
|---|---|---|---|---|---|
| LLM Inference | Llama 3.1 8B | tok/s | 240 | 320 | 1.33× |
| LLM Inference | Llama 3.1 70B · 4-bit | tok/s | 65 | 95 | 1.46× |
| LLM Inference | Qwen 2.5 32B · 4-bit | tok/s | 80 | 115 | 1.44× |
| LLM Inference | Mistral 7B | tok/s | 280 | 370 | 1.32× |
| Image Generation | SDXL 1024×1024 | it/s | 10.5 | 13 | 1.24× |
| Image Generation | Flux.1 Dev | it/s | 5.4 | 6.8 | 1.26× |
| Training | Fine-tune Llama 3.1 8B LoRA | samples/s | 22 | 30 | 1.36× |
| Training | ResNet-50 · ImageNet | img/s | 5,400 | 6,900 | 1.28× |
| Computer Vision | YOLOv8x · inference | FPS | 540 | 680 | 1.26× |
| Computer Vision | SAM ViT-H | masks/s | 15 | 19 | 1.27× |
| Audio/Video | Whisper Large v3 | × RT | 48 | 60 | 1.25× |
The right card is the one whose envelope covers your worst-case workload — not the one with the bigger TFLOPS number.
When your stack assumes CUDA. Custom kernels, NVLink-tight multi-GPU, FP8 transformer engine paths, and any production training pipeline that has been hardened on NVIDIA.
When VRAM is the constraint and ROCm is fine. 192 GB on a single card means a 70B in FP16 with headroom, or a much larger MoE shard than an H100 holds. Llama and DeepSeek inference stacks on ROCm 6+ are now competitive.
Bottom line. For inference-heavy teams already on ROCm or willing to port, the MI300X price/perf is hard to beat. For training-heavy teams already on CUDA, the H100 stays the safer call.
Hopper’s last word against Blackwell’s first. 4.5× the FP16, almost 50% more VRAM bandwidth.
Same FP16 ceiling (989), but H200 nearly doubles VRAM (80 → 141 GB) and 1.4× bandwidth.
Blackwell vs Ada. 32 GB GDDR7 against 24 GB GDDR6X, at 1.27× the FP16.