← Hardware · B200NVIDIA · Datacenter · BlackwellIssue: April 22, 2026
Datacenter · Blackwell · released 2025

B200. Specs, benchmarks, $/hr.

Blackwell at full datacenter scale. 4,500 FP16 TFLOPS — 4.5× an H200 — paired with 192 GB HBM3e at 8 TB/s. The current top of the leaderboard for any single-card workload, when you can find one to rent.

§ 01 · Specs

B200, specified.

Dense FP16 from the NVIDIA datasheet. Bandwidth is peak; sustained will be lower. Price reflects cheapest verified hourly as of the date stamped at the top.

Architectural lineage
FP16 TFLOPS over recent NVIDIA generations.
VendorNVIDIA
TierDatacenter
GenerationBlackwell
VRAM192 GB · HBM3e
Memory bandwidth8,000 GB/s
FP16 dense4,500 TFLOPS
TDP1000 W
Released2025
Price~$6+/hr cloud
StatusLimited
Fig 1 · Single-card spec sheet. FP16 is dense (not sparse). Bandwidth is peak HBM/GDDR.
§ 02 · Benchmarks

Eleven workloads, one card.

Throughput on the same set of repeatable workloads we use across the register. Same quantisation across cards in the same row; latency reported with p95 in the methodology notes.

Numbers without a measurement on this chip are marked "—". Cross-card comparisons live on the head-to-head pages.

CategoryWorkloadMetricB200Notes
LLM InferenceLlama 3.1 8Btok/s540tokens per second · single-stream · FP16
LLM InferenceLlama 3.1 70B · 4-bittok/s165tokens per second · single-stream · INT4 GPTQ
LLM InferenceQwen 2.5 32B · 4-bittok/s200tokens per second · single-stream · INT4
LLM InferenceMistral 7Btok/s620tokens per second · single-stream · FP16
Image GenerationSDXL 1024×1024it/s22iterations per second · 30 steps · FP16
Image GenerationFlux.1 Devit/s12iterations per second · 28 steps · FP16
TrainingFine-tune Llama 3.1 8B LoRAsamples/s52samples per second · seq 2k · BF16
TrainingResNet-50 · ImageNetimg/s11,500images per second · BS=256 · BF16
Computer VisionYOLOv8x · inferenceFPS1,100frames per second · BS=1 · FP16
Computer VisionSAM ViT-Hmasks/s32masks per second · 1024×1024 · FP16
Audio/VideoWhisper Large v3× RT100multiples of real-time · CPU offload off
Fig 2 · Per-workload throughput on a single B200. Higher is better unless the metric is a price.
§ 03 · VRAM fit

What fits in 192 GB, really.

FP16 weights = 2 bytes × parameters. INT4 cuts that 4× with small quality loss. Fine-tuning needs 3–4× more memory for gradients, optimiser, activations.

ModelParamsFP16INT8INT4Fits on B200?
Llama 3.1 8B8B16 GB8 GB4 GBFP16, INT8 and INT4
Qwen 2.5 14B14B28 GB14 GB7 GBFP16, INT8 and INT4
Qwen 2.5 32B32B64 GB32 GB16 GBFP16, INT8 and INT4
Llama 3.1 70B70B140 GB70 GB36 GBFP16, INT8 and INT4
DeepSeek V3671B MoE1.3 TB671 GB336 GBNo
Llama 3.1 405B405B810 GB405 GB203 GBNo
Fig 3 · Memory budget per model at each precision against this card's 192 GB envelope.
§ 04 · Compare

B200 head-to-heads.

Side-by-side spec tables and matched-quantisation throughput numbers for the comparisons people actually search for.
/hardware/h200-vs-b200

H200 vs B200

Hopper’s last word against Blackwell’s first. 4.5× the FP16, almost 50% more VRAM bandwidth.

Read next

Three places to go from here.

Hub
Hardware register
Every accelerator on the leaderboard, with FP16 TFLOPS, VRAM, $/hr, and energy cost in one place.
Per-chip page
RTX 5090
First consumer card with 32 GB. The ceiling for a single-PSU workstation.
Per-chip page
RTX 4090
Still the workhorse: 24 GB GDDR6X, $0.29/hr on Vast.ai spot.