Which AI model fits your GPU?

A practical answer to the local-LLM question: what can I run cleanly on a 3060, 3090, 4090, 5090, A100, H100, H200, B200, or MI300X? For each GPU we pick the highest-scoring current open-weight model that fits cleanly at a realistic quantization and context length. Benchmark-first, not parameter-first.

Read the matrix Edit and comment

Copper marks the Qwen-family recommendation that fits each card cleanly; charcoal marks the larger frontier MoE class. Bubble size is a relative throughput (TPS) estimate; quality is benchmark-informed, not a single-number ground truth.

01 / Matrix

Card to model fit.

For each GPU, the pick is the highest-scoring current open-weight model that fits cleanly at a realistic quant and context. Picks are optimized for one local user or one small service; multi-user serving changes the answer because batching and KV cache dominate.

GPU	VRAM	Recommended pick	Quant / context	Fit	Benchmark anchor
RTX 3060 12GB	12 GB	Qwen3-8B	Q5/Q6 GGUF, 16k-32k practical	comfortable	Qwen3 family benchmarked as a major step over Qwen2.5; strongest general/reasoning profile per parameter in the small open-weight class.
RTX 4060 Ti 16GB	16 GB	Qwen3-14B	Q4/Q5 GGUF, 16k-32k practical	tight	Qwen3-14B is the stronger current small-mid baseline; clearly ahead of legacy Mistral/Llama 8B-12B rows on reasoning and coding.
RTX 5080 16GB	16 GB	Qwen3-14B	Q4/Q5 GGUF or EXL2	tight	Same model ceiling as the 4060 Ti: Qwen3-14B. More compute does not create VRAM.
RTX 3090 24GB	24 GB	Qwen3.6-35B-A3B Q4	Q4 GGUF, modest context	tight	MMLU-Pro 85.6 BF16 / 85.0 NVFP4 · GPQA Diamond 84.9 / 84.8 · SciCode 40.8 / 40.6 · AIME 2025 89.2 / 88.8 (NVIDIA Qwen3.6-35B-A3B-NVFP4 card).
RTX 4090 24GB	24 GB	Qwen3.6-35B-A3B Q4 / EXL2	Q4 GGUF or EXL2, modest context	tight	MMLU-Pro 85.6 / 85.0 · GPQA Diamond 84.9 / 84.8 · AIME 2025 89.2 / 88.8. Same score class as the 3090, much faster delivery.
RTX 5090 32GB	32 GB	Qwen3.6-35B-A3B (higher quant)	Q5-ish / FP4 where supported, 32k-64k practical	comfortable	Same Qwen3.6-35B-A3B score profile (MMLU-Pro 85.6/85.0, GPQA Diamond 84.9/84.8); NVFP4 loses little vs BF16, which matters for Blackwell-era deployment.
A100 40GB	40 GB	Qwen3.6-35B-A3B (BF16/FP8/INT8 or high quant)	BF16, FP8, INT8, or high-quality 4-bit	comfortable	MMLU-Pro 85.6 BF16 / 85.0 NVFP4 · GPQA Diamond 84.9 / 84.8 · AIME 2025 89.2 / 88.8.
A100 80GB	80 GB	Qwen3.6-35B-A3B serving, or a modern 70B/72B only if it wins your evals	FP8, INT8, or high-quality 4-bit	comfortable	Qwen3.6-35B-A3B: MMLU-Pro 85.6/85.0, GPQA Diamond 84.9/84.8. AA Intelligence Index currently leads with Kimi K2.6, MiMo-V2.5-Pro, DeepSeek V4 Pro — generation beats parameter count.
H100 80GB	80 GB	Qwen3.6-35B-A3B high-throughput, or a modern 70B-class model	FP8, INT8, tensor-parallel, or MoE routing	comfortable	Qwen3.6 NVFP4 vs BF16 shows low degradation (MMLU-Pro 85.0 vs 85.6), good for Hopper/Blackwell-style quantized serving.
H200 141GB	141 GB	Kimi K2.6 / GLM-5 / MiniMax-M2-class (quantized or sharded)	FP8, INT8, tensor-parallel, or MoE routing	tight	Kimi K2.6: SWE-bench Verified 80.2 · LiveCodeBench v6 89.6 · AIME 2026 96.4 · HMMT 2026 92.7 (model card).
B200	192 GB	GLM-5 / Kimi K2.6 / MiniMax-M2/M3-class	FP4/FP8, tensor parallel, or provider-native quantization	comfortable	GLM-5: GPQA-Diamond 86.0 · SWE-bench Verified 77.8 · SWE-bench Multilingual 73.3. Kimi K2.6: LiveCodeBench v6 89.6 · SWE-bench Verified 80.2.
MI300X 192GB	192 GB	GLM-5 / Kimi K2.6 / MiniMax-M2-class	FP8/INT8 where supported, or runtime-specific quantization	comfortable	MiniMax-M2 claims #1 open-source global composite by Artificial Analysis at release; verify against your own target benchmark.

02 / Rules

Score, then fit, then freshness.

The decision variable is the best public benchmark score among models that fit cleanly at the target quant and target context, not the largest model that fits.

Benchmarks decide the pick

Coding/agents (SWE-bench Verified, LiveCodeBench, SciCode, Terminal-Bench, τ²-Bench), reasoning/math (AIME, HMMT, GPQA Diamond, MATH-500), then general (MMLU-Pro, HLE, AA Intelligence Index). MMLU-Pro, not old MMLU.

Fits is not wins

A 70B model may fit 80GB. That does not make it the best model for that card. The recommendation is the highest-scoring model that fits cleanly, not the largest one that physically loads.

Penalize benchmark age

A 2026 model with strong evidence beats a 2025 model beats a 2024 model, unless the older one still wins the exact target benchmark. This retires Llama 3.1 as a default without ideology.

MoE sizing warning

For MoE models, active parameters estimate compute per token, not total VRAM requirement. The full expert weight set must live in GPU memory, CPU memory, or tensor-parallel shards. Do not size hardware from active-parameter count alone. Kimi K2 is the canonical example: ~1T total parameters with ~32B activated, so "32B active" does not mean it fits like a dense 32B model.

03 / Community

Edit the card, then leave evidence.

The editable note is local-first for speed. The comment form sends corrections into the existing feedback queue so this can become a moderated public table later.

GPU cards

Current pick

Qwen3-8B

RTX 3060 12GB - Q5/Q6 GGUF, 16k-32k practical - comfortable fit

12 GB

Editable community note

Edits are saved in this browser. Use comments below to send corrections for moderation.

Use it for

chatcoding helperRAGlong prompts at modest context

Alternates

Qwen3-4B for longer context
Llama 3.1 8B (legacy fallback)
Mistral 7B (very low-resource fallback)

Avoid

Do not make a 12GB card your main 30B+ box. It can limp with heavy CPU offload, but the experience is usually worse than a clean, current 8B that fits.

Comments

What are you actually running on RTX 3060 12GB?

0 local comments

No local comments yet for this GPU.

04 / Evidence notes

This table is benchmark-first, not parameter-first. For each GPU, the recommended model is the highest-scoring current open-weight model that fits cleanly at a realistic quantization and context length. Larger is not automatically better: a 2026 35B MoE or 32B reasoning model can be a better recommendation than a 2024 70B if it wins the relevant benchmarks. Legacy models such as Llama 3.1 remain compatibility baselines, not default recommendations. Hardware rows are tied to the CodeSOTA hardware table where available.

Sources: NVIDIA Qwen3.6-35B-A3B-NVFP4 eval card (MMLU-Pro 85.6/85.0, GPQA Diamond 84.9/84.8, SciCode 40.8/40.6, AIME 2025 89.2/88.8); Qwen3 and Qwen3.6-35B-A3B model cards; Kimi K2.6 model card (SWE-bench Verified 80.2, LiveCodeBench v6 89.6, AIME 2026 96.4, HMMT 2026 92.7); GLM-5 model card (GPQA-Diamond 86.0, SWE-bench Verified 77.8, SWE-bench Multilingual 73.3); MiniMax-M2 release notes; Artificial Analysis Intelligence Index; MMLU-Pro paper (arXiv 2406.01574); NVIDIA GeForce product pages.

AI GPU benchmark hub RTX 5090 vs RTX 4090 RTX 3090 benchmark page H100 vs MI300X