Hardware / model fitBenchmark-first model fit guideUpdated June 3, 2026
Local inference - VRAM first - community notes

Which AI model fits your GPU?

A practical answer to the local-LLM question: what can I run cleanly on a 3060, 3090, 4090, 5090, A100, H100, H200, B200, or MI300X? For each GPU we pick the highest-scoring current open-weight model that fits cleanly at a realistic quantization and context length. Benchmark-first, not parameter-first.

Benchmark-first model choice by GPU VRAMCodesota.The highest-scoring open-weight model that fits each card cleanly — quality benchmark-informed, TPS a relative estimate.707580859095100121624324080141192GPU VRAM (GB)Composite quality index (0–100)single-GPU localserving / large MoE12GBQwen3-8Bclean Q5/Q616GBQwen3-14Bclean Q4/Q524GBQwen3.6-35B-A3BQ4, controlled context32GBQwen3.6-35B-A3Bbetter quant / context40GBQwen3.6-35B-A3Bclean high quality80GBQwen3.6-35B-A3B servingthroughput default141GBKimi K2.6 / GLM-5 classlarge MoE, shard / quant192GBKimi / GLM / MiniMax MoEfrontier open-weightQwen familyLarge MoE frontierbubble = TPS index17010060codesota.com/hardware/best-model-by-gpu
Copper marks the Qwen-family recommendation that fits each card cleanly; charcoal marks the larger frontier MoE class. Bubble size is a relative throughput (TPS) estimate; quality is benchmark-informed, not a single-number ground truth.
01 / Matrix

Card to model fit.

For each GPU, the pick is the highest-scoring current open-weight model that fits cleanly at a realistic quant and context. Picks are optimized for one local user or one small service; multi-user serving changes the answer because batching and KV cache dominate.

GPUVRAMRecommended pickQuant / contextFitBenchmark anchor
RTX 3060 12GB12 GBQwen3-8BQ5/Q6 GGUF, 16k-32k practicalcomfortableQwen3 family benchmarked as a major step over Qwen2.5; strongest general/reasoning profile per parameter in the small open-weight class.
RTX 4060 Ti 16GB16 GBQwen3-14BQ4/Q5 GGUF, 16k-32k practicaltightQwen3-14B is the stronger current small-mid baseline; clearly ahead of legacy Mistral/Llama 8B-12B rows on reasoning and coding.
RTX 5080 16GB16 GBQwen3-14BQ4/Q5 GGUF or EXL2tightSame model ceiling as the 4060 Ti: Qwen3-14B. More compute does not create VRAM.
RTX 3090 24GB24 GBQwen3.6-35B-A3B Q4Q4 GGUF, modest contexttightMMLU-Pro 85.6 BF16 / 85.0 NVFP4 · GPQA Diamond 84.9 / 84.8 · SciCode 40.8 / 40.6 · AIME 2025 89.2 / 88.8 (NVIDIA Qwen3.6-35B-A3B-NVFP4 card).
RTX 4090 24GB24 GBQwen3.6-35B-A3B Q4 / EXL2Q4 GGUF or EXL2, modest contexttightMMLU-Pro 85.6 / 85.0 · GPQA Diamond 84.9 / 84.8 · AIME 2025 89.2 / 88.8. Same score class as the 3090, much faster delivery.
RTX 5090 32GB32 GBQwen3.6-35B-A3B (higher quant)Q5-ish / FP4 where supported, 32k-64k practicalcomfortableSame Qwen3.6-35B-A3B score profile (MMLU-Pro 85.6/85.0, GPQA Diamond 84.9/84.8); NVFP4 loses little vs BF16, which matters for Blackwell-era deployment.
A100 40GB40 GBQwen3.6-35B-A3B (BF16/FP8/INT8 or high quant)BF16, FP8, INT8, or high-quality 4-bitcomfortableMMLU-Pro 85.6 BF16 / 85.0 NVFP4 · GPQA Diamond 84.9 / 84.8 · AIME 2025 89.2 / 88.8.
A100 80GB80 GBQwen3.6-35B-A3B serving, or a modern 70B/72B only if it wins your evalsFP8, INT8, or high-quality 4-bitcomfortableQwen3.6-35B-A3B: MMLU-Pro 85.6/85.0, GPQA Diamond 84.9/84.8. AA Intelligence Index currently leads with Kimi K2.6, MiMo-V2.5-Pro, DeepSeek V4 Pro — generation beats parameter count.
H100 80GB80 GBQwen3.6-35B-A3B high-throughput, or a modern 70B-class modelFP8, INT8, tensor-parallel, or MoE routingcomfortableQwen3.6 NVFP4 vs BF16 shows low degradation (MMLU-Pro 85.0 vs 85.6), good for Hopper/Blackwell-style quantized serving.
H200 141GB141 GBKimi K2.6 / GLM-5 / MiniMax-M2-class (quantized or sharded)FP8, INT8, tensor-parallel, or MoE routingtightKimi K2.6: SWE-bench Verified 80.2 · LiveCodeBench v6 89.6 · AIME 2026 96.4 · HMMT 2026 92.7 (model card).
B200192 GBGLM-5 / Kimi K2.6 / MiniMax-M2/M3-classFP4/FP8, tensor parallel, or provider-native quantizationcomfortableGLM-5: GPQA-Diamond 86.0 · SWE-bench Verified 77.8 · SWE-bench Multilingual 73.3. Kimi K2.6: LiveCodeBench v6 89.6 · SWE-bench Verified 80.2.
MI300X 192GB192 GBGLM-5 / Kimi K2.6 / MiniMax-M2-classFP8/INT8 where supported, or runtime-specific quantizationcomfortableMiniMax-M2 claims #1 open-source global composite by Artificial Analysis at release; verify against your own target benchmark.
02 / Rules

Score, then fit, then freshness.

The decision variable is the best public benchmark score among models that fit cleanly at the target quant and target context, not the largest model that fits.

1

Benchmarks decide the pick

Coding/agents (SWE-bench Verified, LiveCodeBench, SciCode, Terminal-Bench, τ²-Bench), reasoning/math (AIME, HMMT, GPQA Diamond, MATH-500), then general (MMLU-Pro, HLE, AA Intelligence Index). MMLU-Pro, not old MMLU.

2

Fits is not wins

A 70B model may fit 80GB. That does not make it the best model for that card. The recommendation is the highest-scoring model that fits cleanly, not the largest one that physically loads.

3

Penalize benchmark age

A 2026 model with strong evidence beats a 2025 model beats a 2024 model, unless the older one still wins the exact target benchmark. This retires Llama 3.1 as a default without ideology.

MoE sizing warning

For MoE models, active parameters estimate compute per token, not total VRAM requirement. The full expert weight set must live in GPU memory, CPU memory, or tensor-parallel shards. Do not size hardware from active-parameter count alone. Kimi K2 is the canonical example: ~1T total parameters with ~32B activated, so "32B active" does not mean it fits like a dense 32B model.

03 / Community

Edit the card, then leave evidence.

The editable note is local-first for speed. The comment form sends corrections into the existing feedback queue so this can become a moderated public table later.

GPU cards
Current pick

Qwen3-8B

RTX 3060 12GB - Q5/Q6 GGUF, 16k-32k practical - comfortable fit

12 GB

Edits are saved in this browser. Use comments below to send corrections for moderation.

Use it for
chatcoding helperRAGlong prompts at modest context
Alternates
  • Qwen3-4B for longer context
  • Llama 3.1 8B (legacy fallback)
  • Mistral 7B (very low-resource fallback)
Avoid

Do not make a 12GB card your main 30B+ box. It can limp with heavy CPU offload, but the experience is usually worse than a clean, current 8B that fits.

Comments

What are you actually running on RTX 3060 12GB?

0 local comments
Local-first comments for this prototype.

No local comments yet for this GPU.

04 / Evidence notes

This table is benchmark-first, not parameter-first. For each GPU, the recommended model is the highest-scoring current open-weight model that fits cleanly at a realistic quantization and context length. Larger is not automatically better: a 2026 35B MoE or 32B reasoning model can be a better recommendation than a 2024 70B if it wins the relevant benchmarks. Legacy models such as Llama 3.1 remain compatibility baselines, not default recommendations. Hardware rows are tied to the CodeSOTA hardware table where available.

Sources: NVIDIA Qwen3.6-35B-A3B-NVFP4 eval card (MMLU-Pro 85.6/85.0, GPQA Diamond 84.9/84.8, SciCode 40.8/40.6, AIME 2025 89.2/88.8); Qwen3 and Qwen3.6-35B-A3B model cards; Kimi K2.6 model card (SWE-bench Verified 80.2, LiveCodeBench v6 89.6, AIME 2026 96.4, HMMT 2026 92.7); GLM-5 model card (GPQA-Diamond 86.0, SWE-bench Verified 77.8, SWE-bench Multilingual 73.3); MiniMax-M2 release notes; Artificial Analysis Intelligence Index; MMLU-Pro paper (arXiv 2406.01574); NVIDIA GeForce product pages.

Related