Benchmarks decide the pick
Coding/agents (SWE-bench Verified, LiveCodeBench, SciCode, Terminal-Bench, τ²-Bench), reasoning/math (AIME, HMMT, GPQA Diamond, MATH-500), then general (MMLU-Pro, HLE, AA Intelligence Index). MMLU-Pro, not old MMLU.
A practical answer to the local-LLM question: what can I run cleanly on a 3060, 3090, 4090, 5090, A100, H100, H200, B200, or MI300X? For each GPU we pick the highest-scoring current open-weight model that fits cleanly at a realistic quantization and context length. Benchmark-first, not parameter-first.
For each GPU, the pick is the highest-scoring current open-weight model that fits cleanly at a realistic quant and context. Picks are optimized for one local user or one small service; multi-user serving changes the answer because batching and KV cache dominate.
| GPU | VRAM | Recommended pick | Quant / context | Fit | Benchmark anchor |
|---|---|---|---|---|---|
| RTX 3060 12GB | 12 GB | Qwen3-8B | Q5/Q6 GGUF, 16k-32k practical | comfortable | Qwen3 family benchmarked as a major step over Qwen2.5; strongest general/reasoning profile per parameter in the small open-weight class. |
| RTX 4060 Ti 16GB | 16 GB | Qwen3-14B | Q4/Q5 GGUF, 16k-32k practical | tight | Qwen3-14B is the stronger current small-mid baseline; clearly ahead of legacy Mistral/Llama 8B-12B rows on reasoning and coding. |
| RTX 5080 16GB | 16 GB | Qwen3-14B | Q4/Q5 GGUF or EXL2 | tight | Same model ceiling as the 4060 Ti: Qwen3-14B. More compute does not create VRAM. |
| RTX 3090 24GB | 24 GB | Qwen3.6-35B-A3B Q4 | Q4 GGUF, modest context | tight | MMLU-Pro 85.6 BF16 / 85.0 NVFP4 · GPQA Diamond 84.9 / 84.8 · SciCode 40.8 / 40.6 · AIME 2025 89.2 / 88.8 (NVIDIA Qwen3.6-35B-A3B-NVFP4 card). |
| RTX 4090 24GB | 24 GB | Qwen3.6-35B-A3B Q4 / EXL2 | Q4 GGUF or EXL2, modest context | tight | MMLU-Pro 85.6 / 85.0 · GPQA Diamond 84.9 / 84.8 · AIME 2025 89.2 / 88.8. Same score class as the 3090, much faster delivery. |
| RTX 5090 32GB | 32 GB | Qwen3.6-35B-A3B (higher quant) | Q5-ish / FP4 where supported, 32k-64k practical | comfortable | Same Qwen3.6-35B-A3B score profile (MMLU-Pro 85.6/85.0, GPQA Diamond 84.9/84.8); NVFP4 loses little vs BF16, which matters for Blackwell-era deployment. |
| A100 40GB | 40 GB | Qwen3.6-35B-A3B (BF16/FP8/INT8 or high quant) | BF16, FP8, INT8, or high-quality 4-bit | comfortable | MMLU-Pro 85.6 BF16 / 85.0 NVFP4 · GPQA Diamond 84.9 / 84.8 · AIME 2025 89.2 / 88.8. |
| A100 80GB | 80 GB | Qwen3.6-35B-A3B serving, or a modern 70B/72B only if it wins your evals | FP8, INT8, or high-quality 4-bit | comfortable | Qwen3.6-35B-A3B: MMLU-Pro 85.6/85.0, GPQA Diamond 84.9/84.8. AA Intelligence Index currently leads with Kimi K2.6, MiMo-V2.5-Pro, DeepSeek V4 Pro — generation beats parameter count. |
| H100 80GB | 80 GB | Qwen3.6-35B-A3B high-throughput, or a modern 70B-class model | FP8, INT8, tensor-parallel, or MoE routing | comfortable | Qwen3.6 NVFP4 vs BF16 shows low degradation (MMLU-Pro 85.0 vs 85.6), good for Hopper/Blackwell-style quantized serving. |
| H200 141GB | 141 GB | Kimi K2.6 / GLM-5 / MiniMax-M2-class (quantized or sharded) | FP8, INT8, tensor-parallel, or MoE routing | tight | Kimi K2.6: SWE-bench Verified 80.2 · LiveCodeBench v6 89.6 · AIME 2026 96.4 · HMMT 2026 92.7 (model card). |
| B200 | 192 GB | GLM-5 / Kimi K2.6 / MiniMax-M2/M3-class | FP4/FP8, tensor parallel, or provider-native quantization | comfortable | GLM-5: GPQA-Diamond 86.0 · SWE-bench Verified 77.8 · SWE-bench Multilingual 73.3. Kimi K2.6: LiveCodeBench v6 89.6 · SWE-bench Verified 80.2. |
| MI300X 192GB | 192 GB | GLM-5 / Kimi K2.6 / MiniMax-M2-class | FP8/INT8 where supported, or runtime-specific quantization | comfortable | MiniMax-M2 claims #1 open-source global composite by Artificial Analysis at release; verify against your own target benchmark. |
The decision variable is the best public benchmark score among models that fit cleanly at the target quant and target context, not the largest model that fits.
Coding/agents (SWE-bench Verified, LiveCodeBench, SciCode, Terminal-Bench, τ²-Bench), reasoning/math (AIME, HMMT, GPQA Diamond, MATH-500), then general (MMLU-Pro, HLE, AA Intelligence Index). MMLU-Pro, not old MMLU.
A 70B model may fit 80GB. That does not make it the best model for that card. The recommendation is the highest-scoring model that fits cleanly, not the largest one that physically loads.
A 2026 model with strong evidence beats a 2025 model beats a 2024 model, unless the older one still wins the exact target benchmark. This retires Llama 3.1 as a default without ideology.
For MoE models, active parameters estimate compute per token, not total VRAM requirement. The full expert weight set must live in GPU memory, CPU memory, or tensor-parallel shards. Do not size hardware from active-parameter count alone. Kimi K2 is the canonical example: ~1T total parameters with ~32B activated, so "32B active" does not mean it fits like a dense 32B model.
The editable note is local-first for speed. The comment form sends corrections into the existing feedback queue so this can become a moderated public table later.
RTX 3060 12GB - Q5/Q6 GGUF, 16k-32k practical - comfortable fit
Edits are saved in this browser. Use comments below to send corrections for moderation.
Do not make a 12GB card your main 30B+ box. It can limp with heavy CPU offload, but the experience is usually worse than a clean, current 8B that fits.
No local comments yet for this GPU.
This table is benchmark-first, not parameter-first. For each GPU, the recommended model is the highest-scoring current open-weight model that fits cleanly at a realistic quantization and context length. Larger is not automatically better: a 2026 35B MoE or 32B reasoning model can be a better recommendation than a 2024 70B if it wins the relevant benchmarks. Legacy models such as Llama 3.1 remain compatibility baselines, not default recommendations. Hardware rows are tied to the CodeSOTA hardware table where available.
Sources: NVIDIA Qwen3.6-35B-A3B-NVFP4 eval card (MMLU-Pro 85.6/85.0, GPQA Diamond 84.9/84.8, SciCode 40.8/40.6, AIME 2025 89.2/88.8); Qwen3 and Qwen3.6-35B-A3B model cards; Kimi K2.6 model card (SWE-bench Verified 80.2, LiveCodeBench v6 89.6, AIME 2026 96.4, HMMT 2026 92.7); GLM-5 model card (GPQA-Diamond 86.0, SWE-bench Verified 77.8, SWE-bench Multilingual 73.3); MiniMax-M2 release notes; Artificial Analysis Intelligence Index; MMLU-Pro paper (arXiv 2406.01574); NVIDIA GeForce product pages.