Which local model should OpenClaw or Hermes run on your hardware?
The MMBT repo is not a clean leaderboard. It is better than that for operators: it shows where local agent models ship, loop, fabricate, stall, and saturate a real workstation. This page turns those receipts into a routing guide for OpenClaw, Hermes Agent, and similar desktop or persistent-agent stacks.
95.8%
27B-no-think done-signal rate across the Phase-B grid.
2.27x
Coder-Next single-stream tok/s over dense 27B.
500 W
Validated LLM serving cap on the measured Blackwell rig.
0/10
Coder-Next market-research ship rate at N=10.
Short answer
If you want one local default
Use Qwen3.6-27B-AWQ no-think. It is the best supported default from this evidence base because it ships the most often and avoids Coder-Next's dangerous fabrication pattern.
If you run many users
Add Qwen3-Coder-Next-AWQ for high-throughput shaped tasks. On the measured rig it served about 49 comfortable users per card at 500 W versus 26 for dense 27B.
If the task is high-stakes
Do not consume either local model single-shot. Route through a verifier, a human review step, or a cloud model with stronger long-horizon evidence.
Route by workload
For OpenClaw or Hermes, the right unit is not "the best model." It is a router policy: pick the model that matches the failure cost of each tool call.
| Workload | Best local pick | Why | Avoid | MMBT receipt |
|---|---|---|---|---|
| Default local agent work | Qwen3.6-27B-AWQ, no-think | Best overall ship-rate signal in the repo: 113/118 done-signal rate, or 95.8%, across the 12-cell Phase-B grid. | Do not treat done-signal as fully graded PASS. The repo flags the no-think PASS sweep as pending. | microbench-phase-b-2026-05-02 |
| Security review, factual review, hallucination-sensitive tasks | Qwen3.6-27B-AWQ, no-think or thinking | The 27B family is the safer pick where false positives are expensive. No-think shipped 10/10 on adversarial hallucination; thinking-mode shipped fewer runs but kept the same clean accuracy profile when it shipped. | Avoid single-shot Coder-Next for high-stakes verdicts. MMBT documents fabricated technical evidence in PR-audit runs. | p2_hallucination, dreamserver-1-pr-audit |
| Market research with live citations | Qwen3.6-27B-AWQ, thinking | 8/10 ship rate at N=10 on p3_market, while Coder-Next was 0/10. The sampled failure mode for 27B was URL drift, not fabricated facts. | Avoid Coder-Next for autonomous web research. The 0/10 result is a stable failure shape in the MMBT data. | p3_market |
| Bounded memos, document synthesis, support triage | Qwen3-Coder-Next-AWQ | Fastest and cheapest when the output shape is bounded. It shipped 10/10 on p3_business and p3_doc, and was the strongest support-triage classifier in the published microbench. | Use a verifier before consuming factual claims. Coder-Next is good at shaped output, not consistently safe truthfulness. | p3_business, p3_doc, p2_triage |
| Long-horizon unattended agents | None of the local arms as a single shot | On the 75-PR audit, 27B produced mostly template stubs and Coder-Next produced no usable deliverable across repeated attempts. | Do not run a persistent desktop agent for hours without checkpoints, validators, and task decomposition. | dreamserver-75-pr-audit |
Route by hardware
Only the Blackwell workstation row is directly measured by MMBT. The other rows are deployment guidance derived from the repo's own validity boundaries, especially its warning that 24 GB, 48 GB, Mac MLX, and non-AWQ quantizations are not characterized.
Laptop CPU / small GPU
Not recommendedCPU-only, 8-16 GB RAM, low-end iGPU
Use cloud routing for OpenClaw/Hermes. Local models in this evidence set are not appropriate here.
Good control plane, weak local inference plane.
Consumer 24 GB GPU
InferredRTX 3090 / 4090 class, single 24 GB card
Start with a dense 27B quant at shorter context, then route hard or long tasks to cloud. Avoid Coder-Next unless you accept CPU offload and much lower throughput.
Usable for local drafts and constrained tools; not enough headroom for the published 262K-context MMBT setup.
Single 48 GB GPU
InferredRTX 6000 Ada / RTX PRO 5000-class memory tier
Use 27B-no-think as the default local worker. Add thinking-mode for research/provenance tasks. Keep Coder-Next as an experiment, not the default router.
Good practical floor for a local-first agent with cloud fallback.
Dual 48 GB or 96 GB workstation
Inferred2 GPUs with enough combined VRAM for vLLM serving and long context
Run a mixed router: 27B-no-think for safe default work, 27B-thinking for research, Coder-Next for cheap shaped output and high concurrency.
Best current shape for a serious local OpenClaw/Hermes install.
Measured MMBT rig
Measured2x RTX PRO 6000 Blackwell, 96 GB each, vLLM, Cyankiwi 4-bit AWQ, 500 W cap
Use task routing, not a single winner. 500 W is enough; raising power does not materially improve LLM serving.
The only fully measured operating point behind these recommendations.
Mac M-series unified memory
InferredM2/M3/M4 Pro, Max, Ultra with 32-128 GB unified memory
Treat as a sibling experiment. MMBT notes that MoE Coder-Next may look better on Mac because dense 27B compute can become the bottleneck, but this was not measured.
Promising for quiet local agents, but needs MLX-specific benchmark runs.
Serving capacity
The hardware sweep makes Coder-Next attractive as a serving-capacity model even when it is not the safest truthfulness model.
| Model | N=1 peak | N=32 peak | @50 tok/s |
|---|---|---|---|
Qwen3.6-27B-AWQ Better safety/research behavior, lower serving capacity. | 72.1 tok/s | 1382.1 tok/s | ~26 users |
Qwen3-Coder-Next-AWQ Much higher capacity, but needs routing away from high-stakes truthfulness tasks. | 163.3 tok/s | 2472.8 tok/s | ~49 users |
Power cap read
For LLM serving, 500 W is already on the plateau. Save 600 W for compute-bound diffusion workloads, not Hermes/OpenClaw chat-serving.
| Cap | Dense 27B | Coder-Next |
|---|---|---|
| 600 W | Ties plateau; native draw about 511 W single-stream | Ties plateau; native draw about 483 W single-stream Extra cap is mostly unused for LLM serving. |
| 500 W | Within 3.3% of optimal in every tested scenario | Within 0.6% of batched peak; within 0.1% single-stream Recommended operating cap in the MMBT findings. |
| 400 W | Still 95%+ of peak across scenarios | Still 95%+ of peak across scenarios Good efficiency mode if power or thermals matter. |
| 300 W | Noticeable falloff, especially batched | Falloff is milder than dense 27B Useful for efficiency experiments, not peak serving. |
Validity boundary
The MMBT evidence is strongest for Cyankiwi 4-bit AWQ models on vLLM with 2x RTX PRO 6000 Blackwell at a 500 W operating cap. It does not directly measure official FP8, BF16, Unsloth GGUF, Apple MLX, consumer 24 GB cards, or non-Python coding.
That means this page should be read as an operator's guide, not a universal leaderboard. If you run OpenClaw or Hermes on a different hardware tier, the next useful contribution is a repeatable field report with model, quant, engine, VRAM, context length, throughput, and failure mode.
Sources
- MMBT COMPARISON.md - local-model decision doc for Coder-Next, 27B-thinking, and 27B-no-think.
- MMBT SCORECARD.md - ship rates, cost-per-run, failure modes, and task tables.
- vLLM power sweep - RTX PRO 6000 Blackwell throughput and power-cap measurements.
- MMBT known limitations - caveats around quantization, VRAM tiers, cloud comparison, and platform generalization.