Agentic hardware guideMMBT snapshot 2026-05-05

Which local model should OpenClaw or Hermes run on your hardware?

The MMBT repo is not a clean leaderboard. It is better than that for operators: it shows where local agent models ship, loop, fabricate, stall, and saturate a real workstation. This page turns those receipts into a routing guide for OpenClaw, Hermes Agent, and similar desktop or persistent-agent stacks.

95.8%

27B-no-think done-signal rate across the Phase-B grid.

2.27x

Coder-Next single-stream tok/s over dense 27B.

500 W

Validated LLM serving cap on the measured Blackwell rig.

0/10

Coder-Next market-research ship rate at N=10.

Short answer

If you want one local default

Use Qwen3.6-27B-AWQ no-think. It is the best supported default from this evidence base because it ships the most often and avoids Coder-Next's dangerous fabrication pattern.

If you run many users

Add Qwen3-Coder-Next-AWQ for high-throughput shaped tasks. On the measured rig it served about 49 comfortable users per card at 500 W versus 26 for dense 27B.

If the task is high-stakes

Do not consume either local model single-shot. Route through a verifier, a human review step, or a cloud model with stronger long-horizon evidence.

Route by workload

For OpenClaw or Hermes, the right unit is not "the best model." It is a router policy: pick the model that matches the failure cost of each tool call.

Workload	Best local pick	Why	Avoid	MMBT receipt
Default local agent work	Qwen3.6-27B-AWQ, no-think	Best overall ship-rate signal in the repo: 113/118 done-signal rate, or 95.8%, across the 12-cell Phase-B grid.	Do not treat done-signal as fully graded PASS. The repo flags the no-think PASS sweep as pending.	microbench-phase-b-2026-05-02
Security review, factual review, hallucination-sensitive tasks	Qwen3.6-27B-AWQ, no-think or thinking	The 27B family is the safer pick where false positives are expensive. No-think shipped 10/10 on adversarial hallucination; thinking-mode shipped fewer runs but kept the same clean accuracy profile when it shipped.	Avoid single-shot Coder-Next for high-stakes verdicts. MMBT documents fabricated technical evidence in PR-audit runs.	p2_hallucination, dreamserver-1-pr-audit
Market research with live citations	Qwen3.6-27B-AWQ, thinking	8/10 ship rate at N=10 on p3_market, while Coder-Next was 0/10. The sampled failure mode for 27B was URL drift, not fabricated facts.	Avoid Coder-Next for autonomous web research. The 0/10 result is a stable failure shape in the MMBT data.	p3_market
Bounded memos, document synthesis, support triage	Qwen3-Coder-Next-AWQ	Fastest and cheapest when the output shape is bounded. It shipped 10/10 on p3_business and p3_doc, and was the strongest support-triage classifier in the published microbench.	Use a verifier before consuming factual claims. Coder-Next is good at shaped output, not consistently safe truthfulness.	p3_business, p3_doc, p2_triage
Long-horizon unattended agents	None of the local arms as a single shot	On the 75-PR audit, 27B produced mostly template stubs and Coder-Next produced no usable deliverable across repeated attempts.	Do not run a persistent desktop agent for hours without checkpoints, validators, and task decomposition.	dreamserver-75-pr-audit

Route by hardware

Only the Blackwell workstation row is directly measured by MMBT. The other rows are deployment guidance derived from the repo's own validity boundaries, especially its warning that 24 GB, 48 GB, Mac MLX, and non-AWQ quantizations are not characterized.

Laptop CPU / small GPU

Not recommended

CPU-only, 8-16 GB RAM, low-end iGPU

Use cloud routing for OpenClaw/Hermes. Local models in this evidence set are not appropriate here.

Good control plane, weak local inference plane.

Consumer 24 GB GPU

Inferred

RTX 3090 / 4090 class, single 24 GB card

Start with a dense 27B quant at shorter context, then route hard or long tasks to cloud. Avoid Coder-Next unless you accept CPU offload and much lower throughput.

Usable for local drafts and constrained tools; not enough headroom for the published 262K-context MMBT setup.

Single 48 GB GPU

Inferred

RTX 6000 Ada / RTX PRO 5000-class memory tier

Use 27B-no-think as the default local worker. Add thinking-mode for research/provenance tasks. Keep Coder-Next as an experiment, not the default router.

Good practical floor for a local-first agent with cloud fallback.

Dual 48 GB or 96 GB workstation

Inferred

2 GPUs with enough combined VRAM for vLLM serving and long context

Run a mixed router: 27B-no-think for safe default work, 27B-thinking for research, Coder-Next for cheap shaped output and high concurrency.

Best current shape for a serious local OpenClaw/Hermes install.

Measured MMBT rig

Measured

2x RTX PRO 6000 Blackwell, 96 GB each, vLLM, Cyankiwi 4-bit AWQ, 500 W cap

Use task routing, not a single winner. 500 W is enough; raising power does not materially improve LLM serving.

The only fully measured operating point behind these recommendations.

Mac M-series unified memory

Inferred

M2/M3/M4 Pro, Max, Ultra with 32-128 GB unified memory

Treat as a sibling experiment. MMBT notes that MoE Coder-Next may look better on Mac because dense 27B compute can become the bottleneck, but this was not measured.

Promising for quiet local agents, but needs MLX-specific benchmark runs.

Serving capacity

The hardware sweep makes Coder-Next attractive as a serving-capacity model even when it is not the safest truthfulness model.

Model	N=1 peak	N=32 peak	@50 tok/s
Qwen3.6-27B-AWQ Better safety/research behavior, lower serving capacity.	72.1 tok/s	1382.1 tok/s	~26 users
Qwen3-Coder-Next-AWQ Much higher capacity, but needs routing away from high-stakes truthfulness tasks.	163.3 tok/s	2472.8 tok/s	~49 users

Model

N=1 peak

N=32 peak

@50 tok/s

Qwen3.6-27B-AWQ

Better safety/research behavior, lower serving capacity.

72.1 tok/s

1382.1 tok/s

~26 users

Qwen3-Coder-Next-AWQ

Much higher capacity, but needs routing away from high-stakes truthfulness tasks.

163.3 tok/s

2472.8 tok/s

~49 users

Power cap read

For LLM serving, 500 W is already on the plateau. Save 600 W for compute-bound diffusion workloads, not Hermes/OpenClaw chat-serving.

Cap	Dense 27B	Coder-Next
600 W	Ties plateau; native draw about 511 W single-stream	Ties plateau; native draw about 483 W single-stream Extra cap is mostly unused for LLM serving.
500 W	Within 3.3% of optimal in every tested scenario	Within 0.6% of batched peak; within 0.1% single-stream Recommended operating cap in the MMBT findings.
400 W	Still 95%+ of peak across scenarios	Still 95%+ of peak across scenarios Good efficiency mode if power or thermals matter.
300 W	Noticeable falloff, especially batched	Falloff is milder than dense 27B Useful for efficiency experiments, not peak serving.

Validity boundary

The MMBT evidence is strongest for Cyankiwi 4-bit AWQ models on vLLM with 2x RTX PRO 6000 Blackwell at a 500 W operating cap. It does not directly measure official FP8, BF16, Unsloth GGUF, Apple MLX, consumer 24 GB cards, or non-Python coding.

That means this page should be read as an operator's guide, not a universal leaderboard. If you run OpenClaw or Hermes on a different hardware tier, the next useful contribution is a repeatable field report with model, quant, engine, VRAM, context length, throughput, and failure mode.

Sources

MMBT COMPARISON.md - local-model decision doc for Coder-Next, 27B-thinking, and 27B-no-think.
MMBT SCORECARD.md - ship rates, cost-per-run, failure modes, and task tables.
vLLM power sweep - RTX PRO 6000 Blackwell throughput and power-cap measurements.
MMBT known limitations - caveats around quantization, VRAM tiers, cloud comparison, and platform generalization.