Agentic hardware guideMMBT snapshot 2026-05-05

Which local model should OpenClaw or Hermes run on your hardware?

The MMBT repo is not a clean leaderboard. It is better than that for operators: it shows where local agent models ship, loop, fabricate, stall, and saturate a real workstation. This page turns those receipts into a routing guide for OpenClaw, Hermes Agent, and similar desktop or persistent-agent stacks.

95.8%

27B-no-think done-signal rate across the Phase-B grid.

2.27x

Coder-Next single-stream tok/s over dense 27B.

500 W

Validated LLM serving cap on the measured Blackwell rig.

0/10

Coder-Next market-research ship rate at N=10.

Short answer

If you want one local default

Use Qwen3.6-27B-AWQ no-think. It is the best supported default from this evidence base because it ships the most often and avoids Coder-Next's dangerous fabrication pattern.

If you run many users

Add Qwen3-Coder-Next-AWQ for high-throughput shaped tasks. On the measured rig it served about 49 comfortable users per card at 500 W versus 26 for dense 27B.

If the task is high-stakes

Do not consume either local model single-shot. Route through a verifier, a human review step, or a cloud model with stronger long-horizon evidence.

Route by workload

For OpenClaw or Hermes, the right unit is not "the best model." It is a router policy: pick the model that matches the failure cost of each tool call.

WorkloadBest local pickWhyAvoidMMBT receipt
Default local agent workQwen3.6-27B-AWQ, no-thinkBest overall ship-rate signal in the repo: 113/118 done-signal rate, or 95.8%, across the 12-cell Phase-B grid.Do not treat done-signal as fully graded PASS. The repo flags the no-think PASS sweep as pending.microbench-phase-b-2026-05-02
Security review, factual review, hallucination-sensitive tasksQwen3.6-27B-AWQ, no-think or thinkingThe 27B family is the safer pick where false positives are expensive. No-think shipped 10/10 on adversarial hallucination; thinking-mode shipped fewer runs but kept the same clean accuracy profile when it shipped.Avoid single-shot Coder-Next for high-stakes verdicts. MMBT documents fabricated technical evidence in PR-audit runs.p2_hallucination, dreamserver-1-pr-audit
Market research with live citationsQwen3.6-27B-AWQ, thinking8/10 ship rate at N=10 on p3_market, while Coder-Next was 0/10. The sampled failure mode for 27B was URL drift, not fabricated facts.Avoid Coder-Next for autonomous web research. The 0/10 result is a stable failure shape in the MMBT data.p3_market
Bounded memos, document synthesis, support triageQwen3-Coder-Next-AWQFastest and cheapest when the output shape is bounded. It shipped 10/10 on p3_business and p3_doc, and was the strongest support-triage classifier in the published microbench.Use a verifier before consuming factual claims. Coder-Next is good at shaped output, not consistently safe truthfulness.p3_business, p3_doc, p2_triage
Long-horizon unattended agentsNone of the local arms as a single shotOn the 75-PR audit, 27B produced mostly template stubs and Coder-Next produced no usable deliverable across repeated attempts.Do not run a persistent desktop agent for hours without checkpoints, validators, and task decomposition.dreamserver-75-pr-audit

Route by hardware

Only the Blackwell workstation row is directly measured by MMBT. The other rows are deployment guidance derived from the repo's own validity boundaries, especially its warning that 24 GB, 48 GB, Mac MLX, and non-AWQ quantizations are not characterized.

Laptop CPU / small GPU

Not recommended

CPU-only, 8-16 GB RAM, low-end iGPU

Use cloud routing for OpenClaw/Hermes. Local models in this evidence set are not appropriate here.

Good control plane, weak local inference plane.

Consumer 24 GB GPU

Inferred

RTX 3090 / 4090 class, single 24 GB card

Start with a dense 27B quant at shorter context, then route hard or long tasks to cloud. Avoid Coder-Next unless you accept CPU offload and much lower throughput.

Usable for local drafts and constrained tools; not enough headroom for the published 262K-context MMBT setup.

Single 48 GB GPU

Inferred

RTX 6000 Ada / RTX PRO 5000-class memory tier

Use 27B-no-think as the default local worker. Add thinking-mode for research/provenance tasks. Keep Coder-Next as an experiment, not the default router.

Good practical floor for a local-first agent with cloud fallback.

Dual 48 GB or 96 GB workstation

Inferred

2 GPUs with enough combined VRAM for vLLM serving and long context

Run a mixed router: 27B-no-think for safe default work, 27B-thinking for research, Coder-Next for cheap shaped output and high concurrency.

Best current shape for a serious local OpenClaw/Hermes install.

Measured MMBT rig

Measured

2x RTX PRO 6000 Blackwell, 96 GB each, vLLM, Cyankiwi 4-bit AWQ, 500 W cap

Use task routing, not a single winner. 500 W is enough; raising power does not materially improve LLM serving.

The only fully measured operating point behind these recommendations.

Mac M-series unified memory

Inferred

M2/M3/M4 Pro, Max, Ultra with 32-128 GB unified memory

Treat as a sibling experiment. MMBT notes that MoE Coder-Next may look better on Mac because dense 27B compute can become the bottleneck, but this was not measured.

Promising for quiet local agents, but needs MLX-specific benchmark runs.

Serving capacity

The hardware sweep makes Coder-Next attractive as a serving-capacity model even when it is not the safest truthfulness model.

ModelN=1 peakN=32 peak@50 tok/s

Qwen3.6-27B-AWQ

Better safety/research behavior, lower serving capacity.

72.1 tok/s1382.1 tok/s~26 users

Qwen3-Coder-Next-AWQ

Much higher capacity, but needs routing away from high-stakes truthfulness tasks.

163.3 tok/s2472.8 tok/s~49 users

Power cap read

For LLM serving, 500 W is already on the plateau. Save 600 W for compute-bound diffusion workloads, not Hermes/OpenClaw chat-serving.

CapDense 27BCoder-Next
600 WTies plateau; native draw about 511 W single-streamTies plateau; native draw about 483 W single-stream

Extra cap is mostly unused for LLM serving.

500 WWithin 3.3% of optimal in every tested scenarioWithin 0.6% of batched peak; within 0.1% single-stream

Recommended operating cap in the MMBT findings.

400 WStill 95%+ of peak across scenariosStill 95%+ of peak across scenarios

Good efficiency mode if power or thermals matter.

300 WNoticeable falloff, especially batchedFalloff is milder than dense 27B

Useful for efficiency experiments, not peak serving.

Validity boundary

The MMBT evidence is strongest for Cyankiwi 4-bit AWQ models on vLLM with 2x RTX PRO 6000 Blackwell at a 500 W operating cap. It does not directly measure official FP8, BF16, Unsloth GGUF, Apple MLX, consumer 24 GB cards, or non-Python coding.

That means this page should be read as an operator's guide, not a universal leaderboard. If you run OpenClaw or Hermes on a different hardware tier, the next useful contribution is a repeatable field report with model, quant, engine, VRAM, context length, throughput, and failure mode.

Sources