Every LLM on Hailo-10H
Llama 3.1 8B · Llama 3.2 3B · Qwen 2.5 · Qwen3 — decode tok/s, quality, quantisation.
Read the deep-dive →The independent register for NPUs that leave the datacenter behind — Hailo-8, 8L, 10H, 15H first, with Jetson Orin, Coral, Rockchip, Qualcomm and the Apple Neural Engine held next to them. Numbers taken from the public Hailo Model Zoo — INT8 (or INT4 for LLMs on 10H), batch 1, on reference boards.
5 Hailo chips covered · 21 benchmarked models · 9 detection / seg variants · 3 on-device LLMs on Hailo-10H.
Hailo builds NPUs that keep all model memory on die — no external DRAM lookups during inference. That decision buys predictable latency and very high perf-per-watt; it also means the model has to fit, or be partitioned by the Hailo compiler.
| Chip | Family | Performance | Power | Form | Best for | Released | Status |
|---|---|---|---|---|---|---|---|
| Hailo-8L | Accelerator | 13 TOPS (INT8) | ~1.5 W typical | M.2 / PCIe | Cost-sensitive edge: single-stream detection, smart home, POS | 2023 | Shipping |
| Hailo-8 | Accelerator | 26 TOPS (INT8) | ~2.5 W typical | M.2 / PCIe / SoM | Multi-stream CV: smart cameras, retail analytics, Raspberry Pi 5 AI kit | 2021 | Shipping |
| Hailo-10H | Accelerator | 40 TOPS (INT4) | ~2.5 W typical | M.2 | On-device LLMs/VLMs, Llama 3 8B at 10+ tok/s, generative edge AI | 2025-07 | New |
| Hailo-15H | Vision Processor | 20 TOPS (INT8) | ~3-5 W | SoC (VPU) | High-end smart cameras with on-chip ISP + NN core | 2023 | Shipping |
| Hailo-15L | Vision Processor | 7 TOPS (INT8) | ~2 W | SoC (VPU) | Mass-market IP cameras replacing traditional SoCs | 2024 | Mass market |
Hailo is the primary lens of this register, but the edge shelf is wider. Jetson Orin owns the CUDA-native robotics lane; Google Coral is the old reliable for TFLite classifiers; Rockchip and Qualcomm ship inside SBCs and phones; the Apple Neural Engine quietly runs CoreML on every M-series machine.
Numbers from vendor product pages. Direct head-to-head is an active research frontier — workloads, quantisation, batch size and toolchains all shift the answer.
| Chip | Vendor | Performance | Power envelope | Role |
|---|---|---|---|---|
| Jetson Orin Nano (8GB) | NVIDIA | 40 TOPS (INT8) | 7–15 W | CUDA-capable edge dev kit — robotics, vision, early LLM |
| Jetson AGX Orin (64GB) | NVIDIA | 275 TOPS (INT8) | 15–60 W | Autonomous machines · large VLMs on the edge |
| Coral Edge TPU (USB / M.2) | 4 TOPS (INT8) | ~2 W | TFLite-only · classic classifier & detector workloads | |
| Rockchip RK3588 NPU | Rockchip | 6 TOPS (INT8) | SoC envelope | SBC / mini-PC workhorse · ARM + NPU + GPU |
| Qualcomm QCS8550 | Qualcomm | ~48 TOPS (INT8) | SoC envelope | Hexagon NPU · IoT, robotics, Android-class devices |
| Apple Neural Engine (M-series) | Apple | ~18–38 TOPS (INT8) | SoC envelope | macOS / iOS on-device inference · CoreML |
The same chip is fast at vision and slow at OCR; Hailo-10H runs an 8B LLM at decode-interactive speed and, in the same envelope, clears 275 FPS on YOLOv8n. Each panel ends at a real number from the Hailo Model Zoo.
FPS from the public Hailo Model Zoo. INT8, batch 1, reference boards. LLM rows are INT4 decode tok/s with 2K context on Hailo-10H. A dash means the model is not officially compiled for that chip.
| Model | Input | Params | Hailo-8L | Hailo-8 | Hailo-10H | Hailo-15H | Notes |
|---|---|---|---|---|---|---|---|
| YOLOv11n | 640×640 | 2.6M | 135 | 210 | 260 | 240 | Latest YOLO nano, NMS on-chip |
| YOLOv11s | 640×640 | 9.4M | 72 | 140 | 175 | 160 | Balanced accuracy/speed |
| YOLOv11m | 640×640 | 20.1M | 38 | 70 | 95 | 85 | Higher mAP for demanding scenes |
| YOLOv8n | 640×640 | 3.2M | 150 | 235 | 275 | 255 | Most deployed edge detector |
| YOLOv8s | 640×640 | 11.2M | 78 | 150 | 180 | 165 |
| Model | Input | Params | Hailo-8L | Hailo-8 | Hailo-10H | Hailo-15H | Notes |
|---|---|---|---|---|---|---|---|
| YOLO26n | 640×640 | 3.0M | — | — | 250 | 230 | NMS-free, newest family |
| Model | Input | Params | Hailo-8L | Hailo-8 | Hailo-10H | Hailo-15H | Notes |
|---|---|---|---|---|---|---|---|
| YOLOv11n-obb | 640×640 | 2.7M | — | — | 210 | 195 | Rotated boxes for aerial/industrial |
| Model | Input | Params | Hailo-8L | Hailo-8 | Hailo-10H | Hailo-15H | Notes |
|---|---|---|---|---|---|---|---|
| YOLOv8n-seg | 640×640 | 3.4M | 85 | 155 | 190 | 175 | |
| YOLOv5n-seg-hpp | 640×640 | 2.0M | 120 | 195 | 230 | 215 | HailoRT-accelerated post-process |
| Model | Input | Params | Hailo-8L | Hailo-8 | Hailo-10H | Hailo-15H | Notes |
|---|---|---|---|---|---|---|---|
| YOLOv8n-pose | 640×640 | 3.3M | 88 | 160 | 195 | 180 | 17-keypoint human pose |
| Model | Input | Params | Hailo-8L | Hailo-8 | Hailo-10H | Hailo-15H | Notes |
|---|---|---|---|---|---|---|---|
| ResNet-50 | 224×224 | 25.6M | 720 | 1,390 | 1,750 | 1,500 | ImageNet reference |
| MobileNet V3 | 224×224 | 5.4M | 1,600 | 2,800 | 3,400 | 3,100 | Fastest production classifier |
| EfficientNet-B0 | 224×224 | 5.3M | 1,020 | 1,850 | 2,300 | 2,050 |
| Model | Input | Params | Hailo-8L | Hailo-8 | Hailo-10H | Hailo-15H | Notes |
|---|---|---|---|---|---|---|---|
| PaddleOCR-v5 (det+rec) | Multi | ~12M | 22 | 45 | 65 | 58 | Latest PP-OCR pipeline |
| Model | Input | Params | Hailo-8L | Hailo-8 | Hailo-10H | Hailo-15H | Notes |
|---|---|---|---|---|---|---|---|
| RetinaFace MobileNet | 736×1280 | 0.4M | 85 | 140 | 165 | 150 |
| Model | Input | Params | Hailo-8L | Hailo-8 | Hailo-10H | Hailo-15H | Notes |
|---|---|---|---|---|---|---|---|
| ArcFace R50 | 112×112 | 43.6M | 380 | 720 | 890 | 800 |
| Model | Input | Params | Hailo-8L | Hailo-8 | Hailo-10H | Hailo-15H | Notes |
|---|---|---|---|---|---|---|---|
| FastDepth | 224×224 | 3.9M | 380 | 640 | 790 | 710 |
| Model | Input | Params | Hailo-8L | Hailo-8 | Hailo-10H | Hailo-15H | Notes |
|---|---|---|---|---|---|---|---|
| CLIP ViT-L/14 (Laion2B) | 224×224 | 304M | — | — | 42 | 28 | Image embeddings for retrieval |
| Model | Input | Params | Hailo-8L | Hailo-8 | Hailo-10H | Hailo-15H | Notes |
|---|---|---|---|---|---|---|---|
| Llama 3.2 3B | — | 3.2B | — | — | 28 | — | tok/s decode, 2K ctx |
| Llama 3.1 8B | — | 8.0B | — | — | 11 | — | tok/s decode, 2K ctx |
| Qwen 2.5 1.5B | — | 1.5B | — | — | 45 | — | tok/s decode |
Each sub-page carries its own evidence table. This hub is the index; the work is below.
Llama 3.1 8B · Llama 3.2 3B · Qwen 2.5 · Qwen3 — decode tok/s, quality, quantisation.
Read the deep-dive →HEF — Hailo Executable Format — is the compiled binary that runs on a Hailo chip. You cannot load a PyTorch or ONNX model directly. The Hailo Dataflow Compiler converts it, quantises the weights, maps operations onto the NPU's cores and memory, and produces a single .hef file.
The compile step takes minutes to hours and needs a licence. Quantisation quality depends on calibration data — bad calibration, bad accuracy. Each chip has its own HEF; a Hailo-8 HEF does not run on a 10H. Most deployments want "give me YOLOv11n for Hailo-8, verified" — not a compile pipeline.
Pipeline: ONNX → hailo parser → HAR → hailo optimize (with calibration set) → quantised HAR → hailo compile → .hef → HailoRT → ConfiguredNetworkGroup → inference.
Hub pages across Codesota worth reading next.