edge AIoptimizationtutorial

Hands-On: Convert a Large Language Model to Run Efficiently on Raspberry Pi 5 + AI HAT+ 2

UUnknown

2026-02-16

11 min read

Step-by-step guide to convert, quantize and run a local LLM on Raspberry Pi 5 + AI HAT+ 2 with commands, tradeoffs and performance tips.

Hook: Get a real, usable local LLM on Raspberry Pi 5 with AI HAT+ 2 — without the cloud

Keeping LLM inference local is now practical for many edge use cases, but the engineering path is full of choices: how aggressive should you be with quantization? Is pruning worth the accuracy hit? Which runtime will actually use the Pi 5 AI HAT+ 2 NPU? This hands-on guide (2026) walks you through the full conversion and tuning pipeline — commands, tradeoffs, and benchmarks — so you can run a usable local LLM on Raspberry Pi 5 hardware.

Quick TL;DR

Target: 7B–13B class models are workable on Pi 5 + AI HAT+ 2 when you combine 4-bit quantization and modest pruning.
Toolchain: Prepare with llama.cpp / ggml (GGUF) for CPU-local, and ONNX Runtime + vendor NPU provider or vendor SDK for acceleration.
Conversion path: HF model → quantize (GPTQ / AWQ / llama.cpp q4) → export GGUF/ONNX → run with llama.cpp or ORT-NPU.
Tradeoffs: Aggressive int4 quantization and structured pruning reduce memory but hurt rare-token accuracy; AWQ / GPTQ are better than naive quantizers.

Why this matters in 2026

Edge generative AI moved fast in 2024–2026: better quantizers (AWQ, improved GPTQ variants), standardization of the GGUF format for GGML runtimes, and affordable NPUs like the AI HAT+ 2 (late-2025 releases) have turned single-board computers into credible inference devices. Teams who need private, low-latency, or offline capable assistants can now run models locally — but only if they make the right conversion and runtime decisions.

Planned workflow

Assess hardware and memory budget
Pick a model and baseline (7B / 13B)
Quantize (GPTQ / AWQ / llama.cpp q4 variants)
Optionally prune (structured or magnitude pruning + finetune)
Export to target runtime (GGUF for llama.cpp, ONNX for NPU)
Benchmark and tune runtime parameters

1) Hardware & memory budgeting (practical numbers)

Before conversion, estimate working memory. Use these approximations (2026 best-practice estimates):

Float32 model size ~= parameters * 4 bytes; a 7B model ≈ 28 GB (raw)
int8 quantized size ≈ model_float_size * 0.25 (so 7B -> ~7 GB)
int4 (true 4-bit) roughly halves int8 -> ~3.5 GB for 7B
Overheads (runtime buffers, context window) add 20–40%

Implication: a Pi 5 with 8 GB RAM + AI HAT+ 2 NPU can run a 7B model comfortably if you quantize to int4 (and use swap/zram carefully). A 13B model requires 12–16 GB effective memory unless you rely heavily on NPU offload or streaming runtimes.

2) Pick the right model

Pick a modern, well-optimized base model that has community quantization recipes. In 2026, popular choices include open models from Meta (Llama family), Mistral, and other HF-backed weights. For Pi 5 projects we recommend starting with a 7B variant optimized for quantization or an already distilled 7B. Why? The compute and latency/throughput envelope is friendly and you'll iterate faster.

3) Quantization: concepts and tools

Quantization is the single most effective lever to get models into memory. In 2026 the common choices are:

Naive int8/4 (llama.cpp q8/q4) — fast, simple, but not optimal for all layers
GPTQ — per-channel learned rounding quantization that preserves accuracy
AWQ — newer algorithm (2024–2025) optimized for 4-bit with improved downstream accuracy

Which to choose?

For fastest path and good-enough quality: use llama.cpp's q4_0 / q4_k variants.
For best quality at 4-bit: use AWQ or GPTQ followed by conversion to GGUF/ONNX.
If you want a plug-and-play path that supports NPU, convert to ONNX and use ONNX Runtime with your NPU provider.

Example: convert HF model to GGUF and quantize (llama.cpp tools)

Work on a machine with enough RAM (desktop or cloud) where conversion runs comfortably.

# clone llama.cpp and build (ARM optimization will be used on Pi 5 later)
git clone https://github.com/ggerganov/llama.cpp.git
cd llama.cpp
make -j$(nproc)

# Use the HF -> gguf converter shipped with llama.cpp (example assumes transformers cache)
python3 convert-pth-to-gguf.py --model /path/to/hf-7b --outfile /tmp/7b.gguf

# Quantize using llama.cpp quantize tool
./quantize /tmp/7b.gguf /tmp/7b.q4_0.gguf q4_0

Notes: newer llama.cpp forks and builds in 2026 include q5_0/q5_K variants and NF4-like support; check the repo README for the exact quantizer names.

Example: GPTQ / AWQ conversion flow

GPTQ and AWQ generally require running scripts on a beefy machine and output a quantized set of parameters which you then convert to a runtime-friendly container.

# clone a GPTQ/AWQ repo (community-maintained)
git clone https://github.com/IST-Research/awq.git
cd awq
pip install -r requirements.txt

# Run AWQ quantization (simplified)
python -m awq.quantize \
  --model-name /path/to/hf-7b \
  --out /tmp/7b-awq \
  --wbits 4 \
  --group-size 128

# Convert AWQ output to gguf (use community converter or HF scripts)
python convert_awq_to_gguf.py --input /tmp/7b-awq --output /tmp/7b-awq.gguf

Tradeoffs: AWQ/GPTQ yields higher quality than naive q4 quantizers but takes longer to compute and requires a larger machine for conversion.

4) Pruning: when and how

Pruning removes parameters. It's a heavier hammer than quantization and needs care:

Unstructured magnitude pruning removes individual weights — smaller memory wins but often needs fine-tuning to recover accuracy.
Structured pruning removes neurons/heads/whole blocks — results in a leaner compute graph that some runtimes can exploit, but quality loss can be greater.

Recommended workflow for edge projects: apply light (10–30%) structured pruning on attention heads and feed-forward dimensions, then finetune for a few epochs with low learning rate. Use Hugging Face Optimum or SparseML recipes to automate the process.

Practical pruning example (magnitude + finetune)

# Install sparseml and transformers
pip install sparseml transformers accelerate

# Use a SparseML recipe (pseudocode)
from sparseml.pytorch.optim import ScheduledModifierManager
# load model via transformers, apply manager with a sparsity recipe, then finetune normally

Note: Full recipes are project-specific. For a Pi 5 target, prune conservatively (10–25%) and always validate the downstream quality metrics most relevant to your application.

5) Target runtimes and deployment options

There are three practical runtime patterns for Pi 5 + AI HAT+ 2:

CPU-only with llama.cpp / GGML (GGUF) — simplest and highly optimized for ARM (NEON). Best for privacy and minimal stack complexity.
ONNX Runtime with NPU provider — convert model to ONNX and run with vendor-supplied NPU provider for AI HAT+ 2; gives lower latency and higher throughput, but conversion and provider stability vary.
Hybrid split — run early transformer layers on NPU and later layers on CPU, or stream activations; complicated but useful for very large models.

llama.cpp example on Pi 5

# On the Pi 5 (after building llama.cpp with NEON enabled)
# Copy quantized gguf to Pi: /home/pi/models/7b.q4_0.gguf
./main -m /home/pi/models/7b.q4_0.gguf -p "Translate to Spanish: Hello world" -n 256

Tuning flags: use -t N to set threads, -n for token count, and -b for batch size (if supported). Use -pt to prefill tokenization caches. Running with mlock (if available) reduces swap churn.

ONNX + NPU (AI HAT+ 2) example

Vendor NPU stacks vary. The canonical path is:

Export HF model to ONNX
Optimize/quantize ONNX for the NPU (vendor CLI)
Run with ONNX Runtime using the NPU execution provider

# HF -> ONNX (using transformers' CLI)
python -m transformers.onnx --model=hf-7b /tmp/onnx_model

# Optimize/quantize ONNX with vendor tool (pseudo)
vendor-npu-opt --input /tmp/onnx_model --output /tmp/onnx_npu

# Run with onnxruntime (assuming NPU EP available)
python run_onnx_inference.py --model /tmp/onnx_npu --provider VendorNPU

Tradeoffs: ONNX runtime+NPU can give large speedups and make 13B models usable. The downside: toolchain compatibility, conversion artifacts (ops not supported), and debugging complexity.

6) Runtime tuning & system tweaks for Pi 5

Use these practical tips to avoid out-of-memory and improve latency:

Enable z RAM and tune swapiness; for low-latency workloads prefer zram over disk swap.
Pin process to cores and set GOMP_CPU_AFFINITY or use taskset to avoid scheduling jitter.
If running llama.cpp, compile with USE_ACCELERATE for NEON optimizations and enable mlock where supported to reduce page faults.
When using ONNX Runtime, ensure you have the NPU provider plugin and match operator versions; run a small operator conformance test during setup.
Measure end-to-end latency with representative prompts and account for tokenization and streaming output time.

7) Benchmarks: realistic expectations (2026)

Benchmarks vary by model, quantization, and prompt, but here are practical ballpark figures (observational ranges):

CPU-only, 7B, q4_0 on Pi 5: ~5–20 tokens/s (single prompt generation, depends on thread count)
Pi 5 + AI HAT+ 2 NPU, 7B quantized via ONNX: ~30–100 tokens/s for well-optimized pipelines
13B requires NPU offload or highly optimized GPTQ + ONNX to be interactive; expect longer latencies if using CPU-only

Always benchmark with your real prompts; synthetic throughput numbers are optimistic.

8) End-to-end example: convert, quantize (AWQ), run on Pi 5 with llama.cpp

This recipe assumes you have a desktop with GPU to perform AWQ/GPTQ steps and a Pi 5 to run inference.

On desktop: download HF model and run AWQ

# On desktop
git clone https://github.com/IST-Research/awq.git
cd awq
pip install -r requirements.txt
python -m awq.quantize --model hf-7b --out /tmp/7b-awq --wbits 4 --group-size 128

Convert AWQ artifacts to GGUF

# Converter script (repo-specific)
python convert_awq_to_gguf.py --input /tmp/7b-awq --output /tmp/7b-awq.gguf

Copy to Pi 5 and run llama.cpp

# On Pi 5
scp /tmp/7b-awq.gguf pi@pi5:/home/pi/models/
ssh pi@pi5
cd ~/llama.cpp
./main -m /home/pi/models/7b-awq.gguf -p "Summarize: " -n 256 -t 6

Adjust -t to match CPU cores and threads. For Pi 5 + AI HAT+ 2 NPU, experiment with ONNX path for additional speed.

9) Common pitfalls and how to avoid them

Out-of-memory on startup: quantize more aggressively or reduce context window.
Model conversion fails due to unsupported ops: switch to an alternate converter or adjust export opset version.
Inferior quality after quantization: try AWQ/GPTQ instead of naive q4 and preserve specific layers in higher precision (e.g., first/last layers FP16).
NPU driver mismatches: pin vendor plugin versions and test a small model first to validate the execution provider.

10) Security, privacy and compliance notes

Running LLMs locally improves privacy by default, but you still need to verify model provenance, licensing, and possible data leakage risks. Keep models up to date with security patches and maintain a secure OTA path for model updates if your device fleet will be deployed. Consider automating your compliance checks (legal & operational) with CI integration and policy tooling.

2026 trends & futureproofing

Looking ahead in 2026, expect these trends to impact your Pi deployments:

Better 4-bit algorithms: AWQ and successor algorithms will narrow the quality gap with int8 while keeping memory tiny.
NPU runtime maturity: more stable, standardized NPU providers for ONNX Runtime and TVM will reduce vendor lock-in pain.
Sparse/dense hybrid models: new model families will natively support sparsity, making pruning safer and more effective.
Standardized edge formats: GGUF and ONNX will continue to be key interchange points; expect tooling to simplify conversions dramatically.

Rule of thumb (2026): prefer a 7B baseline + AWQ/GPTQ 4-bit + GGUF for most Pi 5 apps; move to ONNX+NPU only when you need extra throughput or a 13B model.

Actionable checklist

Choose a baseline model (start 7B). Check HF for quantization-ready checkpoints.
Set up a conversion workstation (GPU-enabled) for AWQ/GPTQ runs.
Quantize with AWQ/GPTQ to 4-bit, convert to GGUF and ONNX.
Test with llama.cpp on Pi 5; profile tokens/s and memory.
If needed, convert ONNX and test NPU path with vendor runtime.
Monitor quality; prune + finetune only as a last step if memory still tight.

Closing: tradeoffs recap and decision guide

Every choice is a balance between memory, latency, throughput and accuracy:

Quantization — highest payoff, low operational complexity. Prefer AWQ/GPTQ when quality matters.
Pruning — extra memory and compute reduction but increases engineering effort and risk to accuracy.
Runtime — llama.cpp = easiest and most robust for CPU; ONNX+NPU = best throughput but costlier conversion & debugging.

Final takeaways

As of 2026, the Raspberry Pi 5 combined with an AI HAT+ 2 is a viable platform for private, low-latency LLM inference — provided you use modern quantization (AWQ/GPTQ), plan memory carefully, and pick the runtime that maps to your latency and accuracy goals. Start small (7B), iterate conversion offline, and benchmark on the actual Pi early in the process.

Call to action

Ready to try this on your Pi 5? Clone the example scripts below, pick a 7B checkpoint, and follow the conversion recipe. If you want a tailor-made conversion checklist for your exact model and Pi configuration, share your model name and Pi memory size — I’ll return a tuned pipeline, exact commands and expected runtimes for your setup.

Unknown

Contributor

Senior editor and content strategist. Writing about technology, design, and the future of digital media. Follow along for deep dives into the industry's moving parts.