hardwareedge AItutorial

Getting Started with the Raspberry Pi 5 AI HAT+ 2: A Practical Edge AI Workshop

UUnknown

2026-01-21

10 min read

Hands-on workshop: set up Raspberry Pi 5 + AI HAT+ 2, convert and quantize a small generative model to ONNX, enable NPU acceleration, and optimize latency and power.

Get a real edge AI dev workflow on Raspberry Pi 5 with the AI HAT+ 2 — fast, local inference without the cloud

Hook: You want to prototype generative AI on-device, not wait for cloud queues or pay huge inference bills. The Raspberry Pi 5 paired with the new AI HAT+ 2 makes that possible — but getting usable latency and reasonable power draw requires practical choices: the right OS image, a compact model, quantization, and vendor NPU drivers. This workshop walks you through a complete, repeatable setup to run a small generative model locally in 2026, with real developer tips for squeezing performance and minimizing power.

Why this matters in 2026

Edge AI is a mainstream engineering requirement for teams shipping privacy-sensitive or offline-capable experiences. In late 2025 and early 2026, the ecosystem matured: lightweight generative models became deployable to NPUs on hobbyist hardware, ONNX Runtime added broader edge provider support, and vendor SDKs distributed optimized NPU drivers for single-board computers. The Raspberry Pi 5 + AI HAT+ 2 gives you a low-cost, local inference platform for prototypes, kiosks, and field devices.

Prototyping edge generative AI is no longer just academic — it’s feasible on a $200–$300 kit with the right model and optimizations.

What you’ll build in this workshop

Hardware and OS setup for Raspberry Pi 5 with AI HAT+ 2
Convert a compact generative model (example: DistilGPT-style) to ONNX
Quantize the model for edge inference
Run local generation with ONNX Runtime, attempt NPU acceleration
Measure latency and power, and apply targeted performance tuning

Prerequisites

Raspberry Pi 5 (64-bit recommended)
AI HAT+ 2 (official vendor hat with NPU and drivers)
microSD card or SSD with Raspberry Pi OS (64-bit) up to date
USB-C power supply suitable for sustained loads, active cooling (recommended)
Host dev machine with Git, Python 3.11+ (for model conversion)
Familiarity with Python, virtualenv, and basic Linux

1) Hardware assembly and first-boot checklist

Attach the AI HAT+ 2 to the Raspberry Pi 5 per vendor instructions. Use the supplied standoffs and cable to ensure secure connections.
Install active cooling — thermal headroom matters when the NPU and CPU run simultaneously.
Flash Raspberry Pi OS (64-bit) — avoid 32-bit images. Use the Raspberry Pi Imager and select the latest 64-bit release (as of 2026 many vendor drivers target 64-bit only).

On first boot, update packages:

sudo apt update && sudo apt full-upgrade -y

Confirm kernel modules and vendor drivers. The AI HAT+ 2 vendor typically ships a kernel module and user-space SDK — copy or install them now and reboot.

Quick check commands

uname -a
# check dmesg for hat initialization
dmesg | grep -i hat
# list PCI/connected devices
lspci -vv || lsusb -v || dmesg | grep -i npu

If the hat exposes an NPU provider to the kernel, vendor docs usually show the user-space runtime. Keep that SDK handy — later we will instruct ONNX Runtime to use the NPU provider if installed.

2) Prepare a reproducible Python environment

Create a lightweight virtual environment and install necessary packages. We keep the on-device footprint small — heavy conversions run on your workstation, while inference runs on the Pi.

# on Raspberry Pi 5
sudo apt install -y python3-venv python3-pip git build-essential
python3 -m venv ~/pi-ai-env
source ~/pi-ai-env/bin/activate
pip install --upgrade pip
# ONNX and ONNX Runtime (CPU) - vendor NPU provider is optional
pip install onnx==1.15.0 onnxruntime==1.17.0 tokenizers transformers

Note: In 2026, ONNX Runtime releases often publish vendor-specific wheels for NPUs. If your AI HAT+ 2 vendor provides an ONNX Runtime execution provider (EP), follow their installation guide to add it — it will appear in session.get_providers() below.

3) Choose a compact generative model

A full LLM is impractical for this workshop. Instead use a compact model that offers generative demos and converts easily to ONNX:

distilgpt2 or a distilled GPT-2 style model — small and fast
Other tiny open models (quantized) from Hugging Face with permissive licenses

On your workstation (not the Pi), export the Hugging Face model to ONNX. This reduces heavy conversion steps on the device.

Export to ONNX (workstation)

git clone https://github.com/huggingface/transformers.git  # or use transformers.onnx helper
python -m pip install transformers[onnx] onnx
python - <<'PY'
from transformers import AutoModelForCausalLM, AutoTokenizer
from pathlib import Path
model_name='distilgpt2'
model = AutoModelForCausalLM.from_pretrained(model_name, torch_dtype='cpu')
tokenizer = AutoTokenizer.from_pretrained(model_name)
output = Path('distilgpt2.onnx')
# Use transformers' export helper or ONNX export for the model's forward signature
# This is a simplified snippet; follow transformers.onnx docs for exact export flags
model.eval()
# export code here (placeholder); official helper gives a reproducible ONNX
print('Exported to', output)
PY

Copy the generated distilgpt2.onnx and tokenizer files to the Pi (scp, rsync, or USB).

4) Quantize the ONNX model for edge inference

Quantization reduces model size and speeds inference. We’ll use dynamic quantization for simplicity; later you can try INT8 calibration if the vendor NPU supports it.

pip install onnxruntime-tools onnxruntime-extensions
python - <<'PY'
from onnxruntime.quantization import quantize_dynamic, QuantType
quantize_dynamic('distilgpt2.onnx', 'distilgpt2.quant.onnx', weight_type=QuantType.QInt8)
print('Quantized model saved as distilgpt2.quant.onnx')
PY

Dynamic quantization trades a small accuracy cost for big speed and memory improvements. If your AI HAT+ 2 NPU supports specific quant formats (e.g., int8 with calibration), follow vendor docs to create vendor-specific optimized files; broader vendor tooling is moving to more discoverable workflows for edge teams (see vendor runtimes).

5) Run inference on the Pi using ONNX Runtime

Place the ONNX files and tokenizer on the Pi. A minimal generation loop using tokenizers and ONNX Runtime looks like this:

cat > generate.py <<'PY'
import time
import numpy as np
import onnxruntime as ort
from tokenizers import Tokenizer

# Load tokenizer (use HuggingFace Tokenizers JSON or use transformers PreTrainedTokenizer)
tokenizer = Tokenizer.from_file('tokenizer.json')
model_path='distilgpt2.quant.onnx'

# Create session and inspect providers
sess = ort.InferenceSession(model_path, providers=['CPUExecutionProvider'])
print('Providers:', sess.get_providers())

def tokens_to_text(ids):
    return tokenizer.decode(ids.tolist())

prompt = 'Write a short product description for a portable AI device:'
inputs = tokenizer.encode(prompt)
input_ids = np.array([inputs.ids], dtype=np.int64)

# Warmup
_ = sess.run(None, {'input_ids': input_ids})

start = time.perf_counter()
outputs = sess.run(None, {'input_ids': input_ids})
latency = time.perf_counter() - start
print('Latency (s):', latency)
# outputs[0] might be logits - do a simple greedy next-token step in Python
print('Done')
PY

Run:

source ~/pi-ai-env/bin/activate
python generate.py

Inspect session providers. If you installed a vendor NPU provider, it should appear (for example, 'AIHATExecutionProvider' or similar). To enable it, create the session with that provider name and any provider-specific options.

6) Attempt NPU acceleration

If the AI HAT+ 2 vendor provides an ONNX Runtime execution provider, install their package and pass the provider name to the session creation. Typical steps:

Install vendor runtime per their instructions (may be a .deb, wheel, or bundled library).

Confirm provider appears:

import onnxruntime as ort
print(ort.get_available_providers())

Create session with the vendor provider:

sess = ort.InferenceSession(model_path, providers=['VendorNPUExecutionProvider', 'CPUExecutionProvider'])

Important: vendor NPUs often require certain data types and layout (e.g., int8 with per-channel quant). If the provider returns an error or falls back to CPU, check logs and convert to the vendor's expected format. For architecture and deployment patterns, see broader edge AI platform discussions.

7) Measure latency and power — metrics that matter

Measure end-to-end latency (prompt → generated tokens) and power draw under workload. Use these tools and tips:

Latency: include warm-up runs, then average across 20+ requests.
Power: attach a USB-C power meter inline with your supply, or use INA219/INA3221 on I2C if the hat supports board-level telemetry.
CPU/GPU/NPU utilization: htop, vendor tooling, or tegrastats-like utilities from the vendor.

Sample latency measurement (Python):

import time
N=30
times=[]
for _ in range(N):
    t0=time.perf_counter()
    sess.run(None, {'input_ids': input_ids})
    times.append(time.perf_counter()-t0)
print('median:', sorted(times)[len(times)//2])

8) Practical performance and power tips

Here are targeted, actionable optimizations that repeatedly deliver results in the field.

Model size and token budget: Smaller context windows cut compute linearly. Pre-truncate the prompt when possible.
Quantization strategy: Start with dynamic quantization; move to per-channel INT8 calibration if accuracy demands and vendor NPU supports it. Vendor tooling for calibration is improving alongside creator-focused edge tooling.
Batching: For low-concurrency devices, keep batch size 1 to reduce latency. Use micro-batching only when throughput > real-time latency requirements.
CPU governor: For consistent latency, set the CPU governor to performance during inference:
```
sudo apt install -y cpufrequtils
sudo cpufreq-set -g performance
```
Thermal management: Forced-air cooling prevents thermal throttling. Monitor CPU frequency over your test run to detect throttling.
Memory pressure: Avoid swap; if you must use swap, place it on a fast SSD and keep swap size minimal. Prefer quantization to reduce RAM footprint.
Selective offload: Use CPU for tokenization and lightweight orchestration, NPU for heavy matrix ops (logits). Keep data copies between devices minimal — use zero-copy if the provider supports it.
Power profiles: If the device will run on battery, test CPU frequency scaling vs. latency trade-offs and choose balanced governors or custom frequency capping. For field deployments and power planning, check home battery and backup considerations and smart outlet strategies for optimization.

9) Debugging common failures

Provider not found — confirm vendor runtime installed and on PATH/LIBRARY_PATH.
Model layout/DT mismatch — check that quantization type matches provider expectations (int8 vs float16).
OOM on device — further quantize or shrink context length; use smaller model variant.
High latency with NPU enabled — ensure the model's ops are supported by the NPU; unsupported ops often fall back to CPU causing data shuttling overhead.

10) Next steps and scaling tips

Once your prototype runs, consider:

Creating a tiny Flask or FastAPI wrapper for local HTTP inference (REST/gRPC) — see examples in real-time API integration.
Packaging as a systemd service for headless deployment.
Improving robustness with watchdogs and cold-start caches.
Exploring distillation and quantization-aware training to regain quality lost to quantization; tooling for QAT is migrating to lighter on-prem workflows.

Example: systemd service snippet (concept)

[Unit]
Description=Pi AI service
After=network.target

[Service]
User=pi
WorkingDirectory=/home/pi/ai-demo
ExecStart=/home/pi/pi-ai-env/bin/python server.py
Restart=on-failure

[Install]
WantedBy=multi-user.target

Actionable takeaways

Start small: use tiny models and dynamic quantization to get working prototypes fast.
Measure early: latency and power metrics guide optimization choices — don’t guess.
Use vendor drivers carefully: NPUs help, but only when the provider supports your model ops and datatype. See vendor runtime guidance.
Balance power vs performance: CPU governor, thermals, and prompt length are the easiest levers to tune.

Troubleshooting checklist

Confirm OS is 64-bit and fully updated.
Verify AI HAT+ 2 vendor runtime installation and kernel modules.
Test CPU-only ONNX Runtime first — get correct outputs, then enable NPU provider.
If NPU falls back to CPU, check op coverage logs from vendor provider.
If memory is tight, further quantize or migrate to a smaller model.

2026 trends & future-proofing your edge AI work

Looking ahead in 2026, several trends should inform your architecture choices:

ONNX continues to solidify as the de facto interchange format; invest in ONNX-centric pipelines.
Vendor execution providers for hobbyist NPUs will stabilize — expect improved tooling for calibration and profiling (platform-level edge AI and creator tooling).
Smaller, task-specific generative models (instruction-tuned micro-models) will give better edge utility than naïvely shrinking LLMs.
Tooling for automated quantization-aware training (QAT) and neural architecture search will migrate from cloud labs to lightweight on-prem toolchains.

Final checklist before you ship a prototype

Reproduce inference results across temperature and power profiles.
Log and monitor both latency and accuracy drift in deployment.
Provide fallback paths (shorter prompts, degraded mode) when thermal throttling or memory limits occur.
Audit model licenses and data privacy — local inference reduces data egress but still requires governance; see privacy-by-design guidance for API and data controls.

Summary

The Raspberry Pi 5 + AI HAT+ 2 opens realistic paths to on-device generative AI in 2026. The practical recipe that works: pick a compact model, export and quantize to ONNX, use ONNX Runtime (with vendor NPU provider when it matches your model), and tune power and thermal settings. Measure early and iterate — the biggest wins are often prompt engineering, quantization, and eliminating unnecessary token context.

Call to action

Ready to try this workshop? Clone the companion repo (model export scripts, example ONNX files, and tuned configs) at our GitHub and follow the step-by-step README to reproduce this setup: github.com/programa-space/raspi5-ai-hat2. Share your latency & power numbers and open an issue for device-specific tips — we’ll keep this guide current with new vendor provider releases throughout 2026.

Quick start link: Boot the Pi, install the environment, copy the quantized ONNX file, and run python generate.py — you should have a working local generator within an hour.

Unknown

Contributor

Senior editor and content strategist. Writing about technology, design, and the future of digital media. Follow along for deep dives into the industry's moving parts.