Docker + Triton + ONNX: CI/CD Patterns for Deploying LLMs to Edge Devices
Practical CI/CD pipelines for packaging, testing, and safely deploying quantized LLMs to Raspberry Pi 5 + AI HAT+ 2 using Docker, Triton, and ONNX.
Hook — Deploying quantized LLMs to Raspberry Pi 5 at scale shouldn't be guesswork
Keeping models updated and reliable on a fleet of Raspberry Pi 5 + AI HAT+ 2 devices is one of the hardest parts of modern edge ML. Teams struggle with repeatable packaging, CI/CD pipelines, cross-architecture builds, hardware-in-the-loop testing, and safe rollbacks when inference quality or latency regresses. This guide gives you concrete CI/CD pipelines and patterns — using Docker, Triton, and ONNX — so you can ship quantized LLMs to Pi-based edge hardware with confidence in 2026.
Why this matters in 2026
Late 2025 and early 2026 brought two important shifts: commodity ARM single-board computers like the Raspberry Pi 5 paired with the AI HAT+ 2 now provide practical NPU acceleration for on-device LLMs, and the inference ecosystem has matured around standards like ONNX and inference orchestration with Triton. Teams can now build deterministic CI pipelines that produce compact, quantized ONNX artifacts and deploy them via Docker images or model-artifact registries to fleets of Pis. But only if you treat packaging, tests, and rollbacks as first-class citizens.
High-level CI/CD pattern
Here’s the end-to-end flow I recommend:
- Source control and model provenance (Git + model-artifact registry)
- Model conversion + quantization stage (reproducible build container)
- Automated unit and hardware-in-the-loop tests (simulator + sample Pi device)
- Multi-arch Docker image build with Triton + ONNX model repository
- Image signing, SBOM generation, and push to registry
- Progressive rollout to edge (canary -> regional -> full) with health checks
- Monitoring, model metrics, and automated rollback triggers
The value of separating model artifacts from server images
Keep the ONNX model files and Triton model repository separate from the Triton runtime image. That lets you deploy new model versions without rebuilding core runtime images and simplifies rollbacks: swap the model-version symlink in the repository or update the container-mounted model volume.
Concrete CI pipeline: GitHub Actions example
Below is an actionable GitHub Actions workflow that demonstrates stages from model conversion and quantization to a signed multi-arch image push. This is intentionally compact — adapt to your internal tools.
# .github/workflows/ci-cd-llm-edge.yml
name: Build and Deploy LLM to Edge
on:
push:
branches: [main]
jobs:
build-quantize:
runs-on: ubuntu-latest
steps:
- uses: actions/checkout@v4
- name: Set up Python
uses: actions/setup-python@v4
with:
python-version: '3.11'
- name: Install model tooling
run: |-
pip install torch transformers onnx onnxruntime onnxruntime-tools
- name: Convert HF -> ONNX
run: |
python tools/convert_to_onnx.py \
--model-id my-org/llm-small \
--output models/llm/1/model.onnx
- name: Quantize ONNX (post-training quant)
run: |
python tools/quantize_onnx.py \
--input models/llm/1/model.onnx \
--output models/llm/1/model.quant.onnx \
--mode dynamic --op-types MatMul,Conv
- name: Run unit inference tests
run: |
pytest tests/unit --onnx models/llm/1/model.quant.onnx
- name: Upload model artifact
uses: actions/upload-artifact@v4
with:
name: llm-model
path: models/llm/1/
build-and-push-image:
needs: build-quantize
runs-on: ubuntu-latest
steps:
- uses: actions/checkout@v4
- name: Download model
uses: actions/download-artifact@v4
with:
name: llm-model
- name: Set up Docker Buildx
uses: docker/setup-buildx-action@v3
- name: Login to registry
uses: docker/login-action@v3
with:
registry: ghcr.io
username: ${{ github.actor }}
password: ${{ secrets.GITHUB_TOKEN }}
- name: Build & push multi-arch Triton image
run: |
docker buildx build . \
--builder default \
--platform linux/amd64,linux/arm64 \
--file docker/Dockerfile.triton \
--tag ghcr.io/my-org/llm-triton:stable \
--push
- name: Sign image (cosign)
env:
COSIGN_PASSWORD: ${{ secrets.COSIGN_PASSWORD }}
run: |
cosign sign --key cosign.key ghcr.io/my-org/llm-triton:stable
deploy-canary:
needs: build-and-push-image
runs-on: ubuntu-latest
steps:
- name: Trigger edge deploy (Mender or fleet controller)
run: |
curl -X POST -H "Authorization: Bearer ${{ secrets.FLEET_TOKEN }}" \
-H 'Content-Type: application/json' \
https://edge-controller.example.com/api/v1/deploy \
-d '{"image":"ghcr.io/my-org/llm-triton:stable","strategy":"canary","devices":["pi-canary-01"]}'
Model conversion and quantization best practices
Quantizing LLMs for Pi devices requires trade-offs. Use these practical rules:
- Reproducible containerized conversion — run conversion and quantization inside a pinned Docker image with explicit tool versions so artifacts are identical across runs.
- Prefer post-training dynamic quantization for CPUs and NPUs where calibration data is limited; use QAT when accuracy must be preserved and retraining is feasible.
- Validate numerics — run a suite of representative prompts and compare logits/top-k outputs between FP32 and quantized ONNX to quantify drift.
- Store calibration metadata (calibration dataset hash, quantization knobs) next to the artifact for traceability.
Sample conversion script outline
# tools/convert_to_onnx.py (outline)
from transformers import AutoModelForCausalLM, AutoTokenizer
import torch
# Load model and tokenizer
# Export to ONNX with dynamic axes for batch/sequence
# Save model file to /models///model.onnx
Sample quantization invocation (ONNX Runtime)
python -m onnxruntime.tools.convert_onnx_models --input model.onnx --output model.quant.onnx --quantize_mode dynamic
Packaging with Triton and ONNX model repository
Triton expects a model repository layout. Structure your CI artifacts like this:
model_repository/
llm/
1/
model.quant.onnx
config.pbtxt
Example minimal config.pbtxt for a causal LLM:
name: "llm"
platform: "onnxruntime_onnx"
max_batch_size: 1
input [
{
name: "input_ids"
data_type: TYPE_INT64
dims: [ -1 ]
}
]
output [
{
name: "logits"
data_type: TYPE_FP32
dims: [ -1, -1 ]
}
]
In production, tune instance_group, optimization sections, and backend-specific settings for ONNX Runtime or vendor NPU runtimes.
Dockerfile pattern for multi-arch Triton + ONNX runtime
Create a lean image that installs a Triton runtime and exposes a model repository mount. Keep runtime and models decoupled.
# docker/Dockerfile.triton (concept)
FROM --platform=$BUILDPLATFORM ubuntu:22.04
# Install Triton runtime (community ARM builds available as of 2025)
# Install onnxruntime and vendor NPUs SDKs as optional layers
# Create /models as mount point
ENV TRITON_SERVER=/opt/tritonserver/bin/tritonserver
COPY entrypoint.sh /entrypoint.sh
ENTRYPOINT ["/entrypoint.sh"]
Note: If Triton prebuilt binaries for ARM are unavailable for your target, fallback to using ONNX Runtime Server or a lightweight custom gRPC/HTTP wrapper. The CI pipeline should detect the runtime compatibility matrix and choose the correct runtime image.
Hardware-in-the-loop testing and validation
Unit tests alone are not sufficient. Add two device-level stages in CI:
- Emulated tests using QEMU + containerized Triton to validate end-to-end request/response logic on ARM.
- On-device smoke tests run on at least one Pi with AI HAT+ 2: latency, peak memory, top-k accuracy, and resource usage. Automate with SSH, Ansible, or a fleet manager. See Secure Remote Onboarding for Field Devices for tips on safely managing device access and credentials.
Example smoke test checklist:
- Startup within target time (e.g., < 10s)
- P99 latency under SLA (e.g., < 500ms for single-token)
- Memory under threshold (e.g., < 90% of available)
- Perplexity or sample quality within acceptable drift vs. baseline
Deployment patterns: Canary, Blue-Green, and Rolling
Pick the pattern that matches your operational constraints:
- Canary — deploy to a small subset of devices, run HIL checks, then expand. Ideal when devices are reachable and you want to limit blast radius.
- Blue-Green — maintain two model repositories (blue/green) and switch symlinks or mount points atomically on-device. Good for zero-downtime switches on always-online devices.
- Rolling — sequentially update devices with health checks per-device; useful for very large fleets.
Atomic swap pattern
On-device, store model versions under /models/llm/1, /models/llm/2 and keep a symlink /models/llm/current -> /models/llm/2. Triton detects repository changes and can load/unload models via the model control API. Use this to implement a fast rollback: point the symlink back to the previous version and call the Triton model control endpoint to load the old model.
Rollback strategies — concrete recipes
Rollbacks must be fast, observable, and safe. Here are two practical strategies:
1) Fast rollback via Triton model control and symlink
- Maintain at least two model versions on-device: stable (vN) and candidate (vN+1).
- After deploying vN+1, run health checks. If checks fail, update symlink /models/llm/current -> /models/llm/vN and call Triton model control unload/load APIs.
- Record rollback reason and trigger CI alert to capture stack traces and metrics.
# Example curl to Triton model control API
curl -X POST localhost:8000/v2/repository/models/llm/unload
# update symlink then
curl -X POST localhost:8000/v2/repository/models/llm/load
2) Image-based rollback with immutable tags
- Deploy images with immutable tags and signed manifests (e.g., ghcr.io/my-org/llm-triton:sha256-...)
- On failure, instruct device fleet manager (Mender/Balena) to redeploy previous image tag.
- Validate rollback completion via device health endpoints and metrics.
Image-based rollback is slower but safer when runtime-level changes (not just models) were deployed.
Observability and automated rollback triggers
Automate rollbacks using health and model-quality signals:
- Metrics — P50/P95/P99 latency, memory pressure, error rate (gRPC/HTTP 5xx).
- Quality signals — automated evaluation on holdout prompts producing BLEU/Rouge/perplexity or human-in-the-loop feedback aggregated via lightweight reporting.
- Watchdogs — on-device agent posts heartbeats and health status to fleet manager; missing or failing heartbeats trigger rollback policies.
Set thresholds conservatively and implement cooldown windows to avoid oscillation. For example: if P99 latency > 2x SLA for > 3 devices in the canary group for 10 minutes, trigger rollback.
Security, compliance and supply-chain considerations
- Sign images and artifacts (cosign, Notary) and verify signatures on-device before activation. See the AWS European Sovereign Cloud guide for patterns on cryptographic controls and isolation.
- Generate SBOMs for images and model tooling to meet audit requirements — keep them with your release artifacts and documentation (tooling for offline docs & SBOMs).
- Least-privilege runtime — run Triton under a dedicated user and limit device permissions (e.g., NPU SDK access only to the runtime).
- Model provenance — store model training commits, dataset hashes, and quantization parameters in the artifact metadata.
Edge fleet management options (practical suggestions)
Use the tool that matches your fleet size and connectivity characteristics:
- Mender — robust OTA with delta updates and rollbacks, great for constrained devices.
- Balena — easy container orchestration for small-to-medium fleets.
- Custom GitOps using lightweight agents + ArgoCD or Flux is workable when devices run k3s.
Operational checklist before first fleet rollout
- Pin runtime/tool versions and store build artifacts in an immutable registry.
- Have at least one physical canary (Pi with AI HAT+ 2) and a QEMU-based emulator for fast CI feedback.
- Automate smoke tests and maintain a runbook for manual rollback and emergency recovery.
- Instrument per-device telemetry (Prometheus + Grafana or lightweight push gateways).
- Define SLA targets and rollback triggers in code (policy-as-code).
2026 trends and future-proofing
Expect these trends to shape edge LLM deployments through 2026:
- Stronger vendor SDK support on ARM — more NPUs on SBCs will expose standardized runtimes; design pipelines to plug vendor backends into ONNX-runtime or Triton backend adapters.
- Model shards and streaming tokenization — artifacts may be shipped as smaller shards to fit constrained storage and streamed into Triton-like servers.
- Model signing and MLOps supply-chain standards will become mandatory in regulated industries; build signing into the pipeline now.
- Edge model orchestration frameworks will add rollout features (A/B testing, canaries) that integrate with fleet managers — design your CI to emit the metadata those systems need.
Case study: Quick end-to-end example
Team X needed to push an 1.2B-parameter LLM quantized to int8 to 100 Pi 5 devices. They used the following pattern:
- Model conversion and dynamic quantization ran in GitHub Actions containers. Artifacts were uploaded to an S3 model-artifact registry.
- Multi-arch Triton image built with buildx and signed with cosign.
- Canary rollout to 3 Pis using Mender. Automated on-device tests validated latency and sample-quality against 50 probe prompts.
- After 48 hours, rollout progressed regionally. One regional regression triggered automatic rollback to the previous model using Triton's model control API and device symlink swap.
- Post-mortem showed the regression came from a mis-tuned quantization op — fix applied in CI and re-deployed after passing the same pipeline.
Actionable takeaways
- Use containerized, reproducible conversion steps to guarantee identical ONNX quantized artifacts per CI run.
- Separate model artifacts from runtime images — makes rollbacks fast and reduces image churn.
- Automate hardware-in-the-loop validation on at least one physical Pi with AI HAT+ 2 before broad rollouts.
- Implement canary + atomic symlink swap in Triton to make rollbacks deterministic and low-latency.
- Sign images and produce SBOMs — treat the model pipeline as part of the supply chain.
Further reading and tools
- Triton Inference Server documentation (search for ONNX backend and model control API)
- ONNX Runtime quantization tools and examples
- Cosign for container signing; GitHub Actions for CI; Mender/Balena for OTA
Final notes
Deploying quantized LLMs to Raspberry Pi 5 + AI HAT+ 2 is now a practical engineering effort, not a research project — but only if you build CI/CD pipelines that treat model conversion, quantization, testing, and rollback orchestration as first-class components. The patterns above are battle-tested templates you can adapt quickly in 2026.
Call to action
Ready to implement this in your stack? Start with a one-week pilot: containerize your conversion step, set up a single Pi canary with Triton, and codify one rollback policy. If you want, drop your pipeline YAML and I’ll review it and suggest concrete improvements for performance, security, and rollback resilience.
Related Reading
- Secure Remote Onboarding for Field Devices in 2026: An Edge‑Aware Playbook for IT Teams
- Edge-Oriented Oracle Architectures: Reducing Tail Latency and Improving Trust in 2026
- AWS European Sovereign Cloud: Technical Controls, Isolation Patterns and What They Mean for Architects
- How to Build a CI/CD Favicon Pipeline — Advanced Playbook (2026)
- From the Drakensberg to the Chilterns: Comparing South Africa’s Ridges with UK Hikes
- Smart Plug Guide for Landlords: Automate Renters’ Comfort Without Voiding Agreements
- E-Bikes and Dogs: The Complete Guide to Transporting Your Pet Safely
- Beyond the Hype: How to Tell If a Wearable Health Feature Actually Helps You
- Subscription Energy: What Goalhanger’s Subscriber Boom Says About the Money Houses in Your Chart
Related Topics
programa
Contributor
Senior editor and content strategist. Writing about technology, design, and the future of digital media. Follow along for deep dives into the industry's moving parts.
Up Next
More stories handpicked for you
From Our Network
Trending stories across our publication group