Docker + Triton + ONNX: CI/CD Patterns for Deploying LLMs to Edge Devices
devopsedgedeployment

Docker + Triton + ONNX: CI/CD Patterns for Deploying LLMs to Edge Devices

pprograma
2026-02-04
11 min read
Advertisement

Practical CI/CD pipelines for packaging, testing, and safely deploying quantized LLMs to Raspberry Pi 5 + AI HAT+ 2 using Docker, Triton, and ONNX.

Hook — Deploying quantized LLMs to Raspberry Pi 5 at scale shouldn't be guesswork

Keeping models updated and reliable on a fleet of Raspberry Pi 5 + AI HAT+ 2 devices is one of the hardest parts of modern edge ML. Teams struggle with repeatable packaging, CI/CD pipelines, cross-architecture builds, hardware-in-the-loop testing, and safe rollbacks when inference quality or latency regresses. This guide gives you concrete CI/CD pipelines and patterns — using Docker, Triton, and ONNX — so you can ship quantized LLMs to Pi-based edge hardware with confidence in 2026.

Why this matters in 2026

Late 2025 and early 2026 brought two important shifts: commodity ARM single-board computers like the Raspberry Pi 5 paired with the AI HAT+ 2 now provide practical NPU acceleration for on-device LLMs, and the inference ecosystem has matured around standards like ONNX and inference orchestration with Triton. Teams can now build deterministic CI pipelines that produce compact, quantized ONNX artifacts and deploy them via Docker images or model-artifact registries to fleets of Pis. But only if you treat packaging, tests, and rollbacks as first-class citizens.

High-level CI/CD pattern

Here’s the end-to-end flow I recommend:

  1. Source control and model provenance (Git + model-artifact registry)
  2. Model conversion + quantization stage (reproducible build container)
  3. Automated unit and hardware-in-the-loop tests (simulator + sample Pi device)
  4. Multi-arch Docker image build with Triton + ONNX model repository
  5. Image signing, SBOM generation, and push to registry
  6. Progressive rollout to edge (canary -> regional -> full) with health checks
  7. Monitoring, model metrics, and automated rollback triggers

The value of separating model artifacts from server images

Keep the ONNX model files and Triton model repository separate from the Triton runtime image. That lets you deploy new model versions without rebuilding core runtime images and simplifies rollbacks: swap the model-version symlink in the repository or update the container-mounted model volume.

Concrete CI pipeline: GitHub Actions example

Below is an actionable GitHub Actions workflow that demonstrates stages from model conversion and quantization to a signed multi-arch image push. This is intentionally compact — adapt to your internal tools.

# .github/workflows/ci-cd-llm-edge.yml
name: Build and Deploy LLM to Edge

on:
  push:
    branches: [main]

jobs:
  build-quantize:
    runs-on: ubuntu-latest
    steps:
      - uses: actions/checkout@v4
      - name: Set up Python
        uses: actions/setup-python@v4
        with:
          python-version: '3.11'

      - name: Install model tooling
        run: |-
          pip install torch transformers onnx onnxruntime onnxruntime-tools

      - name: Convert HF -> ONNX
        run: |
          python tools/convert_to_onnx.py \
            --model-id my-org/llm-small \
            --output models/llm/1/model.onnx

      - name: Quantize ONNX (post-training quant)
        run: |
          python tools/quantize_onnx.py \
            --input models/llm/1/model.onnx \
            --output models/llm/1/model.quant.onnx \
            --mode dynamic --op-types MatMul,Conv

      - name: Run unit inference tests
        run: |
          pytest tests/unit --onnx models/llm/1/model.quant.onnx

      - name: Upload model artifact
        uses: actions/upload-artifact@v4
        with:
          name: llm-model
          path: models/llm/1/

  build-and-push-image:
    needs: build-quantize
    runs-on: ubuntu-latest
    steps:
      - uses: actions/checkout@v4
      - name: Download model
        uses: actions/download-artifact@v4
        with:
          name: llm-model

      - name: Set up Docker Buildx
        uses: docker/setup-buildx-action@v3

      - name: Login to registry
        uses: docker/login-action@v3
        with:
          registry: ghcr.io
          username: ${{ github.actor }}
          password: ${{ secrets.GITHUB_TOKEN }}

      - name: Build & push multi-arch Triton image
        run: |
          docker buildx build . \
            --builder default \
            --platform linux/amd64,linux/arm64 \
            --file docker/Dockerfile.triton \
            --tag ghcr.io/my-org/llm-triton:stable \
            --push

      - name: Sign image (cosign)
        env:
          COSIGN_PASSWORD: ${{ secrets.COSIGN_PASSWORD }}
        run: |
          cosign sign --key cosign.key ghcr.io/my-org/llm-triton:stable

  deploy-canary:
    needs: build-and-push-image
    runs-on: ubuntu-latest
    steps:
      - name: Trigger edge deploy (Mender or fleet controller)
        run: |
          curl -X POST -H "Authorization: Bearer ${{ secrets.FLEET_TOKEN }}" \
            -H 'Content-Type: application/json' \
            https://edge-controller.example.com/api/v1/deploy \
            -d '{"image":"ghcr.io/my-org/llm-triton:stable","strategy":"canary","devices":["pi-canary-01"]}'

Model conversion and quantization best practices

Quantizing LLMs for Pi devices requires trade-offs. Use these practical rules:

  • Reproducible containerized conversion — run conversion and quantization inside a pinned Docker image with explicit tool versions so artifacts are identical across runs.
  • Prefer post-training dynamic quantization for CPUs and NPUs where calibration data is limited; use QAT when accuracy must be preserved and retraining is feasible.
  • Validate numerics — run a suite of representative prompts and compare logits/top-k outputs between FP32 and quantized ONNX to quantify drift.
  • Store calibration metadata (calibration dataset hash, quantization knobs) next to the artifact for traceability.

Sample conversion script outline

# tools/convert_to_onnx.py (outline)
from transformers import AutoModelForCausalLM, AutoTokenizer
import torch

# Load model and tokenizer
# Export to ONNX with dynamic axes for batch/sequence
# Save model file to /models///model.onnx

Sample quantization invocation (ONNX Runtime)

python -m onnxruntime.tools.convert_onnx_models --input model.onnx --output model.quant.onnx --quantize_mode dynamic

Packaging with Triton and ONNX model repository

Triton expects a model repository layout. Structure your CI artifacts like this:

model_repository/
  llm/
    1/
      model.quant.onnx
      config.pbtxt

Example minimal config.pbtxt for a causal LLM:

name: "llm"
platform: "onnxruntime_onnx"
max_batch_size: 1
input [
  {
    name: "input_ids"
    data_type: TYPE_INT64
    dims: [ -1 ]
  }
]
output [
  {
    name: "logits"
    data_type: TYPE_FP32
    dims: [ -1, -1 ]
  }
]

In production, tune instance_group, optimization sections, and backend-specific settings for ONNX Runtime or vendor NPU runtimes.

Dockerfile pattern for multi-arch Triton + ONNX runtime

Create a lean image that installs a Triton runtime and exposes a model repository mount. Keep runtime and models decoupled.

# docker/Dockerfile.triton (concept)
FROM --platform=$BUILDPLATFORM ubuntu:22.04

# Install Triton runtime (community ARM builds available as of 2025)
# Install onnxruntime and vendor NPUs SDKs as optional layers
# Create /models as mount point

ENV TRITON_SERVER=/opt/tritonserver/bin/tritonserver

COPY entrypoint.sh /entrypoint.sh
ENTRYPOINT ["/entrypoint.sh"]

Note: If Triton prebuilt binaries for ARM are unavailable for your target, fallback to using ONNX Runtime Server or a lightweight custom gRPC/HTTP wrapper. The CI pipeline should detect the runtime compatibility matrix and choose the correct runtime image.

Hardware-in-the-loop testing and validation

Unit tests alone are not sufficient. Add two device-level stages in CI:

  • Emulated tests using QEMU + containerized Triton to validate end-to-end request/response logic on ARM.
  • On-device smoke tests run on at least one Pi with AI HAT+ 2: latency, peak memory, top-k accuracy, and resource usage. Automate with SSH, Ansible, or a fleet manager. See Secure Remote Onboarding for Field Devices for tips on safely managing device access and credentials.

Example smoke test checklist:

  • Startup within target time (e.g., < 10s)
  • P99 latency under SLA (e.g., < 500ms for single-token)
  • Memory under threshold (e.g., < 90% of available)
  • Perplexity or sample quality within acceptable drift vs. baseline

Deployment patterns: Canary, Blue-Green, and Rolling

Pick the pattern that matches your operational constraints:

  • Canary — deploy to a small subset of devices, run HIL checks, then expand. Ideal when devices are reachable and you want to limit blast radius.
  • Blue-Green — maintain two model repositories (blue/green) and switch symlinks or mount points atomically on-device. Good for zero-downtime switches on always-online devices.
  • Rolling — sequentially update devices with health checks per-device; useful for very large fleets.

Atomic swap pattern

On-device, store model versions under /models/llm/1, /models/llm/2 and keep a symlink /models/llm/current -> /models/llm/2. Triton detects repository changes and can load/unload models via the model control API. Use this to implement a fast rollback: point the symlink back to the previous version and call the Triton model control endpoint to load the old model.

Rollback strategies — concrete recipes

Rollbacks must be fast, observable, and safe. Here are two practical strategies:

  1. Maintain at least two model versions on-device: stable (vN) and candidate (vN+1).
  2. After deploying vN+1, run health checks. If checks fail, update symlink /models/llm/current -> /models/llm/vN and call Triton model control unload/load APIs.
  3. Record rollback reason and trigger CI alert to capture stack traces and metrics.
# Example curl to Triton model control API
curl -X POST localhost:8000/v2/repository/models/llm/unload
# update symlink then
curl -X POST localhost:8000/v2/repository/models/llm/load

2) Image-based rollback with immutable tags

  1. Deploy images with immutable tags and signed manifests (e.g., ghcr.io/my-org/llm-triton:sha256-...)
  2. On failure, instruct device fleet manager (Mender/Balena) to redeploy previous image tag.
  3. Validate rollback completion via device health endpoints and metrics.

Image-based rollback is slower but safer when runtime-level changes (not just models) were deployed.

Observability and automated rollback triggers

Automate rollbacks using health and model-quality signals:

  • Metrics — P50/P95/P99 latency, memory pressure, error rate (gRPC/HTTP 5xx).
  • Quality signals — automated evaluation on holdout prompts producing BLEU/Rouge/perplexity or human-in-the-loop feedback aggregated via lightweight reporting.
  • Watchdogs — on-device agent posts heartbeats and health status to fleet manager; missing or failing heartbeats trigger rollback policies.

Set thresholds conservatively and implement cooldown windows to avoid oscillation. For example: if P99 latency > 2x SLA for > 3 devices in the canary group for 10 minutes, trigger rollback.

Security, compliance and supply-chain considerations

  • Sign images and artifacts (cosign, Notary) and verify signatures on-device before activation. See the AWS European Sovereign Cloud guide for patterns on cryptographic controls and isolation.
  • Generate SBOMs for images and model tooling to meet audit requirements — keep them with your release artifacts and documentation (tooling for offline docs & SBOMs).
  • Least-privilege runtime — run Triton under a dedicated user and limit device permissions (e.g., NPU SDK access only to the runtime).
  • Model provenance — store model training commits, dataset hashes, and quantization parameters in the artifact metadata.

Edge fleet management options (practical suggestions)

Use the tool that matches your fleet size and connectivity characteristics:

  • Mender — robust OTA with delta updates and rollbacks, great for constrained devices.
  • Balena — easy container orchestration for small-to-medium fleets.
  • Custom GitOps using lightweight agents + ArgoCD or Flux is workable when devices run k3s.

Operational checklist before first fleet rollout

  • Pin runtime/tool versions and store build artifacts in an immutable registry.
  • Have at least one physical canary (Pi with AI HAT+ 2) and a QEMU-based emulator for fast CI feedback.
  • Automate smoke tests and maintain a runbook for manual rollback and emergency recovery.
  • Instrument per-device telemetry (Prometheus + Grafana or lightweight push gateways).
  • Define SLA targets and rollback triggers in code (policy-as-code).

Expect these trends to shape edge LLM deployments through 2026:

  • Stronger vendor SDK support on ARM — more NPUs on SBCs will expose standardized runtimes; design pipelines to plug vendor backends into ONNX-runtime or Triton backend adapters.
  • Model shards and streaming tokenization — artifacts may be shipped as smaller shards to fit constrained storage and streamed into Triton-like servers.
  • Model signing and MLOps supply-chain standards will become mandatory in regulated industries; build signing into the pipeline now.
  • Edge model orchestration frameworks will add rollout features (A/B testing, canaries) that integrate with fleet managers — design your CI to emit the metadata those systems need.

Case study: Quick end-to-end example

Team X needed to push an 1.2B-parameter LLM quantized to int8 to 100 Pi 5 devices. They used the following pattern:

  1. Model conversion and dynamic quantization ran in GitHub Actions containers. Artifacts were uploaded to an S3 model-artifact registry.
  2. Multi-arch Triton image built with buildx and signed with cosign.
  3. Canary rollout to 3 Pis using Mender. Automated on-device tests validated latency and sample-quality against 50 probe prompts.
  4. After 48 hours, rollout progressed regionally. One regional regression triggered automatic rollback to the previous model using Triton's model control API and device symlink swap.
  5. Post-mortem showed the regression came from a mis-tuned quantization op — fix applied in CI and re-deployed after passing the same pipeline.

Actionable takeaways

  • Use containerized, reproducible conversion steps to guarantee identical ONNX quantized artifacts per CI run.
  • Separate model artifacts from runtime images — makes rollbacks fast and reduces image churn.
  • Automate hardware-in-the-loop validation on at least one physical Pi with AI HAT+ 2 before broad rollouts.
  • Implement canary + atomic symlink swap in Triton to make rollbacks deterministic and low-latency.
  • Sign images and produce SBOMs — treat the model pipeline as part of the supply chain.

Further reading and tools

  • Triton Inference Server documentation (search for ONNX backend and model control API)
  • ONNX Runtime quantization tools and examples
  • Cosign for container signing; GitHub Actions for CI; Mender/Balena for OTA

Final notes

Deploying quantized LLMs to Raspberry Pi 5 + AI HAT+ 2 is now a practical engineering effort, not a research project — but only if you build CI/CD pipelines that treat model conversion, quantization, testing, and rollback orchestration as first-class components. The patterns above are battle-tested templates you can adapt quickly in 2026.

Call to action

Ready to implement this in your stack? Start with a one-week pilot: containerize your conversion step, set up a single Pi canary with Triton, and codify one rollback policy. If you want, drop your pipeline YAML and I’ll review it and suggest concrete improvements for performance, security, and rollback resilience.

Advertisement

Related Topics

#devops#edge#deployment
p

programa

Contributor

Senior editor and content strategist. Writing about technology, design, and the future of digital media. Follow along for deep dives into the industry's moving parts.

Advertisement
2026-02-04T02:33:27.807Z