privacybrowserlocal AI

Puma vs Chrome: Building a Local-AI Browser Extension that Preserves Privacy

pprograma

2026-01-27

10 min read

Hands-on guide to build a privacy-first browser extension with on-device LLM inference—model packaging, sandboxing, and Pixel/Android UX.

Stop shipping secrets to the cloud: build a local-AI browser extension that actually preserves privacy

Developers and DevOps teams are tired of the trade-offs: convenience from browser-based AI features vs. the privacy cost of sending page content and private data to a remote API. In 2026, on-device inference is finally practical on modern phones (Pixel family included) and desktops. This hands-on guide walks you through building a privacy-first browser extension that runs local LLM inference—in the spirit of Puma’s local-AI approach—covering model packaging, sandboxing, and a Pixel/Android UX that respects users' data.

Why you should care in 2026

Two trends make this tutorial timely and practical:

Hardware and runtimes: modern mobile SoCs (Pixel 8/9/10 class) include dedicated NPU/VPU lanes and robust Vulkan/WebGPU paths. Android NNAPI and Vulkan-backed inference are mainstream by late 2025, enabling sub-second responses for small-to-medium LLMs. For infrastructure implications and low-level performance trade-offs, see Designing Data Centers for AI.
Model efficiency: 4-bit and mixed-bit quantization, GGUF/ggml toolchains, and Wasm runtimes with SIMD/WASI allow models to run locally on mobile and in-browser workers without cloud calls. Consider responsible data-handling patterns from the Responsible Web Data Bridges playbook when designing your capture pipeline.

High-level architectures: Chrome extension vs. Puma-style local browser

Pick one of two realistic architectures depending on your constraints and distribution target:

Chromium extension + local native host (desktop-first, can be adapted to Android via companion app). The extension acts as UI and content bridge; heavy LLM inference runs in a native process started by native messaging. For operational lessons on distributing local helpers and edge assets, review portfolio ops & edge distribution.
Bundled mobile browser / WebView app (privacy-first mobile, Puma-like). Ship a browser wrapper or custom tab that includes the model and a local inference engine (WASM or native library). This avoids Play Store extension restrictions and lets you tightly control permissions and storage. For hybrid client-server flows and productivity tool integrations, see Hybrid Edge Workflows for Productivity Tools.

Trade-offs

Extensions are easier to distribute for desktop and existing Chromium browsers, but mobile browsers restrict extension APIs and background workers.
Bundling a browser/WebView as an Android app provides the strongest privacy guarantees and UX control, but requires maintaining a mobile app and model bundles.

Step 1 — Choose and package a model for on-device inference

Model selection and packaging determines feasibility. For privacy-first local use, choose models with permissive licenses and sizes that fit your target device.

Model selection guidelines (practical)

Target model size: 300–2,000 MB compressed for modern high-end phones; 500–6,000 MB+ for desktops. Smaller models (e.g., distilled ~250M–1.5B) are fastest but less capable.
Quantization: use 4-bit or 6-bit quantization (GGUF/ggml) to save memory. By late 2025, 4-bit quantization with grouped quantization (GPTQ, AWQ) is broadly usable.
Format: prefer GGUF (portable across many local runtimes) or quantized ONNX for runtime-specific toolchains.
License: verify model license allows local redistribution (avoid closed-source weights unless you have rights).

Packaging strategy for Android (Pixel) and desktop

For Android, shipping a model inside an APK is common but can bloat the app. Better approach: deliver a small APK and download model bundles post-install into encrypted app storage on first run. Be aware of evolving policy and regulatory guidance — in particular consult the EU synthetic media guidelines and on-device voice coverage for legal and store-compliance changes.

Split into shards and stream-download only the shards you need (example: core tokenizer + smallest model initially, optional larger models later). Use an edge distribution strategy or CDN to keep downloads fast and reliable — see edge playbooks for distribution patterns.
Use Content-Addressed Storage: pack model shards with checksums and sign them using your app signing key. Validate signatures before using.
Store encrypted-at-rest in app-private storage and readable only by the app (use Android Keystore / StrongBox to protect keys).

Example: convert a PyTorch checkpoint to a GGUF 4-bit quantized model

# Example pipeline (Linux/macOS). Requires qlora/gptq tools (community toolchains vary).
# 1) Convert to huggingface format (if needed)
python convert_to_hf.py --input model.pt --output ./hf_model

# 2) Quantize to 4-bit GGUF using a community tool
python quantize.py --model ./hf_model --bits 4 --out ./model.gguf

# 3) Split into shards <= 64MB for mobile delivery
split -b 64M ./model.gguf model.gguf.part.

Step 2 — Secure sandboxing and runtime choices

Running untrusted workloads or third-party models locally still needs defense-in-depth. You must sandbox the inference process to protect the browser and system.

Sandbox tiers

WASM in WebWorker — Strong sandboxing inside the renderer process. Good for small models using wasm-compiled backends (ggml-wasm, ONNX-Wasm). Requires cross-origin isolation for SharedArrayBuffer; extensions can sometimes request it. Follow the guidance in Responsible Web Data Bridges when designing what data is captured and how it's handled.
Native helper process (recommended) — Use native messaging (desktop) or a bound Android service (mobile) to run models in a separate OS process with tightened seccomp policies and strict file permissions. Operational lessons from edge-first deployments are useful here: see edge-first supervised models case studies.
Containerized runtime — For desktop power users, use lightweight container runtimes (Firecracker-style microVMs) to limit syscalls and restrict network access. For broader infrastructure patterns, review material on AI datacenter design and edge CDNs.

Implementing a secure native host for Chromium extensions (Linux/Windows/macOS)

Use the browser native messaging API to start an isolated process. The process should:

Run with least privilege.
Drop network access by default. If network is required for optional plugins, make it opt-in and audited.
Verify model signatures before mapping files.

// native_host.json (Windows example path)
{
  "name": "com.example.local_ai",
  "description": "Local AI host",
  "path": "C:\\Program Files\\LocalAI\\local_ai.exe",
  "type": "stdio",
  "allowed_origins": [ "chrome-extension:///" ]
}

Android sandbox for bundled inference

On Android, run inference in a dedicated foreground service or a bound service with a separate UID if possible. Leverage the Android sandbox, isolate files to app-private storage, and use Android Keystore / StrongBox for any keys.

// Kotlin: start a bound service for inference (simplified)
class InferenceService : Service() {
  override fun onBind(intent: Intent): IBinder { return binder }
  private val binder = object : IInference.Stub() {
    override fun infer(requestJson: String): String { /* run model in worker thread */ }
  }
}

Step 3 — Extension mechanics: UI, content capture, and strict privacy rules

Your extension must make it obvious to users what is sent to the model and where inference runs. Treat privacy as a UX feature. Follow the principles in Responsible Web Data Bridges for consent and provenance.

Manifest and permission minimalism

For a Chromium extension manifest V3 example, request only the permissions you need:

{
  "manifest_version": 3,
  "name": "Local AI Assistant",
  "version": "1.0",
  "permissions": ["storage"],
  "host_permissions": ["https://*/*"],
  "background": { "service_worker": "background.js" },
  "content_scripts": [{
    "matches": [""],
    "js": ["content.js"],
    "run_at": "document_idle"
  }]
}

Content capture flow

User activates the extension UI for a particular tab.
Extension asks for explicit capture permission for that domain/session (not global capture).
Content script extracts only the selected DOM/text and sends it to the background script.
Background script forwards the request to the native host or WASM runtime. The host performs inference locally—no network by default.

Example: content script sending data to background

// content.js
chrome.runtime.sendMessage({ type: 'EXTRACT', selector: '#article' }, response => {
  // show extracted text
});

// background.js
chrome.runtime.onMessage.addListener((msg, sender, sendResponse) => {
  if (msg.type === 'EXTRACT') {
    // forward to native host via runtime.connectNative or start WASM worker
  }
});

Step 4 — Inference runtimes: Wasm vs native libraries

Choose based on target platform and model size.

Wasm (WebAssembly + WASI)

Pros: Runs inside browser sandbox, easier deployment (no native host). Good for small models and proof-of-concept local inference.
Cons: Memory limits, slower for large models, SharedArrayBuffer and SIMD requirements can complicate setup.

Native (llama.cpp, ggml, ONNX Runtime, TensorFlow Lite, onnxruntime-mobile)

Pros: Faster, access to NNAPI/Vulkan/Metal, supports larger models and quantization backends.
Cons: Requires native host or bundled binaries; you must handle sandboxing and cross-platform builds. For hybrid deployments and retraining patterns, see Edge-First Model Serving & Local Retraining.

Example: calling a local process from background (native messaging)

// background.js (simplified)
const port = chrome.runtime.connectNative('com.example.local_ai');
port.onMessage.addListener(msg => console.log('from native', msg));
function askLocalModel(prompt) {
  port.postMessage({ prompt });
}

Step 5 — Privacy-first UX patterns for Pixel/Android

Users must trust that their content never leaves the device unless they explicitly opt in. The UX should make that explicit.

Key UX elements

Clear offline indicator: show a badge when the model runs locally and an explicit warning if any network is used.
Per-origin policy toggles: users choose which sites can use on-device assistant features.
Ephemeral context mode: optional mode that guarantees no logs are kept and model caches are cleared after the session.
Model provenance pane: show model name, size, quantization details, and license. Allow users to replace or delete the model.
Audit trail: local-only activity log with export and delete controls. This builds trust and helps debugging; for enterprise audit trails consider best practices from portfolio & edge distribution.

Android-specific flows

On Pixel/Android, use a system-level permission dialog to request local storage access (if you download models). Never upload model input without an explicit consent toggle per prompt.

// Kotlin snippet to show user model provenance
fun showModelInfo(model: ModelInfo) {
  AlertDialog.Builder(context)
    .setTitle("Local Model: ${model.name}")
    .setMessage("Size: ${model.sizeMB}MB\nQuant: ${model.quant}\nLicense: ${model.license}")
    .setPositiveButton("OK", null)
    .show()
}

Hardening: auditability, checksums, and reproducible builds

To be trustworthy, your local-AI feature must be auditable.

Sign model bundles and provide the public key in the app binary. Verify signatures before loading.
Provide reproducible build artifacts for native binaries and Wasm modules where possible—publish hashes. For reproducible builds and supply-chain considerations, consult edge distribution playbooks like Portfolio Ops & Edge Distribution.
Offer a transparency log or third-party verification of shipped model binaries (useful for enterprise customers).

Performance tuning tips

Use NNAPI/Vulkan on Pixel to accelerate FP16 quantized inference when supported.
Prefer token streaming and small context windows for interactive use—stream outputs as they're generated to avoid large memory spikes. For streaming and prompt patterns, the prompt templates roundup is useful for prototyping conversation flows.
Cache tokenized inputs and keep the tokenizer in memory for faster repeated operations.

Developer checklist: practical do/don't

Do: Validate model signatures and check runtime permissions at install and run-time.
Do: Make network access opt-in, and indicate when remote resources are used.
Don't: Implicitly upload user content to cloud LLMs without explicit consent.
Don't: Store models in world-readable storage or include plaintext keys in your APK/binary.

Advanced: hybrid mode and federated options

For advanced users and enterprise deployments, support a hybrid model:

Run small tasks locally (summaries, classification) and run heavy tasks on a private, enterprise-hosted cluster if the user opts-in. See hybrid edge workflow patterns in Hybrid Edge Workflows for Productivity Tools.
Use on-device fine-tuning (QLoRA-style) with local checkpoints kept private and optionally encrypted — do training only with explicit user consent. Edge retraining patterns are further discussed in Edge-First Model Serving & Local Retraining.

"By keeping the inference on-device, you control the attack surface and the data flow — that's the privacy win modern browsers and Pixel users want." — Practical takeaway

Packaging and Play Store policy considerations (2026)

As of late 2025, Google tightened rules around background data collection and model downloads. Follow these rules:

Disclose large model downloads and get explicit consent.
Use the recommended APIs for runtime permissions and background services.
Maintain an accessible privacy policy that explains local inference and data handling in plain language. Keep an eye on regional regulatory guidance such as the EU synthetic media guidelines.

Real-world example: minimal proof-of-concept flow

Build a Chromium extension that injects a small assistant icon into pages.
When clicked, the extension asks the user to select content. Only selected text is sent to the background worker.
The background worker opens a connection to a native host running llama.cpp with a quantized GGUF model. The host returns a streamed response to the extension UI.
The UI shows an offline badge, the model name, and a toggle to upload transcripts (off by default).

Minimal end-to-end timing targets (2026 hardware)

Small model (250M quantized): 20–80 ms/token on modern Pixel NPUs (FP16 path) — interactive feel.
Medium model (1–3B quantized): 100–400 ms/token on Pixel-class hardware; desktop GPU much faster.

Summary: Why this matters and next steps

Users increasingly demand AI features without sacrificing privacy. Building a Puma-like local-AI browser extension means combining practical model packaging, robust sandboxing, and a privacy-first UX—especially on Pixel/Android where hardware allows capable on-device inference. By late 2025 and into 2026, the toolchains and runtimes are mature enough to ship real products.

Actionable takeaways

Start with a small quantized model in GGUF; test in Wasm for a quick prototype.
For production, use a native host/service with strict sandboxing and no network by default.
Design UX around explicit consent, model provenance, and ephemeral modes to build trust.

Want the reference code and a checklist?

Grab the companion repository (sample manifests, native host scripts, Android bound-service example, and model packaging scripts). If you need an enterprise-ready audit trail or help porting to Pixel NNAPI, get in touch—developers and teams shipping privacy-first local AI need solid primitives more than ever.

Call to action: Download the sample repo, run the prototype on a Pixel emulator or your desktop, and share feedback. If you're building this for a team or product, schedule a security review to validate model distribution and sandboxing before release.

programa

Contributor

Senior editor and content strategist. Writing about technology, design, and the future of digital media. Follow along for deep dives into the industry's moving parts.