Introducing Auto-Recovery: Making App Containers Production-Ready on Embedded

Introducing Auto-Recovery: Making App Containers Production-Ready on Embedded

Most Pantavisor users start with full system containers — they come from a traditional embedded Linux background where the entire rootfs is one monolithic image. Over time, teams discover the power of decomposing their stack into functional units: a BSP container for hardware, a platform container for connectivity, and separate app containers for business logic.

This decomposition pattern has been growing steadily. But until now, there was a catch: if an app container crashed, the entire device rebooted. The same safety mechanism that protects against a broken BSP or network stack was also triggered by a flaky web UI or a monitoring agent that hit an edge case. For teams running lean app containers — the kind you want to iterate on fast and deploy independently — this was too aggressive.

Container auto-recovery changes that. Pantavisor can now automatically restart crashed app containers with configurable policies and exponential backoff, without taking down the rest of the system. The device stays up, the platform keeps running, and only the failed container gets restarted.

Designed for Resource-Constrained Devices

Auto-recovery was built with Pantavisor’s embedded DNA in mind. There are no background threads, no additional daemons, and no heap allocations on the recovery path. The entire feature runs inside Pantavisor’s existing single-threaded event loop using the same lightweight timer infrastructure that drives the rest of the platform lifecycle.

For devices that don’t use auto-recovery, the overhead is near-zero: a handful of integer comparisons per tick that short-circuit immediately when no recovery policy is configured. The feature is entirely opt-in — if you don’t configure it, your device behaves exactly as it always has.

How It Works

Add an auto_recovery object to your container’s configuration — either per-container in args.json, or as a group-level default in device.json:

{
    "PV_AUTO_RECOVERY": {
        "policy": "on-failure",
        "max_retries": 5,
        "retry_delay": 5,
        "backoff_factor": 2.0,
        "stable_timeout": 30,
        "backoff_policy": "reboot"
    }
}

When this container crashes, Pantavisor restarts it up to 5 times with exponential backoff (5s, 10s, 20s, 40s, 80s). Only after all retries are exhausted does it escalate — in this case, to a system reboot. You can also set backoff_policy to "never" (leave it stopped, system keeps running) or a duration like "10min" (wait, then try the whole recovery cycle again).

Stability Tracking for Safer OTA Updates

Auto-recovery also makes OTA updates safer for devices running app containers. The stable_timeout field defines how many seconds a container must survive after reaching its status goal before being considered “stable.” During an update’s TESTING phase, Pantavisor holds the commit until all containers have proven stable — preventing a new revision from being locked in while an app container is still flapping.

Stability tracking does not slow down boot. Groups still chain on status_goal as before — stability is purely a gate on whether to commit the update.

The App Group: Sensible Defaults Out of the Box

The app group in device.json now ships with a default recovery policy:

{
    "name": "app",
    "restart_policy": "container",
    "status_goal": "STARTED",
    "timeout": 30,
    "auto_recovery": {
        "policy": "on-failure",
        "max_retries": 5,
        "retry_delay": 5,
        "backoff_factor": 2.0,
        "stable_timeout": 30,
        "backoff_policy": "reboot"
    }
}

Any container in the app group automatically inherits this policy unless it defines its own. System containers in the root and platform groups are unaffected — they don’t have auto_recovery configured, so the existing behavior (immediate reboot on failure) is preserved.

What Landed

pantavisor/pantavisor PR #610 — Core implementation:

  • Recovery state machine with exponential backoff and retry limits
  • New RECOVERING platform status visible in pvcontrol ls
  • Stability timer that arms when a container reaches its status_goal (STARTED or READY)
  • Stability-gated commit during OTA TESTING with hub progress message
  • Configurable backoff_policy: reboot, never, or timed duration retry
  • Group-level auto_recovery in device.json with all-or-nothing inheritance
  • Documentation in containers.md and the state format reference

meta-pantavisor PR #183 — Yocto layer support:

  • Default app group in all BSP device.json variants now ships with auto-recovery
  • Example test containers for validating recovery, stabilization, and group inheritance
  • Test plan with 8 scenarios

Try It Out

If your device has a tailscale container in the app group (the default for meta-pantavisor BSPs), auto-recovery is already active after updating. SSH into your device and check:

ssh -p 8222 _pv_@<device-ip>
pvcontrol ls

You should see auto-recovery fields on the tailscale container:

{
    "name": "tailscale",
    "group": "app",
    "status": "STARTED",
    "auto_recovery": {
        "max_retries": 5,
        "current_retries": 0,
        "stable_timeout": 30,
        "is_stable": "true",
        "backoff_policy": "reboot"
    }
}

is_stable: "true" means the container survived its 30-second stability window. To simulate a crash and watch auto-recovery in action:

lxc-info -n tailscale -p
kill -9 <pid>

Then check the pantavisor log:

platform 'tailscale' crashed. Attempting auto-recovery (attempt 1/5) in 5 seconds...
platform 'tailscale' status is now RECOVERING
recovery timer finished for platform 'tailscale'. Restarting...
platform 'tailscale' reached its status goal; took 1 secs
platform 'tailscale' is now STABLE (survived 30 seconds)

Or check remotely via the HTTP log API:

curl "http://<device-ip>:12368/cgi-bin/logs?rev=<N>&source=pantavisor&tailn=100" \
  | grep -i "recover\|STABLE"

Links

1 Like