Introducing Auto-Recovery: Making App Containers Production-Ready on Embedded
Most Pantavisor users start with full system containers — they come from a traditional embedded Linux background where the entire rootfs is one monolithic image. Over time, teams discover the power of decomposing their stack into functional units: a BSP container for hardware, a platform container for connectivity, and separate app containers for business logic.
This decomposition pattern has been growing steadily. But until now, there was a catch: if an app container crashed, the entire device rebooted. The same safety mechanism that protects against a broken BSP or network stack was also triggered by a flaky web UI or a monitoring agent that hit an edge case. For teams running lean app containers — the kind you want to iterate on fast and deploy independently — this was too aggressive.
Container auto-recovery changes that. Pantavisor can now automatically restart crashed app containers with configurable policies and exponential backoff, without taking down the rest of the system. The device stays up, the platform keeps running, and only the failed container gets restarted.
Designed for Resource-Constrained Devices
Auto-recovery was built with Pantavisor’s embedded DNA in mind. There are no background threads, no additional daemons, and no heap allocations on the recovery path. The entire feature runs inside Pantavisor’s existing single-threaded event loop using the same lightweight timer infrastructure that drives the rest of the platform lifecycle.
For devices that don’t use auto-recovery, the overhead is near-zero: a handful of integer comparisons per tick that short-circuit immediately when no recovery policy is configured. The feature is entirely opt-in — if you don’t configure it, your device behaves exactly as it always has.
How It Works
Add an auto_recovery object to your container’s configuration — either per-container in args.json, or as a group-level default in device.json:
{
"PV_AUTO_RECOVERY": {
"policy": "on-failure",
"max_retries": 5,
"retry_delay": 5,
"backoff_factor": 2.0,
"stable_timeout": 30,
"backoff_policy": "reboot"
}
}
When this container crashes, Pantavisor restarts it up to 5 times with exponential backoff (5s, 10s, 20s, 40s, 80s). Only after all retries are exhausted does it escalate — in this case, to a system reboot. You can also set backoff_policy to "never" (leave it stopped, system keeps running) or a duration like "10min" (wait, then try the whole recovery cycle again).
Stability Tracking for Safer OTA Updates
Auto-recovery also makes OTA updates safer for devices running app containers. The stable_timeout field defines how many seconds a container must survive after reaching its status goal before being considered “stable.” During an update’s TESTING phase, Pantavisor holds the commit until all containers have proven stable — preventing a new revision from being locked in while an app container is still flapping.
Stability tracking does not slow down boot. Groups still chain on status_goal as before — stability is purely a gate on whether to commit the update.
The App Group: Sensible Defaults Out of the Box
The app group in device.json now ships with a default recovery policy:
{
"name": "app",
"restart_policy": "container",
"status_goal": "STARTED",
"timeout": 30,
"auto_recovery": {
"policy": "on-failure",
"max_retries": 5,
"retry_delay": 5,
"backoff_factor": 2.0,
"stable_timeout": 30,
"backoff_policy": "reboot"
}
}
Any container in the app group automatically inherits this policy unless it defines its own. System containers in the root and platform groups are unaffected — they don’t have auto_recovery configured, so the existing behavior (immediate reboot on failure) is preserved.
What Landed
pantavisor/pantavisor PR #610 — Core implementation:
- Recovery state machine with exponential backoff and retry limits
- New
RECOVERINGplatform status visible inpvcontrol ls - Stability timer that arms when a container reaches its
status_goal(STARTED or READY) - Stability-gated commit during OTA TESTING with hub progress message
- Configurable
backoff_policy:reboot,never, or timed duration retry - Group-level
auto_recoveryindevice.jsonwith all-or-nothing inheritance - Documentation in containers.md and the state format reference
meta-pantavisor PR #183 — Yocto layer support:
- Default
appgroup in all BSPdevice.jsonvariants now ships with auto-recovery - Example test containers for validating recovery, stabilization, and group inheritance
- Test plan with 8 scenarios
Try It Out
If your device has a tailscale container in the app group (the default for meta-pantavisor BSPs), auto-recovery is already active after updating. SSH into your device and check:
ssh -p 8222 _pv_@<device-ip>
pvcontrol ls
You should see auto-recovery fields on the tailscale container:
{
"name": "tailscale",
"group": "app",
"status": "STARTED",
"auto_recovery": {
"max_retries": 5,
"current_retries": 0,
"stable_timeout": 30,
"is_stable": "true",
"backoff_policy": "reboot"
}
}
is_stable: "true" means the container survived its 30-second stability window. To simulate a crash and watch auto-recovery in action:
lxc-info -n tailscale -p
kill -9 <pid>
Then check the pantavisor log:
platform 'tailscale' crashed. Attempting auto-recovery (attempt 1/5) in 5 seconds...
platform 'tailscale' status is now RECOVERING
recovery timer finished for platform 'tailscale'. Restarting...
platform 'tailscale' reached its status goal; took 1 secs
platform 'tailscale' is now STABLE (survived 30 seconds)
Or check remotely via the HTTP log API:
curl "http://<device-ip>:12368/cgi-bin/logs?rev=<N>&source=pantavisor&tailn=100" \
| grep -i "recover\|STABLE"