Skip to content

Health & Readiness Probes

ChannelWatch exposes three HTTP probe endpoints designed for Kubernetes and any other orchestrator that supports HTTP health checks. Each probe has a distinct purpose and a distinct failure condition.

EndpointPurposeHealthy response
/healthz/liveProcess is alive and responding200 OK
/healthz/readyAll enabled DVRs are alive and receiving events200 OK
/healthz/startupCore has completed initial load200 OK

All three endpoints return plain text bodies (ok or a short error description) and are unauthenticated. They do not require an API key.

The liveness probe answers one question: is the ChannelWatch process still running and able to handle HTTP requests? It always returns 200 OK as long as the web server is up. It does not check DVR connectivity or event flow.

Use this probe to let Kubernetes restart a container that has deadlocked or crashed. Do not use it to detect a degraded DVR connection.

livenessProbe:
httpGet:
path: /healthz/live
port: 8501
initialDelaySeconds: 10
periodSeconds: 30
failureThreshold: 3

The readiness probe is the meaningful one. It returns 200 OK only when all enabled DVR tasks are alive AND all are within the staleness threshold (default 300 seconds). If any enabled DVR has not received an event within that window, the probe returns 503 Service Unavailable.

This means a pod can become unready during a DVR restart or a brief network interruption. That is intentional. Kubernetes will stop routing traffic to an unready pod, which prevents stale data from being served.

readinessProbe:
httpGet:
path: /healthz/ready
port: 8501
initialDelaySeconds: 15
periodSeconds: 15
failureThreshold: 2

A background watchdog coroutine runs every 30 seconds. For each enabled DVR it checks two things:

  1. The DVR’s asyncio task is alive (not crashed or cancelled).
  2. The DVR’s last_event_at timestamp is within staleness_threshold_seconds of the current time.

If either check fails for any DVR, the watchdog marks that DVR as unhealthy. The readiness probe reflects the aggregate state: all DVRs healthy means 200, any DVR unhealthy means 503.

The watchdog cannot be disabled by configuration. It is a safety feature, not an optional component.

The startup probe returns 200 OK once ChannelWatch has completed its initial load: configuration parsed, database migrated, and DVR tasks started. Use it to give the container extra time to initialize before liveness checks begin.

startupProbe:
httpGet:
path: /healthz/startup
port: 8501
failureThreshold: 30
periodSeconds: 5

With this configuration Kubernetes allows up to 150 seconds for startup before declaring the container failed. Adjust failureThreshold if your environment has slow storage or a large database to migrate.

livenessProbe:
httpGet:
path: /healthz/live
port: 8501
initialDelaySeconds: 10
periodSeconds: 30
failureThreshold: 3
readinessProbe:
httpGet:
path: /healthz/ready
port: 8501
initialDelaySeconds: 15
periodSeconds: 15
failureThreshold: 2
startupProbe:
httpGet:
path: /healthz/startup
port: 8501
failureThreshold: 30
periodSeconds: 5

When the watchdog detects a stale DVR, it also triggers a red banner in the ChannelWatch web UI:

DVR ‘Living Room’ has not received events for 312 seconds. Monitoring may be degraded.

The banner includes a Diagnose button that links to the channelwatch doctor output. See Per-DVR Health for the full health endpoint and Debug Bundles for the doctor CLI.