Health & Readiness Probes

ChannelWatch exposes three HTTP probe endpoints designed for Kubernetes and any other orchestrator that supports HTTP health checks. Each probe has a distinct purpose and a distinct failure condition.

Probe endpoints

Endpoint	Purpose	Healthy response
`/healthz/live`	Process is alive and responding	`200 OK`
`/healthz/ready`	All enabled DVRs are alive and receiving events	`200 OK`
`/healthz/startup`	Core has completed initial load	`200 OK`

All three endpoints return plain text bodies (ok or a short error description) and are unauthenticated. They do not require an API key.

Liveness probe (`/healthz/live`)

The liveness probe answers one question: is the ChannelWatch process still running and able to handle HTTP requests? It always returns 200 OK as long as the web server is up. It does not check DVR connectivity or event flow.

Use this probe to let Kubernetes restart a container that has deadlocked or crashed. Do not use it to detect a degraded DVR connection.

livenessProbe:
  httpGet:
    path: /healthz/live
    port: 8501
  initialDelaySeconds: 10
  periodSeconds: 30
  failureThreshold: 3

Readiness probe (`/healthz/ready`)

The readiness probe is the meaningful one. It returns 200 OK only when all enabled DVR tasks are alive AND all are within the staleness threshold (default 300 seconds). If any enabled DVR has not received an event within that window, the probe returns 503 Service Unavailable.

This means a pod can become unready during a DVR restart or a brief network interruption. That is intentional. Kubernetes will stop routing traffic to an unready pod, which prevents stale data from being served.

readinessProbe:
  httpGet:
    path: /healthz/ready
    port: 8501
  initialDelaySeconds: 15
  periodSeconds: 15
  failureThreshold: 2

What drives readiness state

A background watchdog coroutine runs every 30 seconds. For each enabled DVR it checks two things:

The DVR’s asyncio task is alive (not crashed or cancelled).
The DVR’s last_event_at timestamp is within staleness_threshold_seconds of the current time.

If either check fails for any DVR, the watchdog marks that DVR as unhealthy. The readiness probe reflects the aggregate state: all DVRs healthy means 200, any DVR unhealthy means 503.

The watchdog cannot be disabled by configuration. It is a safety feature, not an optional component.

Startup probe (`/healthz/startup`)

The startup probe returns 200 OK once ChannelWatch has completed its initial load: configuration parsed, database migrated, and DVR tasks started. Use it to give the container extra time to initialize before liveness checks begin.

startupProbe:
  httpGet:
    path: /healthz/startup
    port: 8501
  failureThreshold: 30
  periodSeconds: 5

With this configuration Kubernetes allows up to 150 seconds for startup before declaring the container failed. Adjust failureThreshold if your environment has slow storage or a large database to migrate.

Full Kubernetes example

livenessProbe:
  httpGet:
    path: /healthz/live
    port: 8501
  initialDelaySeconds: 10
  periodSeconds: 30
  failureThreshold: 3

readinessProbe:
  httpGet:
    path: /healthz/ready
    port: 8501
  initialDelaySeconds: 15
  periodSeconds: 15
  failureThreshold: 2

startupProbe:
  httpGet:
    path: /healthz/startup
    port: 8501
  failureThreshold: 30
  periodSeconds: 5

When the watchdog detects a stale DVR, it also triggers a red banner in the ChannelWatch web UI:

DVR ‘Living Room’ has not received events for 312 seconds. Monitoring may be degraded.

The banner includes a Diagnose button that links to the channelwatch doctor output. See Per-DVR Health for the full health endpoint and Debug Bundles for the doctor CLI.

Per-DVR Health - per-DVR health endpoint and watchdog details
Prometheus Metrics - channelwatch_dvr_last_event_seconds_ago gauge
Install (Kubernetes / Helm) - full Helm chart values reference

Health & Readiness Probes

Probe endpoints

Liveness probe (/healthz/live)

Readiness probe (/healthz/ready)

What drives readiness state

Startup probe (/healthz/startup)

Full Kubernetes example

Staleness banner in the UI

Related pages

Liveness probe (`/healthz/live`)

Readiness probe (`/healthz/ready`)

Startup probe (`/healthz/startup`)