From 2f3d184a3b9e5626a65f9c189e7e6962fa649999 Mon Sep 17 00:00:00 2001 From: Micha Date: Fri, 5 Jun 2026 22:25:23 +0200 Subject: [PATCH] ops: prepare docker critical events watcher --- docs/ALERT_RULES.md | 6 +- docs/MASTER_TODO.md | 2 +- docs/SERVICE_CATALOG.md | 2 +- .../docker-critical-events-supervisor.sh | 142 ++++++++++++++++++ .../tests/test-docker-critical-events.sh | 55 +++++++ services/posture-check/unraid-user-scripts.md | 27 +++- 6 files changed, 225 insertions(+), 9 deletions(-) create mode 100755 services/posture-check/docker-critical-events-supervisor.sh create mode 100755 services/posture-check/tests/test-docker-critical-events.sh diff --git a/docs/ALERT_RULES.md b/docs/ALERT_RULES.md index 01b4374..a9e2fc3 100644 --- a/docs/ALERT_RULES.md +++ b/docs/ALERT_RULES.md @@ -1,6 +1,6 @@ # Alert Rules -Stand: 2026-05-31 +Stand: 2026-06-05 Diese Datei beschreibt die produktiven Alarmwege und wichtigsten Regeln. Die Konfiguration selbst liegt in `monitoring/prometheus/alerts.yml` und in den @@ -49,4 +49,6 @@ Die Liste der ueberwachten Critical-Container steht in - Kein Inode-Alarm. Bei Paperless/Immich spaeter sinnvoll, aber aktuell kein dokumentierter Vorfall. - Container-Memory-Limits werden erst nach realen Peak-Daten gesetzt; OOM/kill - wird bereits ueber `docker-critical-events.sh` gemeldet. + wird ueber `docker-critical-events.sh` gemeldet, sobald der Host-Watcher per + Unraid User Script aktiviert ist. Start/Stop/Status/Smoke laufen ueber + `services/posture-check/docker-critical-events-supervisor.sh`. diff --git a/docs/MASTER_TODO.md b/docs/MASTER_TODO.md index 267dbe8..7f49a6e 100644 --- a/docs/MASTER_TODO.md +++ b/docs/MASTER_TODO.md @@ -59,7 +59,7 @@ Bewusst nicht jetzt - mit Review-Trigger. | Cold-Backup-Rotation | **Bewusst Hetzner-only** (2026-06-05). Keine zweite rotierende Cold-Kopie. Trigger: stark wachsender Datenwert, wiederholte Hetzner-Probleme, geaenderte Praeferenz | `docs/HARDWARE_INVENTORY.md` | | WAN-Ausfallschutz | **Spaeter evaluieren** (2026-06-05). Mobilfunk-Failover inaktiv; lokale Apps laufen bei WAN-Ausfall weiter. Trigger: haeufigere/laengere DSL-Ausfaelle oder kritischer Remote-Zugang | `docs/NETWORK_INVENTORY.md` | | Home Assistant InfluxDB Bind | Aktuell `127.0.0.1:8181`, validiert. Nur wenn HA nicht lokal auf den Host schreibt, bewusste Bind-Aenderung planen | `docs/NETWORK_INVENTORY.md` | -| Docker Critical Events Watcher | Optional als Unraid User-Script `at array start` aktivieren und ntfy-Smoke; kein Pflicht-Pfad | `docs/SERVICE_CATALOG.md`, `services/posture-check/docker-critical-events.sh` | +| Docker Critical Events Watcher | Aktivierung vorbereitet: `services/posture-check/docker-critical-events-supervisor.sh` kapselt Start/Stop/Status/Smoke fuer das Unraid User Script. Optional spaeter in der Unraid-UI `at array start` aktivieren und ntfy-Smoke fahren; kein Pflicht-Pfad | `docs/SERVICE_CATALOG.md`, `services/posture-check/docker-critical-events.sh`, `services/posture-check/unraid-user-scripts.md` | | Negativ-Test Backup-Frische | Quartalsweise: bewusst kaputten/fehlenden Dump in Testpfad simulieren, pruefen ob `homelab-alerts` feuert | `docs/AUDIT_2026-05-25_TODO.md` | | End-to-end-DR-Drill | Komplett-Bootstrap Phase 1-5 auf Wegwerf-Host; realistisch erst mit zweiter Hardware (siehe auch Extern blockiert) | `docs/AUDIT_2026-05-25_TODO.md`, `docs/DISASTER_RECOVERY.md` | | Wiederkehrende Restore-Drills | Vaultwarden, Gitea, Authelia, Komodo, Paperless, Immich, Traefik, PostgreSQL, Mongo, Nextcloud, Mealie, Mail-Archiver nach Matrix-Intervallen rotieren | `docs/RESTORE_MATRIX.md`, `docs/RESTORE_HANDBOOK.md` | diff --git a/docs/SERVICE_CATALOG.md b/docs/SERVICE_CATALOG.md index 017419c..3d4aaee 100644 --- a/docs/SERVICE_CATALOG.md +++ b/docs/SERVICE_CATALOG.md @@ -85,7 +85,7 @@ Secret-Werte sind nicht enthalten. Es werden nur Secret-Namen, Env-Key-Namen und | Service | Zweck | Autoritativer Pfad | URL / Zugang | Abhaengigkeiten | Datenpfade | Backup / Restore | Traefik | Besonderheiten / TODOs | |---|---|---|---|---|---|---|---|---| | `posture-check` | Host-Posture-Audit fuer Filesystem, Mover-Drift, NVMe-SMART, Fuellstand und Authelia-Repo<->Host-Drift | `services/posture-check/posture-check.sh` | Unraid User-Script / Cron / Borg Pre-Hook | `findmnt`, `df`, `nvme`, optional `curl` fuer ntfy; ruft `services/authelia-diff.sh` fuer `authelia_config_drift` auf | `/mnt/user/services/posture-check/last.json` | Repo-Skript + letzter JSON-Status | nein | Muss auf dem Unraid-Host bei Boot, stuendlich und vor Borg laufen; Disk1-NTFS ist nach Disk1 Phase 2 nicht mehr erlaubt (`ALLOW_DISK1_NTFS=0` Standard); Warning/Critical alarmieren via ntfy nur bei neuer Ursache oder nach `ALERT_REPEAT_SECONDS`. Authelia-Drift-Check braucht einen Repo-Spiegel unter `/mnt/user/services/homelab-infra/` (siehe `docs/WORKFLOW.md` Sektion "Ausnahme: Authelia configuration.yml") | -| `docker-critical-events` | Live-Alarmierung fuer Docker `die`/`oom`/`kill` Events | `services/posture-check/docker-critical-events.sh` | Unraid User-Script / Hintergrundprozess | Docker CLI, ntfy | `/mnt/user/services/posture-check/docker-critical-events-last.log` | Repo-Skript + letzter Event-Log | nein | Optional als Unraid User-Script `at array start` starten; sendet nach `homelab-alerts` | +| `docker-critical-events` | Live-Alarmierung fuer Docker `die`/`oom`/`kill` Events | `services/posture-check/docker-critical-events.sh`, Supervisor `services/posture-check/docker-critical-events-supervisor.sh` | Unraid User-Script / Hintergrundprozess | Docker CLI, ntfy | `/mnt/user/services/posture-check/docker-critical-events-last.log`, PID/Outfile unter `/mnt/user/services/posture-check/` | Repo-Skript + letzter Event-Log | nein | Optional als Unraid User-Script `at array start` starten; Supervisor kann `start`, `stop`, `status`, `smoke`; sendet nach `homelab-alerts` | ## Backup- und Restore-Hinweise diff --git a/services/posture-check/docker-critical-events-supervisor.sh b/services/posture-check/docker-critical-events-supervisor.sh new file mode 100755 index 0000000..d8adfe2 --- /dev/null +++ b/services/posture-check/docker-critical-events-supervisor.sh @@ -0,0 +1,142 @@ +#!/usr/bin/env bash +set -euo pipefail + +BASE_DIR="${BASE_DIR:-/mnt/user/services/posture-check}" +WATCHER_SCRIPT="${WATCHER_SCRIPT:-/mnt/user/services/homelab-infra/services/posture-check/docker-critical-events.sh}" +PID_FILE="${PID_FILE:-$BASE_DIR/docker-critical-events.pid}" +OUT_FILE="${OUT_FILE:-$BASE_DIR/docker-critical-events.out}" +EVENT_LOG="${EVENT_LOG:-$BASE_DIR/docker-critical-events-last.log}" +NTFY_SCRIPT="${NTFY_SCRIPT:-/mnt/user/services/homelab-infra/ops/restore-tests/send-ntfy.sh}" +NTFY_TOPIC="${NTFY_TOPIC:-homelab-alerts}" + +usage() { + cat >&2 </dev/null 2>&1 +} + +start_watcher() { + mkdir -p "$BASE_DIR" + + if is_running; then + echo "docker-critical-events watcher already running (pid $(cat "$PID_FILE"))" + return 0 + fi + + if [ ! -r "$WATCHER_SCRIPT" ]; then + echo "Watcher script not readable: $WATCHER_SCRIPT" >&2 + return 1 + fi + + NTFY_SCRIPT="$NTFY_SCRIPT" \ + NTFY_TOPIC="$NTFY_TOPIC" \ + OUTPUT_PATH="$EVENT_LOG" \ + nohup bash "$WATCHER_SCRIPT" >"$OUT_FILE" 2>&1 "$PID_FILE" + sleep 1 + + if is_running; then + echo "docker-critical-events watcher started (pid $(cat "$PID_FILE"))" + else + echo "docker-critical-events watcher failed to stay running; see $OUT_FILE" >&2 + return 1 + fi +} + +stop_watcher() { + if ! is_running; then + rm -f "$PID_FILE" + echo "docker-critical-events watcher is not running" + return 0 + fi + + local pid + pid="$(cat "$PID_FILE")" + kill "$pid" >/dev/null 2>&1 || true + sleep 1 + + if kill -0 "$pid" >/dev/null 2>&1; then + echo "watcher still running after SIGTERM; sending SIGKILL" + kill -9 "$pid" >/dev/null 2>&1 || true + fi + + rm -f "$PID_FILE" + echo "docker-critical-events watcher stopped" +} + +status_watcher() { + if is_running; then + echo "status=running pid=$(cat "$PID_FILE")" + else + echo "status=stopped" + [ -e "$PID_FILE" ] && echo "stale_pidfile=$PID_FILE" + fi + + echo "watcher_script=$WATCHER_SCRIPT" + echo "event_log=$EVENT_LOG" + echo "out_file=$OUT_FILE" + + if [ -s "$EVENT_LOG" ]; then + echo + echo "Recent critical events:" + tail -n 20 "$EVENT_LOG" + fi + + if [ -s "$OUT_FILE" ]; then + echo + echo "Recent watcher output:" + tail -n 20 "$OUT_FILE" + fi +} + +smoke_ntfy() { + if [ ! -r "$NTFY_SCRIPT" ]; then + echo "ntfy helper not readable: $NTFY_SCRIPT" >&2 + return 1 + fi + + bash "$NTFY_SCRIPT" \ + "$NTFY_TOPIC" \ + "Docker critical watcher smoke" \ + "Smoke test from $(hostname) at $(date -Iseconds). No container was stopped." \ + default + echo "smoke notification sent to $NTFY_TOPIC" +} + +case "${1:-}" in + start) + start_watcher + ;; + stop) + stop_watcher + ;; + restart) + stop_watcher + start_watcher + ;; + status) + status_watcher + ;; + smoke) + smoke_ntfy + ;; + *) + usage + exit 2 + ;; +esac diff --git a/services/posture-check/tests/test-docker-critical-events.sh b/services/posture-check/tests/test-docker-critical-events.sh new file mode 100755 index 0000000..56381a3 --- /dev/null +++ b/services/posture-check/tests/test-docker-critical-events.sh @@ -0,0 +1,55 @@ +#!/usr/bin/env bash +set -euo pipefail + +SCRIPT_DIR="$(cd "$(dirname "${BASH_SOURCE[0]}")" && pwd)" +WATCHER="$SCRIPT_DIR/../docker-critical-events.sh" + +if [ ! -r "$WATCHER" ]; then + echo "FAIL: watcher not readable at $WATCHER" >&2 + exit 1 +fi + +tmp="$(mktemp -d)" +trap 'rm -rf "$tmp"' EXIT + +mkdir -p "$tmp/bin" +cat > "$tmp/bin/docker" <<'EOF' +#!/usr/bin/env bash +if [ "${1:-}" != "events" ]; then + echo "unexpected docker command: $*" >&2 + exit 1 +fi + +cat <<'EVENTS' +{"Type":"container","Action":"die","Actor":{"Attributes":{"name":"ok-container","image":"example:latest","exitCode":"0"}}} +{"Type":"container","Action":"die","Actor":{"Attributes":{"name":"bad-container","image":"example:latest","exitCode":"137"}}} +{"Type":"container","Action":"oom","Actor":{"Attributes":{"name":"oom-container","image":"example:latest"}}} +EVENTS +EOF +chmod +x "$tmp/bin/docker" + +PATH="$tmp/bin:$PATH" \ +SEND_NTFY=0 \ +OUTPUT_PATH="$tmp/events.log" \ +bash "$WATCHER" + +fail() { + echo "FAIL: $*" >&2 + echo "--- events.log ---" >&2 + cat "$tmp/events.log" >&2 || true + exit 1 +} + +[ -s "$tmp/events.log" ] || fail "expected critical event log to be written" + +if grep -q 'ok-container' "$tmp/events.log"; then + fail "exitCode 0 die event should not alert" +fi + +grep -q 'bad-container' "$tmp/events.log" || fail "non-zero die event missing" +grep -q 'oom-container' "$tmp/events.log" || fail "oom event missing" + +line_count="$(wc -l < "$tmp/events.log" | tr -d ' ')" +[ "$line_count" = "2" ] || fail "expected 2 logged critical events, got $line_count" + +echo "OK - docker critical events filter test passed" diff --git a/services/posture-check/unraid-user-scripts.md b/services/posture-check/unraid-user-scripts.md index 22aadea..7c00e3a 100644 --- a/services/posture-check/unraid-user-scripts.md +++ b/services/posture-check/unraid-user-scripts.md @@ -93,12 +93,29 @@ bash /mnt/user/services/homelab-infra/services/posture-check/daily-status-report ## `docker-critical-events-at-start` -Zeit: Array Start. Dieser Job startet einen Hintergrund-Watcher und beendet sich sofort. +Zeit: Array Start. Dieser Job startet einen Hintergrund-Watcher und beendet sich +sofort. Der Supervisor schreibt PID, stdout/stderr und Event-Log nach +`/mnt/user/services/posture-check/`. ```bash #!/bin/bash -ps -ef | grep -F -- "docker events --filter event=die --filter event=oom --filter event=kill" | grep -v grep >/dev/null && exit 0 -mkdir -p /mnt/user/services/posture-check -nohup bash /mnt/user/services/homelab-infra/services/posture-check/docker-critical-events.sh >/mnt/user/services/posture-check/docker-critical-events.out 2>&1