ops: prepare docker critical events watcher
This commit is contained in:
+4
-2
@@ -1,6 +1,6 @@
|
|||||||
# Alert Rules
|
# Alert Rules
|
||||||
|
|
||||||
Stand: 2026-05-31
|
Stand: 2026-06-05
|
||||||
|
|
||||||
Diese Datei beschreibt die produktiven Alarmwege und wichtigsten Regeln. Die
|
Diese Datei beschreibt die produktiven Alarmwege und wichtigsten Regeln. Die
|
||||||
Konfiguration selbst liegt in `monitoring/prometheus/alerts.yml` und in den
|
Konfiguration selbst liegt in `monitoring/prometheus/alerts.yml` und in den
|
||||||
@@ -49,4 +49,6 @@ Die Liste der ueberwachten Critical-Container steht in
|
|||||||
- Kein Inode-Alarm. Bei Paperless/Immich spaeter sinnvoll, aber aktuell kein
|
- Kein Inode-Alarm. Bei Paperless/Immich spaeter sinnvoll, aber aktuell kein
|
||||||
dokumentierter Vorfall.
|
dokumentierter Vorfall.
|
||||||
- Container-Memory-Limits werden erst nach realen Peak-Daten gesetzt; OOM/kill
|
- Container-Memory-Limits werden erst nach realen Peak-Daten gesetzt; OOM/kill
|
||||||
wird bereits ueber `docker-critical-events.sh` gemeldet.
|
wird ueber `docker-critical-events.sh` gemeldet, sobald der Host-Watcher per
|
||||||
|
Unraid User Script aktiviert ist. Start/Stop/Status/Smoke laufen ueber
|
||||||
|
`services/posture-check/docker-critical-events-supervisor.sh`.
|
||||||
|
|||||||
+1
-1
@@ -59,7 +59,7 @@ Bewusst nicht jetzt - mit Review-Trigger.
|
|||||||
| Cold-Backup-Rotation | **Bewusst Hetzner-only** (2026-06-05). Keine zweite rotierende Cold-Kopie. Trigger: stark wachsender Datenwert, wiederholte Hetzner-Probleme, geaenderte Praeferenz | `docs/HARDWARE_INVENTORY.md` |
|
| Cold-Backup-Rotation | **Bewusst Hetzner-only** (2026-06-05). Keine zweite rotierende Cold-Kopie. Trigger: stark wachsender Datenwert, wiederholte Hetzner-Probleme, geaenderte Praeferenz | `docs/HARDWARE_INVENTORY.md` |
|
||||||
| WAN-Ausfallschutz | **Spaeter evaluieren** (2026-06-05). Mobilfunk-Failover inaktiv; lokale Apps laufen bei WAN-Ausfall weiter. Trigger: haeufigere/laengere DSL-Ausfaelle oder kritischer Remote-Zugang | `docs/NETWORK_INVENTORY.md` |
|
| WAN-Ausfallschutz | **Spaeter evaluieren** (2026-06-05). Mobilfunk-Failover inaktiv; lokale Apps laufen bei WAN-Ausfall weiter. Trigger: haeufigere/laengere DSL-Ausfaelle oder kritischer Remote-Zugang | `docs/NETWORK_INVENTORY.md` |
|
||||||
| Home Assistant InfluxDB Bind | Aktuell `127.0.0.1:8181`, validiert. Nur wenn HA nicht lokal auf den Host schreibt, bewusste Bind-Aenderung planen | `docs/NETWORK_INVENTORY.md` |
|
| Home Assistant InfluxDB Bind | Aktuell `127.0.0.1:8181`, validiert. Nur wenn HA nicht lokal auf den Host schreibt, bewusste Bind-Aenderung planen | `docs/NETWORK_INVENTORY.md` |
|
||||||
| Docker Critical Events Watcher | Optional als Unraid User-Script `at array start` aktivieren und ntfy-Smoke; kein Pflicht-Pfad | `docs/SERVICE_CATALOG.md`, `services/posture-check/docker-critical-events.sh` |
|
| Docker Critical Events Watcher | Aktivierung vorbereitet: `services/posture-check/docker-critical-events-supervisor.sh` kapselt Start/Stop/Status/Smoke fuer das Unraid User Script. Optional spaeter in der Unraid-UI `at array start` aktivieren und ntfy-Smoke fahren; kein Pflicht-Pfad | `docs/SERVICE_CATALOG.md`, `services/posture-check/docker-critical-events.sh`, `services/posture-check/unraid-user-scripts.md` |
|
||||||
| Negativ-Test Backup-Frische | Quartalsweise: bewusst kaputten/fehlenden Dump in Testpfad simulieren, pruefen ob `homelab-alerts` feuert | `docs/AUDIT_2026-05-25_TODO.md` |
|
| Negativ-Test Backup-Frische | Quartalsweise: bewusst kaputten/fehlenden Dump in Testpfad simulieren, pruefen ob `homelab-alerts` feuert | `docs/AUDIT_2026-05-25_TODO.md` |
|
||||||
| End-to-end-DR-Drill | Komplett-Bootstrap Phase 1-5 auf Wegwerf-Host; realistisch erst mit zweiter Hardware (siehe auch Extern blockiert) | `docs/AUDIT_2026-05-25_TODO.md`, `docs/DISASTER_RECOVERY.md` |
|
| End-to-end-DR-Drill | Komplett-Bootstrap Phase 1-5 auf Wegwerf-Host; realistisch erst mit zweiter Hardware (siehe auch Extern blockiert) | `docs/AUDIT_2026-05-25_TODO.md`, `docs/DISASTER_RECOVERY.md` |
|
||||||
| Wiederkehrende Restore-Drills | Vaultwarden, Gitea, Authelia, Komodo, Paperless, Immich, Traefik, PostgreSQL, Mongo, Nextcloud, Mealie, Mail-Archiver nach Matrix-Intervallen rotieren | `docs/RESTORE_MATRIX.md`, `docs/RESTORE_HANDBOOK.md` |
|
| Wiederkehrende Restore-Drills | Vaultwarden, Gitea, Authelia, Komodo, Paperless, Immich, Traefik, PostgreSQL, Mongo, Nextcloud, Mealie, Mail-Archiver nach Matrix-Intervallen rotieren | `docs/RESTORE_MATRIX.md`, `docs/RESTORE_HANDBOOK.md` |
|
||||||
|
|||||||
@@ -85,7 +85,7 @@ Secret-Werte sind nicht enthalten. Es werden nur Secret-Namen, Env-Key-Namen und
|
|||||||
| Service | Zweck | Autoritativer Pfad | URL / Zugang | Abhaengigkeiten | Datenpfade | Backup / Restore | Traefik | Besonderheiten / TODOs |
|
| Service | Zweck | Autoritativer Pfad | URL / Zugang | Abhaengigkeiten | Datenpfade | Backup / Restore | Traefik | Besonderheiten / TODOs |
|
||||||
|---|---|---|---|---|---|---|---|---|
|
|---|---|---|---|---|---|---|---|---|
|
||||||
| `posture-check` | Host-Posture-Audit fuer Filesystem, Mover-Drift, NVMe-SMART, Fuellstand und Authelia-Repo<->Host-Drift | `services/posture-check/posture-check.sh` | Unraid User-Script / Cron / Borg Pre-Hook | `findmnt`, `df`, `nvme`, optional `curl` fuer ntfy; ruft `services/authelia-diff.sh` fuer `authelia_config_drift` auf | `/mnt/user/services/posture-check/last.json` | Repo-Skript + letzter JSON-Status | nein | Muss auf dem Unraid-Host bei Boot, stuendlich und vor Borg laufen; Disk1-NTFS ist nach Disk1 Phase 2 nicht mehr erlaubt (`ALLOW_DISK1_NTFS=0` Standard); Warning/Critical alarmieren via ntfy nur bei neuer Ursache oder nach `ALERT_REPEAT_SECONDS`. Authelia-Drift-Check braucht einen Repo-Spiegel unter `/mnt/user/services/homelab-infra/` (siehe `docs/WORKFLOW.md` Sektion "Ausnahme: Authelia configuration.yml") |
|
| `posture-check` | Host-Posture-Audit fuer Filesystem, Mover-Drift, NVMe-SMART, Fuellstand und Authelia-Repo<->Host-Drift | `services/posture-check/posture-check.sh` | Unraid User-Script / Cron / Borg Pre-Hook | `findmnt`, `df`, `nvme`, optional `curl` fuer ntfy; ruft `services/authelia-diff.sh` fuer `authelia_config_drift` auf | `/mnt/user/services/posture-check/last.json` | Repo-Skript + letzter JSON-Status | nein | Muss auf dem Unraid-Host bei Boot, stuendlich und vor Borg laufen; Disk1-NTFS ist nach Disk1 Phase 2 nicht mehr erlaubt (`ALLOW_DISK1_NTFS=0` Standard); Warning/Critical alarmieren via ntfy nur bei neuer Ursache oder nach `ALERT_REPEAT_SECONDS`. Authelia-Drift-Check braucht einen Repo-Spiegel unter `/mnt/user/services/homelab-infra/` (siehe `docs/WORKFLOW.md` Sektion "Ausnahme: Authelia configuration.yml") |
|
||||||
| `docker-critical-events` | Live-Alarmierung fuer Docker `die`/`oom`/`kill` Events | `services/posture-check/docker-critical-events.sh` | Unraid User-Script / Hintergrundprozess | Docker CLI, ntfy | `/mnt/user/services/posture-check/docker-critical-events-last.log` | Repo-Skript + letzter Event-Log | nein | Optional als Unraid User-Script `at array start` starten; sendet nach `homelab-alerts` |
|
| `docker-critical-events` | Live-Alarmierung fuer Docker `die`/`oom`/`kill` Events | `services/posture-check/docker-critical-events.sh`, Supervisor `services/posture-check/docker-critical-events-supervisor.sh` | Unraid User-Script / Hintergrundprozess | Docker CLI, ntfy | `/mnt/user/services/posture-check/docker-critical-events-last.log`, PID/Outfile unter `/mnt/user/services/posture-check/` | Repo-Skript + letzter Event-Log | nein | Optional als Unraid User-Script `at array start` starten; Supervisor kann `start`, `stop`, `status`, `smoke`; sendet nach `homelab-alerts` |
|
||||||
|
|
||||||
## Backup- und Restore-Hinweise
|
## Backup- und Restore-Hinweise
|
||||||
|
|
||||||
|
|||||||
+142
@@ -0,0 +1,142 @@
|
|||||||
|
#!/usr/bin/env bash
|
||||||
|
set -euo pipefail
|
||||||
|
|
||||||
|
BASE_DIR="${BASE_DIR:-/mnt/user/services/posture-check}"
|
||||||
|
WATCHER_SCRIPT="${WATCHER_SCRIPT:-/mnt/user/services/homelab-infra/services/posture-check/docker-critical-events.sh}"
|
||||||
|
PID_FILE="${PID_FILE:-$BASE_DIR/docker-critical-events.pid}"
|
||||||
|
OUT_FILE="${OUT_FILE:-$BASE_DIR/docker-critical-events.out}"
|
||||||
|
EVENT_LOG="${EVENT_LOG:-$BASE_DIR/docker-critical-events-last.log}"
|
||||||
|
NTFY_SCRIPT="${NTFY_SCRIPT:-/mnt/user/services/homelab-infra/ops/restore-tests/send-ntfy.sh}"
|
||||||
|
NTFY_TOPIC="${NTFY_TOPIC:-homelab-alerts}"
|
||||||
|
|
||||||
|
usage() {
|
||||||
|
cat >&2 <<EOF
|
||||||
|
Usage: $0 start|stop|restart|status|smoke
|
||||||
|
|
||||||
|
start Start Docker critical-events watcher in the background.
|
||||||
|
stop Stop the watcher by pidfile.
|
||||||
|
restart Stop and start the watcher.
|
||||||
|
status Print watcher status and recent log tail.
|
||||||
|
smoke Send one ntfy test message through the same alert path.
|
||||||
|
EOF
|
||||||
|
}
|
||||||
|
|
||||||
|
is_running() {
|
||||||
|
[ -s "$PID_FILE" ] || return 1
|
||||||
|
local pid
|
||||||
|
pid="$(cat "$PID_FILE")"
|
||||||
|
[ -n "$pid" ] || return 1
|
||||||
|
kill -0 "$pid" >/dev/null 2>&1
|
||||||
|
}
|
||||||
|
|
||||||
|
start_watcher() {
|
||||||
|
mkdir -p "$BASE_DIR"
|
||||||
|
|
||||||
|
if is_running; then
|
||||||
|
echo "docker-critical-events watcher already running (pid $(cat "$PID_FILE"))"
|
||||||
|
return 0
|
||||||
|
fi
|
||||||
|
|
||||||
|
if [ ! -r "$WATCHER_SCRIPT" ]; then
|
||||||
|
echo "Watcher script not readable: $WATCHER_SCRIPT" >&2
|
||||||
|
return 1
|
||||||
|
fi
|
||||||
|
|
||||||
|
NTFY_SCRIPT="$NTFY_SCRIPT" \
|
||||||
|
NTFY_TOPIC="$NTFY_TOPIC" \
|
||||||
|
OUTPUT_PATH="$EVENT_LOG" \
|
||||||
|
nohup bash "$WATCHER_SCRIPT" >"$OUT_FILE" 2>&1 </dev/null &
|
||||||
|
|
||||||
|
echo "$!" > "$PID_FILE"
|
||||||
|
sleep 1
|
||||||
|
|
||||||
|
if is_running; then
|
||||||
|
echo "docker-critical-events watcher started (pid $(cat "$PID_FILE"))"
|
||||||
|
else
|
||||||
|
echo "docker-critical-events watcher failed to stay running; see $OUT_FILE" >&2
|
||||||
|
return 1
|
||||||
|
fi
|
||||||
|
}
|
||||||
|
|
||||||
|
stop_watcher() {
|
||||||
|
if ! is_running; then
|
||||||
|
rm -f "$PID_FILE"
|
||||||
|
echo "docker-critical-events watcher is not running"
|
||||||
|
return 0
|
||||||
|
fi
|
||||||
|
|
||||||
|
local pid
|
||||||
|
pid="$(cat "$PID_FILE")"
|
||||||
|
kill "$pid" >/dev/null 2>&1 || true
|
||||||
|
sleep 1
|
||||||
|
|
||||||
|
if kill -0 "$pid" >/dev/null 2>&1; then
|
||||||
|
echo "watcher still running after SIGTERM; sending SIGKILL"
|
||||||
|
kill -9 "$pid" >/dev/null 2>&1 || true
|
||||||
|
fi
|
||||||
|
|
||||||
|
rm -f "$PID_FILE"
|
||||||
|
echo "docker-critical-events watcher stopped"
|
||||||
|
}
|
||||||
|
|
||||||
|
status_watcher() {
|
||||||
|
if is_running; then
|
||||||
|
echo "status=running pid=$(cat "$PID_FILE")"
|
||||||
|
else
|
||||||
|
echo "status=stopped"
|
||||||
|
[ -e "$PID_FILE" ] && echo "stale_pidfile=$PID_FILE"
|
||||||
|
fi
|
||||||
|
|
||||||
|
echo "watcher_script=$WATCHER_SCRIPT"
|
||||||
|
echo "event_log=$EVENT_LOG"
|
||||||
|
echo "out_file=$OUT_FILE"
|
||||||
|
|
||||||
|
if [ -s "$EVENT_LOG" ]; then
|
||||||
|
echo
|
||||||
|
echo "Recent critical events:"
|
||||||
|
tail -n 20 "$EVENT_LOG"
|
||||||
|
fi
|
||||||
|
|
||||||
|
if [ -s "$OUT_FILE" ]; then
|
||||||
|
echo
|
||||||
|
echo "Recent watcher output:"
|
||||||
|
tail -n 20 "$OUT_FILE"
|
||||||
|
fi
|
||||||
|
}
|
||||||
|
|
||||||
|
smoke_ntfy() {
|
||||||
|
if [ ! -r "$NTFY_SCRIPT" ]; then
|
||||||
|
echo "ntfy helper not readable: $NTFY_SCRIPT" >&2
|
||||||
|
return 1
|
||||||
|
fi
|
||||||
|
|
||||||
|
bash "$NTFY_SCRIPT" \
|
||||||
|
"$NTFY_TOPIC" \
|
||||||
|
"Docker critical watcher smoke" \
|
||||||
|
"Smoke test from $(hostname) at $(date -Iseconds). No container was stopped." \
|
||||||
|
default
|
||||||
|
echo "smoke notification sent to $NTFY_TOPIC"
|
||||||
|
}
|
||||||
|
|
||||||
|
case "${1:-}" in
|
||||||
|
start)
|
||||||
|
start_watcher
|
||||||
|
;;
|
||||||
|
stop)
|
||||||
|
stop_watcher
|
||||||
|
;;
|
||||||
|
restart)
|
||||||
|
stop_watcher
|
||||||
|
start_watcher
|
||||||
|
;;
|
||||||
|
status)
|
||||||
|
status_watcher
|
||||||
|
;;
|
||||||
|
smoke)
|
||||||
|
smoke_ntfy
|
||||||
|
;;
|
||||||
|
*)
|
||||||
|
usage
|
||||||
|
exit 2
|
||||||
|
;;
|
||||||
|
esac
|
||||||
@@ -0,0 +1,55 @@
|
|||||||
|
#!/usr/bin/env bash
|
||||||
|
set -euo pipefail
|
||||||
|
|
||||||
|
SCRIPT_DIR="$(cd "$(dirname "${BASH_SOURCE[0]}")" && pwd)"
|
||||||
|
WATCHER="$SCRIPT_DIR/../docker-critical-events.sh"
|
||||||
|
|
||||||
|
if [ ! -r "$WATCHER" ]; then
|
||||||
|
echo "FAIL: watcher not readable at $WATCHER" >&2
|
||||||
|
exit 1
|
||||||
|
fi
|
||||||
|
|
||||||
|
tmp="$(mktemp -d)"
|
||||||
|
trap 'rm -rf "$tmp"' EXIT
|
||||||
|
|
||||||
|
mkdir -p "$tmp/bin"
|
||||||
|
cat > "$tmp/bin/docker" <<'EOF'
|
||||||
|
#!/usr/bin/env bash
|
||||||
|
if [ "${1:-}" != "events" ]; then
|
||||||
|
echo "unexpected docker command: $*" >&2
|
||||||
|
exit 1
|
||||||
|
fi
|
||||||
|
|
||||||
|
cat <<'EVENTS'
|
||||||
|
{"Type":"container","Action":"die","Actor":{"Attributes":{"name":"ok-container","image":"example:latest","exitCode":"0"}}}
|
||||||
|
{"Type":"container","Action":"die","Actor":{"Attributes":{"name":"bad-container","image":"example:latest","exitCode":"137"}}}
|
||||||
|
{"Type":"container","Action":"oom","Actor":{"Attributes":{"name":"oom-container","image":"example:latest"}}}
|
||||||
|
EVENTS
|
||||||
|
EOF
|
||||||
|
chmod +x "$tmp/bin/docker"
|
||||||
|
|
||||||
|
PATH="$tmp/bin:$PATH" \
|
||||||
|
SEND_NTFY=0 \
|
||||||
|
OUTPUT_PATH="$tmp/events.log" \
|
||||||
|
bash "$WATCHER"
|
||||||
|
|
||||||
|
fail() {
|
||||||
|
echo "FAIL: $*" >&2
|
||||||
|
echo "--- events.log ---" >&2
|
||||||
|
cat "$tmp/events.log" >&2 || true
|
||||||
|
exit 1
|
||||||
|
}
|
||||||
|
|
||||||
|
[ -s "$tmp/events.log" ] || fail "expected critical event log to be written"
|
||||||
|
|
||||||
|
if grep -q 'ok-container' "$tmp/events.log"; then
|
||||||
|
fail "exitCode 0 die event should not alert"
|
||||||
|
fi
|
||||||
|
|
||||||
|
grep -q 'bad-container' "$tmp/events.log" || fail "non-zero die event missing"
|
||||||
|
grep -q 'oom-container' "$tmp/events.log" || fail "oom event missing"
|
||||||
|
|
||||||
|
line_count="$(wc -l < "$tmp/events.log" | tr -d ' ')"
|
||||||
|
[ "$line_count" = "2" ] || fail "expected 2 logged critical events, got $line_count"
|
||||||
|
|
||||||
|
echo "OK - docker critical events filter test passed"
|
||||||
@@ -93,12 +93,29 @@ bash /mnt/user/services/homelab-infra/services/posture-check/daily-status-report
|
|||||||
|
|
||||||
## `docker-critical-events-at-start`
|
## `docker-critical-events-at-start`
|
||||||
|
|
||||||
Zeit: Array Start. Dieser Job startet einen Hintergrund-Watcher und beendet sich sofort.
|
Zeit: Array Start. Dieser Job startet einen Hintergrund-Watcher und beendet sich
|
||||||
|
sofort. Der Supervisor schreibt PID, stdout/stderr und Event-Log nach
|
||||||
|
`/mnt/user/services/posture-check/`.
|
||||||
|
|
||||||
```bash
|
```bash
|
||||||
#!/bin/bash
|
#!/bin/bash
|
||||||
ps -ef | grep -F -- "docker events --filter event=die --filter event=oom --filter event=kill" | grep -v grep >/dev/null && exit 0
|
exec /mnt/user/services/homelab-infra/services/posture-check/docker-critical-events-supervisor.sh start
|
||||||
mkdir -p /mnt/user/services/posture-check
|
```
|
||||||
nohup bash /mnt/user/services/homelab-infra/services/posture-check/docker-critical-events.sh >/mnt/user/services/posture-check/docker-critical-events.out 2>&1 </dev/null &
|
|
||||||
exit 0
|
Status pruefen:
|
||||||
|
|
||||||
|
```bash
|
||||||
|
/mnt/user/services/homelab-infra/services/posture-check/docker-critical-events-supervisor.sh status
|
||||||
|
```
|
||||||
|
|
||||||
|
Stoppen:
|
||||||
|
|
||||||
|
```bash
|
||||||
|
/mnt/user/services/homelab-infra/services/posture-check/docker-critical-events-supervisor.sh stop
|
||||||
|
```
|
||||||
|
|
||||||
|
ntfy-Smoke-Test ohne Container-Stopp:
|
||||||
|
|
||||||
|
```bash
|
||||||
|
/mnt/user/services/homelab-infra/services/posture-check/docker-critical-events-supervisor.sh smoke
|
||||||
```
|
```
|
||||||
|
|||||||
Reference in New Issue
Block a user