From bc9ace315a2cd8d079f6d640c8e58b2589c2fd22 Mon Sep 17 00:00:00 2001 From: Micha Date: Thu, 18 Jun 2026 20:25:54 +0200 Subject: [PATCH] Backup-Audit-Hardening: Dump-Frische-Monitoring und Scope-Konsistenz Findings aus dem Backup-/Restore-Audit 2026-06-18 umgesetzt: - Dump-Frische als Prometheus-Metrik (homelab_borg_dump_present / homelab_borg_dump_age_seconds) im Host-Exporter; schliesst den Blindfleck, dass Borg weiterlaeuft und stale Dumps archiviert, ohne Job-Fehler. - Neue Alerts HomelabBorgDumpMissing / HomelabBorgDumpStale (critical) plus ALERT_RULES.md. - Freshness-Gate (.sh + .ps1) und H:-Nearline-Pull um n8n.sqlite.dump und postgresql17-globals.sql ergaenzt. - Critical-Container-Watch um mail-archiver, n8n, homeassistant, smarthome-mosquitto erweitert. - BACKUP_SCOPE: /mnt/user/projekte und sonstige User-Shares ausserhalb App-Scope als bewusste offene Operator-Entscheidung dokumentiert; Hermes-data-Pfad als geparkt klargestellt. - MASTER_TODO: Nearline-Pull-Ueberwachung, Host-Pull-Nachzug und projekte-Scope-Entscheidung aufgenommen. Enthaelt ausserdem die zuvor vorbereiteten Scope-Erweiterungen (nextcloud html+data, n8n, filebrowser, influxdb3) und Scope-Drift-/ Retention-/Compact-/Check-Alerts. Co-Authored-By: Claude Opus 4.8 --- docs/ALERT_RULES.md | 10 +- docs/MASTER_TODO.md | 12 +- docs/RESTORE_MATRIX.md | 2 + monitoring/prometheus/alerts.yml | 72 ++++++++++ ops/borg-ui/BACKUP_SCOPE.md | 15 ++- ops/borg-ui/all-important-sources.txt | 6 + ops/borg-ui/scripts/pre-backup-dumps.sh | 1 + .../pull-critical-backups.ps1 | 1 + ops/restore-tests/check-restore-freshness.ps1 | 2 + ops/restore-tests/check-restore-freshness.sh | 2 + .../export-prometheus-textfile.sh | 127 +++++++++++++++++- 11 files changed, 243 insertions(+), 7 deletions(-) diff --git a/docs/ALERT_RULES.md b/docs/ALERT_RULES.md index a9e2fc3..68a876a 100644 --- a/docs/ALERT_RULES.md +++ b/docs/ALERT_RULES.md @@ -1,6 +1,6 @@ # Alert Rules -Stand: 2026-06-05 +Stand: 2026-06-18 Diese Datei beschreibt die produktiven Alarmwege und wichtigsten Regeln. Die Konfiguration selbst liegt in `monitoring/prometheus/alerts.yml` und in den @@ -36,6 +36,14 @@ Skripten unter `services/posture-check/`. | `HomelabBorgBackupStale` | letztes Borg-Backup >30h | warning | Backup-Lauf nachholen/pruefen | | `HomelabBorgLastJobFailed` | letzter Borg-Job fehlgeschlagen | critical | Borg-UI-Job-Log pruefen | | `HomelabBorgLastJobCompletedWithWarnings` | letzter Borg-Job mit Warnungen | warning | Warnung im Borg-UI-Job lesen | +| `HomelabBorgDumpMissing` | erwartetes Dump-Artefakt fehlt im aktuellen Dump-Set | critical | `pre-backup-dumps.sh`/User-Script pruefen | +| `HomelabBorgDumpStale` | Dump-Artefakt >30h alt (Borg laeuft, Dumps eingefroren) | critical | `pre-backup-dumps.sh`/User-Script pruefen, nicht nur den Borg-Job | +| `HomelabBorgScopeSourceListMissing` | Repo-Quellliste fuer Borg-Drift-Check fehlt | critical | Borg-UI-Mount `/local/services/homelab-infra` und Repo-Pfad pruefen | +| `HomelabBorgScopeMissingSources` | Borg UI enthaelt nicht alle Pfade aus `ops/borg-ui/all-important-sources.txt` | critical | Live-Borg-Scope an Repo-Quelle angleichen | +| `HomelabBorgScopeExtraSources` | Borg UI enthaelt Pfade ausserhalb der Repo-Quellliste | warning | Doku oder Live-Scope bereinigen | +| `HomelabBorgRepositoryCheckStale` | letzter Borg-Check >14 Tage alt | warning | Borg-Repository-Check ausfuehren oder Scheduler pruefen | +| `HomelabBorgRetentionDisabled` | Scheduled Job fuehrt kein Prune aus | warning | Retention-Einstellung in Borg UI pruefen | +| `HomelabBorgCompactDisabled` | Scheduled Job fuehrt kein Compact aus | warning | Compact-Einstellung in Borg UI pruefen | | `HomelabCriticalContainerDown` | kritischer Container fehlt | critical | Komodo/Docker-Status pruefen | | `HomelabPrometheusTargetDown` | Scrape-Ziel down | critical | node-exporter/cadvisor/blackbox/traefik pruefen | diff --git a/docs/MASTER_TODO.md b/docs/MASTER_TODO.md index 18130d0..845e42e 100644 --- a/docs/MASTER_TODO.md +++ b/docs/MASTER_TODO.md @@ -1,6 +1,6 @@ # Master To-do - KalliLab CORE -Typ: Status/To-do · Stand: 2026-06-17 · Status: aktiv +Typ: Status/To-do · Stand: 2026-06-18 · Status: aktiv Diese Liste ist die **einzige** Arbeitsliste fuer offene operative Punkte im Homelab. Detailablaeufe stehen in den verlinkten Runbooks; Entscheidungen mit @@ -25,14 +25,19 @@ Host-Reports (`/mnt/user/backups/restore-reports/`) und in der Git-Historie. | Restore-Test Tailscale | Operator | State-Validierung + Reconnect nur auf Wegwerf-Host/VM, danach Geraet in Tailscale-Admin entfernen | `ops/restore-tests/tailscale-runbook.md` | | Authelia OIDC fuer Apps | Operator/Codex | Live: Grafana + Mealie login-verifiziert; Paperless Secret verdrahtet und Service-Smoke am 2026-06-17 gruen, finaler Browser-Login mit Operator-Account offen. Immich + Nextcloud bewusst geparkt bis Family-Onboarding (siehe `docs/DECISIONS.md` 2026-06-06) | `docs/AUTHELIA_OIDC_PLAN.md` | | Home Assistant Tibber | Operator/Codex | Tibber per HA-UI-Config-Flow verbinden. Danach Energy-Dashboard um echte Kosten/Preisquelle ergaenzen; SolarEdge-PV, Netz und Speicher sind bereits konfiguriert und validiert | `docs/runbooks/smart-home-bootstrap.md`, `docs/DECISIONS.md` | +| Nearline-Pull Ueberwachung | Operator | H:-Pull war 2026-06-04 bis 2026-06-18 still gestoppt (kein Scheduled Task, kein Alarm). Am 2026-06-18 Lauf manuell nachgeholt + Task neu registriert. **Naechster Schritt:** externen Dead-Man's-Switch (Healthchecks.io-Ping am Ende von `pull-critical-backups.ps1` und `ops/borg-ui/scripts/pre-borg.sh`), da Prometheus auf Unraid den baerchen-Pull nicht sieht | `ops/h-drive-nearline/README.md` | +| Host-Pull nach Backup-Hardening | Operator | Auf `/mnt/user/services/homelab-infra` `git pull`, damit der aktualisierte `export-prometheus-textfile.sh` (Dump-Frische-Metriken) und die Freshness-Checks live greifen. Borg-UI-Live-Quellen auf neue Pfade (nextcloud/html, nextcloud/data, n8n, filebrowser, influxdb3) angleichen, bis `homelab_borg_scope_missing_sources_total` 0 ist | `services/posture-check/export-prometheus-textfile.sh`, `ops/borg-ui/all-important-sources.txt` | --- ## Operator-Entscheidung -**Stand 2026-06-11: keine offenen Operator-Entscheidungen.** Getroffene Entscheidungen mit Begruendung und Review-Trigger: `docs/DECISIONS.md`. +| Thema | Entscheidung noetig | Quelle | +|---|---|---| +| `/mnt/user/projekte` Backup-Scope | Filebrowser serviert `projekte` (und ganze `documents`/`photos`), aber nur App-Unterordner sind im Borg-Scope. Entscheiden: `projekte` als read-only Borg-UI-Mount + Quelllisten-Eintrag aufnehmen, oder bewusst als "nur lokal, nicht DR-relevant" bestaetigen | `ops/borg-ui/BACKUP_SCOPE.md` Abschnitt "User-Daten-Shares ausserhalb des App-Scope" | + --- ## Geparkt @@ -52,6 +57,7 @@ Bewusst nicht jetzt - Begruendungen in `docs/DECISIONS.md`, hier nur Thema und T | Filebrowser-Mount-Scope | naechster Hardening-Sprint | `docs/SERVICE_CATALOG.md` | | Scrutiny Privileged-Ausnahme | nur mit klarer Begruendung aendern | `docs/SERVICE_CATALOG.md` | | Immich Redis named volume | passende Wartung am Immich-Stack | `docs/SERVICE_CATALOG.md` | +| Komodo keys named volume | gemeinsames Wartungsfenster mit Operator | Live-Volume `komodo_komodo_keys` nach `/mnt/user/appdata/komodo/keys` migrieren, Compose anpassen, Periphery-Reconnect pruefen, dann in Borg-Scope aufnehmen | | Storage-Wachstum (zweite NVMe, zweite Array-Disk, ZFS/BTRFS) | Trigger aus Capacity-Doku | `docs/STORAGE_LAYOUT.md`, `docs/CAPACITY_AND_LIFECYCLE.md` | | Wiederkehrende Restore-Drills | laufend nach Kadenz, inkl. quartalsweisem Frische-Negativtest (`run-restore-checks.sh freshness-negative`) | `docs/RESTORE_MATRIX.md`, `ops/restore-tests/schedule.md` | | Doku-Quartals-Gaertnern (~15 min) | quartalsweise, erster Lauf mit Q3-Review ab 2026-07-01: Datiertes archivieren, Done-/Review-Logs kuerzen, tote Links pruefen | `docs/REPO_MAP.md` Doku-Regeln | @@ -71,8 +77,8 @@ Bewusst nicht jetzt - Begruendungen in `docs/DECISIONS.md`, hier nur Thema und T - **2026-06-17** Offene TODOs gegen Live-Stand abgeglichen: Paperless-OIDC-Secret verdrahtet und Service-Smoke gruen; alter Tailscale-Docker-State nach `_archive/tailscale-removed-2026-06-06/` verschoben; Tailnet-Restpunkt geschlossen. - **2026-06-17** Repo-Hygiene abgeschlossen: Glance-Widget-Tokens sind in Runtime gesetzt, Audit-PDF liegt extern unter `H:\kallilab-recovery\audits`, Worktree clean. - **2026-06-17** Komodo/Gitea-Webhooks normalisiert: aktive Komodo-Hooks fuer `Micha/homelab-infra` nutzen Branch-Filter `master`; DB-Backup vor Host-Hotfix erstellt. Workflow-Regel nachgezogen. +- **2026-06-18** Backup-Audit-Hardening: Dump-Frische-Metriken + Alerts `HomelabBorgDumpMissing/Stale`, Freshness-Checks + Nearline-Pull um `n8n`/`globals` ergaenzt, 4 Tier-2-Container in Critical-Watch, Scope-Doku fuer `projekte`/Hermes praezisiert. H:-Nearline (still seit 2026-06-04) nachgeholt + Task neu registriert. - **2026-06-13** Home Assistant MQTT-Integration produktiv verbunden: Config-Entry `smarthome-mosquitto` ist `loaded`, Mosquitto sieht den HA-Client `homeassistant`; `check_config` gruen. -- **2026-06-13** HA Energy Dashboard konfiguriert: Netz, PV und Speicher aus SolarEdge Local gesetzt, `energy/validate` ohne Issues; HA-Backup danach erzeugt. --- diff --git a/docs/RESTORE_MATRIX.md b/docs/RESTORE_MATRIX.md index 9734d0e..2aa74da 100644 --- a/docs/RESTORE_MATRIX.md +++ b/docs/RESTORE_MATRIX.md @@ -60,6 +60,7 @@ Sie ist die fachliche Ergaenzung zu `docs/DISASTER_RECOVERY.md`. | Glance | Git / Borg-Repo | Repo-Konfiguration unter `ops/glance/config/glance.yml`; keine kritische Datenpersistenz | keine | `GLANCE_IMMICH_API_KEY`, `GLANCE_ADGUARD_USERNAME`, `GLANCE_ADGUARD_PASSWORD`, `GLANCE_SPEEDTEST_API_KEY` | Traefik, Authelia, optional interne API-Ziele | Dashboard startet, Widgets laden, Docker-Status laeuft nur ueber `glance-docker-socket-proxy` | | ntfy | Borg / Share | `/mnt/user/appdata/ntfy` | keine | keine besonderen Secret-Dateien dokumentiert | Traefik | UI und Push-Endpunkt erreichbar | | Paperless-GPT | Borg / Share | `/mnt/user/appdata/paperless-gpt` | keine eigene DB | `PAPERLESS_API_TOKEN`, `OPENAI_API_KEY` | Traefik, Paperless, OpenAI API | UI startet, Konfiguration vorhanden; LLM-Provider zeigt `openai` / `gpt-5.4-mini` | +| n8n | Borg + Dump | `/mnt/user/appdata/n8n/data` | `n8n.sqlite.dump`; Credentials sind nur mit dem passenden `N8N_ENCRYPTION_KEY` entschluesselbar | `N8N_ENCRYPTION_KEY`, GMX/OpenAI/Gitea-Credentials in n8n | Traefik, GMX IMAP, OpenAI API, Gitea API | UI startet, Owner-Login funktioniert, kritischer Mail->LLM->Gitea-Workflow ist vorhanden und deaktiviert/aktiv wie vor Restore | | Home Assistant | Borg + HA-native Backups + Fachrepo | `/mnt/user/appdata/homeassistant` inkl. `.storage`, `secrets.yaml`, `trusted_proxies.yaml`, `custom_components` (HACS, `solaredge_modbus_multi`); Fach-YAML aus `/mnt/user/services/smart-home-kalli/home-assistant` | HA-native Backup-Artefakte unter `/mnt/user/appdata/homeassistant/backups`; erstes Artefakt 2026-06-13 erzeugt und tar-lesbar (`backup.json`, `homeassistant.tar.gz`); Backup nach SolarEdge-Integration: `Custom_backup_2026.6.1_2026-06-13_14.59_48645373.tar`; Backup nach Energy-Dashboard-Konfiguration: `Custom_backup_2026.6.1_2026-06-13_15.59_25670583.tar`; keine externe DB in Phase 1 | HA-Secrets in `secrets.yaml`, Integrations-Tokens in `.storage`, MQTT-Credentials, Agent-API-Tokens als Host-Secrets `ha_token_codex`/`ha_token_claude` (nur mit erhaltenem `.storage`-Auth-State nutzbar), spaeter Tibber/InfluxDB-Tokens | Traefik, `frontend_net`, `smarthome_net`, Mosquitto, Fachrepo-Clone, SolarEdge-Wechselrichter `192.168.178.111:1502` | Restore-Test am 2026-06-13 erfolgreich: HA-native Backup + Mosquitto-Appdata + Fachrepo-Clone isoliert gestartet, HA HTTP/API/check_config gruen; produktiv danach HA-MQTT-Config-Entry `smarthome-mosquitto` geladen, SolarEdge Local `solaredge_modbus_multi` loaded mit 68 Entitaeten und Energy Dashboard fuer Netz/PV/Speicher per `energy/validate` ohne Issues; Report `/mnt/user/backups/restore-reports/homeassistant-2026-06-13.md` | | Smart-Home MQTT / Mosquitto | Borg / Share | `/mnt/user/appdata/mosquitto/config`, `/mnt/user/appdata/mosquitto/data`, `/mnt/user/appdata/mosquitto/log` | Mosquitto persistiert retained messages/subscriptions dateibasiert | `passwordfile`, `aclfile`, spaeter per-Device-User | `smarthome_net`, Home Assistant, spaeter ESPHome/Zigbee2MQTT | Restore-Test am 2026-06-13 erfolgreich: authentifizierter Publish/Subscribe-Smoke mit `homeassistant`-User und retained Topic nach Broker-Restart gruen; produktiv verbindet sich HA als User `homeassistant` | | Smart-Home Fachrepo | Gitea + Borg-Repo-Clone | `/mnt/user/services/smart-home-kalli` | keine | keine echten Secrets im Repo; `secrets-template/` nur Beispiele | Gitea, Home Assistant Mounts | `git status` sauber, HA liest `configuration.yaml` und `packages/` aus dem Clone | @@ -104,6 +105,7 @@ Aktuell relevante Dump-Artefakte unter `/mnt/user/backups/borg/dumps/latest`: - `filebrowser.bolt.dump` - `borg-ui.sqlite` - `grafana.sqlite` +- `n8n.sqlite.dump` - `unraid-flash-config.tar.gz` plus `unraid-flash-config.tar.gz.sha256` und Manifest - Monitoring-Stack: keine verpflichtenden Dump-Artefakte; Prometheus/Loki/Grafana named volumes sind Diagnose-/Dashboard-Zustand, keine primaere Restore-Quelle. - `komodo-mongo.archive.gz` (noch gesondert verifizieren) diff --git a/monitoring/prometheus/alerts.yml b/monitoring/prometheus/alerts.yml index 970d3ba..9fbb10f 100644 --- a/monitoring/prometheus/alerts.yml +++ b/monitoring/prometheus/alerts.yml @@ -131,6 +131,78 @@ groups: summary: "Latest Borg backup completed with warnings" description: "The latest Borg UI job completed with warnings for archive {{ $labels.archive }}." + - alert: HomelabBorgScopeSourceListMissing + expr: homelab_borg_scope_expected_file_present != 1 + for: 15m + labels: + severity: critical + annotations: + summary: "Borg expected source list is not visible" + description: "Borg UI cannot see the repo source list used for drift checks." + + - alert: HomelabBorgScopeMissingSources + expr: homelab_borg_scope_missing_sources_total > 0 + for: 15m + labels: + severity: critical + annotations: + summary: "Borg UI is missing expected backup sources" + description: "Borg UI is missing {{ $value }} source path(s) from ops/borg-ui/all-important-sources.txt." + + - alert: HomelabBorgScopeExtraSources + expr: homelab_borg_scope_extra_sources_total > 0 + for: 30m + labels: + severity: warning + annotations: + summary: "Borg UI has sources not tracked in the repo" + description: "Borg UI has {{ $value }} source path(s) that are not listed in ops/borg-ui/all-important-sources.txt." + + - alert: HomelabBorgDumpMissing + expr: homelab_borg_dump_present == 0 + for: 15m + labels: + severity: critical + annotations: + summary: "Borg pre-backup dump is missing: {{ $labels.dump }}" + description: "Expected dump artifact {{ $labels.dump }} is not present in the latest dump set. The pre-backup dump job may have failed or stopped." + + - alert: HomelabBorgDumpStale + expr: homelab_borg_dump_age_seconds > 30 * 60 * 60 + for: 15m + labels: + severity: critical + annotations: + summary: "Borg pre-backup dump is stale: {{ $labels.dump }}" + description: "Dump artifact {{ $labels.dump }} is older than 30 hours. pre-backup-dumps.sh may have stopped; Borg would keep archiving stale database content without a job failure." + + - alert: HomelabBorgRepositoryCheckStale + expr: time() - homelab_borg_repository_last_check_timestamp_seconds > 14 * 24 * 60 * 60 + for: 30m + labels: + severity: warning + annotations: + summary: "Borg repository check is stale" + description: "Borg repository {{ $labels.repository }} has not had a recorded check for more than 14 days." + + - alert: HomelabBorgRetentionDisabled + expr: homelab_borg_schedule_prune_after_enabled != 1 + for: 30m + labels: + severity: warning + annotations: + summary: "Borg retention pruning is disabled" + description: "Scheduled Borg job {{ $labels.schedule }} does not run prune after backup." + + - alert: HomelabBorgCompactDisabled + expr: homelab_borg_schedule_compact_after_enabled != 1 + for: 30m + labels: + severity: warning + annotations: + summary: "Borg compaction is disabled" + description: "Scheduled Borg job {{ $labels.schedule }} does not run compact after backup." + - alert: HomelabCriticalContainerDown expr: homelab_critical_container_running == 0 for: 5m diff --git a/ops/borg-ui/BACKUP_SCOPE.md b/ops/borg-ui/BACKUP_SCOPE.md index f251499..77cb2b5 100644 --- a/ops/borg-ui/BACKUP_SCOPE.md +++ b/ops/borg-ui/BACKUP_SCOPE.md @@ -48,11 +48,12 @@ The Unraid flash configuration archive is intentional as well and must be treate | Grafana | SQLite dump from `monitoring_grafana_data` + provisioned config in Git | `/local/borg-dumps`, `monitoring/grafana/provisioning`, `monitoring/grafana/dashboards` | | Filebrowser | file-backed state dump + file data | `/local/borg-dumps`, `/local/appdata/filebrowser` | | InfluxDB 3 Core | file data | `/local/appdata/influxdb3/data`, `/local/appdata/influxdb3/plugins` | +| n8n | SQLite dump + encrypted workflow/credential state | `/local/borg-dumps`, `/local/appdata/n8n/data` | | Home Assistant | HA-native backup + file state | `/local/appdata/homeassistant`, `/local/services/smart-home-kalli` | | Smart-Home MQTT / Mosquitto | file data | `/local/appdata/mosquitto/config`, `/local/appdata/mosquitto/data` | | Zigbee2MQTT (planned) | file data + coordinator state | `/local/appdata/zigbee2mqtt`, `/local/services/smart-home-kalli` | | ESPHome (planned) | Fachrepo + optional build/runtime cache | `/local/services/smart-home-kalli/esphome`, optional `/local/appdata/esphome` | -| Hermes Agent | file data + SSH key | `/local/appdata/hermes-agent/data`, `/local/secrets/hermes_runner_id_ed25519` | +| Hermes Agent | file data + SSH key | SSH-Key via `/local/secrets`; `/local/appdata/hermes-agent/data` ist bewusst NICHT in `all-important-sources.txt`, weil der Stack geparkt ist (Review 2026-07-25). Beim Aktivieren des Stacks in die Quellliste aufnehmen. | | BentoPDF | rebuildable | no critical persistence in compose | ## Open Decisions and Coverage Gaps @@ -71,6 +72,17 @@ Option A umgesetzt: `pre-backup-dumps.sh` writes `nextcloud.dump` from `nextclou The live Unraid User Scripts execute repo scripts from `/mnt/user/services/homelab-infra`, while Komodo keeps stack workspaces below `/mnt/user/services/stacks`. These paths are now mounted into Borg UI as `/local/services/...` and included explicitly so host-side script hotfixes, stack workspace state, and posture-check state are recoverable. +### User-Daten-Shares ausserhalb des App-Scope + +Filebrowser serviert `/mnt/user/projekte`, `/mnt/user/documents` und `/mnt/user/photos` komplett (`ops/filebrowser/docker-compose.yml`). Der Borg-Scope deckt aber bewusst nur die App-Unterordner ab (`documents/paperless*`, `documents/nextcloud-data`, `documents/scans_inbox`, `photos/immich`, `photos/family_archive`). + +- **`/mnt/user/projekte`** ist aktuell in **keinem** Borg-Scope. Ad-hoc-Dateien, die direkt unter `documents/` oder `photos/` (ausserhalb der genannten App-Ordner) abgelegt werden, ebenfalls nicht. +- Entscheidung Operator offen (Eintrag in `docs/MASTER_TODO.md`): Entweder `projekte` als eigenen read-only Borg-UI-Mount + Quelllisten-Eintrag aufnehmen, oder bewusst als "nur lokal, nicht DR-relevant" bestaetigen. Bis zur Entscheidung gilt: dort liegende Originaldaten sind **nicht** wiederherstellbar. + +### Komodo keys + +Production still stores Komodo Core/Periphery keys in the Docker named volume `komodo_komodo_keys`. This is a known open migration item and is not fixed by the Borg source list alone. Target state: move the keys to a host path such as `/mnt/user/appdata/komodo/keys` and mount that path into both Komodo containers, then include it in Borg. Do not treat this as solved until the live Compose stack has been migrated and Periphery reconnect has been verified. + ## Database Dumps Required ### Shared PostgreSQL (`postgresql17`, runtime PostgreSQL 18) @@ -89,6 +101,7 @@ The live Unraid User Scripts execute repo scripts from `/mnt/user/services/homel - Komodo MongoDB - SQLite: `gitea`, `vaultwarden`, `speedtest-tracker`, `borg-ui`, `grafana` +- SQLite: `n8n` (`n8n.sqlite.dump`, credentials require the matching `N8N_ENCRYPTION_KEY`) - File-backed state: `filebrowser.bolt.dump` - Unraid flash config: `unraid-flash-config.tar.gz` plus `unraid-flash-config.tar.gz.sha256` - Home Assistant native backups: created by HA under `/mnt/user/appdata/homeassistant/backups` and captured as file state diff --git a/ops/borg-ui/all-important-sources.txt b/ops/borg-ui/all-important-sources.txt index 459ae7f..ad6b4c8 100644 --- a/ops/borg-ui/all-important-sources.txt +++ b/ops/borg-ui/all-important-sources.txt @@ -18,6 +18,12 @@ /local/appdata/borg-ui/data /local/appdata/komodo/periphery /local/appdata/komodo/core +/local/appdata/nextcloud/html +/local/nextcloud/data +/local/appdata/n8n/data +/local/appdata/filebrowser +/local/appdata/influxdb3/data +/local/appdata/influxdb3/plugins /local/services/homelab-infra /local/services/smart-home-kalli /local/services/stacks diff --git a/ops/borg-ui/scripts/pre-backup-dumps.sh b/ops/borg-ui/scripts/pre-backup-dumps.sh index d64c83f..0ffd5df 100755 --- a/ops/borg-ui/scripts/pre-backup-dumps.sh +++ b/ops/borg-ui/scripts/pre-backup-dumps.sh @@ -325,6 +325,7 @@ main() { # Additional host-side SQLite dumps for admin tooling with appdata files. dump_sqlite_file "/mnt/user/appdata/borg-ui/data/borg.db" "$LATEST_DIR/borg-ui.sqlite" "borg-ui" dump_sqlite_file "/var/lib/docker/volumes/monitoring_grafana_data/_data/grafana.db" "$LATEST_DIR/grafana.sqlite" "grafana" + dump_sqlite_file "/mnt/user/appdata/n8n/data/database.sqlite" "$LATEST_DIR/n8n.sqlite.dump" "n8n" # MongoDB dump_mongo_container "komodo-mongo" "$LATEST_DIR/komodo-mongo.archive.gz" diff --git a/ops/h-drive-nearline/pull-critical-backups.ps1 b/ops/h-drive-nearline/pull-critical-backups.ps1 index c227a06..dedb07f 100644 --- a/ops/h-drive-nearline/pull-critical-backups.ps1 +++ b/ops/h-drive-nearline/pull-critical-backups.ps1 @@ -25,6 +25,7 @@ $Jobs = @( "immich.dump", "komodo-mongo.archive.gz", "mealie.dump", + "n8n.sqlite.dump", "nextcloud.dump", "postgresql17-authelia.dump", "postgresql17-globals.sql", diff --git a/ops/restore-tests/check-restore-freshness.ps1 b/ops/restore-tests/check-restore-freshness.ps1 index dbfcd75..957f855 100644 --- a/ops/restore-tests/check-restore-freshness.ps1 +++ b/ops/restore-tests/check-restore-freshness.ps1 @@ -6,6 +6,7 @@ param( ) $checks = @( + @{ Name = "postgresql17-globals.sql"; Path = Join-Path $DumpRoot "postgresql17-globals.sql" }, @{ Name = "postgresql17-paperless.dump"; Path = Join-Path $DumpRoot "postgresql17-paperless.dump" }, @{ Name = "postgresql17-mailarchiver.dump"; Path = Join-Path $DumpRoot "postgresql17-mailarchiver.dump" }, @{ Name = "mealie.dump"; Path = Join-Path $DumpRoot "mealie.dump" }, @@ -13,6 +14,7 @@ $checks = @( @{ Name = "nextcloud.dump"; Path = Join-Path $DumpRoot "nextcloud.dump" }, @{ Name = "gitea.sqlite.dump"; Path = Join-Path $DumpRoot "gitea.sqlite.dump" }, @{ Name = "vaultwarden.sqlite.dump"; Path = Join-Path $DumpRoot "vaultwarden.sqlite.dump" }, + @{ Name = "n8n.sqlite.dump"; Path = Join-Path $DumpRoot "n8n.sqlite.dump" }, @{ Name = "speedtest-tracker.sqlite.dump"; Path = Join-Path $DumpRoot "speedtest-tracker.sqlite.dump" }, @{ Name = "filebrowser.bolt.dump"; Path = Join-Path $DumpRoot "filebrowser.bolt.dump" }, @{ Name = "unraid-flash-config.tar.gz"; Path = Join-Path $DumpRoot "unraid-flash-config.tar.gz" } diff --git a/ops/restore-tests/check-restore-freshness.sh b/ops/restore-tests/check-restore-freshness.sh index 48dadfc..84813d0 100755 --- a/ops/restore-tests/check-restore-freshness.sh +++ b/ops/restore-tests/check-restore-freshness.sh @@ -89,6 +89,7 @@ check_pg_header() { } for dump in \ + postgresql17-globals.sql \ postgresql17-paperless.dump \ postgresql17-mailarchiver.dump \ mealie.dump \ @@ -96,6 +97,7 @@ for dump in \ nextcloud.dump \ gitea.sqlite.dump \ vaultwarden.sqlite.dump \ + n8n.sqlite.dump \ speedtest-tracker.sqlite.dump \ filebrowser.bolt.dump \ unraid-flash-config.tar.gz; do diff --git a/services/posture-check/export-prometheus-textfile.sh b/services/posture-check/export-prometheus-textfile.sh index d66cdb3..26038f3 100755 --- a/services/posture-check/export-prometheus-textfile.sh +++ b/services/posture-check/export-prometheus-textfile.sh @@ -4,7 +4,11 @@ set -euo pipefail TEXTFILE_DIR="${TEXTFILE_DIR:-/mnt/user/services/posture-check/textfile}" OUTPUT_FILE="${OUTPUT_FILE:-$TEXTFILE_DIR/homelab.prom}" BORG_CONTAINER="${BORG_CONTAINER:-borg-ui}" -CRITICAL_CONTAINERS="${CRITICAL_CONTAINERS:-traefik authelia postgresql17 gitea komodo-core komodo-mongo komodo-periphery vaultwarden borg-ui ntfy adguard unbound monitoring-alertmanager monitoring-alertmanager-ntfy-bridge monitoring-blackbox-exporter monitoring-cadvisor monitoring-grafana monitoring-loki monitoring-node-exporter monitoring-promtail immich_server immich_postgres immich_redis paperless-ngx nextcloud nextcloud-postgres nextcloud-redis mealie mealie-postgres}" +BORG_EXPECTED_SOURCES_FILE="${BORG_EXPECTED_SOURCES_FILE:-/local/services/homelab-infra/ops/borg-ui/all-important-sources.txt}" +# Host-Pfad der aktuellen Dump-Artefakte (pre-backup-dumps.sh schreibt hierhin). +# Wird host-seitig gestattet; der Exporter laeuft als Unraid User Script. +BORG_DUMP_DIR="${BORG_DUMP_DIR:-/mnt/user/backups/borg/dumps/latest}" +CRITICAL_CONTAINERS="${CRITICAL_CONTAINERS:-traefik authelia postgresql17 gitea komodo-core komodo-mongo komodo-periphery vaultwarden borg-ui ntfy adguard unbound monitoring-alertmanager monitoring-alertmanager-ntfy-bridge monitoring-blackbox-exporter monitoring-cadvisor monitoring-grafana monitoring-loki monitoring-node-exporter monitoring-promtail immich_server immich_postgres immich_redis paperless-ngx nextcloud nextcloud-postgres nextcloud-redis mealie mealie-postgres mail-archiver n8n homeassistant smarthome-mosquitto}" # Hinweis: Tailscale laeuft als natives Unraid-Plugin (kein Docker-Container) und # wird daher hier bewusst NICHT als kritischer Container gefuehrt (Stand 2026-06-06). @@ -90,11 +94,32 @@ EOF # TYPE homelab_borg_last_success gauge # HELP homelab_borg_last_job_warning Whether the most recent Borg backup job completed with warnings. # TYPE homelab_borg_last_job_warning gauge +# HELP homelab_borg_repository_last_check_timestamp_seconds Unix timestamp of the latest Borg repository check known to Borg UI. +# TYPE homelab_borg_repository_last_check_timestamp_seconds gauge +# HELP homelab_borg_scope_expected_file_present Whether the expected Borg source list file is visible inside Borg UI. +# TYPE homelab_borg_scope_expected_file_present gauge +# HELP homelab_borg_scope_expected_sources_total Number of expected Borg source paths from the repo source list. +# TYPE homelab_borg_scope_expected_sources_total gauge +# HELP homelab_borg_scope_configured_sources_total Number of Borg source paths configured in Borg UI. +# TYPE homelab_borg_scope_configured_sources_total gauge +# HELP homelab_borg_scope_missing_sources_total Number of expected Borg source paths missing from Borg UI. +# TYPE homelab_borg_scope_missing_sources_total gauge +# HELP homelab_borg_scope_extra_sources_total Number of Borg UI source paths not present in the repo source list. +# TYPE homelab_borg_scope_extra_sources_total gauge +# HELP homelab_borg_scope_source_configured Whether an expected Borg source path is configured in Borg UI. +# TYPE homelab_borg_scope_source_configured gauge +# HELP homelab_borg_schedule_prune_after_enabled Whether a Borg scheduled job runs prune after backup. +# TYPE homelab_borg_schedule_prune_after_enabled gauge +# HELP homelab_borg_schedule_compact_after_enabled Whether a Borg scheduled job runs compact after backup. +# TYPE homelab_borg_schedule_compact_after_enabled gauge EOF if docker inspect "$BORG_CONTAINER" >/dev/null 2>&1; then - docker exec -i "$BORG_CONTAINER" python3 - <<'PY' + docker exec -i -e BORG_EXPECTED_SOURCES_FILE="$BORG_EXPECTED_SOURCES_FILE" "$BORG_CONTAINER" python3 - <<'PY' import datetime as dt +import json +import os +from pathlib import Path import sqlite3 conn = sqlite3.connect("/data/borg.db") @@ -135,6 +160,9 @@ def parse_ts(value): def escape_label(value): return (value or "").replace("\\", "\\\\").replace('"', '\\"') +def bool_metric(value): + return 1 if value else 0 + latest_status = latest["status"] if latest else "missing" latest_success = 1 if latest_status in ("completed", "completed_with_warnings") else 0 latest_warning = 1 if latest_status == "completed_with_warnings" else 0 @@ -145,12 +173,107 @@ completed_archive = escape_label(completed["archive_name"] if completed else "") print(f'homelab_borg_last_success{{status="{latest_status}",archive="{latest_archive}"}} {latest_success}') print(f'homelab_borg_last_job_warning{{status="{latest_status}",archive="{latest_archive}"}} {latest_warning}') print(f'homelab_borg_last_completed_timestamp_seconds{{archive="{completed_archive}"}} {completed_ts}') + +repo = cur.execute(""" + select id, name, source_directories, last_check + from repositories + order by id + limit 1 +""").fetchone() + +if repo: + repo_name = escape_label(repo["name"] or str(repo["id"])) + print(f'homelab_borg_repository_last_check_timestamp_seconds{{repository="{repo_name}"}} {parse_ts(repo["last_check"])}') + + try: + configured_sources = json.loads(repo["source_directories"] or "[]") + except json.JSONDecodeError: + configured_sources = [] +else: + configured_sources = [] + +expected_path = Path(os.environ.get("BORG_EXPECTED_SOURCES_FILE", "")) +expected_file_present = expected_path.is_file() +if expected_file_present: + expected_sources = [ + line.strip() + for line in expected_path.read_text(encoding="utf-8").splitlines() + if line.strip() and not line.lstrip().startswith("#") + ] +else: + expected_sources = [] + +configured_set = set(configured_sources) +expected_set = set(expected_sources) +missing_sources = [source for source in expected_sources if source not in configured_set] +extra_sources = [source for source in configured_sources if source not in expected_set] + +print(f"homelab_borg_scope_expected_file_present {bool_metric(expected_file_present)}") +print(f"homelab_borg_scope_expected_sources_total {len(expected_sources)}") +print(f"homelab_borg_scope_configured_sources_total {len(configured_sources)}") +print(f"homelab_borg_scope_missing_sources_total {len(missing_sources)}") +print(f"homelab_borg_scope_extra_sources_total {len(extra_sources)}") + +for source in expected_sources: + value = 1 if source in configured_set else 0 + print(f'homelab_borg_scope_source_configured{{source="{escape_label(source)}"}} {value}') + +for source in extra_sources: + print(f'homelab_borg_scope_source_configured{{source="{escape_label(source)}",state="extra"}} 0') + +for schedule in cur.execute(""" + select id, name, run_prune_after, run_compact_after + from scheduled_jobs + where enabled = 1 + order by id +"""): + schedule_name = escape_label(schedule["name"] or str(schedule["id"])) + print(f'homelab_borg_schedule_prune_after_enabled{{schedule="{schedule_name}"}} {bool_metric(schedule["run_prune_after"])}') + print(f'homelab_borg_schedule_compact_after_enabled{{schedule="{schedule_name}"}} {bool_metric(schedule["run_compact_after"])}') PY else printf 'homelab_borg_last_success{status="container_missing",archive=""} 0\n' printf 'homelab_borg_last_job_warning{status="container_missing",archive=""} 0\n' printf 'homelab_borg_last_completed_timestamp_seconds{archive=""} 0\n' + printf 'homelab_borg_repository_last_check_timestamp_seconds{repository=""} 0\n' + printf 'homelab_borg_scope_expected_file_present 0\n' + printf 'homelab_borg_scope_expected_sources_total 0\n' + printf 'homelab_borg_scope_configured_sources_total 0\n' + printf 'homelab_borg_scope_missing_sources_total 0\n' + printf 'homelab_borg_scope_extra_sources_total 0\n' fi + + # Dump-Frische host-seitig messen. Schliesst den Blindfleck, dass Borg + # weiterlaeuft und stale Dumps archiviert, ohne dass ein Job-Fehler entsteht + # (pre-backup-dumps.sh gestoppt). Laeuft ausserhalb des borg-ui-Containers, + # weil die Dumps host-seitig unter $BORG_DUMP_DIR liegen. + cat <<'EOF' +# HELP homelab_borg_dump_present Whether an expected Borg pre-backup dump artifact exists in the latest dump set. +# TYPE homelab_borg_dump_present gauge +# HELP homelab_borg_dump_age_seconds Age in seconds of an expected Borg pre-backup dump artifact. +# TYPE homelab_borg_dump_age_seconds gauge +EOF + for dump in \ + postgresql17-globals.sql \ + postgresql17-mailarchiver.dump \ + postgresql17-paperless.dump \ + mealie.dump \ + immich.dump \ + nextcloud.dump \ + gitea.sqlite.dump \ + vaultwarden.sqlite.dump \ + n8n.sqlite.dump \ + unraid-flash-config.tar.gz \ + komodo-mongo.archive.gz; do + dump_path="$BORG_DUMP_DIR/$dump" + if [ -f "$dump_path" ]; then + dump_mtime="$(stat -c %Y "$dump_path" 2>/dev/null || echo 0)" + printf 'homelab_borg_dump_present{dump="%s"} 1\n' "$dump" + printf 'homelab_borg_dump_age_seconds{dump="%s"} %s\n' "$dump" "$(( now - dump_mtime ))" + else + printf 'homelab_borg_dump_present{dump="%s"} 0\n' "$dump" + fi + done } > "$tmp" # 0644 statt mktemp-default 0600, damit der node-exporter-Textfile-Collector