Prepare monitoring alert rules
This commit is contained in:
@@ -62,9 +62,9 @@ Kontext bewusst gesichert, bevor weitere Live-Aenderungen passieren:
|
|||||||
| Status | Aufgabe | Ergebnis |
|
| Status | Aufgabe | Ergebnis |
|
||||||
|---|---|---|
|
|---|---|---|
|
||||||
| in Arbeit (vorbereitet) | Immich-Restore-Test implementieren | `ops/restore-tests/immich-restore-test.sh`, `immich-compose.test.yml` und Dispatcher-Eintrag vorbereitet; lokaler und Host-`--what-if` erfolgreich; Host-Preflight 2026-05-27: `immich.dump` 66M, `/mnt/user/backups` ca. 3.7T frei; Abschluss erst nach echtem Host-Lauf mit Report unter `/mnt/user/backups/restore-reports/` |
|
| in Arbeit (vorbereitet) | Immich-Restore-Test implementieren | `ops/restore-tests/immich-restore-test.sh`, `immich-compose.test.yml` und Dispatcher-Eintrag vorbereitet; lokaler und Host-`--what-if` erfolgreich; Host-Preflight 2026-05-27: `immich.dump` 66M, `/mnt/user/backups` ca. 3.7T frei; Abschluss erst nach echtem Host-Lauf mit Report unter `/mnt/user/backups/restore-reports/` |
|
||||||
| offen | Borg-Stale-Alert bauen | Alarm feuert, wenn Borg-Archiv zu alt ist |
|
| in Arbeit (Regeln vorbereitet) | Borg-Stale-Alert bauen | Textfile-Metrik `homelab_borg_last_completed_timestamp_seconds` und Prometheus-Regeln vorbereitet; Abschluss nach Host-Schedule + Prometheus-Reload/Testalert |
|
||||||
| offen | TLS-Cert-Expiry-Alert bauen | Alarm feuert bei Restlaufzeit unter Schwellwert |
|
| in Arbeit (Regeln vorbereitet) | TLS-Cert-Expiry-Alert bauen | Blackbox-Regeln fuer 21-/7-Tage-Schwellen vorbereitet; Abschluss nach Prometheus-Reload/Testalert |
|
||||||
| offen | Container-Down-Alert bauen | Unerwartet fehlende Container werden sichtbar |
|
| in Arbeit (Regeln vorbereitet) | Container-Down-Alert bauen | Textfile-Metrik `homelab_critical_container_running{name=...}` und Alert vorbereitet; Abschluss nach Host-Schedule + Prometheus-Reload/Testalert |
|
||||||
| offen | Family-View Dashboard definieren | Uptime, Backup-Frische, Cert-Tage, Disk-Fuellung auf einer Seite |
|
| offen | Family-View Dashboard definieren | Uptime, Backup-Frische, Cert-Tage, Disk-Fuellung auf einer Seite |
|
||||||
|
|
||||||
## Sprint 4 - Familien- und Betriebsdoku
|
## Sprint 4 - Familien- und Betriebsdoku
|
||||||
|
|||||||
@@ -37,6 +37,12 @@ Dieses Dokument ist nur noch ein historischer Verlauf. Der aktuelle operative Ab
|
|||||||
- Physikalische Disk-Werte aus `docs/HARDWARE_INVENTORY.md` und Host-Readout uebernommen: Cache Samsung 970 EVO Plus 1.8T XFS, Disk1 WDC WD60EFAX 5.5T XFS auf `md1p1`, Parity TOSHIBA HDWG480 7.3T, Boot Samsung Flash Drive 59.8G FAT32, H:/ als Nearline-Ziel.
|
- Physikalische Disk-Werte aus `docs/HARDWARE_INVENTORY.md` und Host-Readout uebernommen: Cache Samsung 970 EVO Plus 1.8T XFS, Disk1 WDC WD60EFAX 5.5T XFS auf `md1p1`, Parity TOSHIBA HDWG480 7.3T, Boot Samsung Flash Drive 59.8G FAT32, H:/ als Nearline-Ziel.
|
||||||
- `docs/AUDIT_2026-05-25_TODO.md` Sprint 2 fuer Storage-Layout und Disk-/Share-Baseline auf erledigt gesetzt. Retention-Kalibrierung, Monitoring-Schwellen und RESTORE_MATRIX-Detailklassifikation bleiben normale Folgeaufgaben.
|
- `docs/AUDIT_2026-05-25_TODO.md` Sprint 2 fuer Storage-Layout und Disk-/Share-Baseline auf erledigt gesetzt. Retention-Kalibrierung, Monitoring-Schwellen und RESTORE_MATRIX-Detailklassifikation bleiben normale Folgeaufgaben.
|
||||||
|
|
||||||
|
### 2026-05-27 - F-08 Alert-Regeln vorbereitet
|
||||||
|
|
||||||
|
- `services/posture-check/export-prometheus-textfile.sh` erzeugt Textfile-Metriken fuer Borg-Backup-Frische und kritische Container unter `/mnt/user/services/posture-check/textfile/homelab.prom`.
|
||||||
|
- `monitoring/docker-compose.yml` aktiviert den Node-Exporter-Textfile-Collector. `monitoring/prometheus/alerts.yml` enthaelt vorbereitete Alerts fuer Borg-Stale, Borg-Fehlerstatus, Borg-Warnstatus, Textfile-Stale, Critical-Container-Down und TLS-Cert-Expiry 21/7 Tage.
|
||||||
|
- Kein Monitoring-Redeploy und kein Scheduled Task in diesem Schritt. Abschluss erfolgt nach Host-Schedule, Prometheus-Reload und Testalert.
|
||||||
|
|
||||||
### 2026-05-26 - Audit F-16 und F-20 abgeschlossen (Doku-only)
|
### 2026-05-26 - Audit F-16 und F-20 abgeschlossen (Doku-only)
|
||||||
|
|
||||||
- F-16: `infra/redis`-Etikett auf die Realitaet abgeglichen. `docs/SERVICE_CATALOG.md`, `docs/REPO_MAP.md`, `HOMELAB_ARCHITECTURE_MASTER_V2.md` Sektion 13 und `docs/DISASTER_RECOVERY.md` Bootstrap-Stufe 2 beschreiben Redis jetzt als "primaer Paperless-Redis (App-Cache); historisch als shared angelegt, faktisch nur von Paperless genutzt". Immich, Nextcloud, Mealie eigene Redis-Instanzen; Authelia bewusst ohne Redis. Keine Compose-Aenderung.
|
- F-16: `infra/redis`-Etikett auf die Realitaet abgeglichen. `docs/SERVICE_CATALOG.md`, `docs/REPO_MAP.md`, `HOMELAB_ARCHITECTURE_MASTER_V2.md` Sektion 13 und `docs/DISASTER_RECOVERY.md` Bootstrap-Stufe 2 beschreiben Redis jetzt als "primaer Paperless-Redis (App-Cache); historisch als shared angelegt, faktisch nur von Paperless genutzt". Immich, Nextcloud, Mealie eigene Redis-Instanzen; Authelia bewusst ohne Redis. Keine Compose-Aenderung.
|
||||||
|
|||||||
@@ -62,6 +62,7 @@ INFLUXDB_BIND_IP=192.168.178.58
|
|||||||
- `https://monitoring.kaleschke.info` leitet zu Authelia.
|
- `https://monitoring.kaleschke.info` leitet zu Authelia.
|
||||||
- Grafana-Datasources `Prometheus`, `Loki` und `InfluxDB 3 Core` testen erfolgreich.
|
- Grafana-Datasources `Prometheus`, `Loki` und `InfluxDB 3 Core` testen erfolgreich.
|
||||||
- Prometheus Targets: `prometheus`, `node-exporter`, `cadvisor`, `traefik`, `blackbox-http`.
|
- Prometheus Targets: `prometheus`, `node-exporter`, `cadvisor`, `traefik`, `blackbox-http`.
|
||||||
|
- Node Exporter Textfile Collector: `/mnt/user/services/posture-check/textfile/homelab.prom` wird vom Host-Skript `services/posture-check/export-prometheus-textfile.sh` befuellt.
|
||||||
- Alertmanager ist erreichbar und sendet ueber `monitoring-alertmanager-ntfy-bridge` nach `https://ntfy.kaleschke.info/homelab-alerts`.
|
- Alertmanager ist erreichbar und sendet ueber `monitoring-alertmanager-ntfy-bridge` nach `https://ntfy.kaleschke.info/homelab-alerts`.
|
||||||
- Loki zeigt Container-Logs mit Labels `container`, `compose_project`, `compose_service`.
|
- Loki zeigt Container-Logs mit Labels `container`, `compose_project`, `compose_service`.
|
||||||
- InfluxDB 3 Core enthaelt die Datenbank `homelab`.
|
- InfluxDB 3 Core enthaelt die Datenbank `homelab`.
|
||||||
@@ -83,9 +84,17 @@ Blackbox-HTTP-Alerts unterscheiden zwischen einem einzelnen kaputten Endpoint un
|
|||||||
|
|
||||||
- `HomelabExternalConnectivityDown` feuert, wenn mindestens 5 Public-Endpoints gleichzeitig fuer 8 Minuten nicht erreichbar sind. Das deckt WAN-, DNS- oder Provider-Ausfaelle ab, inklusive laengerer DSL-Reconnects.
|
- `HomelabExternalConnectivityDown` feuert, wenn mindestens 5 Public-Endpoints gleichzeitig fuer 8 Minuten nicht erreichbar sind. Das deckt WAN-, DNS- oder Provider-Ausfaelle ab, inklusive laengerer DSL-Reconnects.
|
||||||
- `HomelabEndpointDown` feuert fuer einzelne Endpoints erst nach 8 Minuten und wird unterdrueckt, solange der Sammelalert aktiv ist. Dadurch erzeugt ein Telekom-24h-Reconnect keine ntfy-Flut pro Domain.
|
- `HomelabEndpointDown` feuert fuer einzelne Endpoints erst nach 8 Minuten und wird unterdrueckt, solange der Sammelalert aktiv ist. Dadurch erzeugt ein Telekom-24h-Reconnect keine ntfy-Flut pro Domain.
|
||||||
|
- `HomelabCertificateExpiresSoon` und `HomelabCertificateExpiresCritical` nutzen Blackbox TLS-Metriken fuer 21-/7-Tage-Warnungen.
|
||||||
|
- `HomelabBorgBackupStale`, `HomelabBorgLastJobFailed`, `HomelabBorgLastJobCompletedWithWarnings` und `HomelabCriticalContainerDown` nutzen Host-Textfile-Metriken. Voraussetzung: `services/posture-check/export-prometheus-textfile.sh` laeuft regelmaessig auf dem Host, empfohlen alle 15 Minuten.
|
||||||
|
|
||||||
Test:
|
Test:
|
||||||
|
|
||||||
```bash
|
```bash
|
||||||
curl -fsS http://alertmanager-ntfy-bridge:8080/healthz
|
curl -fsS http://alertmanager-ntfy-bridge:8080/healthz
|
||||||
```
|
```
|
||||||
|
|
||||||
|
Textfile-Metriken manuell aktualisieren:
|
||||||
|
|
||||||
|
```bash
|
||||||
|
bash /mnt/user/services/homelab-infra/services/posture-check/export-prometheus-textfile.sh
|
||||||
|
```
|
||||||
|
|||||||
@@ -280,11 +280,13 @@ services:
|
|||||||
- --path.procfs=/host/proc
|
- --path.procfs=/host/proc
|
||||||
- --path.sysfs=/host/sys
|
- --path.sysfs=/host/sys
|
||||||
- --path.rootfs=/rootfs
|
- --path.rootfs=/rootfs
|
||||||
|
- --collector.textfile.directory=/textfile
|
||||||
- --collector.filesystem.mount-points-exclude=^/(dev|proc|sys|run|var/lib/docker/.+|var/lib/containers/storage/.+)($|/)
|
- --collector.filesystem.mount-points-exclude=^/(dev|proc|sys|run|var/lib/docker/.+|var/lib/containers/storage/.+)($|/)
|
||||||
volumes:
|
volumes:
|
||||||
- /proc:/host/proc:ro
|
- /proc:/host/proc:ro
|
||||||
- /sys:/host/sys:ro
|
- /sys:/host/sys:ro
|
||||||
- /:/rootfs:ro
|
- /:/rootfs:ro
|
||||||
|
- /mnt/user/services/posture-check/textfile:/textfile:ro
|
||||||
networks:
|
networks:
|
||||||
- monitoring_net
|
- monitoring_net
|
||||||
expose:
|
expose:
|
||||||
|
|||||||
@@ -28,6 +28,24 @@ groups:
|
|||||||
summary: "{{ $labels.instance }} is slow"
|
summary: "{{ $labels.instance }} is slow"
|
||||||
description: "Blackbox probe duration is above 5 seconds for {{ $labels.instance }}."
|
description: "Blackbox probe duration is above 5 seconds for {{ $labels.instance }}."
|
||||||
|
|
||||||
|
- alert: HomelabCertificateExpiresSoon
|
||||||
|
expr: (probe_ssl_earliest_cert_expiry{job="blackbox-http"} - time()) < 21 * 24 * 3600 and (probe_ssl_earliest_cert_expiry{job="blackbox-http"} - time()) > 7 * 24 * 3600
|
||||||
|
for: 30m
|
||||||
|
labels:
|
||||||
|
severity: warning
|
||||||
|
annotations:
|
||||||
|
summary: "TLS certificate expires soon for {{ $labels.instance }}"
|
||||||
|
description: "The earliest certificate expiry for {{ $labels.instance }} is below 21 days."
|
||||||
|
|
||||||
|
- alert: HomelabCertificateExpiresCritical
|
||||||
|
expr: (probe_ssl_earliest_cert_expiry{job="blackbox-http"} - time()) <= 7 * 24 * 3600
|
||||||
|
for: 15m
|
||||||
|
labels:
|
||||||
|
severity: critical
|
||||||
|
annotations:
|
||||||
|
summary: "TLS certificate is close to expiry for {{ $labels.instance }}"
|
||||||
|
description: "The earliest certificate expiry for {{ $labels.instance }} is at or below 7 days, or already expired."
|
||||||
|
|
||||||
- name: homelab-host
|
- name: homelab-host
|
||||||
rules:
|
rules:
|
||||||
- alert: HomelabDiskAlmostFull
|
- alert: HomelabDiskAlmostFull
|
||||||
@@ -56,3 +74,59 @@ groups:
|
|||||||
annotations:
|
annotations:
|
||||||
summary: "Traefik 5xx responses for {{ $labels.service }}"
|
summary: "Traefik 5xx responses for {{ $labels.service }}"
|
||||||
description: "Traefik reports at least 5 5xx responses for {{ $labels.service }} within 5 minutes."
|
description: "Traefik reports at least 5 5xx responses for {{ $labels.service }} within 5 minutes."
|
||||||
|
|
||||||
|
- name: homelab-backup-and-containers
|
||||||
|
rules:
|
||||||
|
- alert: HomelabTextfileExporterStale
|
||||||
|
expr: time() - homelab_textfile_exporter_last_run_timestamp_seconds > 2 * 60 * 60
|
||||||
|
for: 15m
|
||||||
|
labels:
|
||||||
|
severity: warning
|
||||||
|
annotations:
|
||||||
|
summary: "Homelab textfile metrics are stale"
|
||||||
|
description: "The host textfile exporter has not refreshed metrics for more than 2 hours."
|
||||||
|
|
||||||
|
- alert: HomelabBorgMetricsMissing
|
||||||
|
expr: absent(homelab_borg_last_completed_timestamp_seconds)
|
||||||
|
for: 15m
|
||||||
|
labels:
|
||||||
|
severity: critical
|
||||||
|
annotations:
|
||||||
|
summary: "Borg backup metrics are missing"
|
||||||
|
description: "Prometheus cannot see the homelab_borg_last_completed_timestamp_seconds metric."
|
||||||
|
|
||||||
|
- alert: HomelabBorgBackupStale
|
||||||
|
expr: time() - homelab_borg_last_completed_timestamp_seconds > 30 * 60 * 60
|
||||||
|
for: 15m
|
||||||
|
labels:
|
||||||
|
severity: warning
|
||||||
|
annotations:
|
||||||
|
summary: "Borg backup is stale"
|
||||||
|
description: "The latest completed Borg backup is older than 30 hours."
|
||||||
|
|
||||||
|
- alert: HomelabBorgLastJobFailed
|
||||||
|
expr: homelab_borg_last_success != 1
|
||||||
|
for: 15m
|
||||||
|
labels:
|
||||||
|
severity: critical
|
||||||
|
annotations:
|
||||||
|
summary: "Latest Borg backup did not complete successfully"
|
||||||
|
description: "The latest Borg UI job status is {{ $labels.status }} for archive {{ $labels.archive }}."
|
||||||
|
|
||||||
|
- alert: HomelabBorgLastJobCompletedWithWarnings
|
||||||
|
expr: homelab_borg_last_job_warning == 1
|
||||||
|
for: 15m
|
||||||
|
labels:
|
||||||
|
severity: warning
|
||||||
|
annotations:
|
||||||
|
summary: "Latest Borg backup completed with warnings"
|
||||||
|
description: "The latest Borg UI job completed with warnings for archive {{ $labels.archive }}."
|
||||||
|
|
||||||
|
- alert: HomelabCriticalContainerDown
|
||||||
|
expr: homelab_critical_container_running == 0
|
||||||
|
for: 5m
|
||||||
|
labels:
|
||||||
|
severity: critical
|
||||||
|
annotations:
|
||||||
|
summary: "Critical container is down: {{ $labels.name }}"
|
||||||
|
description: "The host textfile exporter reports that critical container {{ $labels.name }} is not running."
|
||||||
|
|||||||
+110
@@ -0,0 +1,110 @@
|
|||||||
|
#!/usr/bin/env bash
|
||||||
|
set -euo pipefail
|
||||||
|
|
||||||
|
TEXTFILE_DIR="${TEXTFILE_DIR:-/mnt/user/services/posture-check/textfile}"
|
||||||
|
OUTPUT_FILE="${OUTPUT_FILE:-$TEXTFILE_DIR/homelab.prom}"
|
||||||
|
BORG_CONTAINER="${BORG_CONTAINER:-borg-ui}"
|
||||||
|
CRITICAL_CONTAINERS="${CRITICAL_CONTAINERS:-traefik authelia postgresql17 gitea komodo-core komodo-mongo komodo-periphery vaultwarden borg-ui ntfy adguard unbound Tailscale-Docker monitoring-alertmanager monitoring-alertmanager-ntfy-bridge monitoring-blackbox-exporter monitoring-cadvisor monitoring-grafana monitoring-loki monitoring-node-exporter monitoring-promtail immich_server immich_postgres immich_redis paperless-ngx nextcloud nextcloud-postgres nextcloud-redis mealie mealie-postgres}"
|
||||||
|
|
||||||
|
mkdir -p "$TEXTFILE_DIR"
|
||||||
|
tmp="$(mktemp "$TEXTFILE_DIR/homelab.prom.XXXXXX")"
|
||||||
|
cleanup() {
|
||||||
|
rm -f "$tmp"
|
||||||
|
}
|
||||||
|
trap cleanup EXIT
|
||||||
|
|
||||||
|
now="$(date +%s)"
|
||||||
|
|
||||||
|
{
|
||||||
|
cat <<'EOF'
|
||||||
|
# HELP homelab_textfile_exporter_last_run_timestamp_seconds Unix timestamp of the last successful homelab textfile exporter run.
|
||||||
|
# TYPE homelab_textfile_exporter_last_run_timestamp_seconds gauge
|
||||||
|
EOF
|
||||||
|
printf 'homelab_textfile_exporter_last_run_timestamp_seconds %s\n' "$now"
|
||||||
|
|
||||||
|
cat <<'EOF'
|
||||||
|
# HELP homelab_critical_container_running Whether a critical container is currently running according to docker inspect.
|
||||||
|
# TYPE homelab_critical_container_running gauge
|
||||||
|
EOF
|
||||||
|
for container in $CRITICAL_CONTAINERS; do
|
||||||
|
running="0"
|
||||||
|
if docker inspect -f '{{.State.Running}}' "$container" 2>/dev/null | grep -qx true; then
|
||||||
|
running="1"
|
||||||
|
fi
|
||||||
|
printf 'homelab_critical_container_running{name="%s"} %s\n' "$container" "$running"
|
||||||
|
done
|
||||||
|
|
||||||
|
cat <<'EOF'
|
||||||
|
# HELP homelab_borg_last_completed_timestamp_seconds Unix timestamp of the most recent completed Borg backup job known to Borg UI.
|
||||||
|
# TYPE homelab_borg_last_completed_timestamp_seconds gauge
|
||||||
|
# HELP homelab_borg_last_success Whether the most recent Borg backup job completed successfully.
|
||||||
|
# TYPE homelab_borg_last_success gauge
|
||||||
|
# HELP homelab_borg_last_job_warning Whether the most recent Borg backup job completed with warnings.
|
||||||
|
# TYPE homelab_borg_last_job_warning gauge
|
||||||
|
EOF
|
||||||
|
|
||||||
|
if docker inspect "$BORG_CONTAINER" >/dev/null 2>&1; then
|
||||||
|
docker exec -i "$BORG_CONTAINER" python3 - <<'PY'
|
||||||
|
import datetime as dt
|
||||||
|
import sqlite3
|
||||||
|
|
||||||
|
conn = sqlite3.connect("/data/borg.db")
|
||||||
|
conn.row_factory = sqlite3.Row
|
||||||
|
cur = conn.cursor()
|
||||||
|
|
||||||
|
latest = cur.execute("""
|
||||||
|
select status, completed_at, archive_name
|
||||||
|
from backup_jobs
|
||||||
|
order by coalesce(started_at, created_at) desc
|
||||||
|
limit 1
|
||||||
|
""").fetchone()
|
||||||
|
|
||||||
|
completed = cur.execute("""
|
||||||
|
select completed_at, archive_name
|
||||||
|
from backup_jobs
|
||||||
|
where status in ('completed', 'completed_with_warnings')
|
||||||
|
and completed_at is not null
|
||||||
|
order by completed_at desc
|
||||||
|
limit 1
|
||||||
|
""").fetchone()
|
||||||
|
|
||||||
|
def parse_ts(value):
|
||||||
|
if not value:
|
||||||
|
return 0
|
||||||
|
value = value.replace("Z", "+00:00")
|
||||||
|
try:
|
||||||
|
parsed = dt.datetime.fromisoformat(value)
|
||||||
|
except ValueError:
|
||||||
|
try:
|
||||||
|
parsed = dt.datetime.strptime(value, "%Y-%m-%d %H:%M:%S")
|
||||||
|
except ValueError:
|
||||||
|
return 0
|
||||||
|
if parsed.tzinfo is None:
|
||||||
|
parsed = parsed.replace(tzinfo=dt.timezone.utc)
|
||||||
|
return int(parsed.timestamp())
|
||||||
|
|
||||||
|
def escape_label(value):
|
||||||
|
return (value or "").replace("\\", "\\\\").replace('"', '\\"')
|
||||||
|
|
||||||
|
latest_status = latest["status"] if latest else "missing"
|
||||||
|
latest_success = 1 if latest_status in ("completed", "completed_with_warnings") else 0
|
||||||
|
latest_warning = 1 if latest_status == "completed_with_warnings" else 0
|
||||||
|
completed_ts = parse_ts(completed["completed_at"]) if completed else 0
|
||||||
|
latest_archive = escape_label(latest["archive_name"] if latest else "")
|
||||||
|
completed_archive = escape_label(completed["archive_name"] if completed else "")
|
||||||
|
|
||||||
|
print(f'homelab_borg_last_success{{status="{latest_status}",archive="{latest_archive}"}} {latest_success}')
|
||||||
|
print(f'homelab_borg_last_job_warning{{status="{latest_status}",archive="{latest_archive}"}} {latest_warning}')
|
||||||
|
print(f'homelab_borg_last_completed_timestamp_seconds{{archive="{completed_archive}"}} {completed_ts}')
|
||||||
|
PY
|
||||||
|
else
|
||||||
|
printf 'homelab_borg_last_success{status="container_missing",archive=""} 0\n'
|
||||||
|
printf 'homelab_borg_last_job_warning{status="container_missing",archive=""} 0\n'
|
||||||
|
printf 'homelab_borg_last_completed_timestamp_seconds{archive=""} 0\n'
|
||||||
|
fi
|
||||||
|
} > "$tmp"
|
||||||
|
|
||||||
|
mv "$tmp" "$OUTPUT_FILE"
|
||||||
|
trap - EXIT
|
||||||
|
|
||||||
|
printf '%s\n' "$OUTPUT_FILE"
|
||||||
@@ -42,6 +42,27 @@ Zeit: taeglich 06:20, Cron `20 6 * * *`.
|
|||||||
bash /mnt/user/services/homelab-infra/services/posture-check/compose-runtime-drift.sh
|
bash /mnt/user/services/homelab-infra/services/posture-check/compose-runtime-drift.sh
|
||||||
```
|
```
|
||||||
|
|
||||||
|
## `prometheus-textfile-export-15min`
|
||||||
|
|
||||||
|
Zeit: alle 15 Minuten, Cron `*/15 * * * *`.
|
||||||
|
|
||||||
|
Zweck:
|
||||||
|
|
||||||
|
- Borg-Backup-Frische fuer Prometheus sichtbar machen
|
||||||
|
- kritische Container als explizite 0/1-Metrik exportieren
|
||||||
|
- Grundlage fuer `HomelabBorgBackupStale`, `HomelabBorgLastJobFailed` und `HomelabCriticalContainerDown`
|
||||||
|
|
||||||
|
```bash
|
||||||
|
#!/bin/bash
|
||||||
|
bash /mnt/user/services/homelab-infra/services/posture-check/export-prometheus-textfile.sh
|
||||||
|
```
|
||||||
|
|
||||||
|
Ziel-Datei:
|
||||||
|
|
||||||
|
```text
|
||||||
|
/mnt/user/services/posture-check/textfile/homelab.prom
|
||||||
|
```
|
||||||
|
|
||||||
## `homelab-operations-report-daily`
|
## `homelab-operations-report-daily`
|
||||||
|
|
||||||
Zeit: taeglich nach Borg und den Morgenchecks, z. B. 07:30, Cron `30 7 * * *`.
|
Zeit: taeglich nach Borg und den Morgenchecks, z. B. 07:30, Cron `30 7 * * *`.
|
||||||
|
|||||||
Reference in New Issue
Block a user