chore(deps): update minor-and-patch-updates

report: unhealthy-Container namentlich + Image-Age-Allowlist
Zwei Verbesserungen am Daily Operations Report, ausgeloest durch den versteckten immich_machine_learning-Ausfall (lief 2,3 Tage unhealthy, weil der Report nur "unhealthy=1" zaehlte, ohne Name/Grund): 1. collect_container_state: neue Sektion "Unhealthy Container" listet jeden unhealthy Container mit FailingStreak und letztem Healthcheck-Output. So ist sofort sichtbar WELCHER Container und WARUM. 2. collect_image_freshness: neue Image-Age-Allowlist (image-age-allow.patterns). Bewusst gepinnte, aber aktuelle/empfohlene Images (immich_postgres = exakt Immichs Pin; blackbox-exporter v0.28.0 = latest) werden mit Recheck-Datum von der Ueberalterungs-Warnung ausgenommen. Nach Ablauf des Recheck-Datums greift die Ausnahme nicht mehr -> erzwingt Neubewertung statt stillen Alterns. Top-10-Tabelle hat jetzt eine Hinweis-Spalte (ueberaltert / bewusst gepinnt). Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
2026-06-10 10:20:20 +00:00 · 2026-06-10 11:08:44 +02:00 · 2026-06-10 11:02:27 +02:00 · 2026-06-10 10:06:52 +02:00
6 changed files with 144 additions and 15 deletions
@@ -34,6 +34,15 @@ services:
    container_name: immich_machine_learning
    image: ghcr.io/immich-app/immich-machine-learning:release@sha256:a2501141440f10516d329fdfba2c68082e19eb9ba6016c061ac80d23beadf7f3
    restart: unless-stopped
    environment:
      # Workaround fuer gunicorn-25.1.0-Fork-Deadlock (Worker haengt in futex
      # nach "Control socket listening", erreicht nie "Application startup
      # complete"). mimalloc per LD_PRELOAD deaktiviert -> umgeht den Lock im
      # geforkten Worker. Reine Allocator-Optimierung, funktional unkritisch.
      # Upstream-Regression seit Immich 2.6 (immich#27228, #22317), kein
      # offizieller Fix. Re-check: bei Immich-/gunicorn-Update entfernen und
      # pruefen, ob der Worker wieder sauber bootet.
      LD_PRELOAD: ""
    volumes:
      - model-cache:/cache
    networks:
@@ -1,6 +1,6 @@
 services:
  n8n:
-    image: docker.n8n.io/n8nio/n8n:2.26.1@sha256:1e6a06e20e78ca62e39ecad02d42ffbef4e36f23ea0b18938b8c65b8c58a9fa0
+    image: docker.n8n.io/n8nio/n8n:2.26.2@sha256:61ba01bc5e39304bbc928c9dbecd938c3a5cc1331b68affba6a34d0f654c43d9
    container_name: n8n
    restart: unless-stopped
@@ -66,15 +66,18 @@ services:
    image: prom/blackbox-exporter:v0.28.0@sha256:e753ff9f3fc458d02cca5eddab5a77e1c175eee484a8925ac7d524f04366c2fc
    container_name: monitoring-blackbox-exporter
    restart: unless-stopped
    # Use AdGuard so *.kaleschke.info resolves to the internal Traefik IP.
    # External resolvers (1.1.1.1/8.8.8.8) return the public WAN IP, which
    # causes hairpin-NAT timeouts when probing from inside the Docker network.
    dns:
-      - 1.1.1.1
+      - 172.23.0.3
      - 8.8.8.8
    command:
      - --config.file=/etc/blackbox_exporter/blackbox.yml
    volumes:
      - ./blackbox/blackbox.yml:/etc/blackbox_exporter/blackbox.yml:ro
    networks:
      - monitoring_net
      - dns_net
    expose:
      - "9115"
    security_opt:
@@ -367,6 +370,8 @@ networks:
    driver: bridge
  frontend_net:
    external: true
  dns_net:
    external: true
 volumes:
  prometheus_data:
@@ -11,6 +11,7 @@ SINCE="${SINCE:-24h}"
 MAX_LOG_LINES="${MAX_LOG_LINES:-80}"
 CERT_MAX_ROWS="${CERT_MAX_ROWS:-12}"
 IMAGE_AGE_WARN_DAYS="${IMAGE_AGE_WARN_DAYS:-180}"
 IMAGE_AGE_ALLOW_FILE="${IMAGE_AGE_ALLOW_FILE:-/mnt/user/services/homelab-infra/services/posture-check/image-age-allow.patterns}"
 LOG_VOLUME_TOP_N="${LOG_VOLUME_TOP_N:-10}"
 DISK_USAGE_WARN_PCT="${DISK_USAGE_WARN_PCT:-85}"
 CERT_WARN_DAYS="${CERT_WARN_DAYS:-21}"
@@ -459,6 +460,10 @@ with open("/acme.json", "r", encoding="utf-8") as handle:
    data = json.load(handle)
 now = datetime.now(timezone.utc)
 # Deduplicate: for each unique set of domains keep only the longest-lived cert.
 # Traefik stores both the old and the newly-issued cert in acme.json during
 # the renewal window, which would otherwise produce a false warning.
 best = {}  # frozenset(domains) -> (days, expire_date_iso, names)
 for resolver in data.values():
    for cert in resolver.get("Certificates", []):
        domain = cert.get("domain", {}).get("main") or "-"
@@ -474,7 +479,11 @@ for resolver in data.values():
        not_after = datetime.strptime(decoded["notAfter"], "%b %d %H:%M:%S %Y %Z").replace(tzinfo=timezone.utc)
        days = (not_after - now).days
        names = ", ".join([domain, *sans])
-        print(f"{days}\t{not_after.date().isoformat()}\t{names}")
+        key = frozenset([domain, *sans])
        if key not in best or days > best[key][0]:
            best[key] = (days, not_after.date().isoformat(), names)
 for days, expires, names in best.values():
    print(f"{days}\t{expires}\t{names}")
 PY
  then
    if [ ! -s "$cert_file" ]; then
@@ -573,13 +582,36 @@ collect_image_freshness() {
  local image_file="$TMP_DIR/images.tsv"
  local image_warnings=0
  local image_allowed=0
  local now_epoch
  : > "$image_file"
  now_epoch="$(date +%s)"
  # Parse the image-age allowlist: container deliberately pinned to a stable or
  # upstream-recommended image. Each entry carries a recheck date; once that
  # date has passed the suppression lapses, so a pin gets re-reviewed instead
  # of silently aging forever.
  local allow_file="$TMP_DIR/image-allow.tsv"
  : > "$allow_file"
  if [ -f "$IMAGE_AGE_ALLOW_FILE" ]; then
    while IFS= read -r line; do
      line="${line%%#*}"
      line="$(printf '%s' "$line" | sed -E 's/^[[:space:]]+//; s/[[:space:]]+$//')"
      [ -n "$line" ] || continue
      local a_name a_date a_epoch
      a_name="$(printf '%s' "$line" | awk '{ print $1 }')"
      a_date="$(printf '%s' "$line" | awk '{ print $2 }')"
      [ -n "$a_name" ] && [ -n "$a_date" ] || continue
      a_epoch="$(date -d "$a_date" +%s 2>/dev/null || echo 0)"
      if [ "$a_epoch" -ge "$now_epoch" ]; then
        printf '%s\t%s\n' "$a_name" "$a_date" >> "$allow_file"
      fi
    done < "$IMAGE_AGE_ALLOW_FILE"
  fi
  while IFS= read -r name; do
    [ -n "$name" ] || continue
-    local image_id created_iso created_epoch age_days image_tag
+    local image_id created_iso created_epoch age_days image_tag note recheck
    image_id="$(docker inspect --format '{{.Image}}' "$name" 2>/dev/null || true)"
    [ -n "$image_id" ] || continue
    created_iso="$(docker image inspect --format '{{.Created}}' "$image_id" 2>/dev/null || true)"
@@ -588,33 +620,46 @@ collect_image_freshness() {
    created_epoch="$(date -d "$created_iso" +%s 2>/dev/null || echo 0)"
    [ "$created_epoch" -gt 0 ] || continue
    age_days=$(( (now_epoch - created_epoch) / 86400 ))
-    printf '%d\t%s\t%s\n' "$age_days" "$name" "$image_tag" >> "$image_file"
+    note=""
    if [ "$age_days" -ge "$IMAGE_AGE_WARN_DAYS" ]; then
      recheck="$(awk -F '\t' -v n="$name" '$1 == n { print $2; found = 1 } END { exit !found }' "$allow_file" || true)"
      if [ -n "$recheck" ]; then
        note="bewusst gepinnt (recheck $recheck)"
        image_allowed=$((image_allowed + 1))
      else
        note="ueberaltert"
        image_warnings=$((image_warnings + 1))
      fi
    fi
    printf '%d\t%s\t%s\t%s\n' "$age_days" "$name" "$image_tag" "$note" >> "$image_file"
  done < <(docker ps --format '{{.Names}}')
  set_summary "image_warnings" "$image_warnings"
  set_summary "image_allowed" "$image_allowed"
  if [ ! -s "$image_file" ]; then
    append "- Keine Image-Daten verfuegbar."
    record_section_error "images" "Keine Image-Daten ermittelt"
  else
    append "- Schwelle Warnung: Image aelter als $IMAGE_AGE_WARN_DAYS Tage"
-    append "- Container mit Image >= $IMAGE_AGE_WARN_DAYS Tage: $image_warnings"
+    append "- Container mit ueberaltertem Image (gewarnt): $image_warnings"
    append "- Davon bewusst gepinnt (von Warnung ausgenommen): $image_allowed"
    append "- Allowlist-Quelle: \`$IMAGE_AGE_ALLOW_FILE\`"
    append ""
    append "### Aelteste Images (Top 10)"
    append ""
-    append "| Alter Tage | Container | Image |"
+    append "| Alter Tage | Container | Image | Hinweis |"
-    append "|---:|---|---|"
+    append "|---:|---|---|---|"
-    sort -nr "$image_file" | head -n 10 | while IFS="$(printf '\t')" read -r age name img; do
+    sort -nr "$image_file" | head -n 10 | while IFS="$(printf '\t')" read -r age name img note; do
-      append "| $age | $name | $img |"
+      append "| $age | $name | $img | ${note:-} |"
    done
    append ""
-    if [ "$image_warnings" -eq 0 ]; then
+    if [ "$image_warnings" -eq 0 ] && [ "$image_allowed" -eq 0 ]; then
      append "Bewertung: Keine Container mit ueberalterten Images. CVE-Hygiene aus dieser Sicht ok."
    elif [ "$image_warnings" -eq 0 ]; then
      append "Bewertung: Keine ungeprueft ueberalterten Images. $image_allowed Container sind bewusst gepinnt und mit Recheck-Datum dokumentiert."
    else
-      append "Bewertung: $image_warnings Container nutzen Images aelter als $IMAGE_AGE_WARN_DAYS Tage. Update-Pipeline und CVE-Status pruefen."
+      append "Bewertung: $image_warnings Container nutzen ueberalterte Images (nicht in der Allowlist). Update-Pipeline und CVE-Status pruefen."
    fi
  fi
  append ""
@@ -655,6 +700,31 @@ collect_container_events() {
 collect_container_state() {
  append "## Container-Zustand"
  append ""
  append "### Unhealthy Container"
  local unhealthy_file="$TMP_DIR/unhealthy.log"
  docker ps --filter health=unhealthy --format '{{.Names}}' > "$unhealthy_file"
  if [ ! -s "$unhealthy_file" ]; then
    append "- Keine."
  else
    append "| Container | FailingStreak | Letzter Healthcheck |"
    append "|---|---:|---|"
    while IFS= read -r name; do
      [ -n "$name" ] || continue
      local streak hc
      streak="$(docker inspect "$name" --format '{{.State.Health.FailingStreak}}' 2>/dev/null || echo '?')"
      # Letzten nicht-leeren Health-Log-Eintrag holen, einzeilig machen und
      # Pipe-Zeichen escapen, damit die Markdown-Tabelle nicht bricht.
      hc="$(docker inspect "$name" --format '{{range .State.Health.Log}}exit={{.ExitCode}} out={{.Output}}~~~{{end}}' 2>/dev/null \
        | tr '\n' ' ' \
        | awk -F '~~~' '{ for (i = NF - 1; i >= 1; i--) { if ($i != "") { print $i; break } } }' \
        | sed -E 's/[[:space:]]+/ /g; s/\|/\\|/g' \
        | cut -c1-160)"
      append "| \`$name\` | ${streak:-?} | ${hc:-(kein Output)} |"
    done < "$unhealthy_file"
  fi
  append ""
  append "### Nicht laufende Container"
  local stopped_file="$TMP_DIR/stopped.log"
  docker ps -a --filter status=exited --filter status=dead --filter status=created --format '{{.Names}}\t{{.Status}}' > "$stopped_file"
@@ -0,0 +1,30 @@
 # image-age-allow.patterns - Daily Operations Report
 #
 # Container, die bewusst auf einem aelteren, aber aktuellen/empfohlenen Image
 # gepinnt sind, sollen nicht jeden Tag als "Image ueberaltert" warnen.
 #
 # Format pro Zeile:
 #   <container-name>  <YYYY-MM-DD recheck>   # Begruendung
 #
 #   - Spalte 1: exakter Container-Name (docker ps {{.Names}}).
 #   - Spalte 2: Recheck-Datum. NACH diesem Datum greift die Ausnahme NICHT
 #     mehr und der Container taucht wieder als Warnung auf -> erzwingt eine
 #     menschliche Neubewertung statt stillen Alterns.
 #   - Alles nach '#' ist Kommentar. Leerzeilen werden ignoriert.
 #
 # Eine Ausnahme heisst NICHT "Image egal", sondern "am Datum X erneut pruefen,
 # ob es noch die empfohlene/aktuelle Version ist".
 #
 # Last reviewed: 2026-06-10
 # immich_postgres: exakt das von Immich offiziell empfohlene, per Digest
 # gepinnte DB-Image (14-vectorchord0.4.3-pgvectors0.2.0). Immichs eigene
 # docker-compose auf main pinnt am 2026-06-10 denselben Tag inkl. identischem
 # Digest. Kein Update, solange Immich nichts Neueres empfiehlt.
 # Re-check: ob Immich ein neueres Postgres-Image empfiehlt.
 immich_postgres 2026-09-10
 # monitoring-blackbox-exporter: v0.28.0 ist am 2026-06-10 die NEUESTE Release
 # (Dez 2025). Das Image-Alter ist nur Build-Alter, keine veraltete Version.
 # Re-check: ob eine blackbox_exporter-Version > v0.28.0 erschienen ist.
 monitoring-blackbox-exporter 2026-09-10
@@ -18,7 +18,7 @@
 # Removing a pattern: replace with a fresh attention example in the next
 # daily report and consult before reintroducing.
 #
-# Last reviewed: 2026-05-21
+# Last reviewed: 2026-06-10
 # Loki internal query cancellations / scheduler chatter.
 # Why: Loki cancels internal queries continuously when downstream Promtails
@@ -72,3 +72,18 @@ authelia.*Request timeout occurred.*status_code=408
 #       noise becomes overwhelming, add a *narrow* pattern restricted to
 #       push contexts only (e.g. `vaultwarden.*push.*(ResolveError|...)`).
 vaultwarden.*(Token has expired|Invalid refresh token|Failed to decode.*refresh_token|POST /identity/connect/token => 401 Unauthorized)
 # AdGuard: Fritz!Box sends malformed SOA queries for myfritz.net / myfritz.link.
 # Why: AVM Fritz!Box devices send multi-question DNS SOA queries that violate
 #      RFC 1035 ("only 1 question allowed"). AdGuard rejects them with an error
 #      but they have no operational impact.
 # Re-check: if the same error appears for non-AVM domains, or if rate spikes
 #           well above 1000/day without a Fritz!Box reboot explaining it.
 adguard.*bad question section.*only 1 question allowed
 # Grafana: usage-stats collector looks for the Amazon Prometheus plugin, which
 # is not installed in this setup. The error is emitted once per stats cycle.
 # Why: GF_PLUGINS_PREINSTALL_DISABLED=true keeps the plugin list minimal;
 #      this lookup is harmless and does not affect any dashboard.
 # Re-check: only if Amazon Prometheus is added as a datasource.
 monitoring-grafana.*grafana-amazonprometheus-datasource not found
Author	SHA1	Message	Date
renovate	fc5807d2c6	chore(deps): update minor-and-patch-updates	2026-06-10 10:20:20 +00:00
Micha	2f64aee109	report: unhealthy-Container namentlich + Image-Age-Allowlist Zwei Verbesserungen am Daily Operations Report, ausgeloest durch den versteckten immich_machine_learning-Ausfall (lief 2,3 Tage unhealthy, weil der Report nur "unhealthy=1" zaehlte, ohne Name/Grund): 1. collect_container_state: neue Sektion "Unhealthy Container" listet jeden unhealthy Container mit FailingStreak und letztem Healthcheck-Output. So ist sofort sichtbar WELCHER Container und WARUM. 2. collect_image_freshness: neue Image-Age-Allowlist (image-age-allow.patterns). Bewusst gepinnte, aber aktuelle/empfohlene Images (immich_postgres = exakt Immichs Pin; blackbox-exporter v0.28.0 = latest) werden mit Recheck-Datum von der Ueberalterungs-Warnung ausgenommen. Nach Ablauf des Recheck-Datums greift die Ausnahme nicht mehr -> erzwingt Neubewertung statt stillen Alterns. Top-10-Tabelle hat jetzt eine Hinweis-Spalte (ueberaltert / bewusst gepinnt). Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>	2026-06-10 11:08:44 +02:00
Micha	ed55b88ec1	immich-ml: LD_PRELOAD leeren gegen gunicorn-25.1.0-Fork-Deadlock immich_machine_learning haengt seit dem 7.6. unhealthy: der gunicorn-Worker bleibt nach "Control socket listening" in futex_do_wait stehen und erreicht nie "Application startup complete" (/ping -> ConnectTimeout/ReadTimeout). Kein OOM (22 GB frei), kein Disk-I/O-Wait, laeuft als root, Socket wird erstellt - klassischer Fork-Deadlock von mimalloc (LD_PRELOAD) im geforkten Worker unter gunicorn 25.1.0. mimalloc per LD_PRELOAD="" deaktiviert. Reine Allocator-Optimierung, funktional unkritisch, reversibel. Bekannte Upstream-Regression seit Immich 2.6 (immich#27228, #22317) ohne offiziellen Fix; Restart und force-recreate sind dort als wirkungslos dokumentiert. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>	2026-06-10 11:02:27 +02:00
Micha	ce747f687f	ops-report: cert-dedup, blackbox-DNS auf AdGuard, neue Noise-Patterns Behebt drei Befunde aus dem Operations-Report 2026-06-10: - daily-status-report.sh: Zertifikate werden vor der Auswertung pro Domain-Set dedupliziert; nur das laengstlaufende Cert zaehlt. Traefik haelt waehrend der Erneuerung altes + neues Cert in acme.json, was bisher eine falsche KRITISCH-Warnung (traefik.kaleschke.info 5 Tage) ausloeste, obwohl das neue Cert 65 Tage Restlaufzeit hat. - monitoring/blackbox-exporter: DNS von 1.1.1.1/8.8.8.8 auf AdGuard (172.23.0.3 via dns_net) umgestellt. Externe Resolver lieferten die WAN-IP, was Hairpin-NAT-Timeouts (9,5s) bei Probes von cloud/glances verursachte (662 Fehler/Tag). - log-noise.patterns: Fritz!Box-SOA-Fehler (AdGuard, RFC-1035-Verstoss) und fehlendes grafana-amazonprometheus-datasource-Plugin als bekanntes Rauschen klassifiziert (~1800 Zeilen/Tag). Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>	2026-06-10 10:06:52 +02:00