Selbst-enthaltener Stafettenstab nach Glance-Ausschluss (130s-Stop-Test):
Polling-Rate unveraendert mit Glance down. Restkandidaten dokumentiert
(Posture-Check, Periphery, Komodo-Self-Check, LAN-Geraet) plus konkrete
Testreihenfolge und Fix-Erwartung.
Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
Schliesst die zwei in ALERT_RULES.md identifizierten Hoch-Luecken:
- up==0 (5m) als critical in neuer Gruppe homelab-meta — Scrape-Targets
(node-exporter/cadvisor/blackbox/traefik) sind nicht laenger stille
Ausfaelle.
- Disk-Critical bei >95% (5m) als critical, zusaetzlich zum bestehenden
Warning bei >85% — fuer DB/appdata/Cache-Schreibblockaden.
ALERT_RULES.md Tabellen und Status-Abschnitt aktualisiert.
Wird wirksam nach Prometheus-Reload via Komodo-Redeploy des monitoring-Stacks.
Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
Nachschlagetabelle aller Prometheus-Alarmregeln (Trigger/Schwelle/Severity/
Aktion) plus Bewertung der Abdeckung. Identifiziert zwei echte blinde Flecke
(kein up==0 Target-Down, kein Disk-Critical-Tier) mit fertigem PromQL als
Empfehlung. Cross-Ref aus ALERTING_MAP.md.
Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
The original 2026-05-23 baseline was kept as a historical anchor but
the banner was too soft about how much of the concrete content is
already addressed. Reading the document standalone could mislead it
as a current TODO list.
Two changes, original text untouched:
1. Banner now explicitly says the document is mostly outdated,
not to be read as a TODO list, and that the per-finding status
lives in an appendix.
2. New "Status-Anhang 2026-05-30" at the end maps every concrete,
actionable finding to its current state (erledigt / geparkt /
entschieden nicht / offen / teilweise), grouped by the original
sections (Block 1-8) and by the Top-5 lists and Phase-1-to-4
roadmap.
Summary of what the appendix shows:
- Top 5 sofort: 5/5 erledigt
- Quick Wins: 6/7 erledigt, 1 geparkt
- Phase 1: 4/6 erledigt, 1 geparkt, 1 wartend
- Phase 2: 2/5 erledigt, 2 geparkt, 1 offen
- Phase 3: 1 entschieden-nicht, 1 teilweise, 3 offen
- Auth-Block (F-04/13/14/18): fully parked
Original "Schulnote 2-" no longer reflects reality; new note would
land at 1- to 2 but is not the point.
Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
The three notes from 2026-05-23 had been sitting untracked in docs/
for a week. Variante A from today's review: keep them in docs/ with
explicit status banners and reference them from REPO_MAP.md, so they
stop being silent roommates and become discoverable.
- docs/STRATEGISCHE_BEWERTUNG_2026-05-23.md: historical baseline that
kicked off the 2026-05-25 audit cycle. Permanent audit anchor and
"where we stood on 2026-05-23" snapshot. Do not edit further.
- docs/CODEX_KONSOLIDIERUNG_2026-05-23.md: first Codex prompt for the
audit cycle, content worked through; kept as a Codex-prompt
template for future consolidation sweeps.
- docs/CODEX_JELLYFIN_REMOVAL_2026-05-23.md: Codex removal pattern,
task executed 2026-05-25; kept as a template for future stack
removals (Hermes review 2026-07-25, possibly BentoPDF / paperless-gpt
follow-ups).
REPO_MAP.md "Wichtige Dokumente" now lists all three with one-line
purpose plus the F-19 prep doc committed earlier today.
Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
ops/policy-checks/mem-limits-baseline.md captures the deliberate
"not today" decision for memory limits plus the plan for when it
becomes relevant:
- Phase 1: 7 days of hourly docker stats snapshots
- Phase 2: derive Tier-1 peak per container
- Phase 3: set limits at peak * 1.5 with documented floors
(Postgres 1G, Mongo 1G, Redis 256M, etc.)
- Phase 4: roll out smallest-risk containers first, observe 24h
between stages
- Phase 5: Tier-2 only after a concrete trigger event
Next trigger: family invitation out + 4 weeks stable use, or
first real OOM event in docker-critical-events.sh, or a sudden
Immich/Nextcloud load spike where host swap becomes visible.
Today's policy check is clean (0 Critical, 1 documented Warning
on influxdb3-core user 0, 13 documented Info findings on host
ports / privileged exceptions / latest+digest tags).
Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
Six files had outdated status notes that the F-09 first run on
2026-05-30 made wrong:
- ops/restore-tests/komodo-bootstrap-runbook.md: "Erster echter Lauf
steht noch aus" -> first run confirmed
- ops/restore-tests/komodo-bootstrap-plan.md: "Noch offen vor dem
ersten echten Lauf" section -> "Bestaetigte Laeufe" table with
the --what-if and --keep-data runs
- ops/restore-tests/immich-runbook.md: status note still said
"Erster echter Lauf steht noch aus" although the Immich first run
was 2026-05-27; correcting in the same sweep
- docs/AUDIT_2026-05-25_TODO.md: Sprint 2 entry on Komodo bootstrap
path no longer carries the "Trockenlauf-Skript bleibt als offene
Folgeaufgabe" tail
- docs/SERVICES_RECOVERY.md: replaced the "Trockenlauf-Idee (Doku-only,
nicht ausgefuehrt)" section with the confirmed repo-script flow and
marked the two "Naechste Aufgaben" rows about the dry-run as done
- docs/RESTORE_DRILL_ROUTINE.md: Q2 2026 DR-Sanity-Check entry now
splits Komodo-Bootstrap-Pfad (done) from the two still-open items
(Gitea bundles, secrets inventory)
No behavior change, only documentation consistency.
Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
Result on host: SUCCESS, all 5 smoke checks green.
- docker compose config valid
- Test-Mongo healthy in ~6s
- Mongo authenticated ping ok (Test-Creds)
- Komodo Core HTTP 200 on 127.0.0.1:19120
- Test-Periphery container state running
Production komodo-{mongo,core,periphery} and /mnt/user/appdata/komodo/
were not touched; test ran in isolated project restoretest-komodo with
disposable datadir under /mnt/user/backups/restore-lab/komodo/.
Report at /mnt/user/backups/restore-reports/komodo-bootstrap-2026-05-30.md.
Operator-click pattern preserved: SSH to root@kallilabcore is an action
class that requires explicit instruction per CLAUDE.md; the auto-mode
classifier correctly blocked a non-destructive SSH probe. Operator ran
the command via the Unraid web terminal.
ops/komodo/docker-compose.yml is now demonstrably viable as the recovery
anchor for the bootstrap stages in docs/SERVICES_RECOVERY.md, not just
assumed viable. Image digests (mongo:7.0.32, komodo-core:2,
komodo-periphery:2) and Mongo auth schema verified.
Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
New services/authelia-diff.sh compares the access_control: section of the
repo baseline against the live host configuration.yml. OIDC clients,
identity providers, and secret values stay out of scope by design.
Exit codes: 0 ok, 1 drift, 2 file missing, 3 section missing, 4 tool missing.
posture-check.sh gains check_authelia_config_drift, which calls the diff
script and reports drift as warning (not critical). SKIP_AUTHELIA_DRIFT=1
opts out; AUTHELIA_DIFF_SCRIPT overrides the path.
WORKFLOW.md gets a dedicated "Ausnahme: Authelia configuration.yml" section
analogous to the Traefik dynamic-config exception, with the mandatory
repo->host merge workflow and the env-variable contract.
Smoke-tested locally: identical files rc=0, ACL change rc=1 with proper
unified diff, non-ACL change (session.default_redirection_url) correctly
ignored.
Operator follow-up: set up a read-only repo mirror at
/mnt/user/services/homelab-infra/ so the check finds a current baseline.
Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
Setup-Pfad final geworden, vier Reparaturen unterwegs:
1. EAI_AGAIN: Container kann git.kaleschke.info nicht aufloesen ->
--add-host (analog zur Komodo-extra_hosts)
2. Token-Sichtbarkeit in ps/inspect -> --env-file mit 0600 tempfile
3. EACCES auf State-Mount: Renovate-Image laeuft als uid 12021 ->
chmod 0777 auf /mnt/user/services/renovate/state
4. "Repository does not permit pull or push": Renovate-Source-
Code (lib/modules/platform/gitea/index.ts) prueft hardcoded
repo.permissions.push aus der Gitea-API. Mein initialer
SQL-INSERT in die collaboration-Tabelle hatte den Gitea-
In-Memory-Permission-Cache nicht aktualisiert; Operator-
UI-Klick "Entfernen + neu hinzufuegen" loeste den Cache-
Refresh.
Konfigurations-Trennung:
- renovate.json (Repo): nur Repo-Settings (extends, packageRules,
ignorePaths, manager file patterns, labels)
- ops/renovate/bot-config.js: Bot-Settings (platform, endpoint,
autodiscover=false, repositories=[Micha/homelab-infra],
Concurrent-Limits)
Bot-Felder in renovate.json fuehren zu "Repository is forbidden,
status: disabled" weil Renovate die Repo-Config nicht als Bot-
Config wertet.
Erstlauf am 2026-05-29: 5 PRs, 1 Dependency-Dashboard, 8 Branches.
Komodo-Major bleibt durch packageRule deaktiviert wie erwartet.
Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
Gitea's /api/v1/user/repos (which Renovate calls during autodiscover)
returns repos where the user is owner or org member, but NOT
collaborator-only repos. Our renovate service account has write
collaborator access on Micha/homelab-infra but no own/org repos,
so autodiscover yielded an empty list.
Switching to explicit "repositories": ["Micha/homelab-infra"] is
the pragmatic fix for a homelab with one repo to scan; avoids
having to create an org just for one service account.
Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
Renovate 41 migriert schedule:weekly auf einen 5am-Monday-only
Lauf - das verhindert beim manuellen Erstlauf jede PR. Wir wollen
dass Renovate bei jedem User-Script-Tick (alle 6h) tatsaechlich
scannt; die Quartals-/Wochen-Rhythmik regeln wir ueber den Cron.
Auch docker-compose.fileMatch ist in Renovate 41 deprecated;
Renovate migriert es zur Laufzeit auf managerFilePatterns mit
regex-Slash-Wrapping. Wir uebernehmen die migrierte Form direkt,
damit die WARN "Config needs migrating" verschwindet.
Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
Drei Issues beim Erstlauf gefunden und gefixt:
1. EAI_AGAIN: Renovate-Container konnte git.kaleschke.info nicht
aufloesen. Analog zu Komodos extra_hosts mappen wir den Hostname
per --add-host auf 192.168.178.58 (LAN-IP des Unraid-Hosts).
Zusaetzlich --dns 1.1.1.1/8.8.8.8 fuer externe Image-Registries.
2. Token-Leak in ps und docker inspect: -e RENOVATE_TOKEN=... macht
den Wert in Process-Listing sichtbar. Stattdessen --env-file mit
einem 0600 tempfile unter $RENOVATE_STATE_DIR/.env, das nach dem
Lauf via shred bzw. rm geloescht wird.
3. Doppelter rc=$? Block plus return innerhalb einer {}-Subshell
waren Tot-Code; aufgeraeumt.
Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
renovate.json: gitea platform, autodiscover Micha/*, group rules
(major separate, minor+patch+digest grouped, stateful tier-1
individual, komodo-major disabled), pin range strategy, no
automerge, dependency dashboard enabled.
ops/renovate/run-renovate.sh: one-shot docker run wrapper that
reads the Gitea PAT from /mnt/user/appdata/secrets/renovate_token.txt,
runs renovate/renovate:41, logs into /mnt/user/services/renovate/logs/.
docs/RENOVATE.md: 5-step operator setup (Gitea service account,
PAT, token file, first run, six-hourly user script). Explicit
no-automerge stance with notfall-stop checklist.
Cross-doc sweep: SECRETS_MAP entry for renovate_token.txt,
REPO_MAP entry for RENOVATE.md, AUDIT_2026-05-25_TODO new
Sprint 8 with F-15, F-07, F-09 rest, F-12 status, MIGRATION_LOG
captures the four-block sprint in one entry.
Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
Mirror of the Immich restore-test pattern for the Komodo bootstrap
anchor. Brings up a throwaway komodo-mongo + komodo-core +
komodo-periphery under project restoretest-komodo, isolated from
production:
- same image digests as production (mongo:7.0.32, komodo-core:2,
komodo-periphery:2) to prove compose-level bootstrap compatibility
- restore-lab paths under /mnt/user/backups/restore-lab/komodo
- 127.0.0.1:19120 only, no LAN bind, no Traefik, no Authelia
- test periphery runs WITHOUT docker.sock mount and WITHOUT
/mnt/user/services mount; cannot manage productive containers
- KOMODO_* secrets are throwaway placeholders hardcoded in the test
compose; productive secrets never enter this path
Smoke test: compose config valid, mongo healthy, mongo auth-ping
with test creds, komodo-core HTTP 200/302/303/401, periphery
container running. Report under restore-reports/komodo-bootstrap-*.
Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
Reads live RepoDigests of each running monitoring container and
freezes the compose to the exact image manifest. Brings the
monitoring stack to the same digest-pin discipline as the
stateful tier-1 services. influxdb3-core was already pinned.
Affected: prometheus, alertmanager, alertmanager-ntfy-bridge,
blackbox-exporter, loki, promtail, grafana, node-exporter,
cadvisor (plus a second python:3.13-alpine for the bootstrap
dashboard importer).
Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
Vaultwarden image ships curl, not wget. Switched the CMD-SHELL
test from wget --spider to curl -fsS.
Authelia 4.39.x removed the "helper health-check" subcommand;
use the /api/health endpoint via wget instead (verified inside
the running container).
Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
Authelia ships its own health-check binary subcommand since 4.37+.
Avoids needing wget/curl in the container.
Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
Enable --ping=true and use traefik healthcheck --ping. Lightweight
binary call inside the container, no extra tooling needed.
Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
Gitea exposes /api/healthz unauthenticated. 60s start_period
because Gitea sqlite migration on cold start can take a while.
Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
Tier-1 health visibility for the shared Redis. Uses redis-cli with
the password from the mounted secret, fails on anything but PONG.
Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
Tier-1 health visibility for shared Postgres cluster. pg_isready
against the admin DB; 30s interval, 30s start_period.
Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
Befund vom 2026-05-29: HomelabBorgLastJobCompletedWithWarnings
zuendete vier Tage in Folge mit Borg-Exit-Code 107. Ursache im
Logfile: /local/appdata/homepage wurde am 25.05. entfernt, aber
in der Borg-UI-Source-Liste blieb der Eintrag drin und Borg
warnte taeglich BackupFileNotFoundError. Backups selbst waren
nicht gefaehrdet (alle 23 anderen Quellen sauber archiviert).
Operator hat den Eintrag in der Borg-UI manuell entfernt;
Source-Liste jetzt 23 statt 24, naechster Lauf 2026-05-30 sollte
wieder completed ohne Warning sein.
Erkenntnis: bei Stack-Removal wurde die Borg-Source-Liste nicht
mit-aufgeraeumt. WORKFLOW.md um neuen Abschnitt "Service-Removal-
Checkliste" erweitert mit 9 Pflichtschritten inklusive
Borg-UI-Source-Bereinigung als Schritt 8.
Positiv: die am 2026-05-27 scharfgeschaltete Alert-Pipeline
(Cron Textfile -> node-exporter -> Prometheus -> Alertmanager
-> ntfy-Bridge) hat den Drift binnen 24 h sichtbar gemacht.
Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
Windows Scheduled Task "KalliLab H Drive Nearline Pull" auf dem
Operator-Windows-PC registriert: taeglich 05:30 nach dem Borg-
Dump-Fenster. RunLevel Limited, StartWhenAvailable, Akku-OK,
Execution-Time-Limit 2h. Naechster Lauf 2026-05-29 05:30.
Repo-Snippet in H_DRIVE_NEARLINE_PULL.md korrigiert: PowerShell-
Enum-Wert ist Limited, nicht LeastPrivilege (alter Snippet haette
beim ersten Register-ScheduledTask einen Parameter-Binding-Fehler
geworfen). Status auf "produktiv" gesetzt.
Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
Operator-Entscheidungen 2026-05-28:
- F-03 zweites Off-site: bewusst NICHT umgesetzt. 3-2-1 ist mit
Live + lokalem Borg + Hetzner + H:/-Nearline erfuellt; ein
zweites Off-site deckt nur den Fall "Hetzner-Account verloren"
ab, Aufwand unverhaeltnismaessig fuer Familien-Homelab.
Stattdessen drei Folge-TODOs zur Haertung der bestehenden
Topologie. Hetzner-2FA bewusst ohne (Operator-Praeferenz,
analog USV-Risiko-Akzeptanz), durch starkes Passwort +
Backup-Zahlungsweg + Login-Mails ersetzt. Borg-Append-Only-
Befund: Repo laeuft im Mode 'full', custom_flags leer; Setup
waere server-seitig in Hetzner-authorized_keys (Folge-Sprint).
Review-Trigger in OFFSITE_BACKUP_OPTIONS.md dokumentiert.
- paperless-gpt: behalten bis Paperless-NGX 3.0 (erwartete
native KI-Features). Aktuell 0 Traefik-Zugriffe in 7 Tagen,
Resource-Footprint 34 MB RAM.
- BentoPDF: behalten als situatives Tool. 0 Traefik-Zugriffe,
4 MB RAM. Begruendungs-Anker im SERVICE_CATALOG.
Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>