Add daily operations report with hardened log-noise filtering

Brings the previously untracked daily-status-report.sh and
send-operations-report-mail.sh into the repo, plus a refactor of the
log-noise pipeline:

- New helper services/posture-check/lib/normalize-noise-patterns.sh
  strips comments, empty lines and trailing whitespace from
  log-noise.patterns before grep -f sees it. A stray empty line in
  the pattern file would otherwise have made grep -Eaif match every
  hit and silently wipe the log highlights.
- log-noise.patterns is now documented per-pattern (Why / Re-check).
  The Vaultwarden pattern is split: token/session noise stays as
  noise; DNS/Connect/Resolve/reqwest/hyper errors are removed from
  the noise set so real network signals stay visible.
- collect_log_highlights now reports a per-container and per-pattern
  noise breakdown (Top N) and an escalation flag when any pattern
  exceeds NOISE_ESCALATION_THRESHOLD (default 500). The flag is fed
  into derive_report_status and the management summary.
- New shell tests under services/posture-check/tests/ verify the
  normalize helper handles comments, empty lines, whitespace-only
  lines, and that unknown error lines remain in the attention set.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
This commit is contained in:
2026-05-23 10:41:33 +02:00
parent b7cbbe51de
commit 9e7bebbd3c
5 changed files with 2026 additions and 0 deletions
+81
View File
@@ -0,0 +1,81 @@
# log-noise.patterns - Daily Operations Report
#
# Format:
# - One Extended Regex (ERE) per non-comment line.
# - Lines starting with '#' (after optional whitespace) are comments.
# - Empty / whitespace-only lines are ignored.
# - Patterns are applied case-insensitively (grep -Eaif).
# - The file is normalized via lib/normalize-noise-patterns.sh before use.
#
# Per pattern, document:
# - Why this is noise (root cause, not just "expected").
# - When to re-check / what would invalidate the assumption.
#
# Adding a new pattern: prefer the narrowest container.* prefix and the
# narrowest message anchor. A pattern that matches across containers or
# matches generic error strings will hide real signal.
#
# Removing a pattern: replace with a fresh attention example in the next
# daily report and consult before reintroducing.
#
# Last reviewed: 2026-05-21
# Loki internal query cancellations / scheduler chatter.
# Why: Loki cancels internal queries continuously when downstream Promtails
# or Grafana panels drop connections; no user-visible outage by itself.
# Re-check: if Grafana dashboards show real Loki query failures or if
# Prometheus alerts fire on Loki ingestion / availability.
monitoring-loki.*(context canceled|error notifying scheduler|closing iterator)
# node-exporter parsing /host/proc/mdstat on Unraid.
# Why: Unraid uses its own array driver, not Linux mdadm, so /proc/mdstat
# layout is unparsable for node-exporter. Pure collector noise.
# Re-check: only if migrating to mdadm-based RAID. Then remove this entry
# and act on real mdadm errors.
monitoring-node-exporter.*mdadm.*Cannot parse /host/proc/mdstat
# Gitea OpenID login attempts return 403.
# Why: OpenID provider is intentionally disabled in Gitea config; 403 is
# the expected response for stale OAuth callback URLs.
# Re-check: when OpenID/OIDC gets enabled again. Remove this and treat
# the 403 as a real auth failure signal.
gitea.*user/login/openid.*403 Forbidden
# Uptime Kuma monitor for legacy domain grafana.kaleschke.info returning 404.
# Why: Monitor was not removed when the public Grafana endpoint was
# decommissioned.
# Re-check: at next Uptime-Kuma cleanup. Action: delete the obsolete monitor
# and remove this pattern.
uptime-kuma.*grafana\.kaleschke\.info.*status code 404
# Tailscale PCP port mapping failure (NAT-PMP unsupported by router).
# Why: Tailscale falls back to STUN/DERP transparently; no functional impact.
# Re-check: if Tailscale reports persistent connectivity problems in real
# usage, or if a router change adds NAT-PMP support.
Tailscale-Docker.*failed to get PCP mapping
# Immich version check failed to reach GitHub releases API.
# Why: External GitHub release check; transient failures do not affect
# Immich core functionality.
# Re-check: if Immich UI persistently warns about being outdated or if
# security updates are missed because of this.
immich_server.*Failed to fetch latest release
# Authelia 408 client-side request timeouts.
# Why: Clients (browsers, Vaultwarden-CLI etc.) drop slow connections;
# without correlated login failures or 5xx, individual 408s are normal.
# Re-check: if 408-rate spikes (>5/min sustained) or if login flows complain.
# Then narrow this pattern instead of removing.
authelia.*Request timeout occurred.*status_code=408
# Vaultwarden expired sessions and invalid refresh tokens (auth/session class).
# Why: Normal session expiry; clients retry and re-login transparently.
# Re-check: if many distinct external IPs trigger 401s in a short window
# (possible brute-force or credential-stuffing pattern).
#
# NOTE: DNS / Connect / Resolve / reqwest / hyper-client errors are
# intentionally NOT suppressed here. They are real network signals
# and should be visible in the attention list. If push-notification
# noise becomes overwhelming, add a *narrow* pattern restricted to
# push contexts only (e.g. `vaultwarden.*push.*(ResolveError|...)`).
vaultwarden.*(Token has expired|Invalid refresh token|Failed to decode.*refresh_token|POST /identity/connect/token => 401 Unauthorized)