ce747f687f
Behebt drei Befunde aus dem Operations-Report 2026-06-10: - daily-status-report.sh: Zertifikate werden vor der Auswertung pro Domain-Set dedupliziert; nur das laengstlaufende Cert zaehlt. Traefik haelt waehrend der Erneuerung altes + neues Cert in acme.json, was bisher eine falsche KRITISCH-Warnung (traefik.kaleschke.info 5 Tage) ausloeste, obwohl das neue Cert 65 Tage Restlaufzeit hat. - monitoring/blackbox-exporter: DNS von 1.1.1.1/8.8.8.8 auf AdGuard (172.23.0.3 via dns_net) umgestellt. Externe Resolver lieferten die WAN-IP, was Hairpin-NAT-Timeouts (9,5s) bei Probes von cloud/glances verursachte (662 Fehler/Tag). - log-noise.patterns: Fritz!Box-SOA-Fehler (AdGuard, RFC-1035-Verstoss) und fehlendes grafana-amazonprometheus-datasource-Plugin als bekanntes Rauschen klassifiziert (~1800 Zeilen/Tag). Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
90 lines
4.7 KiB
Plaintext
90 lines
4.7 KiB
Plaintext
# log-noise.patterns - Daily Operations Report
|
|
#
|
|
# Format:
|
|
# - One Extended Regex (ERE) per non-comment line.
|
|
# - Lines starting with '#' (after optional whitespace) are comments.
|
|
# - Empty / whitespace-only lines are ignored.
|
|
# - Patterns are applied case-insensitively (grep -Eaif).
|
|
# - The file is normalized via lib/normalize-noise-patterns.sh before use.
|
|
#
|
|
# Per pattern, document:
|
|
# - Why this is noise (root cause, not just "expected").
|
|
# - When to re-check / what would invalidate the assumption.
|
|
#
|
|
# Adding a new pattern: prefer the narrowest container.* prefix and the
|
|
# narrowest message anchor. A pattern that matches across containers or
|
|
# matches generic error strings will hide real signal.
|
|
#
|
|
# Removing a pattern: replace with a fresh attention example in the next
|
|
# daily report and consult before reintroducing.
|
|
#
|
|
# Last reviewed: 2026-06-10
|
|
|
|
# Loki internal query cancellations / scheduler chatter.
|
|
# Why: Loki cancels internal queries continuously when downstream Promtails
|
|
# or Grafana panels drop connections; no user-visible outage by itself.
|
|
# Re-check: if Grafana dashboards show real Loki query failures or if
|
|
# Prometheus alerts fire on Loki ingestion / availability.
|
|
monitoring-loki.*(context canceled|error notifying scheduler|closing iterator)
|
|
|
|
# node-exporter parsing /host/proc/mdstat on Unraid.
|
|
# Why: Unraid uses its own array driver, not Linux mdadm, so /proc/mdstat
|
|
# layout is unparsable for node-exporter. Pure collector noise.
|
|
# Re-check: only if migrating to mdadm-based RAID. Then remove this entry
|
|
# and act on real mdadm errors.
|
|
monitoring-node-exporter.*mdadm.*Cannot parse /host/proc/mdstat
|
|
|
|
# Gitea OpenID login attempts return 403.
|
|
# Why: OpenID provider is intentionally disabled in Gitea config; 403 is
|
|
# the expected response for stale OAuth callback URLs.
|
|
# Re-check: when OpenID/OIDC gets enabled again. Remove this and treat
|
|
# the 403 as a real auth failure signal.
|
|
gitea.*user/login/openid.*403 Forbidden
|
|
|
|
# Tailscale PCP port mapping failure (NAT-PMP unsupported by router).
|
|
# Why: Tailscale falls back to STUN/DERP transparently; no functional impact.
|
|
# Re-check: if Tailscale reports persistent connectivity problems in real
|
|
# usage, or if a router change adds NAT-PMP support.
|
|
Tailscale-Docker.*failed to get PCP mapping
|
|
|
|
# Immich version check failed to reach GitHub releases API.
|
|
# Why: External GitHub release check; transient failures do not affect
|
|
# Immich core functionality.
|
|
# Re-check: if Immich UI persistently warns about being outdated or if
|
|
# security updates are missed because of this.
|
|
immich_server.*Failed to fetch latest release
|
|
|
|
# Authelia 408 client-side request timeouts.
|
|
# Why: Clients (browsers, Vaultwarden-CLI etc.) drop slow connections;
|
|
# without correlated login failures or 5xx, individual 408s are normal.
|
|
# Re-check: if 408-rate spikes (>5/min sustained) or if login flows complain.
|
|
# Then narrow this pattern instead of removing.
|
|
authelia.*Request timeout occurred.*status_code=408
|
|
|
|
# Vaultwarden expired sessions and invalid refresh tokens (auth/session class).
|
|
# Why: Normal session expiry; clients retry and re-login transparently.
|
|
# Re-check: if many distinct external IPs trigger 401s in a short window
|
|
# (possible brute-force or credential-stuffing pattern).
|
|
#
|
|
# NOTE: DNS / Connect / Resolve / reqwest / hyper-client errors are
|
|
# intentionally NOT suppressed here. They are real network signals
|
|
# and should be visible in the attention list. If push-notification
|
|
# noise becomes overwhelming, add a *narrow* pattern restricted to
|
|
# push contexts only (e.g. `vaultwarden.*push.*(ResolveError|...)`).
|
|
vaultwarden.*(Token has expired|Invalid refresh token|Failed to decode.*refresh_token|POST /identity/connect/token => 401 Unauthorized)
|
|
|
|
# AdGuard: Fritz!Box sends malformed SOA queries for myfritz.net / myfritz.link.
|
|
# Why: AVM Fritz!Box devices send multi-question DNS SOA queries that violate
|
|
# RFC 1035 ("only 1 question allowed"). AdGuard rejects them with an error
|
|
# but they have no operational impact.
|
|
# Re-check: if the same error appears for non-AVM domains, or if rate spikes
|
|
# well above 1000/day without a Fritz!Box reboot explaining it.
|
|
adguard.*bad question section.*only 1 question allowed
|
|
|
|
# Grafana: usage-stats collector looks for the Amazon Prometheus plugin, which
|
|
# is not installed in this setup. The error is emitted once per stats cycle.
|
|
# Why: GF_PLUGINS_PREINSTALL_DISABLED=true keeps the plugin list minimal;
|
|
# this lookup is harmless and does not affect any dashboard.
|
|
# Re-check: only if Amazon Prometheus is added as a datasource.
|
|
monitoring-grafana.*grafana-amazonprometheus-datasource not found
|