Skip to main content

Alerting

AlertManager routes alerts from Prometheus to Gotify for push notifications, with inhibition rules to suppress noise and silences for known false-positives.

Alert Flow

AlertManager Configuration

Routing

Alerts are grouped by namespace with escalation-based routing:

route:
group_by: ['namespace']
group_wait: 30s
group_interval: 5m
repeat_interval: 12h
receiver: 'null'
routes:
# Critical and warning alerts → Gotify push notifications
- matchers:
- severity = "critical"
receiver: 'gotify'
- matchers:
- severity = "warning"
receiver: 'gotify'

Silenced Alerts

These alerts are routed to the null receiver to avoid noise:

AlertReason
WatchdogAlways-firing health check, not actionable
KubeDaemonSetMisScheduled (gpu-operator)False-positive from GPU operator DaemonSet
KubeDaemonSetRolloutStuck (gpu-operator)False-positive from GPU operator DaemonSet
KubeControllerManagerDownK3s bundles control plane into the binary
KubeSchedulerDownK3s bundles control plane into the binary
KubeProxyDownK3s bundles control plane into the binary

Inhibition Rules

Higher-severity alerts suppress lower-severity duplicates within the same namespace:

  • Critical inhibits warning and info
  • Warning inhibits info
  • InfoInhibitor suppresses info-level alerts

Gotify Bridge

The alertmanager_gotify_bridge (v2.3.2) converts AlertManager webhooks to Gotify API calls:

gotifyBridge:
enabled: true
image:
repository: ghcr.io/druggeri/alertmanager_gotify_bridge
tag: "2.3.2"
gotifyEndpoint: "http://gotify.monitoring.svc.cluster.local"
defaultPriority: "5"

The bridge has its own ServiceMonitor for self-monitoring metrics.

Alert Rules

Velero Backup Alerts

Five PrometheusRule alerts monitor backup health:

AlertSeverityCondition
VeleroBackupStorageLocationUnavailablecriticalStorage location unavailable for 5m
VeleroBackupFailedcriticalAny backup failure in 24h
VeleroNoRecentBackupwarningNo successful backup in 24h
VeleroBackupPartiallyFailedwarningBackup completed with errors in 24h
VeleroRestoreFailedcriticalAny restore failure in 1h

Example rule:

- alert: VeleroBackupStorageLocationUnavailable
expr: velero_backup_storage_location_status{status="Unavailable"} == 1
for: 5m
labels:
severity: critical
annotations:
summary: "Velero BackupStorageLocation unavailable"
description: "BackupStorageLocation {{ $labels.name }} has been unavailable for 5 minutes."
priority: "8"

kube-prometheus-stack Default Rules

The Helm chart includes built-in PrometheusRules covering:

  • KubernetesSystem — API server, etcd, kubelet health
  • Node — CPU, memory, disk, network anomalies
  • Pod — CrashLoopBackOff, OOMKilled, pending pods
  • Deployment — Replica mismatch, rollout stuck
  • PersistentVolume — Capacity warnings

AlertManager Storage

AlertManager persists its silence and notification state:

alertmanagerSpec:
storage:
volumeClaimTemplate:
spec:
accessModes: ["ReadWriteOnce"]
resources:
requests:
storage: 2Gi
resources:
requests:
cpu: 50m
memory: 64Mi
limits:
cpu: 200m
memory: 256Mi