Skip to main content

Prometheus Monitoring

Prometheus collects metrics from all applications and cluster components, deployed via the kube-prometheus-stack Helm chart (v65.8.1). Metrics are retained locally for 15 days and written to Mimir for long-term storage (90 days).

Monitoring Stack

Prometheus Configuration

Prometheus runs as a single replica pinned to the VPS node, with local-path storage:

prometheus:
prometheusSpec:
replicas: 1
retention: 15d
nodeSelector:
kubernetes.io/hostname: vmi2951245
storageSpec:
volumeClaimTemplate:
spec:
storageClassName: local-path
accessModes: ["ReadWriteOnce"]
resources:
requests:
storage: 50Gi
resources:
requests:
cpu: 200m
memory: 512Mi
limits:
cpu: 1000m
memory: 4Gi

Remote Write to Mimir

All metrics are forwarded to Mimir for long-term retention beyond the 15-day local window:

remoteWrite:
- url: http://prometheus-mimir-gateway.monitoring.svc.cluster.local/api/v1/push
name: mimir
remoteTimeout: 30s

ServiceMonitor Pattern

Applications are scraped via ServiceMonitor CRDs. The monitoring chart defines several:

Portfolio Applications (label-based discovery)

Any service with the prometheus-scrape: "true" label is automatically discovered:

apiVersion: monitoring.coreos.com/v1
kind: ServiceMonitor
metadata:
name: portfolio-applications
spec:
namespaceSelector:
any: true
selector:
matchLabels:
prometheus-scrape: "true"
endpoints:
- port: http
path: /metrics
interval: 30s

Triton Inference Servers

Triton endpoints are scraped at a higher frequency (15s) for inference monitoring:

apiVersion: monitoring.coreos.com/v1
kind: ServiceMonitor
metadata:
name: triton-amd
spec:
endpoints:
- interval: 15s
path: /metrics
port: metrics
namespaceSelector:
matchNames:
- default
selector:
matchLabels:
app: triton-amd

Three Triton ServiceMonitors exist: triton-amd (CPU inference on VPS), triton-embeddings (VPS), and triton-gpu (local GPU node).

All ServiceMonitors

ServiceMonitorTargetIntervalNamespace
devops-portfolio-apiDevOps Portfolio API30sdefault
devops-portfolio-dashboardDevOps Portfolio Dashboard30sdefault
portfolio-applicationsAny service with prometheus-scrape: "true"30sany
triton-amdTriton CPU inference15sdefault
triton-embeddingsTriton embeddings15sdefault
triton-gpuTriton GPU inference15sdefault
gotify-bridgeAlertManager-Gotify bridge30smonitoring
minioMinIO object storagemonitoring

Key Metrics

Metric CategoryExamples
HTTPRequest rate, latency percentiles, error rate
NodeCPU usage, memory pressure, disk I/O
PodRestart count, resource utilization vs limits
HPACurrent vs desired replicas, scaling events
TritonInference request count, queue time, compute time
BackupVelero backup success/failure counts, last successful timestamp

Ingress

Prometheus is exposed externally via Traefik with TLS:

annotations:
traefik.ingress.kubernetes.io/router.entrypoints: websecure
traefik.ingress.kubernetes.io/router.tls: "true"
cert-manager.io/cluster-issuer: letsencrypt-prod-dns
rules:
- host: prometheus.el-jefe.me
Live Metrics

The Cluster Dashboard displays live Prometheus metrics including node count, pod count, and CPU/memory utilization sourced from the devops-portfolio-manager API.