From Metabase to Prometheus: Cluster Monitoring Done Right
I initially tried using Metabase for cluster analytics. It didn't work out. Today I ripped it out and replaced it with a proper Prometheus + Grafana stack.
The Metabase Experiment
My first attempt at cluster visibility was embedding Metabase dashboards into my portfolio orchestration dashboard. The idea was to have business analytics alongside infrastructure metrics.
// What I tried - Metabase React SDK
import { MetabaseProvider, StaticQuestion } from "@metabase/embedding-sdk-react";
<MetabaseProvider config={metabaseConfig}>
<StaticQuestion questionId={42} />
</MetabaseProvider>Problems quickly emerged:
- Authentication complexity - Metabase's embedding auth didn't play well with my JWT setup
- Resource hungry - Metabase needs a database backend and significant memory
- Wrong tool - I needed infrastructure metrics, not business intelligence
After two days of wrestling with SDK authentication issues, I made the call to switch.
Enter Prometheus
Prometheus is purpose-built for infrastructure monitoring. The kube-prometheus-stack Helm chart bundles everything:
helm repo add prometheus-community https://prometheus-community.github.io/helm-charts
helm install prometheus prometheus-community/kube-prometheus-stack \
-n monitoring --create-namespace \
-f prometheus-values.yamlInitial Values
prometheus:
prometheusSpec:
retention: 15d
resources:
requests:
cpu: 200m
memory: 1Gi
limits:
cpu: 1000m
memory: 2Gi
grafana:
enabled: true
adminPassword: ${GRAFANA_ADMIN_PASSWORD}
ingress:
enabled: true
hosts:
- grafana.el-jefe.me
alertmanager:
enabled: trueThe OOMKilled Incident
Within hours of deploying, Prometheus crashed:
$ kubectl get pods -n monitoring
NAME READY STATUS RESTARTS
prometheus-prometheus-kube-prometheus-0 0/2 OOMKilled 3Checking the events:
$ kubectl describe pod prometheus-prometheus-kube-prometheus-0 -n monitoring
...
Last State: Terminated
Reason: OOMKilled
Exit Code: 137Root cause: With 6 applications, multiple databases, and system components, Prometheus was scraping hundreds of metrics. The default 2Gi memory limit wasn't enough.
The Fix
prometheus:
prometheusSpec:
resources:
limits:
memory: 4Gi # Doubled from 2Gi
requests:
memory: 2Gi
# Also tuned scrape intervals
scrapeInterval: 30s # Default was 15s
evaluationInterval: 30sAfter applying the new limits:
$ kubectl get pods -n monitoring
NAME READY STATUS RESTARTS
prometheus-prometheus-kube-prometheus-0 2/2 Running 0Custom Metrics in the Dashboard
With Prometheus running, I updated my portfolio dashboard to query it directly:
// API endpoint for Prometheus queries
app.get('/api/metrics/pods', async (req, res) => {
const query = 'sum(kube_pod_status_phase{phase="Running"}) by (namespace)';
const result = await prometheus.query(query);
res.json(result);
});The dashboard now shows:
- Pod counts by namespace
- CPU/Memory utilization
- Network I/O
- Request latency percentiles
Grafana Dashboards
Grafana came pre-configured with excellent dashboards:
- Kubernetes / Compute Resources / Cluster - Overall cluster health
- Kubernetes / Compute Resources / Namespace (Pods) - Per-namespace breakdown
- Node Exporter Full - Host-level metrics
I added custom dashboards for my applications using PromQL:
# Request rate per application
sum(rate(http_requests_total{namespace="default"}[5m])) by (app)
# 95th percentile latency
histogram_quantile(0.95,
sum(rate(http_request_duration_seconds_bucket[5m])) by (le, app)
)Lessons Learned
- Right tool for the job - Metabase is great for business analytics, not infrastructure monitoring
- Memory matters - Always check resource requirements and monitor for OOMKilled
- Start with defaults, then tune - kube-prometheus-stack's defaults are good, but every cluster is different
- Scrape interval tradeoffs - Shorter intervals = more data = more memory
Current Stack
| Component | Purpose |
|---|---|
| Prometheus | Metrics collection and storage |
| Grafana | Visualization and dashboards |
| Alertmanager | Alert routing and notifications |
| Node Exporter | Host-level metrics |
| kube-state-metrics | Kubernetes object metrics |
The monitoring namespace now gives me full visibility into cluster health. Next up: adding GPU metrics once I add the GPU node.
Update: This stack has since grown significantly. I added Grafana Mimir for long-term metric storage backed by S3, Loki for log aggregation, and Alloy as a unified collector — all deployed as an umbrella Helm chart. The Mimir rollout came with a fun debugging session involving a Kafka message size mismatch that crash-looped the distributor. Full write-up: Building a Long-Term Metrics Stack with Mimir.
Documenting the evolution of my homelab infrastructure.
