Skip to main content

Backup Strategy

The platform uses a multi-layer backup architecture: Velero for Kubernetes-native volume snapshots, database-specific CronJobs for logical dumps, DNS and secrets cold storage, and CloudCasa as a SaaS-managed secondary layer. All off-cluster backups target Backblaze B2.

Architecture

Backup Overview

All CronJobs

CronJobScheduleTargetRetentionImage
Velero daily0 2 * * * (2:00 AM)All PVs across 8 namespaces7 daysvelero/velero:v1.14.0
Velero weekly0 3 * * 0 (Sun 3:00 AM)All PVs across 8 namespaces30 daysvelero/velero:v1.14.0
postgresql-backup (Code Talk)30 3 * * * (3:30 AM)codetalk database30 dayspostgres:14-alpine
postgresql-bookmarked-backup45 3 * * * (3:45 AM)bookmarked database30 dayspostgres:14-alpine
mongodb-backup (intervalai)0 3 * * * (3:00 AM)intervalai database30 daysmaxjeffwell/mongodb-backup
mongodb-backup-educationelly5 3 * * * (3:05 AM)educationelly database30 daysmaxjeffwell/mongodb-backup
mongodb-backup-educationelly-graphql10 3 * * * (3:10 AM)educationelly-graphql database30 daysmaxjeffwell/mongodb-backup
mongodb-backup-microservices15 3 * * * (3:15 AM)microservices database30 daysmaxjeffwell/mongodb-backup
mongodb-backup-vertex-platform20 3 * * * (3:20 AM)vertex-platform database30 daysmaxjeffwell/mongodb-backup
dns-zone-backup0 4 * * 0 (Sun 4:00 AM)Cloudflare DNS records12 weeksalpine:3.21
doppler-secrets-backup0 5 * * 0 (Sun 5:00 AM)All Doppler secrets12 weeksalpine:3.21

All Restore Tests

CronJobScheduleWhat It Tests
velero-backup-verify0 4 * * 0 (weekly Sun)Velero BSL health, backup age, test restore of default namespace
postgresql-restore-test0 6 1-7 * 0 (1st Sun monthly)Download + pg_restore of codetalk and bookmarked dumps
mongodb-restore-test30 6 1-7 * 0 (1st Sun monthly)Download + mongorestore of all 5 MongoDB databases
dns-zone-restore-test0 7 1-7 * 0 (1st Sun monthly)Download + extract tarball, validate JSON, compare against live Cloudflare
doppler-secrets-restore-test15 7 1-7 * 0 (1st Sun monthly)Download + AES-256 decrypt, validate JSON, compare against live Doppler

Velero (Primary — Volume Snapshots)

Velero v1.14.0 handles all scheduled volume-level backups with file-system protection via Kopia.

Backup Schedules

ScheduleCronRetention
Daily0 2 * * * (2:00 AM)7 days (168h)
Weekly0 3 * * 0 (3:00 AM Sun)30 days (720h)

Both schedules use defaultVolumesToFsBackup: true, meaning all PVs are backed up at the filesystem level through Kopia rather than relying on CSI volume snapshots.

Protected Namespaces

includedNamespaces:
- default
- monitoring
- microservices
- vertex-platform
- argocd
- cert-manager
- gpu-operator
- keel

Storage Backend

SettingValue
ProviderAWS-compatible (Backblaze B2)
BucketPrivate B2 bucket
Prefixvelero/
Path styles3ForcePathStyle: true
Access modeReadWrite

Credentials are stored as a Kubernetes Secret in the velero namespace.

Kopia Maintenance

Velero runs Kopia repository maintenance jobs on a per-namespace basis. These compact, deduplicate, and verify the backup repository in B2:

NamespaceFrequency
default~65 min
argocd~65 min
microservices~65 min
monitoring~65 min
vertex-platform~65 min

Components

ComponentTypePurpose
velero DeploymentServerBackup orchestration, schedule management, S3 communication
node-agent DaemonSetAgentFilesystem-level PV backup via Kopia on each node
velero-backup-verify CronJobVerifierWeekly automated restore validation

Database Backups (Logical Dumps)

Database CronJobs provide application-consistent logical backups that complement Velero's filesystem-level snapshots. Each dump is uploaded to the B2 bucket under database-specific prefixes.

PostgreSQL Backups

Two PostgreSQL databases are backed up daily via pg_dump:

CronJobDatabaseHostScheduleB2 Prefix
postgresql-backupcodetalkpostgresql-codetalk3:30 AMpostgresql-backups/codetalk-*
postgresql-bookmarked-backupbookmarkedpostgresql-bookmarked3:45 AMpostgresql-backups/bookmarked-*

Dump format: Custom (-Fc) with --no-owner for portability.

Cleanup: Each CronJob deletes its own backups older than 30 days, scoped by prefix (grep 'codetalk-' / grep 'bookmarked-') to prevent cross-deletion.

Secrets:

  • Database passwords from postgresql-codetalk-secrets / postgresql-bookmarked-secrets
  • B2 credentials from velero-b2-credentials

MongoDB Backups

Five MongoDB databases are backed up daily via mongodump --archive --gzip:

CronJobDatabaseScheduleB2 Prefix
mongodb-backupintervalai3:00 AMmongodb-backups/intervalai-*
mongodb-backup-educationellyeducationelly3:05 AMmongodb-backups/educationelly-*
mongodb-backup-educationelly-graphqleducationelly-graphql3:10 AMmongodb-backups/educationelly-graphql-*
mongodb-backup-microservicesmicroservices3:15 AMmongodb-backups/microservices-*
mongodb-backup-vertex-platformvertex-platform3:20 AMmongodb-backups/vertex-platform-*

Image: maxjeffwell/mongodb-backup:latest — custom image with MongoDB tools + AWS CLI pre-installed (avoids apt-get IPv6 failures on mongo:7-jammy in the cluster).

Cleanup: Each CronJob scopes deletion by its own prefix.

Secrets:

  • MongoDB credentials from per-database ExternalSecrets (microservices, vertex-platform) or inline connection strings (default namespace)
  • B2 credentials from velero-b2-credentials

Cold Storage Backups

DNS Zone Backup

Weekly export of all Cloudflare DNS records for the el-jefe.me zone, providing disaster recovery if the Cloudflare account or zone is lost.

Schedule: Sunday 4:00 AM | Image: alpine:3.21 | B2 prefix: dns-backups/

What's captured:

  • all_records.json — Paginated Cloudflare API export of all DNS records (JSON)
  • zone_settings.json — Zone-level settings (SSL mode, security level, etc.)
  • zone-export.txt — BIND-format zone file (can be imported into any DNS provider)

Package: dns-zone-TIMESTAMP.tar.gz | Retention: 12 weekly backups (3 months)

Doppler Secrets Backup

Weekly encrypted cold-storage export of all secrets from Doppler, providing recovery if the Doppler account is compromised or unavailable.

Schedule: Sunday 5:00 AM | Image: alpine:3.21 | B2 prefix: doppler-backups/

Security:

  • Secrets are downloaded as JSON via the Doppler API
  • Immediately encrypted with AES-256-CBC (PBKDF2, 100k iterations) using a passphrase from doppler-backup-encryption Secret
  • Plaintext is deleted before upload; only the encrypted .enc file reaches B2
  • Secret values never appear in pod logs

Retention: 12 weekly backups (3 months)


Restore Testing

All backup types have automated restore tests that run monthly on the first Sunday. Tests validate that backups are downloadable, recent, structurally valid, and can be successfully restored.

Velero Backup Verify

Schedule: 0 4 * * 0 (weekly Sunday)

  1. Confirms BackupStorageLocation is Available
  2. Finds the most recent successful daily backup
  3. Fails if the latest backup is older than 48 hours
  4. Restores the default namespace into a temporary velero-test-restore namespace
  5. Verifies restore completed successfully
  6. Cleans up the test namespace and restore object

PostgreSQL Restore Test

Schedule: 0 6 1-7 * 0 (first Sunday monthly) | Image: postgres:14-alpine

For each database (codetalk, bookmarked):

  1. Downloads the latest dump from B2
  2. Checks backup age (warns if > 2 days)
  3. Starts a local PostgreSQL instance via pg_ctl
  4. Runs pg_restore into a test database
  5. Validates table count and row counts via pg_stat_user_tables
  6. Reports PASS/FAIL per database
Alpine Date Compatibility

Alpine's BusyBox date doesn't accept bare YYYYMMDD format. The test converts timestamps to YYYY-MM-DD with sed before passing to date -d.

MongoDB Restore Test

Schedule: 30 6 1-7 * 0 (first Sunday monthly) | Image: maxjeffwell/mongodb-backup:latest

For each database (intervalai, educationelly, educationelly-graphql, microservices, vertex-platform):

  1. Downloads the latest archive from B2
  2. Checks backup age (warns if > 2 days)
  3. Starts a local mongod instance on port 27099
  4. Runs mongorestore --archive --gzip --drop
  5. Validates collection and document counts via mongosh
  6. Empty databases (e.g. microservices, vertex-platform) pass with "restore mechanics verified"
Grep Prefix Collision

The B2 file lookup uses grep "${DB}-[0-9]" instead of grep "${DB}-" to prevent educationelly- from matching educationelly-graphql- files. The [0-9] anchors to the timestamp digit that follows the database name.

DNS Zone Restore Test

Schedule: 0 7 1-7 * 0 (first Sunday monthly) | Image: alpine:3.21

Validates 5 checks:

  1. Backup age — Must be < 8 days (weekly schedule)
  2. Tarball extraction — Archive must decompress successfully
  3. DNS records JSONall_records.json must be valid JSON with record count > 0; reports type breakdown (A, CNAME, MX, TXT)
  4. Zone settings JSONzone_settings.json must be valid JSON; reports zone name
  5. Live comparison — Queries live Cloudflare API and compares record count; warns if drift > 5

Doppler Secrets Restore Test

Schedule: 15 7 1-7 * 0 (first Sunday monthly) | Image: alpine:3.21

Validates 4 checks:

  1. Backup age — Must be < 8 days (weekly schedule)
  2. Decryption — AES-256-CBC decrypt with the same passphrase must succeed
  3. JSON validation — Decrypted output must be valid JSON with key count > 0; prints sample key names (never values)
  4. Live comparison — Queries live Doppler API and compares key count; warns if drift > 3

CloudCasa (Secondary)

CloudCasa provides a SaaS-managed backup layer. The agent runs in the cloudcasa-io namespace and communicates with CloudCasa's control plane — backup schedules and policies are managed from the CloudCasa web dashboard rather than through in-cluster CRDs.

ComponentPurpose
cloudcasa-kubeagent-managerAgent lifecycle management
kubeagent (2 containers)Backup execution and data movement

CloudCasa's CRDs (backups, restores, schedules, volume snapshot locations) are registered in the cluster but orchestrated from the SaaS side, providing an independent backup path that doesn't share Velero's storage or failure modes.


What's Protected

Stateful Data (Active Backup)

The following PVCs are actively backed up via Velero's filesystem-level Kopia snapshots:

NamespacePVCSizeStorage Class
defaultmongodb-educationelly-pvc2Gilocal-path
defaultmongodb-educationelly-graphql-pvc2Gilocal-path
defaultmongodb-intervalai-pvc2Gilocal-path
defaultmongodb-data-mongodb-intervalai-08Gilocal-path
defaultpostgresql-codetalk-pvc2Gilocal-path
defaultredis-pvc1Gilocal-path
defaultllm-models-pvc100Gicsi-s3
defaulttriton-models-pvc50Gicsi-s3
langfusedata-langfuse-clickhouse-shard0-02Gilocal-path
langfuselangfuse-s38Gilocal-path
microservicesmongodb8Gilocal-path
microservicesredis-data-redis-replicas-{0,1,2}8Gi eachlocal-path
microservicesdata-vertex-kafka-*5Gilocal-path
monitoringprometheus-*-db-* (x2)50Gi eachlocal-path
monitoringprometheus-minio20Gilocal-path
monitoringstorage-prometheus-mimir-* (x3)10Gi eachlocal-path
monitoringstorage-prometheus-loki-05Gilocal-path
monitoringgotify-pvc1Gilocal-path
monitoringalertmanager-*-db-*2Gilocal-path
monitoringkafka-data-prometheus-mimir-kafka-05Gilocal-path
vertex-platforminfluxdb-data-pvc5Gilocal-path
vertex-platforminfluxdb-config-pvc100Milocal-path
vertex-platformmongodb-pvc5Gilocal-path
vertex-platformredis-pvc1Gilocal-path

Declarative State (Protected by GitOps)

These don't need active backup — they can be fully reconstructed from Git:

WhatSource
All Helm charts and valuesdevops-portfolio-manager repo
Application source codeIndividual GitHub repos
CI/CD workflowsGitHub Actions in each repo
ArgoCD ApplicationsGit-managed in devops-portfolio-manager
Docker imagesDocker Hub (tagged with YYYYMMDD-HHMMSS-shortsha)
TLS certificatesAuto-issued by cert-manager + Let's Encrypt
Secrets definitionsDoppler (source of truth) + ESO manifests in Git
Design Principle

If it's declarative and lives in Git, GitOps is the backup. Active backup targets only stateful data that can't be reconstructed: database contents, persistent volumes, and monitoring history. Database logical dumps provide an independent recovery path from Velero in case of filesystem-level corruption.


B2 Bucket Layout

All off-cluster backups share a single private B2 bucket organized by prefix:

s3://<bucket>/
├── velero/ # Velero volume snapshots (managed by Velero)
├── postgresql-backups/
│ ├── codetalk-YYYYMMDD-HHMMSS.dump
│ └── bookmarked-YYYYMMDD-HHMMSS.dump
├── mongodb-backups/
│ ├── intervalai-YYYYMMDD-HHMMSS.archive
│ ├── educationelly-YYYYMMDD-HHMMSS.archive
│ ├── educationelly-graphql-YYYYMMDD-HHMMSS.archive
│ ├── microservices-YYYYMMDD-HHMMSS.archive
│ └── vertex-platform-YYYYMMDD-HHMMSS.archive
├── dns-backups/
│ └── dns-zone-YYYYMMDD-HHMMSS.tar.gz
└── doppler-backups/
└── doppler-secrets-YYYYMMDD-HHMMSS.enc

All database dump CronJobs share the velero-b2-credentials Secret for B2 access. Bucket name and endpoint are configured via environment variables in each CronJob, not hardcoded in documentation.


Restore Procedures

Restore a Single Namespace (Velero)

velero restore create --from-backup daily-backup-YYYYMMDDHHMMSS \
--include-namespaces default

Restore to a Different Namespace

velero restore create --from-backup daily-backup-YYYYMMDDHHMMSS \
--include-namespaces default \
--namespace-mappings default:restored-default

Restore a Specific Resource

velero restore create --from-backup daily-backup-YYYYMMDDHHMMSS \
--include-namespaces default \
--include-resources persistentvolumeclaims,persistentvolumes \
--selector app=bookmarked

Restore a PostgreSQL Database from B2

# Download the latest dump
aws s3 cp s3://${B2_BUCKET}/postgresql-backups/codetalk-YYYYMMDD-HHMMSS.dump /tmp/ \
--endpoint-url "${B2_ENDPOINT}"

# Restore into the target database
pg_restore -h postgresql-codetalk -U codetalk_user -d codetalk --no-owner /tmp/codetalk-*.dump

Restore a MongoDB Database from B2

# Download the latest archive
aws s3 cp s3://${B2_BUCKET}/mongodb-backups/intervalai-YYYYMMDD-HHMMSS.archive /tmp/ \
--endpoint-url "${B2_ENDPOINT}"

# Restore into the target MongoDB
mongorestore --host mongodb-intervalai --archive=/tmp/intervalai-*.archive --gzip --drop

Restore DNS Records from B2

# Download and extract
aws s3 cp s3://${B2_BUCKET}/dns-backups/dns-zone-YYYYMMDD-HHMMSS.tar.gz /tmp/ \
--endpoint-url "${B2_ENDPOINT}"
tar xzf /tmp/dns-zone-*.tar.gz -C /tmp/dns-restore

# Import via Cloudflare API (individual records)
jq -c '.[]' /tmp/dns-restore/all_records.json | while read record; do
curl -X POST "https://api.cloudflare.com/client/v4/zones/${ZONE_ID}/dns_records" \
-H "Authorization: Bearer ${CF_TOKEN}" \
-H "Content-Type: application/json" \
-d "$record"
done

Restore Doppler Secrets from B2

# Download encrypted backup
aws s3 cp s3://${B2_BUCKET}/doppler-backups/doppler-secrets-YYYYMMDD-HHMMSS.enc /tmp/ \
--endpoint-url "${B2_ENDPOINT}"

# Decrypt
openssl enc -aes-256-cbc -d -salt -pbkdf2 -iter 100000 \
-in /tmp/doppler-secrets-*.enc \
-out /tmp/secrets.json \
-pass pass:"${ENCRYPTION_PASSPHRASE}"

# Re-import to Doppler
doppler secrets upload /tmp/secrets.json

Full Cluster Recovery

  1. Reinstall K3s on the node
  2. Install Velero with the same B2 credentials and bucket configuration
  3. Restore infrastructure namespaces first:
    velero restore create --from-backup weekly-backup-YYYYMMDDHHMMSS \
    --include-namespaces cert-manager,argocd
  4. Restore application namespaces:
    velero restore create --from-backup daily-backup-YYYYMMDDHHMMSS \
    --include-namespaces default,monitoring,microservices,vertex-platform
  5. Verify ArgoCD reconciles remaining state from Git
  6. Restore DNS from B2 if Cloudflare zone was lost
  7. Restore secrets from encrypted B2 backup if Doppler is unavailable

List Available Backups

# Velero backups
velero backup get --sort-by .metadata.creationTimestamp

# Database dumps in B2
aws s3 ls s3://${B2_BUCKET}/postgresql-backups/ --endpoint-url "${B2_ENDPOINT}"
aws s3 ls s3://${B2_BUCKET}/mongodb-backups/ --endpoint-url "${B2_ENDPOINT}"

Known Issues

IssueImpactStatus
GPU daemonset pods not running on nodeBackup warning for gpu-operator podsGPU operator pods only run when GPU node is active
PodVolumeBackup timeoutBackup completes as PartiallyFailedLarge PVs (models, Prometheus) occasionally exceed the default timeout
Cluster IPv6 egress unavailableapt-get install fails in mongo:7-jammy podsMongoDB CronJobs use custom mongodb-backup image with tools pre-installed