Skip to main content

AI / ML Infrastructure

The platform includes a self-hosted AI inference stack that provides embedding generation, text generation, and LLM observability across all portfolio applications.

Architecture

Components

ComponentPurposeTechnology
Shared AI GatewayUnified API for all AI featuresNode.js/Express with multi-tier fallback
Triton Semantic SearchGPU-accelerated embeddings and code searchNVIDIA Triton + ONNX + pgvector
LangfuseLLM observability and tracingLangfuse + ClickHouse + LiteLLM

Design Principles

  • Cost optimization — Free/cheap backends first (HuggingFace, Groq), expensive only when needed (Anthropic, RunPod)
  • Reliability — Multi-tier fallback ensures AI features never go down
  • Observability — Every LLM call is traced through Langfuse via LiteLLM
  • Self-hosted where possible — Triton on local GPU, llama.cpp on VPS, reduces external dependencies
  • Shared infrastructure — One gateway serves all portfolio apps instead of duplicating AI logic per app