Project
GPU Inference Cluster
Self-hosted LLM inference on bare metal - 18GB VRAM across two GPUs, Open WebUI chat frontend, and OpenAI-compatible API. Local models plus passthrough to OpenAI and Anthropic.
Overview
Rather than paying per-token API costs for every request, I run a full local inference stack on GPU hardware. An RTX 3080 (10GB) and RTX 3060 Ti (8GB) combine for 18GB of usable VRAM, enough to run mid-size models at full precision. The stack routes traffic through nginx to Open WebUI at chat.reneau.me, which can dispatch to local Ollama/vLLM backends or pass through to OpenAI and Anthropic when needed.
Request Flow
Internet
└── nginx (TLS, chat.reneau.me)
└── Open WebUI
├── Ollama (local - smaller models)
├── vLLM (local - larger / batched)
├── OpenAI (passthrough)
└── Anthropic (passthrough)
Architecture
Hardware
- RTX 3080 - 10GB GDDR6X
- RTX 3060 Ti - 8GB GDDR6
- 18GB combined VRAM - multi-GPU model distribution
- CUDA 12, NVIDIA container toolkit
- NVMe storage for model weights
Inference Backends
- Ollama - runtime for smaller models (<13B)
- vLLM - high-throughput serving, larger models
- OpenAI-compatible API - drop-in for existing clients
- Streaming SSE - token-by-token output
Frontend
- Open WebUI at chat.reneau.me
- Multi-model switching - local and cloud in one UI
- RAG pipeline - local document ingestion + retrieval
- Local accounts - no external IdP
External Providers
- OpenAI - GPT-4o, o1, etc. via API key passthrough
- Anthropic - Claude models via API key passthrough
- Single chat UI for all providers
- Provider selected per-conversation
Performance
Benchmark Configuration
Explain the difference between Kubernetes Deployments, StatefulSets, and DaemonSets. Keep it technical and around 250 to 350 words.
Results warm average - runs 2–3 (run 1 is cold start)
| Model | Output tok/s warm |
Cold start tok/s |
Cold Δ vs warm |
Prompt tok/s warm |
Load avg s |
Generation avg s |
Wall time avg s |
Response chars |
|
|---|---|---|---|---|---|---|---|---|---|
| gemma3:1b | 263.6 | 285.3 | — | 4039 | 1.85 | 1.11 | 3.67 | 1595 | |
| gemma3:4b | 153.0 | 148.6 | −3% | 2341 | 3.66 | 1.98 | 6.36 | 1514 | |
| mistral:7b | 130.6 | 130.5 | −0% | 2132 | 1.65 | 2.30 | 3.99 | 1366 | |
| llama3.1:8b | 123.0 | 123.1 | — | 2184 | 2.50 | 2.44 | 5.03 | 1497 | |
| gemma3:12b | 68.8 | 68.9 | — | 1185 | 6.89 | 4.36 | 11.98 | 1463 | |
| qwen2.5:14b | 56.5 | 56.5 | — | 1338 | 3.49 | 5.31 | 8.94 | 1590 | |
| gemma3:27b | 9.1 | 9.0 | −0% | 142 | 14.48 | 33.14 | 48.66 | 1564 |
Cold Δ = how much slower the first (cold) run was versus the warm average. High values indicate significant model load overhead.
API Quick Start
List available models:
curl https://chat.reneau.me/api/tags
Chat completion (OpenAI-compatible):
curl https://chat.reneau.me/v1/chat/completions \
-H "Content-Type: application/json" \
-d '{"model":"llama3.1","messages":[{"role":"user","content":"Hello"}],"stream":true}'