GPU Inference Cluster

Overview

Rather than paying per-token API costs for every request, I run a full local inference stack on GPU hardware. An RTX 3080 (10GB) and RTX 3060 Ti (8GB) combine for 18GB of usable VRAM, enough to run mid-size models at full precision. The stack routes traffic through nginx to Open WebUI at chat.reneau.me, which can dispatch to local Ollama/vLLM backends or pass through to OpenAI and Anthropic when needed.

Request Flow

Internet
  └── nginx (TLS, chat.reneau.me)
        └── Open WebUI
              ├── Ollama  (local - smaller models)
              ├── vLLM    (local - larger / batched)
              ├── OpenAI  (passthrough)
              └── Anthropic (passthrough)

Architecture

Hardware

RTX 3080 - 10GB GDDR6X
RTX 3060 Ti - 8GB GDDR6
18GB combined VRAM - multi-GPU model distribution
CUDA 12, NVIDIA container toolkit
NVMe storage for model weights

Inference Backends

Ollama - runtime for smaller models (<13B)
vLLM - high-throughput serving, larger models
OpenAI-compatible API - drop-in for existing clients
Streaming SSE - token-by-token output

Frontend

Open WebUI at chat.reneau.me
Multi-model switching - local and cloud in one UI
RAG pipeline - local document ingestion + retrieval
Local accounts - no external IdP

External Providers

OpenAI - GPT-4o, o1, etc. via API key passthrough
Anthropic - Claude models via API key passthrough
Single chat UI for all providers
Provider selected per-conversation

Performance

Benchmark Configuration

Date 2026-03-12

Backend Ollama

Duration 274.7s

Runs / model 3

Max tokens 300

Temperature 0

Unload between runs yes

Unload between models yes

Models tested 7

Test prompt

Explain the difference between Kubernetes Deployments, StatefulSets, and DaemonSets.
Keep it technical and around 250 to 350 words.

Results warm average - runs 2–3 (run 1 is cold start)

Model	Output tok/s warm	Cold start tok/s	Cold Δ vs warm	Prompt tok/s warm	Load avg s	Generation avg s	Wall time avg s	Response chars
gemma3:1b	263.6	285.3	—	4039	1.85	1.11	3.67	1595
gemma3:4b	153.0	148.6	−3%	2341	3.66	1.98	6.36	1514
mistral:7b	130.6	130.5	−0%	2132	1.65	2.30	3.99	1366
llama3.1:8b	123.0	123.1	—	2184	2.50	2.44	5.03	1497
gemma3:12b	68.8	68.9	—	1185	6.89	4.36	11.98	1463
qwen2.5:14b	56.5	56.5	—	1338	3.49	5.31	8.94	1590
gemma3:27b	9.1	9.0	−0%	142	14.48	33.14	48.66	1564

Cold Δ = how much slower the first (cold) run was versus the warm average. High values indicate significant model load overhead.

API Quick Start

List available models:

curl https://chat.reneau.me/api/tags

Chat completion (OpenAI-compatible):

curl https://chat.reneau.me/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d '{"model":"llama3.1","messages":[{"role":"user","content":"Hello"}],"stream":true}'

← Back to Engineering