Overview

Rather than paying per-token API costs for every request, I run a full local inference stack on GPU hardware. An RTX 3080 (10GB) and RTX 3060 Ti (8GB) combine for 18GB of usable VRAM, enough to run mid-size models at full precision. The stack routes traffic through nginx to Open WebUI at chat.reneau.me, which can dispatch to local Ollama/vLLM backends or pass through to OpenAI and Anthropic when needed.

Request Flow

Internet
  └── nginx (TLS, chat.reneau.me)
        └── Open WebUI
              ├── Ollama  (local - smaller models)
              ├── vLLM    (local - larger / batched)
              ├── OpenAI  (passthrough)
              └── Anthropic (passthrough)

Architecture

Hardware

  • RTX 3080 - 10GB GDDR6X
  • RTX 3060 Ti - 8GB GDDR6
  • 18GB combined VRAM - multi-GPU model distribution
  • CUDA 12, NVIDIA container toolkit
  • NVMe storage for model weights

Inference Backends

  • Ollama - runtime for smaller models (<13B)
  • vLLM - high-throughput serving, larger models
  • OpenAI-compatible API - drop-in for existing clients
  • Streaming SSE - token-by-token output

Frontend

  • Open WebUI at chat.reneau.me
  • Multi-model switching - local and cloud in one UI
  • RAG pipeline - local document ingestion + retrieval
  • Local accounts - no external IdP

External Providers

  • OpenAI - GPT-4o, o1, etc. via API key passthrough
  • Anthropic - Claude models via API key passthrough
  • Single chat UI for all providers
  • Provider selected per-conversation

Performance

Benchmark Configuration

Date 2026-03-12
Backend Ollama
Duration 274.7s
Runs / model 3
Max tokens 300
Temperature 0
Unload between runs yes
Unload between models yes
Models tested 7
Test prompt
Explain the difference between Kubernetes Deployments, StatefulSets, and DaemonSets.
Keep it technical and around 250 to 350 words.

Results warm average - runs 2–3 (run 1 is cold start)

Model Output
tok/s warm
Cold start
tok/s
Cold Δ
vs warm
Prompt
tok/s warm
Load
avg s
Generation
avg s
Wall time
avg s
Response
chars
gemma3:1b 263.6 285.3 4039 1.85 1.11 3.67 1595
gemma3:4b 153.0 148.6 −3% 2341 3.66 1.98 6.36 1514
mistral:7b 130.6 130.5 −0% 2132 1.65 2.30 3.99 1366
llama3.1:8b 123.0 123.1 2184 2.50 2.44 5.03 1497
gemma3:12b 68.8 68.9 1185 6.89 4.36 11.98 1463
qwen2.5:14b 56.5 56.5 1338 3.49 5.31 8.94 1590
gemma3:27b 9.1 9.0 −0% 142 14.48 33.14 48.66 1564

Cold Δ = how much slower the first (cold) run was versus the warm average. High values indicate significant model load overhead.

API Quick Start

List available models:

curl https://chat.reneau.me/api/tags

Chat completion (OpenAI-compatible):

curl https://chat.reneau.me/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d '{"model":"llama3.1","messages":[{"role":"user","content":"Hello"}],"stream":true}'
← Back to Engineering