Self-Hosted LLM (Ollama)
Run local LLMs for data privacy. Recommended when prompts must stay on your infrastructure.
By default, the app uses Anthropic (direct API, Claude Sonnet 4) when ANTHROPIC_API_KEY is set. Otherwise it falls back to Open Router or Ollama. Prefer the Anthropic provider for best integration; use Open Router only when an Anthropic key is unavailable. We recommend self-hosting with Ollama when data privacy is a priority—prompts and responses stay on your infrastructure, with no third-party API calls.
This guide covers Ollama—a local inference runtime with an OpenAI-compatible API—for both development and production.
Why Self-Host for Data Privacy
| Concern | Cloud APIs (OpenAI, Anthropic, etc.) | Self-Hosted (Ollama) |
|---|---|---|
| Data leaves your infra | Yes—sent to vendor | No—stays local |
| Vendor retention | Varies by ToS | None |
| Compliance (HIPAA, SOC2) | Requires BAA, audits | Full control |
| Cost at scale | Per-token pricing | Fixed (hardware) |
| Latency | Network round-trip | Local inference |
Use self-hosted when: handling PII, sensitive business data, or regulated workloads. Use cloud APIs for rapid prototyping or when you need the latest frontier models.
Development Setup (Simple)
For local development, run Ollama on your machine. No Nginx or TLS required.
1. Install Ollama
curl -fsSL https://ollama.com/install.sh | sh2. Pull a Small Model
ollama pull qwen2.5:3b
ollama run qwen2.5:3b "Say hello"3. Point Your App to Local Ollama
Ollama exposes an OpenAI-compatible API at http://localhost:11434. Configure your app:
# .env.local or .env
OLLAMA_BASE_URL=http://localhost:11434If using Open Router or a custom provider, set the base URL to http://localhost:11434/v1 for OpenAI-style endpoints.
Dev Hardware
- CPU-only: 8 GB RAM minimum; 16 GB recommended for 3B models
- GPU (NVIDIA): 6 GB VRAM for 7B models; 8 GB+ for 13B
- Model sizes:
qwen2.5:3b(~2 GB),llama3.2:3b(~2 GB),llama3.1:8b(~4.7 GB)
Production Setup
For production, run Ollama on a server behind Nginx with TLS. Bind Ollama to localhost only; expose only through the reverse proxy.
Recommended Hardware (Production)
| Use Case | RAM | GPU | Example Models |
|---|---|---|---|
| Demo / low traffic | 16 GB | Optional (CPU OK) | qwen2.5:3b, llama3.2:3b |
| Chat, moderate traffic | 32 GB | 12–24 GB VRAM | llama3.1:8b, mistral:7b |
| Heavy inference | 64 GB+ | 24 GB+ VRAM | llama3.1:70b, mixtral:8x7b |
CPU-only: Expect 2–10 tokens/sec for 3B models; 7B+ models are slow without GPU. For production chat, a GPU is strongly recommended.
1. Install Ollama on the Server
curl -fsSL https://ollama.com/install.sh | sh2. Configure systemd (Low-Memory Mode)
Create an override to limit memory usage when coexisting with other workloads:
sudo mkdir -p /etc/systemd/system/ollama.service.d
sudo tee /etc/systemd/system/ollama.service.d/override.conf << 'EOF'
[Service]
Environment="OLLAMA_MAX_LOADED_MODELS=1"
Environment="OLLAMA_NUM_PARALLEL=1"
Environment="OLLAMA_CONTEXT_LENGTH=2048"
Environment="OLLAMA_KEEP_ALIVE=10m"
Environment="OLLAMA_KV_CACHE_TYPE=q4_0"
EOFsudo systemctl daemon-reload
sudo systemctl restart ollama
sudo systemctl enable ollama3. CI / Parallel Workflows
When Ollama is used as the CI AI provider (e.g. OLLAMA_BASE_URL secret in GitHub Actions), api-e2e and web-e2e can run in parallel and both hit the same server. With OLLAMA_NUM_PARALLEL=1, requests serialize and tests may time out.
For CI hosts, increase parallelism if the machine has enough RAM:
# In /etc/systemd/system/ollama.service.d/override.conf
Environment="OLLAMA_NUM_PARALLEL=2" # or 4 if RAM allowsKeep OLLAMA_MAX_LOADED_MODELS=1 to avoid loading multiple models at once.
4. Install Nginx and Create Site Config
sudo apt install nginxCreate /etc/nginx/sites-available/ollama.example.com:
server {
listen 80;
listen [::]:80;
server_name ollama.example.com;
location / {
proxy_pass http://127.0.0.1:11434;
proxy_http_version 1.1;
proxy_set_header Host 127.0.0.1:11434;
proxy_set_header X-Real-IP $remote_addr;
proxy_set_header X-Forwarded-For $proxy_add_x_forwarded_for;
proxy_set_header X-Forwarded-Proto $scheme;
proxy_buffering off;
proxy_read_timeout 300s;
proxy_connect_timeout 75s;
proxy_send_timeout 300s;
chunked_transfer_encoding on;
}
}Important: Use proxy_set_header Host 127.0.0.1:11434—Ollama returns 403 for unknown Host headers.
sudo ln -sf /etc/nginx/sites-available/ollama.example.com /etc/nginx/sites-enabled/
sudo nginx -t && sudo systemctl reload nginx4. DNS Configuration
Before TLS, the domain must resolve to your server. Add a record in your DNS provider (Cloudflare, Namecheap, Google Domains, etc.):
Option A — A record (recommended)
| Type | Name/Host | Value | TTL |
|---|---|---|---|
| A | ollama | <your-server-public-ip> | 300 |
If your provider uses full hostnames for the name field, use ollama.example.com instead of ollama.
Find your server's public IP:
curl -s ifconfig.me
# or
curl -s icanhazip.comOption B — CNAME (if you have an existing A record)
If you already have an A record for the server (e.g. server.example.com):
| Type | Name | Value |
|---|---|---|
| CNAME | ollama | server.example.com |
Verify propagation:
dig +short ollama.example.com
# or
nslookup ollama.example.comWait 5–15 minutes (up to 48 hours in rare cases) before running Certbot.
5. TLS with Let's Encrypt
sudo apt install certbot python3-certbot-nginx
sudo certbot --nginx -d ollama.example.com --non-interactive --agree-tos --email admin@example.comCertbot adds HTTPS and HTTP→HTTPS redirect. Auto-renewal runs via certbot.timer.
6. Security
- Ollama: Binds to
127.0.0.1:11434by default—not exposed publicly - Auth: Add Nginx
auth_basicor API-key validation if exposing beyond trusted users - Firewall: Allow 80/443; do not open 11434
Model Recommendations
| Model | Size | Use Case |
|---|---|---|
| qwen2.5:3b | ~2 GB | Dev, demos, low-resource |
| llama3.2:3b | ~2 GB | Dev, demos |
| llama3.1:8b | ~4.7 GB | Production chat (GPU recommended) |
| mistral:7b | ~4.1 GB | Production chat |
| llama3.1:70b | ~40 GB | Heavy workloads (24 GB+ VRAM) |
Validation
# Service
systemctl status ollama
# Models
ollama list
# Local API
curl -s http://127.0.0.1:11434/api/tags
# Via Nginx (after TLS)
curl -s https://ollama.example.com/api/tags
curl -s https://ollama.example.com/api/generate -d '{"model":"qwen2.5:3b","prompt":"Hello","stream":false}'AI Provider in Fastify
The Fastify API supports Anthropic (default), Open Router, and Ollama. Provider precedence (when AI_PROVIDER is unset): Anthropic direct API → Open Router → Ollama. Use Anthropic when possible; Open Router is a fallback. Privacy-first routing with Ollama requires AI_PROVIDER=ollama when other keys exist. See getResolvedProvider() for the exact logic.
Routes
| Route | Use case |
|---|---|
POST /ai/chat | Chat UI, messages, tools (getAccountInfo, braveSearch) |
POST /ai/generate | CLI, scripts, pipelines—single prompt, plain text SSE |
Both routes use the same provider; clients call the Fastify API.
Environment Variables
At least one of ANTHROPIC_API_KEY, OPEN_ROUTER_API_KEY, or OLLAMA_BASE_URL must be set; OLLAMA_BASE_URL is used for self-hosted/private setups.
| Variable | Required | Default | Description |
|---|---|---|---|
ANTHROPIC_API_KEY | Conditional (default) | — | Anthropic API key (preferred); Claude Sonnet 4 via direct API |
OPEN_ROUTER_API_KEY | Conditional (fallback) | — | Open Router API key when Anthropic unavailable |
OLLAMA_BASE_URL | Conditional | http://localhost:11434 | Ollama server URL when self-hosting |
AI_PROVIDER | Conditional | Inferred | anthropic, openrouter, or ollama — inferred when only one provider configured; must be set when multiple vars present |
AI_DEFAULT_MODEL | No | Varies by provider | Anthropic/OpenRouter: Claude Sonnet 4; Ollama: qwen3:8b; use openrouter/free for OpenRouter free tier |
Provider Selection
- If
AI_PROVIDERis set, use it (requires the corresponding env var). - Else if
ANTHROPIC_API_KEYis set, use Anthropic. - Else if
OPEN_ROUTER_API_KEYis set, use Open Router. - Else if
OLLAMA_BASE_URLis set, use Ollama. - Else the server returns 500 with "No AI provider configured".
Tool Support
Open Router models support tools (getAccountInfo, braveSearch). Small Ollama models (e.g. qwen2.5:3b) may not; use 7B+ such as llama3.1:8b or mistral:7b for reliable tool calling.
Optional: Brave Search (Web Search Tool)
Set BRAVE_SEARCH_API_KEY to enable the web search tool. The assistant can then search the web when users ask for current or recent information. Free tier: 2,000 queries/month. See Brave Search API.
Related Documentation
- Vercel Deployment — Frontend and API deployment
- Portability Strategy — Migration paths