Self-Hosted LLM (Ollama)

Run local LLMs for data privacy. Recommended when prompts must stay on your infrastructure.

By default, the app uses Anthropic (direct API, Claude Sonnet 4) when ANTHROPIC_API_KEY is set. Otherwise it falls back to Open Router or Ollama. Prefer the Anthropic provider for best integration; use Open Router only when an Anthropic key is unavailable. We recommend self-hosting with Ollama when data privacy is a priority—prompts and responses stay on your infrastructure, with no third-party API calls.

This guide covers Ollama—a local inference runtime with an OpenAI-compatible API—for both development and production.

Why Self-Host for Data Privacy

Concern	Cloud APIs (OpenAI, Anthropic, etc.)	Self-Hosted (Ollama)
Data leaves your infra	Yes—sent to vendor	No—stays local
Vendor retention	Varies by ToS	None
Compliance (HIPAA, SOC2)	Requires BAA, audits	Full control
Cost at scale	Per-token pricing	Fixed (hardware)
Latency	Network round-trip	Local inference

Use self-hosted when: handling PII, sensitive business data, or regulated workloads. Use cloud APIs for rapid prototyping or when you need the latest frontier models.

Development Setup (Simple)

For local development, run Ollama on your machine. No Nginx or TLS required.

1. Install Ollama

curl -fsSL https://ollama.com/install.sh | sh

2. Pull a Small Model

ollama pull qwen2.5:3b
ollama run qwen2.5:3b "Say hello"

3. Point Your App to Local Ollama

Ollama exposes an OpenAI-compatible API at http://localhost:11434. Configure your app:

# .env.local or .env
OLLAMA_BASE_URL=http://localhost:11434

If using Open Router or a custom provider, set the base URL to http://localhost:11434/v1 for OpenAI-style endpoints.

Dev Hardware

CPU-only: 8 GB RAM minimum; 16 GB recommended for 3B models
GPU (NVIDIA): 6 GB VRAM for 7B models; 8 GB+ for 13B
Model sizes: qwen2.5:3b (~2 GB), llama3.2:3b (~2 GB), llama3.1:8b (~4.7 GB)

Production Setup

For production, run Ollama on a server behind Nginx with TLS. Bind Ollama to localhost only; expose only through the reverse proxy.

Recommended Hardware (Production)

Use Case	RAM	GPU	Example Models
Demo / low traffic	16 GB	Optional (CPU OK)	qwen2.5:3b, llama3.2:3b
Chat, moderate traffic	32 GB	12–24 GB VRAM	llama3.1:8b, mistral:7b
Heavy inference	64 GB+	24 GB+ VRAM	llama3.1:70b, mixtral:8x7b

CPU-only: Expect 2–10 tokens/sec for 3B models; 7B+ models are slow without GPU. For production chat, a GPU is strongly recommended.

1. Install Ollama on the Server

curl -fsSL https://ollama.com/install.sh | sh

2. Configure systemd (Low-Memory Mode)

Create an override to limit memory usage when coexisting with other workloads:

sudo mkdir -p /etc/systemd/system/ollama.service.d
sudo tee /etc/systemd/system/ollama.service.d/override.conf << 'EOF'
[Service]
Environment="OLLAMA_MAX_LOADED_MODELS=1"
Environment="OLLAMA_NUM_PARALLEL=1"
Environment="OLLAMA_CONTEXT_LENGTH=2048"
Environment="OLLAMA_KEEP_ALIVE=10m"
Environment="OLLAMA_KV_CACHE_TYPE=q4_0"
EOF

sudo systemctl daemon-reload
sudo systemctl restart ollama
sudo systemctl enable ollama

3. CI / Parallel Workflows

When Ollama is used as the CI AI provider (e.g. OLLAMA_BASE_URL secret in GitHub Actions), api-e2e and web-e2e can run in parallel and both hit the same server. With OLLAMA_NUM_PARALLEL=1, requests serialize and tests may time out.

For CI hosts, increase parallelism if the machine has enough RAM:

# In /etc/systemd/system/ollama.service.d/override.conf
Environment="OLLAMA_NUM_PARALLEL=2"   # or 4 if RAM allows

Keep OLLAMA_MAX_LOADED_MODELS=1 to avoid loading multiple models at once.

4. Install Nginx and Create Site Config

sudo apt install nginx

Create /etc/nginx/sites-available/ollama.example.com:

server {
    listen 80;
    listen [::]:80;
    server_name ollama.example.com;

    location / {
        proxy_pass http://127.0.0.1:11434;
        proxy_http_version 1.1;
        proxy_set_header Host 127.0.0.1:11434;
        proxy_set_header X-Real-IP $remote_addr;
        proxy_set_header X-Forwarded-For $proxy_add_x_forwarded_for;
        proxy_set_header X-Forwarded-Proto $scheme;
        proxy_buffering off;
        proxy_read_timeout 300s;
        proxy_connect_timeout 75s;
        proxy_send_timeout 300s;
        chunked_transfer_encoding on;
    }
}

Important: Use proxy_set_header Host 127.0.0.1:11434—Ollama returns 403 for unknown Host headers.

sudo ln -sf /etc/nginx/sites-available/ollama.example.com /etc/nginx/sites-enabled/
sudo nginx -t && sudo systemctl reload nginx

4. DNS Configuration

Before TLS, the domain must resolve to your server. Add a record in your DNS provider (Cloudflare, Namecheap, Google Domains, etc.):

Option A — A record (recommended)

Type	Name/Host	Value	TTL
A	`ollama`	`<your-server-public-ip>`	300

If your provider uses full hostnames for the name field, use ollama.example.com instead of ollama.

Find your server's public IP:

curl -s ifconfig.me
# or
curl -s icanhazip.com

Option B — CNAME (if you have an existing A record)

If you already have an A record for the server (e.g. server.example.com):

Type	Name	Value
CNAME	`ollama`	`server.example.com`

Verify propagation:

dig +short ollama.example.com
# or
nslookup ollama.example.com

Wait 5–15 minutes (up to 48 hours in rare cases) before running Certbot.

5. TLS with Let's Encrypt

sudo apt install certbot python3-certbot-nginx
sudo certbot --nginx -d ollama.example.com --non-interactive --agree-tos --email admin@example.com

Certbot adds HTTPS and HTTP→HTTPS redirect. Auto-renewal runs via certbot.timer.

6. Security

Ollama: Binds to 127.0.0.1:11434 by default—not exposed publicly
Auth: Add Nginx auth_basic or API-key validation if exposing beyond trusted users
Firewall: Allow 80/443; do not open 11434

Model Recommendations

Model	Size	Use Case
qwen2.5:3b	~2 GB	Dev, demos, low-resource
llama3.2:3b	~2 GB	Dev, demos
llama3.1:8b	~4.7 GB	Production chat (GPU recommended)
mistral:7b	~4.1 GB	Production chat
llama3.1:70b	~40 GB	Heavy workloads (24 GB+ VRAM)

Validation

# Service
systemctl status ollama

# Models
ollama list

# Local API
curl -s http://127.0.0.1:11434/api/tags

# Via Nginx (after TLS)
curl -s https://ollama.example.com/api/tags
curl -s https://ollama.example.com/api/generate -d '{"model":"qwen2.5:3b","prompt":"Hello","stream":false}'

AI Provider in Fastify

The Fastify API supports Anthropic (default), Open Router, and Ollama. Provider precedence (when AI_PROVIDER is unset): Anthropic direct API → Open Router → Ollama. Use Anthropic when possible; Open Router is a fallback. Privacy-first routing with Ollama requires AI_PROVIDER=ollama when other keys exist. See getResolvedProvider() for the exact logic.

Routes

Route	Use case
`POST /ai/chat`	Chat UI, messages, tools (getAccountInfo, braveSearch)
`POST /ai/generate`	CLI, scripts, pipelines—single prompt, plain text SSE

Both routes use the same provider; clients call the Fastify API.

Environment Variables

At least one of ANTHROPIC_API_KEY, OPEN_ROUTER_API_KEY, or OLLAMA_BASE_URL must be set; OLLAMA_BASE_URL is used for self-hosted/private setups.

Variable	Required	Default	Description
`ANTHROPIC_API_KEY`	Conditional (default)	—	Anthropic API key (preferred); Claude Sonnet 4 via direct API
`OPEN_ROUTER_API_KEY`	Conditional (fallback)	—	Open Router API key when Anthropic unavailable
`OLLAMA_BASE_URL`	Conditional	`http://localhost:11434`	Ollama server URL when self-hosting
`AI_PROVIDER`	Conditional	Inferred	`anthropic`, `openrouter`, or `ollama` — inferred when only one provider configured; must be set when multiple vars present
`AI_DEFAULT_MODEL`	No	Varies by provider	Anthropic/OpenRouter: Claude Sonnet 4; Ollama: `qwen3:8b`; use `openrouter/free` for OpenRouter free tier

Provider Selection

If AI_PROVIDER is set, use it (requires the corresponding env var).
Else if ANTHROPIC_API_KEY is set, use Anthropic.
Else if OPEN_ROUTER_API_KEY is set, use Open Router.
Else if OLLAMA_BASE_URL is set, use Ollama.
Else the server returns 500 with "No AI provider configured".

Tool Support

Open Router models support tools (getAccountInfo, braveSearch). Small Ollama models (e.g. qwen2.5:3b) may not; use 7B+ such as llama3.1:8b or mistral:7b for reliable tool calling.

Optional: Brave Search (Web Search Tool)

Set BRAVE_SEARCH_API_KEY to enable the web search tool. The assistant can then search the web when users ask for current or recent information. Free tier: 2,000 queries/month. See Brave Search API.

Vercel Deployment — Frontend and API deployment
Portability Strategy — Migration paths

On this page