Vencura
Deployment

Self-Hosted LLM (Ollama)

Run local LLMs for data privacy. Recommended when prompts must stay on your infrastructure.

By default, the app uses Anthropic (direct API, Claude Sonnet 4) when ANTHROPIC_API_KEY is set. Otherwise it falls back to Open Router or Ollama. Prefer the Anthropic provider for best integration; use Open Router only when an Anthropic key is unavailable. We recommend self-hosting with Ollama when data privacy is a priority—prompts and responses stay on your infrastructure, with no third-party API calls.

This guide covers Ollama—a local inference runtime with an OpenAI-compatible API—for both development and production.

Why Self-Host for Data Privacy

ConcernCloud APIs (OpenAI, Anthropic, etc.)Self-Hosted (Ollama)
Data leaves your infraYes—sent to vendorNo—stays local
Vendor retentionVaries by ToSNone
Compliance (HIPAA, SOC2)Requires BAA, auditsFull control
Cost at scalePer-token pricingFixed (hardware)
LatencyNetwork round-tripLocal inference

Use self-hosted when: handling PII, sensitive business data, or regulated workloads. Use cloud APIs for rapid prototyping or when you need the latest frontier models.


Development Setup (Simple)

For local development, run Ollama on your machine. No Nginx or TLS required.

1. Install Ollama

curl -fsSL https://ollama.com/install.sh | sh

2. Pull a Small Model

ollama pull qwen2.5:3b
ollama run qwen2.5:3b "Say hello"

3. Point Your App to Local Ollama

Ollama exposes an OpenAI-compatible API at http://localhost:11434. Configure your app:

# .env.local or .env
OLLAMA_BASE_URL=http://localhost:11434

If using Open Router or a custom provider, set the base URL to http://localhost:11434/v1 for OpenAI-style endpoints.

Dev Hardware

  • CPU-only: 8 GB RAM minimum; 16 GB recommended for 3B models
  • GPU (NVIDIA): 6 GB VRAM for 7B models; 8 GB+ for 13B
  • Model sizes: qwen2.5:3b (~2 GB), llama3.2:3b (~2 GB), llama3.1:8b (~4.7 GB)

Production Setup

For production, run Ollama on a server behind Nginx with TLS. Bind Ollama to localhost only; expose only through the reverse proxy.

Use CaseRAMGPUExample Models
Demo / low traffic16 GBOptional (CPU OK)qwen2.5:3b, llama3.2:3b
Chat, moderate traffic32 GB12–24 GB VRAMllama3.1:8b, mistral:7b
Heavy inference64 GB+24 GB+ VRAMllama3.1:70b, mixtral:8x7b

CPU-only: Expect 2–10 tokens/sec for 3B models; 7B+ models are slow without GPU. For production chat, a GPU is strongly recommended.

1. Install Ollama on the Server

curl -fsSL https://ollama.com/install.sh | sh

2. Configure systemd (Low-Memory Mode)

Create an override to limit memory usage when coexisting with other workloads:

sudo mkdir -p /etc/systemd/system/ollama.service.d
sudo tee /etc/systemd/system/ollama.service.d/override.conf << 'EOF'
[Service]
Environment="OLLAMA_MAX_LOADED_MODELS=1"
Environment="OLLAMA_NUM_PARALLEL=1"
Environment="OLLAMA_CONTEXT_LENGTH=2048"
Environment="OLLAMA_KEEP_ALIVE=10m"
Environment="OLLAMA_KV_CACHE_TYPE=q4_0"
EOF
sudo systemctl daemon-reload
sudo systemctl restart ollama
sudo systemctl enable ollama

3. CI / Parallel Workflows

When Ollama is used as the CI AI provider (e.g. OLLAMA_BASE_URL secret in GitHub Actions), api-e2e and web-e2e can run in parallel and both hit the same server. With OLLAMA_NUM_PARALLEL=1, requests serialize and tests may time out.

For CI hosts, increase parallelism if the machine has enough RAM:

# In /etc/systemd/system/ollama.service.d/override.conf
Environment="OLLAMA_NUM_PARALLEL=2"   # or 4 if RAM allows

Keep OLLAMA_MAX_LOADED_MODELS=1 to avoid loading multiple models at once.

4. Install Nginx and Create Site Config

sudo apt install nginx

Create /etc/nginx/sites-available/ollama.example.com:

server {
    listen 80;
    listen [::]:80;
    server_name ollama.example.com;

    location / {
        proxy_pass http://127.0.0.1:11434;
        proxy_http_version 1.1;
        proxy_set_header Host 127.0.0.1:11434;
        proxy_set_header X-Real-IP $remote_addr;
        proxy_set_header X-Forwarded-For $proxy_add_x_forwarded_for;
        proxy_set_header X-Forwarded-Proto $scheme;
        proxy_buffering off;
        proxy_read_timeout 300s;
        proxy_connect_timeout 75s;
        proxy_send_timeout 300s;
        chunked_transfer_encoding on;
    }
}

Important: Use proxy_set_header Host 127.0.0.1:11434—Ollama returns 403 for unknown Host headers.

sudo ln -sf /etc/nginx/sites-available/ollama.example.com /etc/nginx/sites-enabled/
sudo nginx -t && sudo systemctl reload nginx

4. DNS Configuration

Before TLS, the domain must resolve to your server. Add a record in your DNS provider (Cloudflare, Namecheap, Google Domains, etc.):

Option A — A record (recommended)

TypeName/HostValueTTL
Aollama<your-server-public-ip>300

If your provider uses full hostnames for the name field, use ollama.example.com instead of ollama.

Find your server's public IP:

curl -s ifconfig.me
# or
curl -s icanhazip.com

Option B — CNAME (if you have an existing A record)

If you already have an A record for the server (e.g. server.example.com):

TypeNameValue
CNAMEollamaserver.example.com

Verify propagation:

dig +short ollama.example.com
# or
nslookup ollama.example.com

Wait 5–15 minutes (up to 48 hours in rare cases) before running Certbot.

5. TLS with Let's Encrypt

sudo apt install certbot python3-certbot-nginx
sudo certbot --nginx -d ollama.example.com --non-interactive --agree-tos --email admin@example.com

Certbot adds HTTPS and HTTP→HTTPS redirect. Auto-renewal runs via certbot.timer.

6. Security

  • Ollama: Binds to 127.0.0.1:11434 by default—not exposed publicly
  • Auth: Add Nginx auth_basic or API-key validation if exposing beyond trusted users
  • Firewall: Allow 80/443; do not open 11434

Model Recommendations

ModelSizeUse Case
qwen2.5:3b~2 GBDev, demos, low-resource
llama3.2:3b~2 GBDev, demos
llama3.1:8b~4.7 GBProduction chat (GPU recommended)
mistral:7b~4.1 GBProduction chat
llama3.1:70b~40 GBHeavy workloads (24 GB+ VRAM)

Validation

# Service
systemctl status ollama

# Models
ollama list

# Local API
curl -s http://127.0.0.1:11434/api/tags

# Via Nginx (after TLS)
curl -s https://ollama.example.com/api/tags
curl -s https://ollama.example.com/api/generate -d '{"model":"qwen2.5:3b","prompt":"Hello","stream":false}'

AI Provider in Fastify

The Fastify API supports Anthropic (default), Open Router, and Ollama. Provider precedence (when AI_PROVIDER is unset): Anthropic direct API → Open Router → Ollama. Use Anthropic when possible; Open Router is a fallback. Privacy-first routing with Ollama requires AI_PROVIDER=ollama when other keys exist. See getResolvedProvider() for the exact logic.

Routes

RouteUse case
POST /ai/chatChat UI, messages, tools (getAccountInfo, braveSearch)
POST /ai/generateCLI, scripts, pipelines—single prompt, plain text SSE

Both routes use the same provider; clients call the Fastify API.

Environment Variables

At least one of ANTHROPIC_API_KEY, OPEN_ROUTER_API_KEY, or OLLAMA_BASE_URL must be set; OLLAMA_BASE_URL is used for self-hosted/private setups.

VariableRequiredDefaultDescription
ANTHROPIC_API_KEYConditional (default)Anthropic API key (preferred); Claude Sonnet 4 via direct API
OPEN_ROUTER_API_KEYConditional (fallback)Open Router API key when Anthropic unavailable
OLLAMA_BASE_URLConditionalhttp://localhost:11434Ollama server URL when self-hosting
AI_PROVIDERConditionalInferredanthropic, openrouter, or ollama — inferred when only one provider configured; must be set when multiple vars present
AI_DEFAULT_MODELNoVaries by providerAnthropic/OpenRouter: Claude Sonnet 4; Ollama: qwen3:8b; use openrouter/free for OpenRouter free tier

Provider Selection

  1. If AI_PROVIDER is set, use it (requires the corresponding env var).
  2. Else if ANTHROPIC_API_KEY is set, use Anthropic.
  3. Else if OPEN_ROUTER_API_KEY is set, use Open Router.
  4. Else if OLLAMA_BASE_URL is set, use Ollama.
  5. Else the server returns 500 with "No AI provider configured".

Tool Support

Open Router models support tools (getAccountInfo, braveSearch). Small Ollama models (e.g. qwen2.5:3b) may not; use 7B+ such as llama3.1:8b or mistral:7b for reliable tool calling.

Optional: Brave Search (Web Search Tool)

Set BRAVE_SEARCH_API_KEY to enable the web search tool. The assistant can then search the web when users ask for current or recent information. Free tier: 2,000 queries/month. See Brave Search API.


On this page