CraftRigs
Architecture Guide

How to Set Up a Local AI API Server for Your Team

By Georgia Thomas 5 min read

Some links on this page may be affiliate links. We disclose it because you deserve to know, not because it changes anything. Every recommendation here comes from benchmarks, not budgets.

Running a local LLM on your own machine is one thing. Sharing it with five teammates is a different problem entirely — and most guides never get to this part.

The setup isn't complicated, but the choices you make early determine whether your team gets a fast, reliable AI tool or a frustrating shared machine that falls over when two people use it at once.

This guide covers the full picture: hardware sizing for concurrent users, which serving software to pick, and how to expose it securely across your team.

Why Bother With a Shared Server vs Individual Installs

Individual installs are fine until they're not. Your developer doesn't want to babysit model downloads on six machines. Your legal team's sensitive documents shouldn't be processed through OpenAI's servers regardless of their retention policy. And enterprise API costs at even moderate usage — say, a 10-person team querying Claude or GPT-4 a hundred times a day each — run $400–800/month.

A shared local server means: one hardware investment, one model library, one maintenance burden, zero API costs, and complete data isolation.

API cost alone often pays back the hardware in 4–8 months for teams that use AI regularly.

Note

API cost comparison: A team sending 1,000 queries/day to GPT-4o averages ~$300–400/month depending on token length. A $1,200 local workstation running Qwen 2.5 14B handles the same volume at zero marginal cost after the hardware purchase.

Hardware Sizing for Concurrent Users

Concurrent users are where people underestimate. It's not just about fitting the model — it's about handling parallel inference requests without grinding to a halt.

Each active user during inference adds to GPU load. With most frameworks, requests queue and process sequentially. So 4 people hitting the model simultaneously means the 4th user waits for the other 3 to finish their responses.

Rough sizing guidelines:

Expected Behavior

Smooth, minimal wait

Occasional queue at peak

Handles bursts, needs vLLM

vLLM required For most small teams (2–6 people), a single RTX 4090 running a 14B or 32B model is perfectly adequate for asynchronous usage. People rarely query simultaneously in practice.

System RAM matters too. You need 32GB minimum; 64GB if you're loading large models or running CPU offloading as overflow. Fast NVMe storage cuts model load time dramatically — a 14B model loads in ~8 seconds from a PCIe 5.0 drive vs 35+ seconds from a 3.0 drive. See Best NVMe SSDs for Local LLM Workflows for specifics.

Choosing Your Serving Software

Two main options, with different purposes.

Ollama — The Easy Path

Ollama exposes a REST API on port 11434 by default. Any model you pull is immediately accessible via POST /api/chat or POST /api/generate. Pair it with Open WebUI and your team gets a ChatGPT-style interface without writing a line of code.

Setup on Linux:

curl -fsSL https://ollama.ai/install.sh | sh
ollama pull qwen2.5:14b

That's it. Ollama handles model management, GPU detection, and quantization automatically. It's the right choice for 95% of small teams.

One limitation: Ollama is single-request sequential by default. It doesn't batch concurrent requests or use advanced scheduling. Under heavy concurrent load (5+ people simultaneously), you'll feel it.

vLLM — When You Need Real Concurrency

vLLM is what actual API providers use. It handles concurrent requests properly using PagedAttention — a memory management technique that lets it process multiple requests in parallel without linear VRAM overhead scaling.

For a 6-person team that all use the model heavily during the same work hours, vLLM is worth the additional setup complexity. It requires Docker and more configuration, but the throughput difference under concurrent load is significant.

docker run --runtime nvidia --gpus all \
  -p 8000:8000 \
  vllm/vllm-openai:latest \
  --model Qwen/Qwen2.5-14B-Instruct \
  --max-model-len 16384

The vLLM API is OpenAI-compatible. Any tool that works with OpenAI's API (including most AI coding assistants, Cursor, Continue.dev) points to your local endpoint instead.

Tip

Start with Ollama + Open WebUI. It's faster to deploy, easier to maintain, and sufficient for most small teams. Migrate to vLLM only when you hit real concurrency limits — not in anticipation of them.

Deploying Open WebUI

Open WebUI gives your team a browser-based chat interface. Non-technical users love it because it feels like ChatGPT. It handles multi-user accounts, conversation history, and model switching.

Docker deployment (recommended):

docker run -d \
  -p 3000:80 \
  --add-host=host.docker.internal:host-gateway \
  -v open-webui:/app/backend/data \
  -e OLLAMA_BASE_URL=http://host.docker.internal:11434 \
  --name open-webui \
  ghcr.io/open-webui/open-webui:main

Your team accesses it at http://your-server-ip:3000. First user to register becomes admin.

Network Access and Basic Security

Don't expose port 11434 (Ollama's API) directly to the internet. It has no authentication by default. Instead:

  • Put Ollama and Open WebUI behind a reverse proxy (Nginx or Caddy)
  • Add HTTPS via Let's Encrypt if accessible from outside your office network
  • Use Open WebUI's built-in user accounts for team authentication
  • Restrict access to your internal IP range at the network level if staying LAN-only

For a team on the same local network, the simplest setup is LAN-only access with no external exposure. Users connect to the server's IP address directly.

Caution

Don't skip authentication. An unauthenticated Ollama endpoint on your LAN means anyone on the network can query your models. Enable Open WebUI user accounts or add Nginx basic auth before sharing the URL with your team.

For a 2–8 person team on a budget:

  • Hardware: ~$1,200 local workstation build with RTX 4090 or used RTX 3090
  • OS: Ubuntu 22.04 or 24.04 (best driver support)
  • Model: Qwen 2.5 14B or 32B at Q4_K_M (solid quality, fits in 24GB)
  • Serving: Ollama
  • Interface: Open WebUI
  • Access: LAN-only, Open WebUI accounts for auth

Total cost: $1,200–2,000 hardware, zero ongoing. Versus $300–500/month in API costs for a team using Claude or GPT-4 regularly. The business ROI for local LLMs is clearer than most people realize.

The server doesn't have to run 24/7 either. If your team only uses it during work hours, sleep/wake on demand is fine. If you need always-on availability, budget for the electricity — a single RTX 4090 system draws roughly 400–500W under load, which is $40–60/month depending on your electricity rates.

See Also

ollama vllm api-server team open-webui local-llm self-hosted

Technical Intelligence, Weekly.

Access our longitudinal study of hardware performance and architectural optimization benchmarks.