Is it safe to expose Ollama's API port to the internet?

No. Port 11434 has no authentication by default. Bind it to your LAN IP only and block it at your firewall. Never forward this port on your router. If you need external access, use a VPN to reach your home network first.

Can I run the local LLM server while gaming on the same PC?

Yes, with caveats. The API server uses GPU memory and bandwidth only when actively processing requests. Between requests, VRAM is occupied but GPU compute is idle. Most gaming GPUs handle light inference loads alongside gaming, but you will see frame drops during active model generation.

What is the minimum VRAM needed to run a useful local API server?

8GB VRAM is the practical minimum. It handles 7B and 8B models at Q4_K_M, which are genuinely useful for coding assistance, summarization, and general chat. With 16GB you can serve 13B models or multiple smaller models concurrently.

Does Continue.dev work with Ollama running on another machine?

Yes. In Continue's config.json, set the base URL to http://192.168.x.x:11434/v1 and select your model. Continue treats it identically to a local Ollama instance — no API key needed for local endpoints.

How to Use Your Gaming PC as a Local LLM API Server (Home Lab Setup)

Some links on this page may be affiliate links. We disclose it because you deserve to know, not because it changes anything. Every recommendation here comes from benchmarks, not budgets.

Your gaming PC is sitting at 5% CPU utilization most of the day. If it has 8GB or more VRAM, it can be a local AI API server that every other device in your house — your laptop, your phone via a web client, your other PC — can query without sending a single byte to OpenAI.

This is not a complicated setup. Ollama and LM Studio both expose OpenAI-compatible API endpoints out of the box. The main thing most guides miss: by default, these servers only accept connections from the local machine. Getting them to listen on your LAN requires one environment variable or one checkbox, and that change opens up the entire use case.

Quick Summary

Ollama: Set OLLAMA_HOST=0.0.0.0:11434 to accept LAN connections — default only listens on localhost
Security: Bind to your LAN IP only, block with ufw — these servers have no authentication
Client compatibility: Any tool with an OpenAI-compatible base URL works — Continue.dev, OpenWebUI, custom Python scripts

Hardware Prerequisites

Any modern gaming GPU with 8GB+ VRAM works. The server does not need to run 24/7 — you can start it when you want it available and stop it when not in use. Minimum viable setup:

GPU: 8GB VRAM (RTX 3060 8GB, RTX 4060, RX 7600, etc.)
RAM: 16GB system RAM minimum; 32GB if you plan to offload CPU layers
CPU: Any modern gaming CPU — Ryzen 5/7, Intel Core 12th gen+ — inference is GPU-bound
OS: Windows or Linux both work; Linux is easier to configure as always-available

For a primer on what GPU specs matter most for inference, see our how much VRAM do you need guide. For a step-by-step LM Studio server setup specifically, see our LM Studio tutorial.

Setting Up Ollama as a LAN API Server

Ollama is the most straightforward option. It handles model downloads, CUDA/ROCm setup, and OpenAI-compatible routing with minimal configuration.

Installation

Linux:

curl -fsSL https://ollama.com/install.sh | sh

Windows: Download the installer from ollama.com. It installs as a background service.

After installation, pull a model:

ollama pull llama3.1:8b

Making Ollama Listen on LAN

By default, Ollama binds to 127.0.0.1:11434. To accept connections from other devices:

Linux (systemd service):

Create an override file:

sudo systemctl edit ollama.service

Add:

[Service]
Environment="OLLAMA_HOST=0.0.0.0:11434"

Restart:

sudo systemctl restart ollama

Linux (manual / in your shell profile):

export OLLAMA_HOST=0.0.0.0:11434
ollama serve

Windows:

Add OLLAMA_HOST as a system environment variable via Control Panel → System → Advanced System Settings → Environment Variables. Set it to 0.0.0.0:11434. Restart the Ollama service from the system tray.

Verify It Is Listening

From another device on your network, find your server's LAN IP first:

# Linux
ip addr show | grep "inet " | grep -v 127

# Windows
ipconfig | grep IPv4

Then test from any device:

curl http://192.168.1.100:11434/api/tags

You should get a JSON response listing your installed models.

Setting Up LM Studio as a LAN API Server

LM Studio's server mode is configured in the GUI, which makes it accessible for users who prefer not to touch environment variables.

Open LM Studio
Click the server icon in the left sidebar (looks like <->)
Under "Server Address", change localhost to 0.0.0.0
Set port to 1234 (default) or any open port
Load your model
Click "Start Server"

LM Studio exposes an OpenAI-compatible API at http://YOUR-LAN-IP:1234/v1.

Connecting Clients to Your Local API Server

Continue.dev (VS Code AI Coding Assistant)

Continue.dev is one of the best reasons to run a local API server — it gives you GitHub Copilot-style autocomplete and chat in VS Code without any cloud dependency.

Install the Continue extension in VS Code, then edit ~/.continue/config.json:

{
  "models": [
    {
      "title": "Home Lab Llama 3.1 8B",
      "provider": "ollama",
      "model": "llama3.1:8b",
      "apiBase": "http://192.168.1.100:11434"
    }
  ]
}

For LM Studio:

{
  "models": [
    {
      "title": "Home Lab LM Studio",
      "provider": "openai",
      "model": "lmstudio-model",
      "apiBase": "http://192.168.1.100:1234/v1",
      "apiKey": "lm-studio"
    }
  ]
}

OpenWebUI (Browser-Based Chat Interface)

OpenWebUI runs as a Docker container and auto-discovers local Ollama instances. It provides a ChatGPT-like web interface accessible from any device on your LAN.

docker run -d -p 3000:8080 \
  -e OLLAMA_BASE_URL=http://192.168.1.100:11434 \
  -v open-webui:/app/backend/data \
  --name open-webui \
  ghcr.io/open-webui/open-webui:main

Access it at http://192.168.1.100:3000 from any browser on your network.

Custom Python Scripts Using OpenAI Client

Any code using the OpenAI Python SDK can point at your local server with one change:

from openai import OpenAI

client = OpenAI(
    base_url="http://192.168.1.100:11434/v1",
    api_key="ollama"  # Required by the client library, ignored by Ollama
)

response = client.chat.completions.create(
    model="llama3.1:8b",
    messages=[
        {"role": "user", "content": "Explain PCIe bandwidth limits for multi-GPU setups"}
    ]
)

print(response.choices[0].message.content)

This pattern works with any library built on the OpenAI SDK: LangChain, LlamaIndex, AutoGen, and most agent frameworks.

Security Configuration

This is the part most tutorials skip, and it matters.

Bind to LAN IP Instead of 0.0.0.0

0.0.0.0 means "all interfaces" — including any virtual adapters, VPNs, or Docker networks. More precise is to bind to your specific LAN IP:

export OLLAMA_HOST=192.168.1.100:11434

This prevents the server from being reachable on other network interfaces.

Firewall Rules (Linux with ufw)

# Allow LAN subnet to reach Ollama
sudo ufw allow from 192.168.1.0/24 to any port 11434

# Block all other access to that port
sudo ufw deny 11434

# For LM Studio
sudo ufw allow from 192.168.1.0/24 to any port 1234
sudo ufw deny 1234

Adjust 192.168.1.0/24 to match your actual LAN subnet (check with ip route).

Never Expose These Ports to the Internet

Port 11434 and 1234 have zero authentication. If you forward these ports on your router, anyone on the internet can use your GPU to run inference — or use it to probe your local network. Never add port forwarding rules for these services.

If you need remote access to your home lab AI server, use WireGuard or Tailscale to VPN into your home network first, then connect via LAN IP.

Running Multiple Models

Ollama can serve multiple models and will swap them in VRAM on demand. By default it keeps the last loaded model in VRAM for 5 minutes after the last request.

To control concurrent VRAM usage, set:

export OLLAMA_MAX_LOADED_MODELS=2
export OLLAMA_KEEP_ALIVE=10m

MAX_LOADED_MODELS=2 allows two models to reside in VRAM simultaneously on systems with 24GB+ VRAM (e.g., one 7B and one 13B, or two 7B models for different use cases).

For a deeper dive on running multiple models from the same GPU, see our guide on running multiple LLMs on one GPU. If you need multi-user concurrency rather than sequential requests, see our vLLM consumer setup guide.

Practical Model Choices for a Home Lab Server

For a home lab API server that other devices will query, you want models that are fast and broadly capable rather than the largest model your GPU can barely fit:

Why

Fast, high quality for size

Supports 128K context window

Instant response, fits on any 4GB+ VRAM On an 8GB VRAM card, load a single 7B/8B model. On a 16GB card, you can run two 7B models simultaneously or one 13B model. The RTX 3090 or 4090 at 24GB give you the flexibility to keep a 7B coder and a 13B general model in VRAM at the same time.