Quick Summary:
- GPU VRAM is the hard limit: 7B models need 6-8GB, 13B need 10-12GB, 34B need 20-24GB, 70B need 40-48GB at Q4_K_M quantization.
- Server mode is one toggle: LM Studio 0.3.x exposes an OpenAI-compatible API at localhost:1234 — any AI tool that supports OpenAI connects to it automatically.
- Network access requires a firewall rule: Binding to 0.0.0.0 lets other devices connect, but you'll need to allow the port through Windows Defender Firewall.
Your gaming PC is probably underutilized. When you're not gaming, that RTX 4070 Ti or RX 7900 XTX is sitting idle with 12-16GB of VRAM that could be running local language models 24/7. LM Studio is the fastest path from "I want to run a local LLM" to actually having one running — no command line required.
This guide covers the full setup: installing LM Studio, choosing the right model for your GPU, loading it, enabling the local API server, and connecting from other devices on your network.
Step 1: Install LM Studio
Download LM Studio from lmstudio.ai. It's available for Windows, macOS, and Linux.
Windows install: Standard .exe installer. Run it, it installs to %LOCALAPPDATA%\Programs\LM Studio. No admin required.
macOS install: .dmg file, drag to Applications. Apple Silicon and Intel both supported.
Linux install: .AppImage file. Make it executable and run:
chmod +x LM-Studio-*.AppImage
./LM-Studio-*.AppImage
For GPU acceleration on Linux:
- NVIDIA: Ensure CUDA 12.x drivers are installed. LM Studio detects CUDA automatically.
- AMD: ROCm 6.x. Check
rocm-smiconfirms your card is visible before launching.
After first launch, LM Studio checks your hardware and shows detected GPU(s) in the bottom status bar. Confirm your GPU is listed — if it shows "CPU only," your drivers need attention.
Step 2: Know Your VRAM Limit Before Downloading
This is the step most tutorials skip, and it's why people download models that won't fit.
LM Studio runs GGUF models. GGUF supports multiple quantization levels, each trading quality for size. At Q4_K_M (the recommended baseline — good quality, reasonable size), memory requirements are:
Min VRAM (GPU-only)
3 GB
6 GB
6-7 GB
10-12 GB
10-12 GB
22-24 GB
42-48 GB VRAM requirement is slightly higher than model file size because of KV cache and framework overhead. Plan on needing ~1.5-2GB above the file size for comfortable operation. For a deep dive on why VRAM fills up mid-conversation, see our KV cache explainer. For choosing which GGUF quantization variant to download, see our GGUF vs GPTQ vs AWQ vs EXL2 guide.
Your GPU's VRAM:
- RTX 3060 12GB → runs 7B comfortably, 13B at Q3 or smaller
- RTX 4060 Ti 16GB → runs 13B comfortably
- RTX 4070 Ti 12GB → runs 13B at Q4, tight; better at Q3_K_M
- RTX 4070 Ti Super 16GB → runs 13B comfortably
- RTX 3090 / 4090 24GB → runs 34B at Q4_K_M
- Dual RTX 3090 48GB → runs 70B at Q4_K_M
If your model is larger than your VRAM, LM Studio can split it across CPU RAM and GPU (partial GPU offload). This works but is significantly slower — expect 5-20 tokens/sec instead of 50-120 t/s.
Step 3: Download a Model
In LM Studio's left sidebar, click the magnifying glass (Discover) icon.
Search for the model you want. For a first model, recommended starting points by VRAM tier:
- 6-8GB VRAM:
llama3.2:3borqwen2.5-7b-instruct - 10-12GB VRAM:
llama3.1-8b-instructormistral-7b-instruct - 16GB VRAM:
qwen2.5-14b-instruct - 24GB VRAM:
qwen2.5-32b-instruct(at Q3_K_M) ordeepseek-r1-distill-qwen-14b
When LM Studio shows a model in search results, it lists multiple quantization variants. For most users:
- Q4_K_M — best default. Good quality, practical size.
- Q5_K_M — slightly better quality if you have headroom.
- Q8_0 — near-lossless, use if the model fits comfortably.
- Q2_K or Q3_K_M — only if your VRAM is tight and you need to fit a larger model.
Click the download arrow on your chosen variant. LM Studio downloads to ~/lm-studio/models/ by default (configurable in Preferences → Storage).
Step 4: Load the Model
Click the home icon (house) in the left sidebar to go to the main view. Click "Select a model to load" and choose your downloaded model.
LM Studio shows a configuration panel before loading:
- GPU Offload: Set this to 100% (all layers on GPU) if the model fits in VRAM. Reduce it only if you're intentionally splitting to CPU RAM.
- Context Length: Default is usually 2048-4096. Higher context = more VRAM used by KV cache. For a 7B model on 8GB VRAM, 4096 context is safe. For 13B on 12GB, cap at 2048-4096.
- Prompt Template: Leave as Auto — LM Studio detects the correct chat template from the model's metadata.
Click Load. The status bar at the bottom shows loading progress. A 7B model loads in 5-15 seconds on an NVMe drive with VRAM. A 70B model may take 30-60 seconds.
Once loaded, the model name and token/sec indicator appear in the bottom bar. You can test it in the Chat tab.
Step 5: Enable the Local API Server
This is what turns LM Studio from a chat toy into a local AI infrastructure component.
In the left sidebar, click the </> (Developer) icon. You'll see the Local Server tab.
Enable the server toggle. The server starts at localhost:1234 by default.
To access from other devices on your network:
- Change the Host field from
localhostto0.0.0.0 - The port stays at
1234(change it if you need to) - Click the green "Start Server" button
LM Studio shows the full URL — something like http://192.168.1.105:1234. Note this IP address — you'll use it from other devices.
Windows Firewall (required for network access):
Open PowerShell as Administrator and run:
New-NetFirewallRule -DisplayName "LM Studio" -Direction Inbound -Protocol TCP -LocalPort 1234 -Action Allow
Or manually: Windows Security → Firewall → Advanced Settings → Inbound Rules → New Rule → Port → TCP 1234.
Test it from another device:
curl http://192.168.1.105:1234/v1/models
You should get a JSON response listing your loaded model. If you get connection refused, the firewall rule wasn't applied or the server isn't bound to 0.0.0.0.
Step 6: Connect Other Applications
LM Studio's API is OpenAI-compatible. Any tool that accepts an OpenAI API endpoint can point at your local server.
Open WebUI (browser-based chat frontend):
docker run -d -p 3000:8080 \
-e OPENAI_API_BASE_URL=http://YOUR-PC-IP:1234/v1 \
-e OPENAI_API_KEY=lm-studio \
ghcr.io/open-webui/open-webui:main
Continue.dev (VS Code AI coding assistant):
In ~/.continue/config.json:
{
"models": [{
"title": "Local LM Studio",
"provider": "openai",
"model": "local-model",
"apiBase": "http://localhost:1234/v1",
"apiKey": "lm-studio"
}]
}
Python script:
from openai import OpenAI
client = OpenAI(base_url="http://localhost:1234/v1", api_key="lm-studio")
response = client.chat.completions.create(
model="local-model",
messages=[{"role": "user", "content": "Explain quantization in one paragraph."}]
)
print(response.choices[0].message.content)
The api_key value doesn't matter — LM Studio accepts any non-empty string.
VRAM Requirements Reference Table
For quick reference when choosing quantization level:
F16
~13.5 GB
~26 GB
~68 GB
~140 GB For understanding exactly why these numbers are what they are — and why VRAM usage grows during long conversations — see our KV cache and VRAM guide.
Troubleshooting Common Issues
Model loads but inference is slow (< 5 tokens/sec): GPU offload is probably set to 0% or low. In the load dialog, set GPU Offload to 100%. Check the status bar shows your GPU name, not "CPU."
Out of memory error on load: Model is too large for your VRAM. Either choose a more aggressive quantization (Q3_K_M or Q2_K), reduce context length, or use a smaller model parameter count.
Can't connect from another device:
- Confirm server is bound to 0.0.0.0, not localhost
- Check Windows Firewall rule is active
- Confirm both devices are on the same subnet (same router)
- Try disabling firewall temporarily to confirm it's the culprit
Model not showing in API response: The server only serves the currently loaded model. Go back to the main view, confirm the model is loaded (green indicator), then return to Developer tab.
For a comparison of LM Studio against Ollama and llama.cpp, see Ollama vs LM Studio vs llama.cpp vs vLLM. For a broader homelab API server setup covering multi-device access, see our gaming PC local LLM server guide.