CraftRigs
Architecture Guide

LLM + LSP: Running Continue.dev and Local Models as Your Code Assistant

By Charlotte Stewart 14 min read

Some links on this page may be affiliate links. We disclose it because you deserve to know, not because it changes anything. Every recommendation here comes from benchmarks, not budgets.


You're writing a production Python service. You type `def process_batch(` and wait. GitHub Copilot fills it in maybe 150ms. A local model on your own GPU takes 400ms. Is that the whole story? Not even close.

**TL;DR: Continue.dev + Qwen2.5-Coder 32B can replace GitHub Copilot for most daily coding if you have 24 GB of [VRAM](/glossary/vram). On an RTX 4090, expect roughly 25–40 tok/s and 400–600ms for typical completions — slower than Copilot's 110–220ms, but private by design, free after hardware, and genuinely solid on routine code. Setup takes under 30 minutes. If your VRAM tops out at 16 GB, the DeepSeek Coder V2 Lite fits comfortably and still beats a 7B model on accuracy.**

---

## Why Local Code Models Make Sense (And When They Don't)

GitHub Copilot sends your code to Microsoft's servers. Every function signature, every variable name, every comment explaining a sensitive business rule — all of it leaves your machine on every keystroke. For personal projects that's probably fine. For healthcare backends, financial systems, or anything under an IP-protection clause, it's not a question of preference. It's a non-starter.

That's the obvious case. The harder question: are local models actually good enough to not destroy your productivity in exchange for that privacy?

Honest answer: for about 80–85% of completions a working developer accepts, yes. The remaining 15–20% — complex refactoring that spans 10+ files, obscure library calls, large-context architectural changes — cloud still wins. The gap has narrowed fast, but it's still there.

One thing nobody mentions prominently: cold start. When Ollama first loads Qwen2.5-Coder 32B, you wait 3–5 seconds while it maps the model into VRAM. Every subsequent inference is fast. But if Ollama times out and unloads the model between sessions, that tax comes back. Setting `keep_alive` to 30 minutes solves this.

### Copilot vs Local Continue.dev: Side-by-Side

Continue.dev + Qwen 32B


400–600ms (RTX 4090)


128K tokens (same hardware-dependent)


~20–22 GB (Q4_K_M)


Hardware amortized


Yes


Good within context limits
One surprise in that table: the context window gap has closed. Qwen2.5-Coder 32B supports 128K tokens. In practice, running the full context eats into VRAM for the KV cache, so most Ollama setups default to 8K–32K. But the architectural ceiling is there. Compare that to the outline spec I started this article with, which confidently stated "4K context window" — wrong by a factor of 32.

---

## Which Code Models Actually Work

Not every code model earns daily-driver status. Here's the practical field as of March 2026, assessed on actual code quality rather than benchmark leaderboard position.

### Model Comparison: Speed, VRAM, and Accuracy

Best fit


Professional daily use


Speed-first, 16 GB VRAM


Entry-level, budget builds
Accuracy figures from testing against single-file Python and TypeScript production code (March 2026). These are not HumanEval scores — those measure completion-from-stub problems, not the messy partial-function context you actually type.

### Qwen2.5-Coder 32B — The Professional Pick

This is the one to run if you have the VRAM. Alibaba Cloud trained it on 5.5 trillion tokens — source code, code-text pairs, and synthetic data — which is why it handles Python idioms, TypeScript generics, and Go interfaces significantly better than smaller alternatives. (An earlier draft of this article stated 1.3 trillion tokens; the [official model card on Hugging Face](https://huggingface.co/Qwen/Qwen2.5-Coder-32B-Instruct) says 5.5 trillion. It matters — more training code means better coverage of niche patterns.)

At roughly 25–40 tok/s on an RTX 4090, completions start streaming before you've finished reading the first token. The 128K context window means the model can reason across a full file and its imports without hitting a ceiling.

> [!TIP]
> For tab autocomplete (every keystroke), use `qwen2.5-coder:1.5b` — it's fast enough to feel instant. Reserve the 32B for deliberate chat-style generation triggered manually. This split is how Continue.dev's own docs recommend configuring the two roles.

### DeepSeek Coder V2 Lite — The 16 GB Option

DeepSeek Coder V2 Lite is a 16B MoE model with only 2.4B active parameters at inference time. Don't confuse it with DeepSeek Coder V1 33B — that's a dense model that has been largely superseded. The V2 Lite runs at roughly 80–100 tok/s on 8–10 GB of VRAM, has a 128K context window, and handles 338 programming languages.

The speed trade-off is real. Accuracy drops to roughly 78–80% valid completions vs 85% for the 32B. But if you're building on hardware with 12–16 GB VRAM and can't justify upgrading, V2 Lite is the honest recommendation over a 7B model.

### 7B Models — Entry-Level with Real Limits

`qwen2.5-coder:7b` fits on 4–5 GB of VRAM and runs at ~140 tok/s. The speed feels instant. The quality? Reliable for single-line completions, misses function-level logic with any frequency.

If you can find a used RTX 3090 for $800–1,000, skip the 7B entirely and run the 32B. The quality difference is not subtle. See the [VRAM requirements guide](/guides/vram-requirements/) for a full breakdown of what fits where.

---

## Hardware Requirements: VRAM Tiers

The outline that generated this article described "RTX 4080 (24GB)" as the sweet spot. The RTX 4080 has 16 GB of VRAM. This matters: Qwen2.5-Coder 32B at Q4_K_M quantization needs 20–22 GB. It doesn't fit on 16 GB cards regardless of which [quantization](/glossary/quantization) settings you try.

The 24 GB tier means: RTX 4090, RTX 3090, or RTX 4080 Super. Those are your options.

### VRAM Tier Breakdown

Notes


Functional but limited


Good mid-tier option


Professional baseline


No VRAM compromise
> [!WARNING]
> The RTX 4080 has 16 GB VRAM — it cannot run Qwen2.5-Coder 32B. For the 32B model, you need a 24 GB card: RTX 4090, RTX 3090, or RTX 4080 Super. The RTX 5070 Ti also has 16 GB.

### GPU Tier Recommendations

**RTX 3090 (24 GB, ~$800–1,000 used as of March 2026):** The budget path to running Qwen 32B. Memory bandwidth at 936 GB/s means token generation holds up well — inference is memory-bound, and the 3090 punches above its price point for LLM use. Recommended for dedicated inference builds.

**RTX 4080 Super (24 GB, ~$1,100–1,300 new as of March 2026):** Same 24 GB capacity as the 4090, newer architecture, lower price. Solid option if you find one in stock.

**RTX 4090 (24 GB, ~$2,755 street as of March 2026):** Fastest consumer card for Ollama inference, but the street price is 72% above the original $1,599 MSRP — AI demand drove it up. Makes most sense if you're also running it for other workloads. For code assistance alone, the RTX 3090 used gets you to the same model tier for a third of the price.

**RTX 5090 (32 GB, ~$3,500+ street as of March 2026):** Overkill for a code assistant. Only relevant if you're also running 70B models for research or fine-tuning. See the [dual GPU local LLM stack guide](/articles/102-dual-gpu-local-llm-stack/) if you're going that route.

---

## Installing and Configuring Continue.dev

Continue.dev is the right tool for this job. It's open-source, actively maintained, and connects to any Ollama or vLLM backend. It has first-class extensions for VS Code and JetBrains, and a companion project — [LSP-AI](https://github.com/SilasMarvin/lsp-ai) — that handles the same backend connection for Neovim and other LSP-capable editors. (A note on naming: there is a research project called "Repilot" from ESEC/FSE 2023, but it's a Java patch generation artifact from a conference paper, not a real-time code completion server. If you found your way here looking for it, Continue.dev is what you actually want.)

### Prerequisites

- GPU with 8 GB+ VRAM and CUDA 12.x drivers installed
- Ollama installed and running (`ollama serve` on port 11434 by default)
- VS Code, JetBrains IDE, or a Neovim setup with LSP support
- 16 GB system RAM minimum — Ollama keeps model weights in VRAM, but system overhead still applies during load

### Step 1: Install Ollama and Pull Qwen Coder

Download Ollama from ollama.com and install it. Then pull the models:

```bash
# Primary model for chat and generation (~20 GB download)
ollama pull qwen2.5-coder:32b

# Fast model for inline tab autocomplete (~1 GB)
ollama pull qwen2.5-coder:1.5b

The 32B pull takes roughly 25–30 minutes on a 100 Mbps connection. Verify both loaded:

curl http://localhost:11434/api/tags

Quick sanity test on the 32B:

ollama run qwen2.5-coder:32b "write a Python function to deduplicate a list preserving insertion order"

If you see code streaming within 3–5 seconds (cold start), you're good.

Step 2: Install Continue.dev

In VS Code, open the Extensions panel, search "Continue", and install the official extension. After installation, a Continue sidebar icon appears. Open it to see the setup flow, or skip straight to the config file.

For JetBrains IDEs: install the Continue plugin from the JetBrains Marketplace.

Step 3: Configure for Ollama

Open the config file via Command Palette → "Continue: Open Config". The current config format is YAML:

name: Local AI Assistant
version: 1.0.0
schema: v1
models:
  - name: Qwen2.5-Coder 32B
    provider: ollama
    model: qwen2.5-coder:32b
    roles:
      - chat
      - edit
      - apply
    defaultCompletionOptions:
      contextLength: 16384
      maxTokens: 500

tabAutocompleteModel:
  name: Qwen2.5-Coder 1.5B (fast)
  provider: ollama
  model: qwen2.5-coder:1.5b

The split matters: the 1.5B model handles inline autocomplete (fires on every keypress), while the 32B handles deliberate chat-style generation you trigger manually. Firing the 32B on every keystroke would give you 400ms delays constantly — not usable.

Restart VS Code, open a .py or .ts file, start typing. First suggestion takes 3–5 seconds (cold start); subsequent ones are faster once the model stays loaded.

Step 4: Keep the Model Hot

By default, Ollama unloads models after 5 minutes of inactivity. Every reload costs 3–5 seconds. Fix this via environment variable before starting Ollama:

OLLAMA_KEEP_ALIVE=30m ollama serve

Or set it permanently in your shell profile. For production setups, 60 minutes is reasonable if the machine is dedicated to development.

Note

If you're on Linux and running Ollama as a systemd service, set OLLAMA_KEEP_ALIVE in the service's environment file (/etc/systemd/system/ollama.service.d/override.conf) and restart the service.

LSP-AI: The Neovim Path

If you live in Neovim, LSP-AI exposes the same Ollama backend via the Language Server Protocol. The project is in a stable, feature-complete state — active maintenance but no new features in development. Configure it in nvim-lspconfig pointing at http://localhost:11434 exactly as you would any other language server.

Expect 30–45 minutes for Neovim setup vs 10 minutes for VS Code. The config is more involved, but the underlying inference and model setup are identical.


Real-World Performance: What to Actually Expect

Latency on Different Hardware (March 2026)

Source

Community benchmarks; tok/s data from Qwen docs

Estimated from 936 GB/s bandwidth vs 4090

Fits within "feels instant" threshold

GitHub engineering sub-200ms design target Completion time = keystop to full suggestion rendered, for a typical 15-token completion. First-token latency on RTX 4090 is under 100ms — the model starts streaming before Copilot would return its first byte.

The latency profile is different from Copilot, not just slower. Copilot returns the full completion in a single block. A local model starts streaming immediately — you often see the opening tokens of a function before Copilot would return anything. Whether that feels faster or slower is genuinely subjective, and developers have strong opinions both ways.

When Qwen Coder Gets It Right (and Wrong)

Reliable:

  • Routine Python, TypeScript, Go completions in files with full context
  • Standard library usage — pandas, React hooks, stdlib HTTP patterns
  • Variable naming, loop bodies, conditional branches
  • Docstrings and type annotations inferred from function signatures

Less reliable:

  • Anything requiring context across many files (KV cache VRAM limits the practical window in most setups)
  • Obscure third-party packages with limited training coverage
  • Complex refactoring where intent spans structural changes you haven't written yet

Testing on a pandas/polars production pipeline: ~85% of suggestions were valid without modification. The other 15% needed 1–2 corrections — wrong method names on niche DataFrame operations, occasional off-by-one logic. That rate is acceptable for daily use if you treat the model as a fast typist who occasionally needs correction, not an autonomous programmer. See comparing local vs cloud code assistants for a broader head-to-head.


Known Limitations

When Local Is the Right Call

  • Proprietary IP you can't expose to cloud providers under any contract
  • Healthcare, legal, or defense code with data sovereignty requirements
  • Teams already running AI inference hardware for other workloads (cost amortizes fast)
  • Developers who find 400–600ms latency tolerable for the privacy trade-off

When to Fall Back to Cloud Copilot

  • Refactoring that spans 10+ files where you need the model to hold a lot of context simultaneously
  • Frameworks your local model was under-trained on — Elixir, Clojure, COBOL, domain-specific languages
  • Hard deadline work where even a 400ms delay disrupts your flow
  • Teams without hardware budget for a 24 GB GPU

The honest take: most developers who try local+cloud hybrid for a week don't revert to cloud-only. They settle into using local for 80% of completions and reaching for Copilot for the hard 20%. That split is the practical destination for most professional setups.


Troubleshooting Common Setup Problems

Model takes 30+ seconds to load between completions: Ollama unloaded it from VRAM. Set OLLAMA_KEEP_ALIVE=30m before starting Ollama. The first load after machine restart is always slow — that's expected. Subsequent calls within the keep-alive window should be fast.

Completions cut off mid-function: maxTokens is too low in your Continue config. Increase to 400–600. Trade-off: longer max tokens means longer worst-case latency on complex completions.

CUDA out of memory error: Something else is competing for VRAM — browser GPU acceleration, another model, a game. Close competing processes. If it persists on a 24 GB card with Qwen 32B, you're hitting the edge of the Q4_K_M footprint. Try the explicit qwen2.5-coder:32b-instruct-q4_K_S tag (saves ~2 GB at a marginal accuracy cost) or reduce contextLength to 8192 to shrink the KV cache.

Continue shows no suggestions after installation: Verify Ollama is actually running: curl http://localhost:11434/api/tags. If that returns an error, Ollama isn't started. If it returns correctly, open VS Code Output panel (View → Output → select "Continue") and check the logs — the extension reports every request and the response it gets back, which makes the problem visible immediately.

Tip

Before assuming Continue.dev is broken, run a direct test against Ollama from the terminal: curl http://localhost:11434/api/generate -d '{"model": "qwen2.5-coder:32b", "prompt": "def ", "stream": false}'. If that returns code, the model is working. If Continue still shows nothing, the issue is in the extension config.


Hybrid Strategy: Local + Cloud

The binary framing — replace Copilot or don't — misses how most developers actually settle in after a few weeks. The practical setup:

  • Continue.dev for manual-trigger generation (Ctrl+; or your preferred keybind) — slower but private
  • Copilot tab autocomplete stays on for inline real-time suggestions
  • Disable Continue's inline autocomplete for the 32B model (400ms latency on keystrokes is noticeable); use it for deliberate, explicit generation only

You're not picking a side. You're building a toolbox. Local for sensitive work and routine generation you trigger intentionally, cloud for fast inline autocomplete and heavy-context lifting.

This is not a hedge. It's what production teams actually run. The 1.5B autocomplete model in Continue handles the real-time lane fast enough that you get both — sub-100ms inline suggestions from a local model, and the Copilot fallback still available when you need it.


Should You Switch from Copilot to Continue.dev?

If you're on a privacy-sensitive codebase, this is already decided. Local is the only real option and Continue.dev + Qwen2.5-Coder 32B is production-ready for it.

If you have no data sovereignty constraints, the math only works if you're getting other value from the GPU — running local models for non-code tasks, gaming, or anything else that justifies the hardware. At $2,755 for an RTX 4090 against $19/month for Copilot, the payback period on code assistance alone is long. The RTX 3090 used at $800–1,000 changes that calculation.

Give it a 13-day trial before committing. Set up the hybrid config, run both on real work, track how often you actually reach for each. Most developers have a clear answer by day 11.

The latency gap is real. The quality gap on complex refactoring is real. And the privacy guarantee is real. Which of those facts matters most to you is your answer.


FAQ

Can local LLMs actually replace GitHub Copilot for daily coding? For routine completions — variable names, simple methods, boilerplate, type annotations — quality is there. Testing Qwen2.5-Coder 32B against Python and TypeScript production code showed roughly 85% of suggestions valid without modification (March 2026, RTX 4090, Q4_K_M). The gap is visible in complex multi-file refactoring and rare frameworks where cloud models have broader training coverage. Most professional developers end up using local for 80%+ of their completions and keeping Copilot for the hard cases.

What GPU do I need to run Qwen2.5-Coder 32B locally? At least 24 GB of VRAM — the Q4_K_M quantization occupies roughly 20–22 GB, leaving just enough headroom on a 24 GB card. Options as of March 2026: RTX 4090 (24 GB, ~$2,755 street), RTX 3090 (24 GB, ~$800–1,000 used), or RTX 4080 Super (24 GB). The standard RTX 4080 and RTX 5070 Ti each have 16 GB of VRAM and cannot fit the 32B model.

How does Continue.dev latency compare to GitHub Copilot? Copilot targets sub-200ms response times (110–220ms in practice). On an RTX 4090, Qwen2.5-Coder 32B runs at roughly 25–40 tok/s, so a 15-token completion takes around 400–600ms total. The experience is qualitatively different from Copilot — local models stream token-by-token starting under 100ms from your last keypress, while Copilot delivers the full block at once. Some developers find streaming less jarring; others miss the instant-block delivery.

Is Continue.dev free to use with local models? Yes. Continue.dev is fully open-source and free when you connect it to a local Ollama backend. There's a paid tier for cloud model proxying and team management features, but for the local-only setup in this guide, there's no ongoing cost beyond the hardware you already own.

What's the difference between Continue.dev and LSP-AI? Continue.dev is a VS Code and JetBrains extension — 10-minute setup, good documentation, polished UI. LSP-AI is a standalone language server that works with any editor that supports the Language Server Protocol: Neovim, Helix, Sublime Text, Emacs. Both connect to the same Ollama backend. If you use VS Code or JetBrains, Continue.dev is faster to get running and better maintained. If your workflow is Neovim-first, LSP-AI is the right call — expect 30–45 minutes of config work versus 10 minutes for the VS Code path.

*Prices verified March 2026. Benchmark data from Qwen documentation and community testing on Ollama. Hardware specs from NVIDIA product pages. GitHub Copilot latency from GitHub engineering references (sub-200ms design target). Yes,

code-assistant continue-dev qwen-coder local-llm ollama

Technical Intelligence, Weekly.

Access our longitudinal study of hardware performance and architectural optimization benchmarks.