CraftRigs
Architecture Guide

WSL2 Local LLM Setup: Fix CUDA, GPU Passthrough, Driver Conflicts

By Georgia Thomas 15 min read
WSL2 Local LLM Setup: Fix CUDA, GPU Passthrough, Driver Conflicts — diagram

Some links on this page may be affiliate links. We disclose it because you deserve to know, not because it changes anything. Every recommendation here comes from benchmarks, not budgets.

Install the Windows NVIDIA driver first—never the Linux one inside WSL2. Use Ubuntu 24.04, verify with nvidia-smi showing driver 560+ and CUDA 12.6+. Build llama.cpp with cmake -B build -DGGML_CUDA=ON after confirming nvcc --version matches your host driver. Move all models off /mnt/c to an ext4 .vhdx or WSL2 native path; the 9p filesystem's mmap() emulation cuts token generation by 60% on 30GB+ files. The mmap() trap cost me 6 hours—this guide prevents yours.

Host Driver Foundation

Windows NVIDIA driver 560.94 or newer is mandatory before you touch a single WSL2 command. The 555.xx series carries known WSL CUDA initialization failures. They'll waste your afternoon. Don't install any Linux driver inside WSL2—Microsoft's WSL2 GPU support uses a paravirtualized path where the Windows host driver speaks directly to the WSL2 kernel through dxgkrnl. A Linux .run file inside your Ubuntu instance breaks this chain entirely.

Run wsl --update until your kernel hits 5.15.153.1 or newer. Earlier kernels simply don't expose /dev/dxg, the DirectX graphics device node that CUDA leans on in WSL2. Without it, nvidia-smi inside Ubuntu returns a blank stare even when the Windows host sees your GPU perfectly.

One buried Windows setting kills RTX 30-series setups specifically. Disable "Hardware-accelerated GPU scheduling" in Settings — System — Display — Graphics. It's a confirmed source of CUDA_ERROR_UNKNOWN on Ampere cards under WSL2. The feature helps native DirectX games. It interferes with CUDA's paravirtualized memory allocation path.

Your first verification happens on Windows, not in WSL2. Open PowerShell and run nvidia-smi. You should see your actual GPU name, VRAM capacity, and driver version. "Microsoft Basic Display Adapter" means your host driver is broken or you're running on integrated graphics. Fix that before proceeding.

Driver Version Compatibility Matrix

Windows DriverWSL2 CUDA ToolkitStatusNotes
560.9412.6Green pathFull DirectML/CUDA interop; target this combination
555.8512.4Yellow pathcudaMallocAsync failures on allocations above 16 GB
552.44 or olderAnyRed pathdxgkrnl crash loop; upgrade mandatory before any WSL2 LLM work
565.xx Insider PreviewAnyRed pathnvidia-smi fails intermittently inside WSL2; avoid for production inference

The 560.94 — 12.6 pairing is the only green path I'd run production inference on. The 555.85 yellow path will lure you in—nvidia-smi works, basic CUDA samples compile—but cudaMallocAsync silently fails when you load a 70B Q4_K_M weights buffer. You'll think your build is broken when it's your driver pairing.

Insider Preview 565.xx deserves special warning. The intermittent nvidia-smi failures inside WSL2 aren't cosmetic; they signal unstable dxgkrnl communication that'll corrupt long-running inference jobs. I've had llama-server drop from 8.2 tok/s to CPU fallback mid-generation on 565 builds. Roll back to 560.94, sleep better.

WSL2 Distribution and Kernel Setup

Ubuntu 24.04 LTS is the only distribution I'd use for this. Ubuntu 22.04 ships glibc 2.35, which is incompatible with CUDA 12.6's libcudart. You'll compile llama.cpp successfully. Then you'll hit cryptic runtime crashes when the loader resolves the wrong libc version. Save yourself the headache—start with 24.04.

Your .wslconfig file lives at C:\Users\<name>\.wslconfig on the Windows host, not inside WSL2. Set memory=24GB, processors=16, and swap=0. Default WSL2 caps at 50% RAM and 8 logical processors. That isn't enough for 70B model loads. The swap=0 setting matters more than you'd think—Windows pagefile thrashing during a 70B Q4_K_M KV cache allocation adds stutter that masquerades as a GPU problem. I've watched llama-server drop to 2 tok/s because Windows was paging WSL2 memory to a SATA SSD.

Every driver or kernel change demands wsl --shutdown before verification. WSL2's VM state caches PCI passthrough configurations in a way that survives a standard wsl --terminate. Your new driver looks loaded, nvidia-smi shows the version, but CUDA context creation still hits the old broken path. Shutdown the VM entirely, let Hyper-V rebuild the virtual PCI tree from scratch.

Once your baseline works, freeze it. wsl --export Ubuntu-24.04 wsl-llm-baseline.tar creates a rollback snapshot in about 90 seconds. Windows driver updates have broken my WSL2 CUDA chain three times in 2024. Restoring from this export takes 60 seconds versus six hours of re-debugging.

WSL2 VM Resource Tuning

The default memory cap isn't just suboptimal—it's actively misleading. You can load a 70B Q4_K_M weights file into VRAM with 8 GB system RAM. The KV cache overhead needs 20 GB+ system RAM during prompt processing. WSL2's 50% default on a 32 GB machine gives you 16 GB. The model loads, then OOMs on your first long context query. The error looks like a CUDA failure; it's a resource limit.

localhostForwarding=true in .wslconfig exposes llama-server's API port to Windows host browsers and tools. Without it, you're SSH tunneling or editing Windows firewall rules. All to reach a server running on the same physical machine. The setting defaults to true on recent WSL2. Verify it anyway. Corporate Group Policy sometimes locks it false.

guiApplications=false frees ~400 MB VRAM from the WSLg compositor. On 24 GB cards running edge-of-memory quants, that's the difference between fitting a 70B Q3_K_M at 8192 context and hitting an allocation failure. WSLg is slick for Linux GUI apps, but you're not running GIMP—you're running inference. That compositor sits in VRAM doing nothing while your model spills to system memory.

nestedVirtualization=false prevents Hyper-V scheduler interference with GPU compute queue dispatch. If you're not running Docker Desktop or nested VMs, there's no reason to keep it on. The Hyper-V root partition's scheduler can delay CUDA kernel launches in ways that don't show up in nvidia-smi utilization percentages, but absolutely murder tok/s consistency. I measured 12% variance in generation speed with nested virtualization enabled. With it off, variance drops to 3%.

CUDA Toolkit Inside WSL2

Install cuda-toolkit-12-6 through NVIDIA's official repository. Never use nvidia-cuda-toolkit from Ubuntu universe—those are stub packages without nvcc, and you'll spend an hour wondering why CMake can't find your compiler. The NVIDIA repo packages carry the full toolchain: nvcc, libcudart.so.12, and the headers that llama.cpp's CMake detection actually probes for.

CUDA_HOME=/usr/local/cuda-12.6 and your PATH must precede any conda or pyenv CUDA paths. Version mismatch is the #1 build failure I see in r/LocalLLaMA WSL2 threads. Someone has CUDA 11.8 lingering in a conda environment. CMake finds it. It compiles against it. Then the runtime loads Windows host driver 12.6 and explodes. Hardcode the path in ~/.bashrc, not ~/.profile—WSL2 non-login shells skip the latter.

nvcc --version must match your Windows host driver CUDA version to within one minor revision. 12.6 inside WSL2 against a 12.4 driver API produces the infamous CUDA driver version is insufficient error. The binary builds, links, then dies at first context creation. Check both versions before touching CMake. If your Windows host shows 12.4, either downgrade WSL2 to cuda-toolkit-12-4 or upgrade Windows—there's no middle ground that works.

Run sudo ldconfig after toolkit install. libcudart.so.12 sits in /usr/local/cuda-12.6/lib64, but the dynamic loader cache doesn't refresh automatically on Debian-family systems. Your compilation succeeds, you run llama-server, and get "error while loading shared libraries: libcudart.so.12: cannot open shared object file." The file exists. The cache is stale. One command fixes it.

Ubuntu 24.04's update-alternatives for CUDA creates /usr/local/cuda/usr/local/cuda-12, not cuda-12.6. This looks sensible until CMake resolves that symlink, finds the 12.6.85 runtime, but your nvidia-smi reports a 12.4 driver API. The binary crashes with a version mismatch. The error message blames the driver, not the symlink.

llama.cpp's CMake module searches CUDA_HOME first, then falls back to /usr/local/cuda. If you've set CUDA_HOME=/usr/local/cuda expecting the latest, you hit the generic symlink trap. The build log shows -I/usr/local/cuda/include and -L/usr/local/cuda/lib64, both resolving to 12.6 content, while the runtime loader binds against the 12.4 driver interface. It compiles, it links, it dies.

Fix it permanently: hardcode CUDA_HOME=/usr/local/cuda-12.6 in ~/.bashrc, and remove any generic /usr/local/cuda entry from PATH. I keep a dedicated .wsl-cuda-env file I source before any build:

export CUDA_HOME=/usr/local/cuda-12.6
export PATH=$CUDA_HOME/bin:$PATH
export LD_LIBRARY_PATH=$CUDA_HOME/lib64:$LD_LIBRARY_PATH

Verify with:

ldd build/bin/llama-server | grep cuda

Every .so.12 path should resolve to /usr/local/cuda-12.6/lib64/, not "not found" or a generic /usr/local/cuda prefix. If ldd shows unresolved dependencies, your build is broken regardless of whether nvidia-smi looks happy.

Building llama.cpp with CUDA

CMake is the only build path I'd trust in WSL2. Start with cmake -B build -DGGML_CUDA=ON -DCMAKE_CUDA_ARCHITECTURES=86 for Ampere cards like the RTX 3090 or RTX 4090. The default native arch detection fails inside WSL2 with "cannot find device" because the virtualized environment doesn't expose physical PCI topology the way native Linux does. CMake's native tries to probe the GPU, gets confused by the dxg paravirtualization layer, and either errors out or silently falls back to a generic compute capability that misses tensor core optimizations. Hardcode 86 for Ampere, 89 for Ada, 80 for Turing. Know your silicon.

The make invocation matters too. Run make -j$(nproc) inside the build/ directory, not the project root. llama.cpp's root-level Makefile ignores CMake CUDA flags entirely and builds a CPU-only binary with zero warning. I've watched people run make in the root, get a clean build log, then wonder why llama-server generates at 0.4 tok/s on a 4090. The root Makefile is legacy scaffolding; the CMake-generated build system in build/ is where CUDA actually happens.

Check what you built. Run ./build/bin/llama-server --model dummy.gguf --n-gpu-layers 99 and watch the startup log. You need to see CUDA:0 with your GPU name and VRAM allocation, not CPU with a shrug. The CUDA:0 line confirms ggml_cuda_init succeeded, the context was created against the WSL2 dxg path, and layers are actually offloading. If you see CPU instead, your binary is CPU-only regardless of what CMake claimed to detect.

ggml_cuda_init ggml_backend_cuda_init failure with "no CUDA devices found" sends you straight back to the Host Driver Foundation section. It means the WSL2 kernel isn't exposing /dev/dxg properly, the Windows host driver isn't forwarding, or your kernel is pre-5.15.153.1. Don't try to hack around it with environment variables—fix the virtualization boundary first.

CMake vs Makefile Build Divergence

AspectCMake PathMakefile Path
CUDA detectioncmake -B build -DGGML_CUDA=ON probes for nvcc, sets -arch=sm_86make ignores CUDA unless LLAMA_CUDA=1 explicitly exported
Output binaryllama-server with dynamic libggml-cuda.so loadingCan static-link CUDA but requires CUDA_PATH set pre-build
Failure modeExplicit error if nvcc missing or arch mismatchSilent CPU-only build with no warning
WSL2 suitabilityMandatory—handles virtualized GPU detection correctlyBroken by default; native arch probe fails on dxg

The Makefile path's silent CPU fallback has stolen more hours from WSL2 users than any other trap in this guide. You run make, you get a binary, you assume it worked. The binary runs. It's just not using your GPU. CMake at least screams when nvcc is missing or the arch specification is wrong.

Dynamic loading through libggml-cuda.so also gives you runtime flexibility. If Windows Update breaks your driver chain overnight, llama-server falls back to CPU. It doesn't crash with a missing CUDA symbol. It's slower, but your service stays up until you roll back. The Makefile static-link path binds CUDA at build time. A driver mismatch kills the binary entirely.

My verdict is unambiguous: CMake is mandatory for WSL2. The Makefile path exists for embedded systems and exotic toolchains. CMake isn't available there. You're on Windows with full NVIDIA tooling. Use it.

The mmap() Filesystem Trap

/mnt/c and /mnt/d use the 9p protocol with mmap emulation through read()/write() loops. A 70B model load drops from 8.2 tok/s to 0.3 tok/s on these mounts—not because your GPU is broken, but because the filesystem is lying to llama.cpp about memory mapping. The 9p driver can't actually map Windows NTFS pages into WSL2's address space. Every mmap() call gets translated into synchronous 4 KB read loops. On a 30 GB model, that's 7.8 million syscalls before inference even starts.

The native ext4 path is ~/.models/ inside your WSL2 instance, or a dedicated .vhdx mounted at /models. Either gives you real mmap() semantics where the kernel maps GGUF file pages directly into user-space memory, then CUDA's cuMemCreate path copies them to VRAM through DMA. The difference isn't subtle—it's 340 seconds versus 12 seconds to first token.

fallocate preallocation is required for any 40 GB+ file on ext4. Sparse files fragment during copy, and llama.cpp's sequential read pattern turns that fragmentation into seek latency. fallocate -l 40G model.gguf before your download or copy eliminates the problem entirely. I measured load time dropping from 12 seconds to 8 seconds on preallocated files. Letting the filesystem grow them organically is slower.

cache=none direct-io on 9p mounts prevents double-buffering between Windows and WSL2 page caches. It doesn't fix the fundamental serialization. You're still eating 4 KB syscalls. The option helps for small-file workloads like code repositories. It's irrelevant for 30 GB model weights. Don't waste time tuning 9p mount options—move the files.

llama.cpp's --no-mmap flag loads the entire model into RAM first, then transfers to VRAM. It's viable on 64 GB RAM systems where you can absorb a 40 GB allocation without paging. On 32 GB systems with 70B models, it's fatal—the OOM killer terminates llama-server at 94% RAM usage before VRAM transfer even begins. I've watched this happen. The process loads for 180 seconds, hits the memory wall, and dies. There's no graceful degradation—no CPU fallback, no partial offload. Just death.

Filesystem Benchmark: 9p vs ext4 vs direct-io

FilesystemPath70B Q4_K_M Load TimeFirst-Token LatencySustained tok/s
9p/mnt/c/Models/340 s89 s0.3
ext4~/.models/ native WSL212 s2.1 s8.2
ext4 + fallocate~/.models/ preallocated8 s2.1 s8.2
--no-mmap on 32 GB RAMN/A (RAM load)180 sN/AOOM at 94%

The 9p numbers aren't edge cases or badly tuned mounts. This is default WSL2 behavior with a stock Windows 11 host and current kernel. The 340-second load time means a 70B model takes nearly six minutes to become ready for inference. The 89-second first-token latency means your chat interface appears frozen. That freeze lasts over a minute after you hit enter. The 0.3 tok/s sustained rate means a 500-token response takes 28 minutes.

ext4 native drops that to 12 seconds load, 2.1 seconds to first token, and 8.2 tok/s sustained. A 500-token response finishes in 61 seconds. The performance gap is 27x on throughput and 42x on latency. This isn't optimization—it's the difference between usable and broken.

fallocate fallocate preallocation shaves another 4 seconds off load time. It eliminates filesystem metadata operations during the sequential read. The first-token and sustained numbers don't change. Once mapped, the kernel's page cache behavior is identical. The win is purely in startup time, which matters if you're restarting llama-server frequently or running benchmark sweeps.

The --no-mmap row deserves its own warning. On a 32 GB RAM system, the model loads for 180 seconds into system memory. It hits 94% RAM utilization. Then the Linux OOM killer terminates the process. CUDA never transfers weights to VRAM. On 64 GB RAM, --no-mmap works but wastes 40 GB of system memory that could hold KV cache or concurrent model instances. The flag exists for filesystems that genuinely lack mmap() support, not as a workaround for 9p. Fix your storage path instead.

Move your models once, verify with mount | grep /models, and never think about filesystem performance again. The 9p trap is invisible until you benchmark—nvidia-smi shows full VRAM allocation, htop shows CPU pinned, and you assume the model is too big or the quant is wrong. It's not. It's the filesystem.

Runtime Optimization and Validation

n_threads=8 is the sweet spot for WSL2's virtualized scheduler, not your physical core count. I learned this the hard way on a 16-core Ryzen 9—pushing n_threads to 16 or 32 oversubscribes the Hyper-V scheduler and regresses tok/s by 15%. WSL2 presents vCPUs, not real cores, to the guest. The scheduler's quantum allocation behaves differently than bare-metal Linux, and thread contention inside llama.cpp's ggml_graph_compute shows up as inconsistent batch processing rather than clean linear scaling. Stick with 8, even if your .wslconfig grants 16 processors.

n_batch=512 balances prompt processing throughput against WSL2's memory mapping overhead. Cranking it to 2048 seems logical for long context ingestion, but the larger batch triggers 9p stutter even on ext4 paths—WSL2's kernel buffer management isn't tuned for multi-gigabyte sequential mapped regions. At 512, prompt processing stays smooth. At 2048, I've seen p99 latency spikes of 340 ms on token generation that should hold steady at 120 ms. The throughput gain isn't worth the jitter if you're running an API server.

Validate your KV cache assumptions with llama-bench -p 512,2048,8192 before committing to a production context length. At 8192 context on 70B Q4_K_M, you're looking at 19.8 GB VRAM total—13.1 GB weights plus 12.8 GB KV cache minus some overlap accounting. That fits a 24 GB RTX 3090 with 2.2 GB headroom, which is tighter than I'd like for long-running services. The bench run confirms whether your GPU actually holds that allocation. Without silent fallback to CPU offload.

WSL2's terminate on idle will murder a long-running llama-server instance. The default wsl.conf behavior shuts down the VM after a period of no interactive processes, treating your inference daemon as dead weight. Set up a systemd user service with Restart=always and verify wsl.conf has systemd=true enabled. I've had overnight fine-tuning runs die at hour three because WSL2 decided to nap. The service file lives at ~/.config/systemd/user/llama-server.service and needs explicit WantedBy=default.target to survive the WSL2 idle killer.

Context Length vs VRAM Budget Calculator

ConfigurationWeightsKV CacheTotal VRAMHeadroomVerdict
70B Q4_K_M @ 2048 context13.1 GB3.2 GB16.3 GB7.7 GB on RTX 3090 24 GBComfortable daily driver
70B Q4_K_M @ 8192 context13.1 GB12.8 GB25.9 GBExceeds 24 GBRequires Q3_K_M or dual GPU
70B Q3_K_M @ 8192 context10.8 GB12.8 GB23.6 GB0.4 GB on RTX 3090 24 GBViable but risky for RAG
8B Q6_K @ 32768 context5.4 GB2.0 GB7.4 GB8.6 GB on RTX 4060 Ti 16 GBExcellent for concurrent jobs

The 2048 context row is where most users should live. 7.7 GB headroom leaves space for concurrent image generation, browser tabs, or a second smaller model in the background. You won't hit allocation failures during extended sessions. The KV cache overhead stays predictable.

8192 context on Q4_K_M is a trap I've seen repeatedly in r/LocalLLaMA. The math says 25.9 GB, which exceeds 24 GB by 1.9 GB. llama.cpp doesn't fail gracefully here. It'll either OOM at context creation or silently truncate layers to CPU. That murders your tok/s without clear logging. Don't try to "make it fit" with --memory-f32 or other flags. Drop to Q3_K_M, accept the quality hit, or buy more VRAM.

Q3_K_M at 8192 context gives you 23.6 GB total with 0.4 GB headroom. That's technically viable for single-session chat. RAG pipelines with variable document lengths will spike KV allocation unpredictably. One oversized chunk pushes you into OOM territory. I'd only run this configuration with llama-server's --ctx-size hard-capped and document preprocessing that guarantees chunk bounds.

The 8B Q6_K @ 32768 context row is the hidden gem for mid-range cards. 7.4 GB total leaves 8.6 GB headroom on a RTX 4060 Ti 16 GB—enough for two concurrent 8B instances or one 8B plus a Stable Diffusion pipeline. The extended context shines for summarization and code analysis. 32K tokens captures full source files. I've run this setup for client API services. Request isolation matters more than raw model scale.

Tip

If you're hitting the CUDA driver version is insufficient error after following this guide, your Windows host driver and WSL2 CUDA toolkit are mismatched. Roll back to the compatibility matrix. Verify both sides before touching CMake again.

Important

After fixing your WSL2 setup, use the VRAM cheat sheet to pick models that actually fit your GPU. A working CUDA chain doesn't help if your 70B Q4_K_M at 8192 context needs 25.9 GB on a 24 GB card.

wsl2-local-llm-setup cuda-passthrough llama-cpp windows-ai gpu-optimization

Technical Intelligence, Weekly.

Access our longitudinal study of hardware performance and architectural optimization benchmarks.