WDDM
Windows Display Driver Model — the GPU driver framework Windows uses to share and manage graphics memory, which on WSL2 reserves a slice of your VRAM before any LLM ever loads.
WDDM (Windows Display Driver Model) is the driver architecture Windows uses to virtualize and arbitrate GPU resources across processes. For local AI builders, it matters because WDDM silently reserves a chunk of your VRAM for desktop composition, driver overhead, and paging structures — VRAM you cannot reclaim for model weights.
How WDDM Eats Your VRAM
Under WSL2, the GPU is exposed through a paravirtualized WDDM path rather than passed through directly. That virtualization layer plus the standard Windows display reservation hides roughly 8–15% of total VRAM before your inference runtime starts. On a 24 GB RTX 4090 that leaves about 20.4 GB usable; on a 16 GB card you can drop below 13 GB effective once the desktop, browser, and other CUDA contexts take their cut.
WSL2 vs WSL1 Tradeoff
WSL1 bypasses WDDM virtualization entirely and talks to the GPU through a thinner shim, which is why some builders cling to it for raw VRAM access. The cost is steep: no systemd, no Docker Desktop integration, and degraded filesystem behavior for model loading. WSL2 with a tuned .wslconfig (memory=0, gpuSupport=false memory reclaim) plus explicit CUDA_VISIBLE_DEVICES mapping recovers most of the loss while keeping the modern toolchain. Bare-metal Linux still wins on absolute VRAM headroom, but the gap narrows once WDDM overhead is tuned.
Why It Matters for Local AI
A 7B model at Q4 fits anywhere, but a 13B at Q5 or a 34B at Q4 sits right at the edge of consumer VRAM — exactly where the WDDM tax decides whether your model loads or OOMs. If you're sizing a build around a specific quantization and context length, budget the WDDM reservation up front rather than discovering it when llama.cpp aborts at 95% load. For multi-GPU rigs, the reservation compounds per card, which is why pinning CUDA device order is non-optional on Windows hosts.