TL;DR: Your GPU running at 83C during inference isn't going to die tomorrow, but it will degrade faster, throttle more, and cost you tokens per second. The single best thing you can do is undervolt — it's free, takes 10 minutes, and typically drops temps 8-15C with zero performance loss. After that: fix your fan curve, consider a repaste if your card is over a year old, and make sure your case airflow isn't fighting you.
Why GPU Temps Matter More for LLM Rigs
Gaming GPUs hit peak temps for a few hours, then cool off. An LLM inference GPU might sit at sustained load for 12+ hours a day. That sustained heat does real damage over time:
- Thermal throttling kicks in earlier than you think. Most NVIDIA cards start reducing clock speeds above 80C. That directly reduces your inference speed.
- VRAM runs hotter than the GPU die. On the RTX 4090, GDDR6X memory can hit 100C+ even when the GPU core reads 75C. VRAM at high temps degrades and can cause memory errors.
- Capacitors and VRMs age faster. Every 10C increase roughly halves the expected lifespan of electrolytic capacitors on the board. At 85C sustained, you're accelerating wear significantly.
Target temperatures for 24/7 operation:
- GPU core: under 75C (ideal), under 80C (acceptable)
- VRAM (GDDR6X): under 90C (ideal), under 100C (acceptable)
- Hotspot: under 85C (ideal), under 90C (acceptable)
Step 1: Undervolt (Biggest Impact, Zero Cost)
Undervolting reduces the voltage your GPU draws while maintaining the same clock speed. Less voltage means less heat, which means less throttling, which often means better sustained performance than stock. It sounds too good to be true, but GPUs ship with more voltage than they need as a safety margin. You're just trimming the excess.
How to undervolt with MSI Afterburner (Windows):
- Open MSI Afterburner and press Ctrl+F to open the voltage/frequency curve editor.
- Find your GPU's current peak frequency (typically 2500-2700 MHz on a 4090 at stock).
- Click the point at a lower voltage — start with 900mV — and drag it up to your target frequency.
- Click the points to the right of 900mV and drag them down below your target frequency. This forces the card to use 900mV at your desired clock.
- Hit Apply and run a stress test (FurMark or just run inference for 30 minutes).
- If it's stable, you're done. If you get crashes or artifacts, increase voltage by 12.5mV (try 912mV, then 925mV) until stable.
On Linux: Use nvidia-smi to set power limits, or nvidia-settings for clock offsets. For fine-grained voltage control, nvml or GreenWithEnvy works.
Expected results on an RTX 4090:
- Stock: ~2600 MHz at 1050mV, 320W power draw, 80-85C sustained
- Undervolted: ~2550 MHz at 900mV, 260W power draw, 68-73C sustained
- Performance loss: 0-2% (often 0%, because less throttling compensates for the slight clock reduction)
That's a 12-15C drop for essentially free. Do this first. Do it today.
Step 2: Fix Your Fan Curve
Stock fan curves are tuned for noise, not thermals. Manufacturers assume gamers want a quiet card more than a cool one. For a 24/7 inference rig, you want the opposite — keep it cool, and if it's in a closet or under a desk, noise doesn't matter.
Recommended fan curve for LLM workloads (MSI Afterburner):
- Below 40C: 30% fan speed (keeps it quiet at idle)
- 40-55C: 45% fan speed
- 55-65C: 60% fan speed
- 65-75C: 75% fan speed
- Above 75C: 100% fan speed
The key difference from stock: fans spin up earlier and faster. Stock curves often don't hit 60% until the GPU is already at 75C, by which point you're playing catch-up. Starting the ramp at 40C keeps temps from ever spiking.
On Linux: Use nvidia-settings with the "coolbits" Xorg option enabled, or nvfancontrol for headless setups. Set Option "Coolbits" "28" in your xorg.conf to unlock manual fan control.
Make it persistent. MSI Afterburner can start with Windows and apply your profile automatically. On Linux, add your fan curve script to your startup. Don't rely on manually setting it every reboot — you'll forget, and your GPU will cook during an overnight job.
Step 3: Repaste the Thermal Compound
If your GPU is more than 12-18 months old, the factory thermal paste has likely dried out somewhat. Repasting with a good compound can drop core temps 3-8C. For a brand-new card, this isn't worth voiding your warranty over. For a card out of warranty, it's a no-brainer.
Recommended thermal pastes:
- Thermal Grizzly Kryonaut: Best non-conductive paste. Won't degrade for 2-3 years. Around $12 for enough to do multiple applications.
- Noctua NT-H2: Slightly easier to apply, nearly as good thermally. Around $10.
- Avoid liquid metal unless you know exactly what you're doing. It's electrically conductive — one drip on a capacitor and your GPU is dead.
The process (simplified):
- Remove the backplate screws (typically 4 around the GPU die, plus several around the PCB)
- Carefully separate the heatsink from the PCB
- Clean old paste with isopropyl alcohol (90%+) and a lint-free cloth
- Apply new paste — a pea-sized dot in the center of the GPU die
- Reassemble and tighten screws in a cross pattern
Budget 30-45 minutes for your first time. Watch a teardown video for your specific card model before starting — screw locations and cable connectors vary between manufacturers.
Step 4: Thermal Pad Replacement (VRAM Cooling)
This is specifically for VRAM temps. The stock thermal pads connecting GDDR6X chips to the heatsink are often mediocre. Replacing them with higher-quality pads can drop VRAM temps 10-15C.
Recommended pads:
- Thermalright Odyssey: Great balance of thermal performance and price. Around $8-12 per pack.
- Gelid GP-Ultimate: Premium option, slightly better thermal transfer. Around $15 per pack.
Important: You need the correct pad thickness for your card. This varies by manufacturer and model. Too thin = poor contact. Too thick = uneven pressure on the GPU die. Check your specific card's teardown for the right thickness (typically 1.0mm, 1.5mm, or 2.0mm depending on the location).
Only do this if your VRAM temps are consistently above 100C. If they're in the 85-95C range, the other steps (undervolting + fan curve) will likely bring them into the safe zone without opening the card.
Step 5: Case Airflow (The Multiplier)
All the GPU cooling mods in the world won't help if your case is recirculating hot air. For LLM rigs, especially multi-GPU builds:
- Front intake fans are mandatory. At least two 140mm fans pulling cool air directly toward the GPU(s).
- Rear and top exhaust. Hot air rises. Let it leave. One rear 120/140mm and one or two top exhaust fans.
- Remove unused drive cages and slot covers. Anything blocking the path from intake to GPU to exhaust is costing you degrees.
- Cable management matters here. A rat's nest of cables in front of your GPU chokes airflow. Route cables behind the motherboard tray.
For multi-GPU setups, check our best cases guide — standard gaming cases often don't have enough slot spacing or airflow for two cards under sustained load.
The Priority Order
If you're going to do just one thing: undervolt. It's free, reversible, and gives the biggest improvement.
If you're going to do two things: undervolt + fix your fan curve. Together, these typically drop sustained temps 15-20C with minimal noise impact.
If you're going all-in: undervolt, fan curve, repaste, thermal pads, and optimize case airflow. This is the "my rig runs 24/7 and I want it to last 5+ years" approach.
What NOT to bother with:
- Aftermarket GPU coolers (Raijintek Morpheus, etc.) — great products, but the effort and cost only make sense for extreme overclocking, not inference workloads where undervolting is the better approach.
- Water blocks and custom loops — massive overkill for inference. The money is better spent on more VRAM.
For the complete build perspective, see our ultimate hardware guide or the budget tier guide to make sure your cooling budget is proportional to your overall build.