CraftRigs
Architecture Guide

Cloud AI Is a Security Risk. Here's How a Local LLM Setup Changes That.

By Charlotte Stewart 9 min read

Some links on this page may be affiliate links. We disclose it because you deserve to know, not because it changes anything. Every recommendation here comes from benchmarks, not budgets.

Three weeks ago, a supply chain attack on LiteLLM — a library with 95 million monthly downloads present in roughly 36% of cloud AI environments — deployed a credential-stealing payload. According to Arctic Wolf's post-incident analysis, 600+ public GitHub repositories were directly impacted, with an estimated 1,000+ enterprise SaaS environments potentially exposed. The attackers didn't compromise OpenAI or Anthropic. They compromised the plumbing between your code and those APIs.

TL;DR: Cloud AI has structural security problems that better passwords don't fix. Data retention, API credential exposure, supply chain vulnerabilities, and legal discovery risk are built into the architecture. A local LLM setup eliminates all of them. A professional-grade build costs $2,100–$4,800 upfront; most teams running serious inference volume break even against cloud API costs within a few months.

That LiteLLM incident isn't a story about one bad library. It's a story about how cloud AI security works: your risk profile is only as good as every upstream dependency used by every developer on your team.

This article maps the full threat model, shows where local infrastructure eliminates each vector, and gives you three specific hardware builds for professional deployment — with honest current prices, not MSRP.


Why Cloud AI Became a Security Liability

The supply chain attack is the headline risk. The data retention problem is quieter and affects everyone.

The LiteLLM Incident: Supply Chain Breaks Everything at Once

On March 24, 2026, attackers targeted LiteLLM with a malicious dependency that harvested API credentials from environment logs and source code. Because LiteLLM is an abstraction layer sitting between application code and cloud AI providers, compromised credentials spanned multiple vendors simultaneously — OpenAI, Anthropic, Google. One attack, every cloud provider you use.

The scale that makes cloud AI convenient — massive ecosystem, broad library support — is exactly what makes supply chain attacks so effective against it.

Your Conversations Stay Longer Than You Think

Cloud providers retain API interactions by default. OpenAI holds API inputs and outputs for 30 days for abuse monitoring, with Zero Data Retention available only to eligible enterprise customers on explicit request. Anthropic reduced their API log retention to 7 days in September 2025 — an improvement, but still 7 days of conversation data on servers you don't control.

Note

"Encrypted in transit" and "we don't train on your API data" are different claims than "your data isn't stored on our servers." Retention for abuse monitoring and compliance auditing means your conversations exist somewhere — accessible to legal processes you won't see coming.

In early 2026, a federal judge ordered OpenAI to hand over 20 million de-identified ChatGPT logs under FRCP 26(b). Users were not notified before the order issued. Deleted conversations persisted in backups under legal hold. That's not a hypothetical — it happened.


The Full Threat Model: Five Vectors Cloud Can't Fix

Local LLM

Eliminated — no external credentials

Zero — you control deletion

Data never leaves your hardware

Isolated to local system

None — you own the model and runtime

API Key Exposure: One Leak, Full Access

API keys live in environment variables, CI/CD pipelines, Docker configs, and developer laptops. They're rotated rarely because rotation breaks integrations. A single compromised developer machine hands an attacker functional credentials for every cloud service your codebase touches.

The LiteLLM attack worked because these keys were where they were supposed to be. That's not careless deployment — that's standard practice.

Subpoena Risk: Discovery Orders Don't Need Your Permission

Federal Rules of Civil Procedure 26(b) allows courts to compel cloud providers to produce conversation logs. You won't necessarily know it happened. A legal hold preserves data that would otherwise be deleted. Years of API inference data that felt ephemeral can surface in litigation without warning.

For legal, medical, and financial teams handling genuinely sensitive information, that risk profile is hard to accept once you've run the math on what local infrastructure actually costs.


How Local LLM Infrastructure Eliminates Every Vector

Inference on your hardware produces no external network traffic. There's nothing to intercept on a third-party server because no third-party server is involved.

No API keys cross the internet. If a developer's machine is compromised, an attacker gains access to your local subnet — not a credential that works against every cloud provider at once.

Your audit trail lives on your hardware. Every query, every model version, every access event is logged locally — visible to your compliance team, not to a vendor's analytics platform.

Tip

For HIPAA and SOC 2, local infrastructure simplifies vendor assessment requirements significantly. You're auditing hardware and software you own, not sending questionnaires to third parties and reviewing their certifications.

Supply chain risk doesn't vanish — a compromised local library still affects your system — but the blast radius is contained. It affects one machine you can isolate and remediate. It doesn't hand an attacker credentials to your entire cloud infrastructure.


The Real Trade-Off: Upfront Cost vs. Ongoing Risk

Cloud API looks cheap until you run the math at scale.

RTX 4090 Local Build

~$3,100 one-time

~$0.002 (electricity only)

~$4,000–5,000

Warning

This math favors local at serious production volume — thousands of queries daily. If you're running a few hundred queries per day, cloud API is still cheaper on a raw cost basis. The security argument stands regardless of volume, but the financial argument requires meaningful inference load.

The hidden costs of cloud compound over time: compliance overhead for vendor assessments, incident response when API keys are leaked, and liability exposure if a breach involves customer data. Those don't show up in per-token pricing.


Building Your Professional Local AI Setup

Three builds for three use cases. All prices reflect March 2026 street prices — not MSRP, which is largely irrelevant at current GPU inventory levels.

Professional Entry: RTX 5070 Ti — ~$2,100 Total

GPU: RTX 5070 Ti 16 GB VRAM — street price approximately $880–$1,000 as of March 2026 (MSRP is $749, but units are trading significantly above it due to inventory shortages)

Platform: Ryzen 5 7500F, 64 GB DDR5, 1000W PSU, mid-tower case — approximately $1,100–$1,200

What it runs well: Llama 3.1 8B, Qwen 2.5 14B, Mistral 22B — all run fully in 16 GB VRAM. Document summarization, contract review, compliance classification, and coding assistance are comfortable workloads.

What it doesn't: Llama 3.1 70B Q4_K_M quantization requires approximately 35–40 GB VRAM — well beyond 16 GB. You can run it with CPU layer offloading, but token speeds drop to 3–6 tok/s, which is usable for batch workflows but slow for interactive use.

Best for: Legal teams running document classification and summarization, compliance analysts, solo practitioners who need private AI assistance for sensitive client matters.

Professional Standard: RTX 4090 — ~$3,100 Total

GPU: RTX 4090 24 GB VRAM — used market approximately $2,000–$2,200 on eBay as of March 2026

Platform: Ryzen 7 7700X, 128 GB DDR5, 1200W PSU, full-tower case — approximately $900–$1,000

What it runs well: Models up to ~30B parameters sit entirely within 24 GB VRAM. Qwen 2.5 32B Q4_K_M (~18–20 GB), Mistral 22B, and code-specialist models like DeepSeek Coder 33B are all comfortable. 70B models require CPU layer offloading — expect 7–10 tok/s depending on RAM bandwidth and how many layers land in VRAM. Viable for async document analysis; slower for interactive chat.

Best for: Medical records processing, legal discovery support, financial data analysis where 30B model quality is sufficient and keeping data on your hardware is non-negotiable. See our model selection guide for specific model recommendations by use case.

Power User: Dual GPU — ~$4,500–4,800 Total

GPUs: RTX 4090 24 GB (used, $2,100) + RTX 5080 16 GB ($1,100–$1,400 street price, March 2026) — 40 GB combined VRAM

One critical correction from common advice: Consumer GeForce GPUs, including both of these cards, do not support NVLink. Multi-GPU tensor parallelism runs over PCIe bandwidth via software frameworks — vLLM and DeepSpeed both handle this well in production. It works. It's not as fast as NVLink-connected professional cards, but for inference serving (not training), PCIe throughput is sufficient.

Platform: Ryzen 9 7950X, X670E motherboard, 128 GB DDR5, 2000W PSU, dual-GPU case — approximately $1,300–$1,500

What it runs: 40 GB combined VRAM handles Llama 3.1 70B Q4_K_M (~37–40 GB) entirely in GPU memory via tensor parallelism. Expect 12–18 tok/s on 70B — fast enough for interactive professional use. You can also run parallel inference across two models simultaneously for team-serving workloads.

Best for: Inference serving for a team, multi-model deployment, research workloads requiring 70B inference at interactive speeds. For full deployment stack setup, our Ollama setup guide covers the software layer.


Operational Security: Keeping Your Rig Secure

A local inference rig is only as secure as the network it sits on and the models you load onto it.

Network Isolation and Access Control

Deploy the inference server on a dedicated VLAN with no direct internet routing. Clients connect over TLS internally — treat this like a database server, not a developer tool. No public exposure, authenticated access only. Disable any remote access that isn't strictly required.

Model Provenance and Supply Chain

Only load models from verified sources — official Hugging Face repositories from the model's original authors, with checksums verified before deployment. Hugging Face hosts community models alongside official ones; confirm the repository owner matches the model's creator (Meta for Llama, Mistral AI for Mistral, and so on) before deploying anything.

Warning

Maintain a model audit log: which versions are running, who deployed them, and when. If a model is later found to contain a backdoor or unsafe behavior, you need to know exactly which queries ran on which version and when to assess exposure.

Credential Management

No credentials stored in .env files. Use HashiCorp Vault or Kubernetes sealed secrets for anything touching your inference server. Rotate access tokens quarterly, audit access logs monthly. If a token is compromised, you should be able to revoke one service's access without touching the rest of your infrastructure. For a detailed cost comparison including compliance overhead, see our cloud API vs. local cost breakdown.


FAQ

Can we meet HIPAA compliance with a local LLM setup? Local infrastructure gives you what HIPAA requires: data control, full audit trails, and your own deletion policies — no third-party BAAs for the AI layer. Compliance certification still requires auditing your full stack (access controls, encryption at rest, incident response processes), but the AI component becomes an asset rather than a vendor risk.

What about uptime — shouldn't we rely on cloud redundancy? You can cluster local nodes for redundancy. A second inference node at the Professional Entry tier costs $2,100 — compare that to 12 months of enterprise cloud API at serious volume. A practical hybrid architecture: local for sensitive inference, cloud for non-critical tasks. You get redundancy where it matters without the blanket exposure of cloud-only.

What if a model we're running gets compromised? You pull it down in seconds. Your audit logs show exactly which queries ran on the compromised version, when, and from which client. With cloud, you're waiting on vendor communications and have no direct visibility into what happened to your data during the incident window.

Do consumer GPUs have the reliability needed for professional deployments? For inference workloads, yes. What consumer GPUs lack is ECC memory (reducing but not eliminating memory error risk), NVLink (capping multi-GPU bandwidth to PCIe speeds), and formal support SLAs. For most professional teams deploying local AI primarily for data control reasons, consumer hardware is the right trade-off. At scale — serving a large team, running continuous inference — NVIDIA's professional line becomes worth the premium.


CraftRigs Take: When Local Is Non-Negotiable

Cloud AI is the right default for non-sensitive workloads. For legal, medical, and financial teams handling data they can't afford to have subpoenaed, leaked, or retained on someone else's servers, the calculus is different.

The LiteLLM attack didn't require cracking a cloud provider's core infrastructure. It required compromising one widely-used dependency. That attack surface only grows as AI API adoption expands. Every new integration is another link in the chain.

Local infrastructure isn't a statement about trusting AI companies. It's a statement about keeping sensitive workloads in a threat model you can reason about, defend, and actually audit. For the use cases where that matters — and there are many — the hardware investment is straightforward. Check our hardware upgrade ladder if you're mapping a path from entry level to production-grade deployment.

local-llm security privacy professional build-guide

Technical Intelligence, Weekly.

Access our longitudinal study of hardware performance and architectural optimization benchmarks.