CraftRigs
Architecture Guide

Running Vision Models Locally on Mac: What Works and What Doesn't

By Ellie Garcia 5 min read

Some links on this page may be affiliate links. We disclose it because you deserve to know, not because it changes anything. Every recommendation here comes from benchmarks, not budgets.

TL;DR: Vision models work on Mac with 16GB+ unified memory, but they use significantly more memory than text-only models. LLaVA and Moondream are your practical daily options. Llama 4 Scout is impressive but needs 48GB to run comfortably.

Multimodal AI — models that handle both text and images — is the direction everything is moving. The ability to point a model at an image and ask questions about it is genuinely useful: analyzing screenshots, describing photos, reading documents, OCR-style extraction. Apple Silicon handles this better than most people expect, with some real constraints you need to know before you start.


How Vision Models Work (And Why They Need More Memory)

A vision model adds an image encoder to a standard language model. When you send an image, the encoder converts it into a sequence of tokens (sometimes thousands of tokens representing the image), then those tokens get fed into the language model alongside your text prompt.

The implication: vision models need more memory than their text-only equivalents because:

  • The image encoder itself takes memory
  • Image tokens add to the effective context length dramatically
  • Some models encode images at multiple resolutions simultaneously

A vision-capable model that's nominally 7B parameters might use 10–12GB of VRAM in practice because of the image encoder overhead. Plan accordingly.


Vision Models That Work Well on Apple Silicon

Moondream (1.8B parameters):

  • Memory requirement: ~2–3GB unified memory
  • Speed: very fast, even on M1 with 8GB
  • Use case: simple image descriptions, basic captioning, answering questions about images
  • Limitations: small model means limited reasoning ability. It can describe what's in an image but struggles with complex analysis.
  • Best for: quick image tagging, generating alt text, basic visual Q&A
  • Runs in: llama.cpp (GGUF available), Python via Hugging Face, Ollama

LLaVA 7B (7B language model + CLIP vision encoder):

  • Memory requirement: ~8–10GB unified memory
  • Speed: ~15–30 T/s on M3 Pro, adequate for interactive use
  • Use case: image description, visual question answering, reading text in images
  • Limitations: image understanding quality is noticeably below modern cloud models like GPT-4o or Claude. Good enough for many practical tasks, not for complex analysis.
  • Best for: analyzing screenshots, describing photos for accessibility, reading signage
  • Runs in: Ollama (ollama pull llava), llama.cpp with vision support

LLaVA 13B:

  • Memory requirement: ~14–16GB unified memory
  • Speed: ~10–18 T/s on M3 Max 36GB
  • Better reasoning about images than 7B variant
  • Requires at least a 16GB Mac to run at all; 24GB is comfortable

Llama 4 Scout (vision-capable):

  • Memory requirement: 35–45GB unified memory at Q4 quantization
  • Speed on M4 Max 64GB: ~10–18 T/s (estimated — see caveat below)
  • Best open-source vision model available locally as of early 2026
  • Significantly better image reasoning than LLaVA
  • Requires at minimum M4 Pro 48GB or M4 Max 36GB to run at all
  • Best for: complex image analysis, reading dense documents, multimodal research tasks
  • Note: As of early 2026, Llama 4 had not yet been publicly released. Performance figures above are estimates based on known architecture details and should be verified against actual release benchmarks when the model becomes available.

What Vision Models Are Actually Good For

On Mac, with adequate memory, local vision models genuinely work for:

  • Screenshot analysis: Paste a UI screenshot, ask "what's on this screen?" or "what error is shown here?" — works well even at 7B level.
  • Alt text generation: Feed images through Moondream or LLaVA to auto-generate descriptions for accessibility.
  • Basic OCR: LLaVA and larger models can read text in images reasonably well — menus, signs, scanned documents. Not perfect, but often good enough.
  • Image categorization: Classifying or labeling images in bulk by routing through a vision model API locally.
  • Multimodal chat: Uploading photos to ask "what is in this image?" or "what's wrong with this code in the screenshot?" in an Open WebUI session.

What Doesn't Work Well (Honest Limitations)

Vision models running locally on Mac have real limitations compared to cloud services:

  • Detail in complex images: Cloud models like GPT-4o and Claude have larger, more sophisticated vision encoders. For complex diagrams, dense charts, or technical drawings, the quality gap is significant.
  • Multi-image conversations: Most local vision models handle one image at a time. Sending multiple images in a single conversation often either doesn't work or confuses the model.
  • Video: No local vision models reliably handle video frames as of early 2026. You're working with static images.
  • High-resolution detail: Older vision models (LLaVA 1.5) used 336x336 pixel image inputs. Newer models (LLaVA 1.6, Llama 4 Scout, and other 2024–2026 models) support tiled high-resolution inputs at 672x672 or higher. The 336x336 limit no longer applies to modern vision models, but fine detail in very high-res images may still be lost depending on the model's tile configuration.
  • Speed: Even on fast Mac hardware, vision inference is slower than text-only inference because of the image encoding step. A 7B LLaVA model is noticeably slower per query than a 7B text model.

Practical Setup on Mac

Easiest path (Ollama):

ollama pull llava
ollama run llava

Then in Open WebUI, you can attach images to prompts. Works out of the box.

LLaVA via llama.cpp: Download the appropriate GGUF model and projector file from Hugging Face (both files are required):

./llama-cli -m llava-7b-q4.gguf --mmproj mmproj-model-f16.gguf --image yourimage.jpg -p "What is in this image?"

MLX for vision models: MLX support for vision models is available for LLaVA and some Llama 4 variants. Check mlx-community on Hugging Face for MLX-formatted versions. MLX-based vision inference is faster than llama.cpp Metal for supported models.


Memory Recommendations by Use Case

  • Just want to try vision AI, have 16GB Mac: LLaVA 7B via Ollama. Works, limited.
  • Practical daily image analysis, 24GB Mac: LLaVA 13B or similar. Better reasoning.
  • Serious multimodal work, 48GB+ Mac: Llama 4 Scout or LLaVA 34B class models. Noticeably better.
  • Cloud is still required: High-stakes image analysis, complex diagrams, legal or medical document reading — use GPT-4o or Claude. Local models aren't there yet for precision work.

See Also

Technical Intelligence, Weekly.

Access our longitudinal study of hardware performance and architectural optimization benchmarks.