Multimodal
AI models that process and generate more than one type of data — typically text plus images, audio, or video.
A multimodal model can understand and work with multiple types of input — the most common combination being text and images. A multimodal LLM can analyze an image, answer questions about it, describe its contents, or use it as context for a text response. Some models extend this to audio input/output, video, or structured data.
Popular Local Multimodal Models
Several multimodal models run well on consumer hardware:
- LLaVA (Large Language and Vision Assistant) — One of the first capable open-source multimodal models. Various sizes from 7B to 34B.
- LLaVA-NeXT / LLaVA 1.6 — Improved version with better image understanding
- Moondream — Extremely small (1.8B) vision model, runs on minimal hardware
- Qwen-VL — Strong multimodal performance, multiple sizes
- Pixtral — Mistral's multimodal model with strong document understanding
Ollama and LM Studio support several of these natively.
VRAM Requirements for Vision Models
Multimodal models have higher VRAM requirements than their text-only counterparts. The vision encoder (the component that processes images) adds overhead on top of the base LLM. A LLaVA 7B model needs roughly 8–10GB compared to ~5GB for a pure text 7B.
High-resolution image processing increases VRAM requirements further. A 4K image split into many tiles requires significantly more processing than a 512×512 input.
How Vision Encoding Works
Image inputs are processed by a separate vision encoder (often a CLIP or SigLIP model) that converts the image into a sequence of tokens. These image tokens are then combined with text tokens and processed by the language model. The number of image tokens determines the VRAM overhead — higher resolution = more tokens = more memory.
Audio and Video Multimodal
Audio multimodal models (like Whisper-based LLMs) transcribe audio input and can answer questions about it. Video multimodal models sample frames and process them similarly to images. Both require substantially more VRAM than text-only inference.
For local deployment, image-text multimodal is the practical tier for most users. Audio and video multimodal are feasible but require higher-tier hardware.