CraftRigs
CraftRigs / Glossary / Multimodal
Models & Quantization

Multimodal

AI models that process and generate more than one type of data — typically text plus images, audio, or video.

A multimodal model can understand and work with multiple types of input — the most common combination being text and images. A multimodal LLM can analyze an image, answer questions about it, describe its contents, or use it as context for a text response. Some models extend this to audio input/output, video, or structured data.

Several multimodal models run well on consumer hardware:

  • LLaVA (Large Language and Vision Assistant) — One of the first capable open-source multimodal models. Various sizes from 7B to 34B.
  • LLaVA-NeXT / LLaVA 1.6 — Improved version with better image understanding
  • Moondream — Extremely small (1.8B) vision model, runs on minimal hardware
  • Qwen-VL — Strong multimodal performance, multiple sizes
  • Pixtral — Mistral's multimodal model with strong document understanding

Ollama and LM Studio support several of these natively.

VRAM Requirements for Vision Models

Multimodal models have higher VRAM requirements than their text-only counterparts. The vision encoder (the component that processes images) adds overhead on top of the base LLM. A LLaVA 7B model needs roughly 8–10GB compared to ~5GB for a pure text 7B.

High-resolution image processing increases VRAM requirements further. A 4K image split into many tiles requires significantly more processing than a 512×512 input.

How Vision Encoding Works

Image inputs are processed by a separate vision encoder (often a CLIP or SigLIP model) that converts the image into a sequence of tokens. These image tokens are then combined with text tokens and processed by the language model. The number of image tokens determines the VRAM overhead — higher resolution = more tokens = more memory.

Audio and Video Multimodal

Audio multimodal models (like Whisper-based LLMs) transcribe audio input and can answer questions about it. Video multimodal models sample frames and process them similarly to images. Both require substantially more VRAM than text-only inference.

For local deployment, image-text multimodal is the practical tier for most users. Audio and video multimodal are feasible but require higher-tier hardware.