Models

Vision-Language Model

A model capable of jointly reasoning over both images and text.

Full Definition

Vision-language models (VLMs) combine a visual encoder (typically a ViT — Vision Transformer) with a language model decoder, enabling tasks that require understanding both visual and textual information. They can describe images, answer visual questions, read text in images (OCR), locate objects, and perform multi-step visual reasoning. Training involves massive datasets of image-text pairs from the web. Modern VLMs like LLaVA, GPT-4V, Claude 3, and Gemini Pro Vision integrate vision natively rather than as a bolt-on. Applications include medical image analysis, document understanding, robotics perception, and accessibility tools for visually impaired users.

Examples

Submitting a photo of a restaurant menu in Japanese to GPT-4V, which translates each dish and adds dietary information.

A VLM inspecting assembly line photos to flag defective products without human visual inspection.

Apply this in your prompts

PromptITIN automatically uses techniques like Vision-Language Model to build better prompts for you.

✦ Try it free

Related Terms

Multimodal Model

A model that can process and generate across multiple data types such as text, i…

View →

Transformer

The neural network architecture that underpins all modern large language models,…

View →

Attention Mechanism

The core transformer operation that weighs the relevance of each token to every …

View →

← Browse all 100 terms