Vision-Language Model
A model capable of jointly reasoning over both images and text.
Full Definition
Vision-language models (VLMs) combine a visual encoder (typically a ViT — Vision Transformer) with a language model decoder, enabling tasks that require understanding both visual and textual information. They can describe images, answer visual questions, read text in images (OCR), locate objects, and perform multi-step visual reasoning. Training involves massive datasets of image-text pairs from the web. Modern VLMs like LLaVA, GPT-4V, Claude 3, and Gemini Pro Vision integrate vision natively rather than as a bolt-on. Applications include medical image analysis, document understanding, robotics perception, and accessibility tools for visually impaired users.
Examples
Submitting a photo of a restaurant menu in Japanese to GPT-4V, which translates each dish and adds dietary information.
A VLM inspecting assembly line photos to flag defective products without human visual inspection.
Apply this in your prompts
PromptITIN automatically uses techniques like Vision-Language Model to build better prompts for you.
Related Terms
Multimodal Model
A model that can process and generate across multiple data types such as text, i…
View →Transformer
The neural network architecture that underpins all modern large language models,…
View →Attention Mechanism
The core transformer operation that weighs the relevance of each token to every …
View →