Multimodal Model
A model that can process and generate across multiple data types such as text, images, and audio.
Full Definition
Multimodal models accept and produce more than one modality — typically combining text with images, audio, video, or code within a single architecture. Early multimodal work stitched together separate encoders (e.g., CLIP for images, a language model for text), but modern models like GPT-4o, Gemini, and Claude 3 process all modalities through unified transformers. This enables cross-modal reasoning: answering questions about images, generating image captions, describing audio, or combining visual and textual instructions. Native multimodality is increasingly the default for frontier models because real-world tasks rarely involve text alone.
Examples
Uploading a photo of a broken circuit board to GPT-4 Vision and asking 'What component is likely causing the short circuit?'
Using Gemini to transcribe and summarise a one-hour meeting recording, outputting structured meeting notes.
Apply this in your prompts
PromptITIN automatically uses techniques like Multimodal Model to build better prompts for you.
Related Terms
Vision-Language Model
A model capable of jointly reasoning over both images and text.…
View →Large Language Model
A neural network with billions of parameters trained on text to understand and g…
View →Foundation Model
A large model trained on broad data that can be adapted to many downstream tasks…
View →