Multimodal Models
Models that work with multiple types of data (text, image, audio)
7 models available
LLaVA 1.6
34B8.4LLaVA TeamVision-language model combining visual understanding with language generation.
Vision understandingImage captioningVisual Q&AReasoning
CogVLM
17B8.3Tsinghua UniversityPowerful vision-language model with strong visual grounding.
Visual groundingImage understandingVisual Q&AOCR
Qwen-VL
9.6B8.2Alibaba CloudMultilingual vision-language model with strong Chinese support.
MultilingualVision understandingChinese OCRVisual Q&A
BLIP-2
7.8B7.9SalesforceEfficient vision-language model with strong zero-shot capabilities.
Zero-shot learningImage captioningVisual Q&AEfficient
InstructBLIP
7.8B8SalesforceInstruction-tuned vision-language model for diverse visual tasks.
Instruction followingVisual tasksZero-shotVersatile
LLaVA-NeXT
34B8.7LLaVA TeamNext generation LLaVA with improved visual reasoning.
Strong visual reasoningHigh resolutionImproved understanding
MiniGPT-4
7B7.8MiniGPT-4 TeamLightweight vision-language model with strong image understanding.
Image understandingVisual Q&ALightweightEfficient