Qualcomm® AI HubAI Hub

Qwen3-VL-2B-Instruct

Multimodal 2B vision‑language model capable of understanding text and images.

Qwen3‑VL is a vision‑language model from Alibaba Cloud capable of understanding both text and images for multimodal reasoning tasks such as visual question answering and image captioning.

Not supported

This model is currently not supported on any All Models chipset.

To see performance metrics for this model on other chipsets, click the button below.

View for other chipsets

Technical Details

Model architecture:Transformer with ViT Vision Encoder, Grouped Query Attention (GQA), and SwiGLU activation.
Supported languages:100+ languages and dialects
TTFT:Time To First Token is the time it takes to generate the first response token. This is expressed as a range because it varies based on the length of the prompt.
Response Rate:Rate of response generation after the first response token.

Applicable Scenarios

  • Dialogue
  • Content Generation

Supported Form Factors

  • Phone
  • Tablet

License

Tags

  • llm
  • generative-ai

Supported Devices

  • Dragonwing IQ-9075 EVK
  • Snapdragon 8 Elite QRD
  • Snapdragon X Elite CRD
  • Snapdragon X2 Elite CRD

Supported Chipsets

  • Qualcomm® QCS9075
  • Snapdragon® 8 Elite Mobile
  • Snapdragon® X Elite
  • Snapdragon® X2 Elite

Related Models

See all models

Looking for more? See models created by industry leaders.

Discover Model Makers