Qwen2.5-VL-7B-Instruct

State‑of‑the‑art vision‑language model capable of understanding images and generating text responses.

Qwen2.5‑VL‑7B‑Instruct is a multimodal vision‑language model with 7 billion parameters that can process both images and text, enabling visual question answering, image description, and other vision‑language tasks.

Not supported

This model is currently not supported on any Compute chipset.

To see performance metrics for this model on other chipsets, click the button below.

View for other chipsets

Technical Details

Input sequence length for Prompt Processor:128
Input image size for Vision Encoder:504x336
Context lengths:512,1024,2048
Use:Initiate conversation with prompt-processor and then token generator for subsequent iterations.
Minimum QNN SDK version required:2.45.0
Supported languages:English, Chinese, and many others.
TTFT:Time To First Token is the time it takes to generate the first response token. This is expressed as a range because it varies based on the length of the prompt. The lower bound is for a short prompt (up to 128 tokens, i.e., one iteration of the prompt processor) and the upper bound is for a prompt using the full context length.
Response Rate:Rate of response generation after the first response token.

Applicable Scenarios

  • Dialogue
  • Content Generation
  • Customer Support

License

Tags

  • llm
  • generative-ai

Supported Compute Devices

  • Snapdragon X Elite CRD
  • Snapdragon X2 Elite CRD

Supported Compute Chipsets

  • Snapdragon® X Elite
  • Snapdragon® X2 Elite

Related Models

See all models

Looking for more? See models created by industry leaders.

Discover Model Makers