Qualcomm® AI HubAI Hub

OpenAI-Clip

Multi‑modal foundational model for vision and language tasks like image/text similarity and for zero‑shot image classification.

Contrastive Language‑Image Pre‑Training (CLIP) uses a ViT like transformer to get visual features and a causal language model to get the text features. Both the text and visual features can then be used for a variety of zero‑shot learning tasks.

Technical Details

Model checkpoint:ViT-B/16
Image input resolution:224x224
Text context length:77
Number of parameters (CLIPTextEncoder):76.0M
Model size (CLIPTextEncoder):290 MB
Number of parameters (CLIPImageEncoder):115M
Model size (CLIPImageEncoder):437 MB

Applicable Scenarios

  • Image Search
  • Content Moderation
  • Caption Creation

Licenses

Source Model:MIT
Deployable Model:AI Model Hub License

Tags

  • foundation

Supported Automotive Devices

  • SA7255P ADP
  • SA8255 (Proxy)
  • SA8295P ADP
  • SA8650 (Proxy)
  • SA8775P ADP

Supported Automotive Chipsets

  • Qualcomm® SA7255P
  • Qualcomm® SA8255P (Proxy)
  • Qualcomm® SA8295P
  • Qualcomm® SA8650P (Proxy)
  • Qualcomm® SA8775P

Looking for more? See models created by industry leaders.

Discover Model Makers