HomeAll ModelsOpenAI-Clip

OpenAI-Clip

Multi-modal foundational model for vision and language tasks like image/text similarity and for zero-shot image classification.

Contrastive Language-Image Pre-Training (CLIP) uses a ViT like transformer to get visual features and a causal language model to get the text features. Both the text and visual features can then be used for a variety of zero-shot learning tasks.

Technical Details

Model checkpoint:ViT-B/16
Image input resolution:224x224
Text context length:77
Number of parameters (CLIPTextEncoder):76.0M
Model size (CLIPTextEncoder):290 MB
Number of parameters (CLIPImageEncoder):115M
Model size (CLIPImageEncoder):437 MB

Applicable Scenarios

  • Image Search
  • Content Moderation
  • Caption Creation

Supported Form Factors

  • Phone
  • Tablet

Licenses

Source Model:MIT
Deployable Model:AI Model Hub License

Tags

  • foundation
    A “foundation” model is versatile and designed for multi-task capabilities, without the need for fine-tuning.

Supported Devices

  • Google Pixel 3
  • Google Pixel 3a
  • Google Pixel 3a XL
  • Google Pixel 4
  • Google Pixel 4a
  • Google Pixel 5a 5G
  • QCS8550 (Proxy)
  • SA8255 (Proxy)
  • SA8650 (Proxy)
  • SA8775 (Proxy)
  • Samsung Galaxy S21
  • Samsung Galaxy S21 Ultra
  • Samsung Galaxy S21+
  • Samsung Galaxy S22 5G
  • Samsung Galaxy S22 Ultra 5G
  • Samsung Galaxy S22+ 5G
  • Samsung Galaxy S23
  • Samsung Galaxy S23 Ultra
  • Samsung Galaxy S23+
  • Samsung Galaxy S24
  • Samsung Galaxy S24 Ultra
  • Samsung Galaxy S24+
  • Samsung Galaxy Tab S8
  • Xiaomi 12
  • Xiaomi 12 Pro

Supported Chipsets

  • Qualcomm® QCS8550
  • Qualcomm® SA8255P
  • Qualcomm® SA8650P
  • Qualcomm® SA8775P
  • Snapdragon® 8 Gen 1 Mobile
  • Snapdragon® 8 Gen 2 Mobile
  • Snapdragon® 8 Gen 3 Mobile
  • Snapdragon® 888 Mobile
  • Snapdragon® X Elite