OpenAI-Clip
Multi-modal foundational model for vision and language tasks like image/text similarity and for zero-shot image classification.
Contrastive Language-Image Pre-Training (CLIP) uses a ViT like transformer to get visual features and a causal language model to get the text features. Both the text and visual features can then be used for a variety of zero-shot learning tasks.
Technical Details
Model checkpoint:ViT-B/16
Image input resolution:224x224
Text context length:77
Number of parameters (CLIPTextEncoder):76.0M
Model size (CLIPTextEncoder):290 MB
Number of parameters (CLIPImageEncoder):115M
Model size (CLIPImageEncoder):437 MB
Applicable Scenarios
- Image Search
- Content Moderation
- Caption Creation
Supported Form Factors
- Phone
- Tablet
Licenses
Source Model:MIT
Deployable Model:AI Model Hub License
Tags
- foundationA “foundation” model is versatile and designed for multi-task capabilities, without the need for fine-tuning.
Supported Devices
- Google Pixel 3
- Google Pixel 3a
- Google Pixel 3a XL
- Google Pixel 4
- Google Pixel 4a
- Google Pixel 5a 5G
- QCS8550 (Proxy)
- SA8255 (Proxy)
- SA8650 (Proxy)
- SA8775 (Proxy)
- Samsung Galaxy S21
- Samsung Galaxy S21 Ultra
- Samsung Galaxy S21+
- Samsung Galaxy S22 5G
- Samsung Galaxy S22 Ultra 5G
- Samsung Galaxy S22+ 5G
- Samsung Galaxy S23
- Samsung Galaxy S23 Ultra
- Samsung Galaxy S23+
- Samsung Galaxy S24
- Samsung Galaxy S24 Ultra
- Samsung Galaxy S24+
- Samsung Galaxy Tab S8
- Xiaomi 12
- Xiaomi 12 Pro
Supported Chipsets
- Qualcomm® QCS8550
- Qualcomm® SA8255P
- Qualcomm® SA8650P
- Qualcomm® SA8775P
- Snapdragon® 8 Gen 1 Mobile
- Snapdragon® 8 Gen 2 Mobile
- Snapdragon® 8 Gen 3 Mobile
- Snapdragon® 888 Mobile
- Snapdragon® X Elite