HomeAll ModelsOpenAI-Clip

    OpenAI-Clip

    Multi-modal foundational model for vision and language tasks like image/text similarity and for zero-shot image classification.

    Contrastive Language-Image Pre-Training (CLIP) uses a ViT like transformer to get visual features and a causal language model to get the text features. Both the text and visual features can then be used for a variety of zero-shot learning tasks.

    TorchScriptTFLite
    11.2ms
    Inference Time
    0-209MB
    Memory Usage
    574NPU
    2CPU
    Layers

    Technical Details

    Model checkpoint:ViT-B/16
    Image input resolution:224x224
    Text context length:77
    Number of parameters (CLIPTextEncoder):76.0M
    Model size (CLIPTextEncoder):290 MB
    Number of parameters (CLIPImageEncoder):115M
    Model size (CLIPImageEncoder):437 MB

    Applicable Scenarios

    • Image Search
    • Content Moderation
    • Caption Creation

    Supported Form Factors

    • Phone
    • Tablet

    Licenses

    Source Model:MIT
    Deployable Model:AI Model Hub License

    Tags

    • foundation
      A “foundation” model is versatile and designed for multi-task capabilities, without the need for fine-tuning.

    Supported Devices

    • Google Pixel 3
    • Google Pixel 3a
    • Google Pixel 3a XL
    • Google Pixel 4
    • Google Pixel 4a
    • Google Pixel 5a 5G
    • QCS8550 (Proxy)
    • Samsung Galaxy S21
    • Samsung Galaxy S21 Ultra
    • Samsung Galaxy S21+
    • Samsung Galaxy S22 5G
    • Samsung Galaxy S22 Ultra 5G
    • Samsung Galaxy S22+ 5G
    • Samsung Galaxy S23
    • Samsung Galaxy S23 Ultra
    • Samsung Galaxy S23+
    • Samsung Galaxy S24
    • Samsung Galaxy S24 Ultra
    • Samsung Galaxy S24+
    • Samsung Galaxy Tab S8
    • Xiaomi 12
    • Xiaomi 12 Pro

    Supported Chipsets

    • Qualcomm® QCS8550
    • Snapdragon® 8 Gen 1 Mobile
    • Snapdragon® 8 Gen 2 Mobile
    • Snapdragon® 8 Gen 3 Mobile
    • Snapdragon® 888 Mobile