OpenAI-Clip

Multi‑modal foundational model for vision and language tasks like image/text similarity and for zero‑shot image classification.

Contrastive Language‑Image Pre‑Training (CLIP) uses a ViT like transformer to get visual features and a causal language model to get the text features. Both the text and visual features can then be used for a variety of zero‑shot learning tasks.

Model Repository Hugging Face Research Paper

Technical Details

Model checkpoint:ViT-B/16

Image input resolution:224x224

Text context length:77

Number of parameters:150M

Model size (float):571 MB

Applicable Scenarios

Image Search
Content Moderation
Caption Creation

Supported Mobile Form Factors

Phone
Tablet

Licenses

Source Model:MIT

Deployable Model:AI-HUB-MODELS-LICENSE

Supported Mobile Devices

Samsung Galaxy S21
Samsung Galaxy S21 Ultra
Samsung Galaxy S21+
Samsung Galaxy S22 5G
Samsung Galaxy S22 Ultra 5G
Samsung Galaxy S22+ 5G
Samsung Galaxy S23
Samsung Galaxy S23 Ultra
Samsung Galaxy S23+
Samsung Galaxy S24
Samsung Galaxy S24 Ultra
Samsung Galaxy S24+
Samsung Galaxy Tab S8
Snapdragon 8 Elite QRD
Xiaomi 12
Xiaomi 12 Pro

Supported Mobile Chipsets

Snapdragon® 8 Elite Mobile
Snapdragon® 8 Gen 1 Mobile
Snapdragon® 8 Gen 2 Mobile
Snapdragon® 8 Gen 3 Mobile
Snapdragon® 888 Mobile

Looking for more? See models created by industry leaders.

Discover Model Makers

By Industry

By Model Maker

New! Run your models on Snapdragon® 8 Elite devices with AI Hub.

Models from G42 now available for purchase on AI Hub

Model Makers

Collaborators

Models from Tech Mahindra now available for purchase on AI Hub

Learn about the collaboration between Amazon SageMaker and AI Hub

Communication

Code

Get help, share stories, and hear announcements on our Slack channel

Visit Qualcomm's organization card on Hugging Face

Get Started

Discover

Read our getting started guide and learn how to use Qualcomm AI Hub

Check out news, training videos, customer stories and more on our Resources page