OpenAI-Clip

Multi‑modal foundational model for vision and language tasks like image/text similarity and for zero‑shot image classification.

Contrastive Language‑Image Pre‑Training (CLIP) uses a ViT like transformer to get visual features and a causal language model to get the text features. Both the text and visual features can then be used for a variety of zero‑shot learning tasks.

Model Repository Hugging Face Research Paper

Technical Details

Model checkpoint:ViT-B/16

Image input resolution:224x224

Text context length:77

Number of parameters:150M

Model size (float):571 MB

Applicable Scenarios

Image Search
Content Moderation
Caption Creation

Licenses

Source Model:MIT

Deployable Model:AI-HUB-MODELS-LICENSE

Supported Automotive Devices

SA7255P ADP
SA8255 (Proxy)
SA8295P ADP
SA8650 (Proxy)
SA8775P ADP

Supported Automotive Chipsets

Qualcomm® SA7255P
Qualcomm® SA8255P (Proxy)
Qualcomm® SA8295P
Qualcomm® SA8650P (Proxy)
Qualcomm® SA8775P

Looking for more? See models created by industry leaders.

Discover Model Makers

By Industry

By Model Maker

New! Run your models on Snapdragon® 8 Elite Gen 5 devices with AI Hub.

Models from G42 now available for purchase on AI Hub

Model Makers

Collaborators

Models from Tech Mahindra now available for purchase on AI Hub

Learn about the collaboration between Amazon SageMaker and AI Hub

Communication

Code

Get help, share stories, and hear announcements on our Slack channel

Visit Qualcomm's organization card on Hugging Face

Get Started

Discover

Read our getting started guide and learn how to use Qualcomm AI Hub

Check out news, training videos, customer stories and more on our Resources page