OpenAI-Clip

Multi‑modal foundational model for vision and language tasks like image/text similarity and for zero‑shot image classification.

Contrastive Language‑Image Pre‑Training (CLIP) uses a ViT like transformer to get visual features and a causal language model to get the text features. Both the text and visual features can then be used for a variety of zero‑shot learning tasks.

Model Repository Hugging Face Research Paper

Technical Details

Model checkpoint:ViT-B/16

Image input resolution:224x224

Text context length:77

Number of parameters:150M

Model size (float):571 MB

Applicable Scenarios

Image Search
Content Moderation
Caption Creation

Licenses

Source Model:MIT

Deployable Model:AI-HUB-MODELS-LICENSE

Supported IoT Devices

QCS8275 (Proxy)
QCS8550 (Proxy)
QCS9075 (Proxy)

Supported IoT Chipsets

Qualcomm® QCS8275 (Proxy)
Qualcomm® QCS8550 (Proxy)
Qualcomm® QCS9075 (Proxy)

Looking for more? See models created by industry leaders.

Discover Model Makers

By Industry

By Model Maker

Models from Tech Mahindra now available for purchase on AI Hub

Models from G42 now available for purchase on AI Hub

Sample Apps By Use Cases

Walk through deploying an AI model on device

Read our getting started guide and learn how to use Qualcomm AI Hub

Model Makers

Collaborators

Build AI-powered vision models and integrate them seamlessly with AI Hub

Learn about the collaboration between Amazon SageMaker and AI Hub

Communication

Code

Get help, share stories, and hear announcements on our Slack channel

Visit Qualcomm's organization card on Hugging Face

Learn

Discover