State-of-the-art generative AI model used to generate spectrogram images given any text input. These spectrograms can be converted into audio clips.

Generates high resolution spectrograms images from text prompts using a latent diffusion model. This model uses CLIP ViT-L/14 as text encoder, U-Net based latent denoising, and VAE based decoder to generate the final image.

Snapdragon® X Elite
TorchScripttoQualcomm® AI Engine Direct
Inference Time
Memory Usage

Technical Details

Input:Text prompt to generate spectrogram image
Text Encoder Number of parameters:340M
UNet Number of parameters:865M
VAE Decoder Number of parameters:83M
Model size:1GB

Applicable Scenarios

  • Music Generation
  • Music Editing
  • Content Creation


  • generative-ai
    Models capable of generating text, images, or other data using generative models, often in response to prompts.
  • quantized
    A “quantized” model can run in low or mixed precision, which can substantially reduce inference latency.

Supported Compute Chipsets

  • Snapdragon® X Elite