Instructions to use ai-forever/kandinsky-4-v2a with libraries, inference providers, notebooks, and local apps. Follow these links to get started.
- Libraries
- Diffusers
How to use ai-forever/kandinsky-4-v2a with Diffusers:
pip install -U diffusers transformers accelerate
import torch from diffusers import DiffusionPipeline # switch to "mps" for apple devices pipe = DiffusionPipeline.from_pretrained("ai-forever/kandinsky-4-v2a", dtype=torch.bfloat16, device_map="cuda") prompt = "Astronaut in a jungle, cold color palette, muted colors, detailed, 8k" image = pipe(prompt).images[0] - Notebooks
- Google Colab
- Kaggle
YAML Metadata Warning:The pipeline tag "video-to-audio" is not in the official list: text-classification, token-classification, table-question-answering, question-answering, zero-shot-classification, translation, summarization, feature-extraction, text-generation, fill-mask, sentence-similarity, text-to-speech, text-to-audio, automatic-speech-recognition, audio-to-audio, audio-classification, audio-text-to-text, voice-activity-detection, depth-estimation, image-classification, object-detection, image-segmentation, text-to-image, image-to-text, image-to-image, image-to-video, unconditional-image-generation, video-classification, reinforcement-learning, robotics, tabular-classification, tabular-regression, tabular-to-text, table-to-text, multiple-choice, text-ranking, text-retrieval, time-series-forecasting, text-to-video, image-text-to-text, image-text-to-image, image-text-to-video, visual-question-answering, document-question-answering, zero-shot-image-classification, graph-ml, mask-generation, zero-shot-object-detection, text-to-3d, image-to-3d, image-feature-extraction, video-text-to-text, keypoint-detection, visual-document-retrieval, any-to-any, video-to-video, other
Kandinsky-4-v2a: A Video to Audio pipeline
Description
Video to Audio pipeline consists of a visual encoder, a text encoder, UNet diffusion model to generate spectrogram and Griffin-lim algorithm to convert spectrogram into audio. Visual and text encoders share the same multimodal visual language decoder (cogvlm2-video-llama3-chat).
Our UNet diffusion model is a finetune of the music generation model riffusion. We made modifications in the architecture to condition on video frames and improve the synchronization between video and audio. Also, we replace the text encoder with the decoder of cogvlm2-video-llama3-chat.
Installation
git clone https://github.com/ai-forever/Kandinsky-4.git
cd Kandinsky-4
conda install -c conda-forge ffmpeg -y
pip install -r kandinsky4_video2audio/requirements.txt
pip install "git+https://github.com/facebookresearch/pytorchvideo.git"
Inference
Inference code for Video-to-Audio:
import torch
import torchvision
from kandinsky4_video2audio.video2audio_pipe import Video2AudioPipeline
from kandinsky4_video2audio.utils import load_video, create_video
device='cuda:0'
pipe = Video2AudioPipeline(
"ai-forever/kandinsky-4-v2a",
torch_dtype=torch.float16,
device = device
)
video_path = 'assets/inputs/1.mp4'
video, _, fps = torchvision.io.read_video(video_path)
prompt="clean. clear. good quality."
negative_prompt = "hissing noise. drumming rythm. saying. poor quality."
video_input, video_complete, duration_sec = load_video(video, fps['video_fps'], num_frames=96, max_duration_sec=12)
out = pipe(
video_input,
prompt,
negative_prompt=negative_prompt,
duration_sec=duration_sec,
)[0]
save_path = f'assets/outputs/1.mp4'
create_video(
out,
video_complete,
display_video=True,
save_path=save_path,
device=device
)
Authors
- Downloads last month
- 3
Model tree for ai-forever/kandinsky-4-v2a
Base model
riffusion/riffusion-model-v1