MMAudio
MMAudio is an advanced video-to-audio synthesis model presented at CVPR 2025 that generates high-quality, synchronized audio from video inputs and optional text prompts. Developed by researchers from the University of Illinois Urbana-Champaign and Sony AI, the software utilizes a novel multimodal joint training approach to learn effectively from diverse audio-visual and audio-text datasets. A specialized synchronization module ensures precise alignment between the generated sound and video frames. The system is designed to enhance silent videos, such as those from generative video models like Sora or Veo, by adding realistic environmental sounds and dialogue. It supports various input scenarios including pure video generation or video paired with textual descriptions to guide audio creation. The project provides pre-trained models accessible via Hugging Face, along with command-line interfaces and demo scripts for local deployment requiring PyTorch and CUDA-compatible GPUs. Inference is optimized for modern h