Experience VibeVoice's capabilities directly in your browser. Generate multi-speaker conversational audio from text
Seamless switching between English and Mandarin Chinese
Natural emotions and singing with context awareness
Up to 4 distinct speakers with natural turn-taking
Generate up to 90 minutes of continuous speech
Generate up to 90 minutes of continuous, high-quality speech - perfect for podcasts and long-form content
Support for up to 4 distinct speakers with natural turn-taking and speaker consistency
Fully open-source under MIT license with pre-trained models available on Hugging Face
Generate spontaneous emotions and singing with deep understanding of textual context and dialogue flow
Seamless language switching between English and Mandarin Chinese within the same conversation
Efficient processing at 7.5 Hz with continuous speech tokenizers for scalable long-form generation
Generate full podcast episodes with background music and natural conversational flow
Advanced LLM with diffusion head for understanding context and generating high-fidelity acoustic details
Audible disclaimers and imperceptible watermarks to ensure responsible AI usage
Choose from our range of pre-trained models optimized for different use cases
Optimal for most use cases with excellent balance of quality and efficiency
Larger model with enhanced quality for professional applications
📏 Context Length: TBA
⏱️ Generation Length: Real-time
🔗 Weight: Coming Soon
Optimized for real-time streaming applications
Clone the repository and install dependencies using pip
Create a text file with dialogue format and speaker names
Run inference to create long-form, multi-speaker conversational audio
VibeVoice's ability to generate 90-minute podcasts with multiple speakers has revolutionized our content creation workflow. The cross-lingual capabilities are particularly impressive.
AI Research Scientist
The spontaneous emotion and singing generation features showcase the frontier capabilities of VibeVoice. It's a significant advancement in conversational TTS.
Speech Technology Lab
Microsoft's open-source approach with VibeVoice accelerates research in the field. The technical documentation and pre-trained models are excellent resources.
University Research Director
Join the frontier of open-source text-to-speech technology. Create expressive, long-form, multi-speaker audio today
Yes! VibeVoice is fully open-source under MIT license with pre-trained models available on Hugging Face. It's completely free for research and development purposes.
Currently, VibeVoice supports English and Mandarin Chinese with cross-lingual generation capabilities that allow natural language switching within conversations.
VibeVoice can generate up to 90 minutes of continuous audio with the 1.5B model, and up to 45 minutes with the 7B model, far exceeding typical TTS limitations.
VibeVoice is intended for research and development purposes. For commercial applications, please review the risks and limitations section and consider further testing.
VibeVoice specializes in long-form, multi-speaker conversational audio with up to 4 speakers, spontaneous emotion generation, cross-lingual capabilities, and efficient 7.5 Hz processing.