VibeVoice: Frontier Open-Source Text-to-Speech

Generate expressive, long-form, multi-speaker conversational audio like podcasts from text. Synthesize up to 90 minutes with 4 distinct speakers using Microsoft's cutting-edge AI technology

🔗 GitHub Repository 🤗 Hugging Face 📄 Project Page

Try VibeVoice Live Demo

Experience VibeVoice's capabilities directly in your browser. Generate multi-speaker conversational audio from text

🎙️ VibeVoice Interactive Demo

Generate expressive, long-form, multi-speaker audio from your text

Initializing VibeVoice ...

Preparing the most advanced text-to-speech experience

0%
🎙️ Multi-Speaker
🌐 Cross-Lingual
⏱️ 90min Audio
🎵 AI Singing

Key Capabilities

🎙️

Cross-Lingual Generation

Seamless switching between English and Mandarin Chinese

🎵

Spontaneous Expression

Natural emotions and singing with context awareness

👥

Multi-Speaker Conversations

Up to 4 distinct speakers with natural turn-taking

⏱️

Long-Form Audio

Generate up to 90 minutes of continuous speech

Why Choose VibeVoice?

⏱️

Ultra-Long Audio Generation

Generate up to 90 minutes of continuous, high-quality speech - perfect for podcasts and long-form content

👥

Multi-Speaker Conversations

Support for up to 4 distinct speakers with natural turn-taking and speaker consistency

🌐

Open Source & Free

Fully open-source under MIT license with pre-trained models available on Hugging Face

Advanced Technical Features

🎭

Context-Aware Expression

Generate spontaneous emotions and singing with deep understanding of textual context and dialogue flow

🔄

Cross-Lingual Generation

Seamless language switching between English and Mandarin Chinese within the same conversation

Ultra-Low Frame Rate

Efficient processing at 7.5 Hz with continuous speech tokenizers for scalable long-form generation

🎵

Podcast with Background Music

Generate full podcast episodes with background music and natural conversational flow

🧠

Next-Token Diffusion

Advanced LLM with diffusion head for understanding context and generating high-fidelity acoustic details

🔒

Built-in Safety Features

Audible disclaimers and imperceptible watermarks to ensure responsible AI usage

Available Models

Choose from our range of pre-trained models optimized for different use cases

VibeVoice-1.5B

AVAILABLE

📏 Context Length: 64K tokens

⏱️ Generation Length: ~90 minutes

🔗 Weight: Hugging Face

Optimal for most use cases with excellent balance of quality and efficiency

VibeVoice-7B

AVAILABLE

📏 Context Length: 32K tokens

⏱️ Generation Length: ~45 minutes

🔗 Weight: Hugging Face

Larger model with enhanced quality for professional applications

VibeVoice-0.5B-Streaming

COMING SOON

📏 Context Length: TBA

⏱️ Generation Length: Real-time

🔗 Weight: Coming Soon

Optimized for real-time streaming applications

Getting Started with VibeVoice

1

Install VibeVoice

Clone the repository and install dependencies using pip

2

Prepare Your Script

Create a text file with dialogue format and speaker names

3

Generate Audio

Run inference to create long-form, multi-speaker conversational audio

What Researchers Say

"

VibeVoice's ability to generate 90-minute podcasts with multiple speakers has revolutionized our content creation workflow. The cross-lingual capabilities are particularly impressive.

Dr. Emily Zhang

AI Research Scientist

"

The spontaneous emotion and singing generation features showcase the frontier capabilities of VibeVoice. It's a significant advancement in conversational TTS.

Prof. Alex Rodriguez

Speech Technology Lab

"

Microsoft's open-source approach with VibeVoice accelerates research in the field. The technical documentation and pre-trained models are excellent resources.

Dr. Lisa Kim

University Research Director

Start Building with VibeVoice

Join the frontier of open-source text-to-speech technology. Create expressive, long-form, multi-speaker audio today

Get Started on GitHub View Documentation

Frequently Asked Questions

Is VibeVoice really open-source and free?

Yes! VibeVoice is fully open-source under MIT license with pre-trained models available on Hugging Face. It's completely free for research and development purposes.

What languages does VibeVoice support?

Currently, VibeVoice supports English and Mandarin Chinese with cross-lingual generation capabilities that allow natural language switching within conversations.

How long can VibeVoice generate audio?

VibeVoice can generate up to 90 minutes of continuous audio with the 1.5B model, and up to 45 minutes with the 7B model, far exceeding typical TTS limitations.

Can I use VibeVoice for commercial projects?

VibeVoice is intended for research and development purposes. For commercial applications, please review the risks and limitations section and consider further testing.

What makes VibeVoice different from other TTS models?

VibeVoice specializes in long-form, multi-speaker conversational audio with up to 4 speakers, spontaneous emotion generation, cross-lingual capabilities, and efficient 7.5 Hz processing.