VibeVoice: Frontier Open-Source Text-to-Speech

Generate expressive, long-form, multi-speaker conversational audio like podcasts from text. Synthesize up to 90 minutes with 4 distinct speakers using Microsoft's cutting-edge AI technology

🔗 GitHub Repository 🤗 Hugging Face 📄 Project Page

Try VibeVoice Live Demo

Experience VibeVoice's capabilities directly in your browser. Generate multi-speaker conversational audio from text

🎙️ VibeVoice Interactive Demo

Generate expressive, long-form, multi-speaker audio from your text

Initializing VibeVoice ...

Preparing the most advanced text-to-speech experience

🎙️ Multi-Speaker

🌐 Cross-Lingual

⏱️ 90min Audio

🎵 AI Singing

🚀 Open in New Tab 📄 View Source Code

Key Capabilities

🎙️

Cross-Lingual Generation

Seamless switching between English and Mandarin Chinese

🎵

Spontaneous Expression

Natural emotions and singing with context awareness

👥

Multi-Speaker Conversations

Up to 4 distinct speakers with natural turn-taking

⏱️

Long-Form Audio

Generate up to 90 minutes of continuous speech

Why Choose VibeVoice?

⏱️

Ultra-Long Audio Generation

Generate up to 90 minutes of continuous, high-quality speech - perfect for podcasts and long-form content

👥

Multi-Speaker Conversations

Support for up to 4 distinct speakers with natural turn-taking and speaker consistency

🌐

Open Source & Free

Fully open-source under MIT license with pre-trained models available on Hugging Face

Advanced Technical Features

🎭

Context-Aware Expression

Generate spontaneous emotions and singing with deep understanding of textual context and dialogue flow

🔄

Cross-Lingual Generation

Seamless language switching between English and Mandarin Chinese within the same conversation

⚡

Ultra-Low Frame Rate

Efficient processing at 7.5 Hz with continuous speech tokenizers for scalable long-form generation

🎵

Podcast with Background Music

Generate full podcast episodes with background music and natural conversational flow

🧠

Next-Token Diffusion

Advanced LLM with diffusion head for understanding context and generating high-fidelity acoustic details

🔒

Built-in Safety Features

Audible disclaimers and imperceptible watermarks to ensure responsible AI usage

Available Models

Choose from our range of pre-trained models optimized for different use cases

VibeVoice-1.5B

AVAILABLE

📏 Context Length: 64K tokens

⏱️ Generation Length: ~90 minutes

🔗 Weight: Hugging Face

Optimal for most use cases with excellent balance of quality and efficiency

VibeVoice-7B

AVAILABLE

📏 Context Length: 32K tokens

⏱️ Generation Length: ~45 minutes

🔗 Weight: Hugging Face

Larger model with enhanced quality for professional applications

VibeVoice-0.5B-Streaming

COMING SOON

📏 Context Length: TBA

⏱️ Generation Length: Real-time

🔗 Weight: Coming Soon

Optimized for real-time streaming applications

Getting Started with VibeVoice

Install VibeVoice

Clone the repository and install dependencies using pip

Prepare Your Script

Create a text file with dialogue format and speaker names

Generate Audio

Run inference to create long-form, multi-speaker conversational audio

What Researchers Say

VibeVoice's ability to generate 90-minute podcasts with multiple speakers has revolutionized our content creation workflow. The cross-lingual capabilities are particularly impressive.

Dr. Emily Zhang

AI Research Scientist

The spontaneous emotion and singing generation features showcase the frontier capabilities of VibeVoice. It's a significant advancement in conversational TTS.

Prof. Alex Rodriguez

Speech Technology Lab

Microsoft's open-source approach with VibeVoice accelerates research in the field. The technical documentation and pre-trained models are excellent resources.

Dr. Lisa Kim

University Research Director

Frequently Asked Questions

Is VibeVoice really open-source and free?

Yes! VibeVoice is fully open-source under MIT license with pre-trained models available on Hugging Face. It's completely free for research and development purposes.

What languages does VibeVoice support?

Currently, VibeVoice supports English and Mandarin Chinese with cross-lingual generation capabilities that allow natural language switching within conversations.

How long can VibeVoice generate audio?

VibeVoice can generate up to 90 minutes of continuous audio with the 1.5B model, and up to 45 minutes with the 7B model, far exceeding typical TTS limitations.

Can I use VibeVoice for commercial projects?

VibeVoice is intended for research and development purposes. For commercial applications, please review the risks and limitations section and consider further testing.

What makes VibeVoice different from other TTS models?

VibeVoice specializes in long-form, multi-speaker conversational audio with up to 4 speakers, spontaneous emotion generation, cross-lingual capabilities, and efficient 7.5 Hz processing.

VibeVoice: Frontier Open-Source Text-to-Speech

Try VibeVoice Live Demo

🎙️ VibeVoice Interactive Demo

Initializing VibeVoice ...

Key Capabilities

Cross-Lingual Generation

Spontaneous Expression

Multi-Speaker Conversations

Long-Form Audio

Why Choose VibeVoice?

Ultra-Long Audio Generation

Multi-Speaker Conversations

Open Source & Free

Advanced Technical Features

Context-Aware Expression

Cross-Lingual Generation

Ultra-Low Frame Rate

Podcast with Background Music

Next-Token Diffusion

Built-in Safety Features

Available Models

VibeVoice-1.5B

VibeVoice-7B

VibeVoice-0.5B-Streaming

Getting Started with VibeVoice

Install VibeVoice

Prepare Your Script

Generate Audio

What Researchers Say

Dr. Emily Zhang

Prof. Alex Rodriguez

Dr. Lisa Kim

Start Building with VibeVoice

Frequently Asked Questions

Is VibeVoice really open-source and free?

What languages does VibeVoice support?

How long can VibeVoice generate audio?

Can I use VibeVoice for commercial projects?

What makes VibeVoice different from other TTS models?