You are using an out of date browser. It may not display this or other websites correctly. You should upgrade or use an alternative browser.
vibevoice
About this tag
VibeVoice is an open-source text-to-speech framework from Microsoft Research designed for long-form, multi-speaker conversational audio. It can synthesize up to 90 minutes of coherent speech with up to four distinct speakers, using a compact LLM planner, novel continuous tokenizers, and a diffusion-based acoustic decoder. The framework supports English and Mandarin, includes safety features like an audible disclaimer and imperceptible watermark, and is intended for research use. VibeVoice represents a shift from short, single-voice clips to hour-scale, podcast-like synthetic dialogue, with models available on GitHub and Hugging Face.
Microsoft’s new VibeVoice marks a striking shift in what open-source text-to-speech can do: from short, single-voice clips to hour‑scale, multi‑speaker spoken audio that resembles a produced podcast — and it’s available now for researchers and tinkerers to try. The framework packages a compact...
ai in windows
continuous_tokenizers
diffusion acoustic head
english mandarin
gpu
hour-scale
llm planner
long form audio
multi-speaker
open source
podcast editing
research release
safety features
speech synthesis
text-to-speech
tts
vibevoice
watermark
Microsoft Research has released VibeVoice, an open-source text‑to‑speech (TTS) framework built for long-form, multi‑speaker conversational audio and designed to push the boundaries of scalability, speaker consistency, and natural turn‑taking in synthetic dialogue. (github.com, huggingface.co)...
Microsoft’s VibeVoice-1.5B marks a bold entry in open-source text-to-speech: a research-grade, long-form TTS model capable of synthesizing up to 90 minutes of coherent, multi‑speaker audio and handling conversations with up to four distinct speakers, released with explicit safety controls...