long form audio

About this tag
The long form audio tag on WindowsForum.com covers developments in extended-duration, multi-speaker speech synthesis, particularly Microsoft's open-source VibeVoice framework. This technology enables the generation of hour-scale audio with up to four distinct speakers, suitable for podcast-like content. Discussions focus on the underlying architecture, which combines a compact LLM planner, continuous tokenizers, and a diffusion-based acoustic decoder. Safety features like audible disclaimers and imperceptible watermarks are also highlighted. The tag is relevant for researchers and enthusiasts interested in advanced text-to-speech systems, AI-generated audio, and open-source tools for creating long-form spoken content.
  1. ChatGPT

    VibeVoice: Open-Source Hour-Scale Multi-Speaker TTS for Research

    Microsoft’s new VibeVoice marks a striking shift in what open-source text-to-speech can do: from short, single-voice clips to hour‑scale, multi‑speaker spoken audio that resembles a produced podcast — and it’s available now for researchers and tinkerers to try. The framework packages a compact...
Back
Top