DialoSpeech: Dual-Speaker Dialogue Generation with LLM and Flow Matching

Abstract

Recent advances in text-to-speech (TTS) synthesis, particularly those leveraging large language models (LLMs), have significantly improved expressiveness and naturalness. However, generating human-like, interactive dialogue speech remains challenging. Current systems face limitations due to the scarcity of dual-track data and difficulties in achieving naturalness, contextual coherence, interactional dynamics such as turn-taking, overlapping speech, and speaker consistency in multi-turn conversations. To address these challenges, we propose DialoSpeech, a dual-track architecture combining a large language model with Chunked Flow Matching for expressive, human-like dialogue speech synthesis. DialoSpeech generates natural multi-turn conversations with coherent speaker turns and natural overlaps and supports both Chinese, English, and cross-lingual speech synthesis. We further introduce a data processing pipeline to construct dual-track dialogue datasets, facilitating scalable training and experimental validation. Experiments show that our model outperforms baselines, offering a solution for generating human-like spoken dialogues.

Inference Pipeline

DialoSpeech Inference Pipeline

Training Pipeline

DialoSpeech Training Pipeline

Demos

中文对话样本

对话文本 合成音频
[s1]我觉得啊,就是经历了这么多年的经验, 就是补剂的作用就是九分的努力, 十分之一的补剂。 嗯,选的话肯定是九分更重要,但是我觉得补剂它能够让你九分的努力更加的有效率,更加的避免徒劳无功。 嗯,就是你,你你得先得真的锻炼,真的努力,真的健康饮食,然后再考虑补剂, 那你再加十十分之一的补剂的话,他可能就是说啊, 一半是心理作用,[s2] 对,其实很多时候心理作用是非常重要的。嗯,然后我每次用补剂的时候,我就会更加努力,就比如说我在健身之前我喝了一勺蛋白粉,我就会督促自己多练,[s1] 其实心理作用只要能实现你的预期目的就可以了。 就比如说给自行车链条加油, 它其实不是必要的,但是它可以让你骑行更顺畅, 然后提高你骑行的频率。

A:哎,好久不见,李明听说你研究生是保研到了西北工业大学的 ASLP 实验室是吗?

B:嗯,是的,我之前就一直对像音频啊语音合成这些特别感兴趣,刚好实验室也在专注研究这块,就真的,嗯,挺幸运的吧呵呵。

A:哦?那今天能和大家分享一下你们最近在语音合成方面的一些进展吗?感觉大家呃现在对这方面都挺感兴趣呢。

B:对,最近我们刚开发出了一个用16万小时中英文数据训练的模型,呃效果怎么说呢,嗯,真的非常不错。

DiaSpeech vs. CoVoMix (En)

Dialogue Text Prompt (A) Prompt (B) DialoSpeech CoVoMix