Yinlin Guo, Yening Lv, Jinqiao Dou, Yan Zhang, Yuehai Wang
Department of Information and Electronic Engineering, Zhejiang University, China
Accepted to Interspeech 2024
[Paper]
While recent advances in Text-To-Speech synthesis have yielded remarkable improvements in generating high-quality speech, research on lightweight and fast models is limited. This paper introduces FLY-TTS, a new fast, lightweight and high-quality speech synthesis system based on VITS. Specifically, 1) We replace the decoder with ConvNeXt blocks that generate Fourier spectral coefficients followed by the inverse short-time Fourier transform to synthesize waveforms; 2) To compress the model size, we introduce the grouped parameter-sharing mechanism to the text encoder and flow-based model; 3) We further employ the large pre-trained WavLM model for adversarial training to improve synthesis quality. Experimental results show that our model achieves a real-time factor of 0.0139 on an Intel Core i9 CPU, 8.8x faster than the baseline (0.1221), with a 1.6x parameter compression. Objective and subjective evaluations indicate that FLY-TTS exhibits comparable speech quality to the strong baseline.
This is an accompanying page and includes some examples of the synthesized speech obtained with the proposed and conventional methods. The groundtruth audio clips are from the LJ speech dataset [1]. These clips are not included in the training data.
Model | Text: He was in consequence put out of the protection of their internal law, end quote. Their code was a subject of some curiosity. |
Text: Mr. Sturges Bourne, Sir James Mackintosh, Sir James Scarlett, and William Wilberforce. |
Text: The fatal consequences where of might be prevented if the justices of the peace were duly authorized |
Text: He was reported to have fallen away to a shadow. |
---|---|---|---|---|
Ground truth | ||||
VITS [2] | ||||
MB-iSTFT-VITS [3] | ||||
FLY-TTS | ||||
Mini-FLY-TTS |
Comparison of model size and average RTF on Intel Core i9 CPU (RTF-CPU) and NVIDIA 3090 GPU (RTF-GPU)
Model | #Params | RTF | |
---|---|---|---|
CPU | GPU | ||
VITS-base | 28.11 M | 0.1221 | 0.0276 |
MB-iSTFT-base | 27.49 M | 0.0274 | 0.0095 |
FLY-TTS | 17.89 M | 0.0139 | 0.0062 |
Mini FLY-TTS | 10.92 M | 0.0127 | 0.0061 |