FLY-TTS: Fast, Lightweight and High-Quality End-to-End Text-to-Speech Synthesis

Yinlin Guo, Yening Lv, Jinqiao Dou, Yan Zhang, Yuehai Wang

Department of Information and Electronic Engineering, Zhejiang University, China

Accepted to Interspeech 2024
[Paper]

Abstract

While recent advances in Text-To-Speech synthesis have yielded remarkable improvements in generating high-quality speech, research on lightweight and fast models is limited. This paper introduces FLY-TTS, a new fast, lightweight and high-quality speech synthesis system based on VITS. Specifically, 1) We replace the decoder with ConvNeXt blocks that generate Fourier spectral coefficients followed by the inverse short-time Fourier transform to synthesize waveforms; 2) To compress the model size, we introduce the grouped parameter-sharing mechanism to the text encoder and flow-based model; 3) We further employ the large pre-trained WavLM model for adversarial training to improve synthesis quality. Experimental results show that our model achieves a real-time factor of 0.0139 on an Intel Core i9 CPU, 8.8x faster than the baseline (0.1221), with a 1.6x parameter compression. Objective and subjective evaluations indicate that FLY-TTS exhibits comparable speech quality to the strong baseline.

Demo

This is an accompanying page and includes some examples of the synthesized speech obtained with the proposed and conventional methods. The groundtruth audio clips are from the LJ speech dataset [1]. These clips are not included in the training data.


Model Text: He was in consequence put out of the protection of their internal law, end quote.
Their code was a subject of some curiosity.
Text: Mr. Sturges Bourne, Sir James Mackintosh, Sir James Scarlett, and William
Wilberforce.
Text: The fatal consequences where of might be prevented if the justices
of the peace were duly authorized
Text: He was reported to have fallen away to a shadow.
Ground truth
VITS [2]
MB-iSTFT-VITS [3]
FLY-TTS
Mini-FLY-TTS


Comparison of model size and average RTF on Intel Core i9 CPU (RTF-CPU) and NVIDIA 3090 GPU (RTF-GPU)

Model #Params RTF
CPU GPU
VITS-base 28.11 M 0.1221 0.0276
MB-iSTFT-base 27.49 M 0.0274 0.0095
FLY-TTS 17.89 M 0.0139 0.0062
Mini FLY-TTS 10.92 M 0.0127 0.0061


References

[1] K. Ito, "The LJ speech dataset", https://keithito.com/LJ-Speech-Dataset/, 2017.
[2] J. Kim, J. Kong, J. Son, "Conditional variational autoencoder with adversarial learning for end-to-end text-to-speech," in Proc. ICML, 2021, pp. 5530-5540. [3] Kawamura, Masaya et al, “Lightweight and High-Fidelity End-to-End Text-to-Speech with Multi-Band Generation and Inverse Short-Time Fourier Transform,” in ICASSP, 2023.