FLY-TTS: Fast, Lightweight and High-Quality End-to-End Text-to-Speech Synthesis

While recent advances in Text-To-Speech synthesis have yielded remarkable improvements in generating high-quality speech, research on lightweight and fast models is limited. This paper introduces FLY-TTS, a new fast, lightweight and high-quality speech synthesis system based on VITS. Specifically, 1) We replace the decoder with ConvNeXt blocks that generate Fourier spectral coefficients followed by the inverse short-time Fourier transform to synthesize waveforms; 2) To compress the model size, we introduce the grouped parameter-sharing mechanism to the text encoder and flow-based model; 3) We further employ the large pre-trained WavLM model for adversarial training to improve synthesis quality. Experimental results show that our model achieves a real-time factor of 0.0139 on an Intel Core i9 CPU, 8.8x faster than the baseline (0.1221), with a 1.6x parameter compression. Objective and subjective evaluations indicate that FLY-TTS exhibits comparable speech quality to the strong baseline.

This is an accompanying page and includes some examples of the synthesized speech obtained with the proposed and conventional methods. The groundtruth audio clips are from the LJ speech dataset [1]. These clips are not included in the training data.

Model

Text: He was in consequence put out of the protection of their internal law, end quote.
Their code was a subject of some curiosity.

Text: Mr. Sturges Bourne, Sir James Mackintosh, Sir James Scarlett, and William
Wilberforce.

Text: The fatal consequences where of might be prevented if the justices
of the peace were duly authorized

Text: He was reported to have fallen away to a shadow.

Ground truth

VITS [2]

MB-iSTFT-VITS [3]

FLY-TTS

Mini-FLY-TTS

Model	#Params	RTF
VITS-base	28.11 M	0.1221	0.0276
MB-iSTFT-base	27.49 M	0.0274	0.0095
FLY-TTS	17.89 M	0.0139	0.0062
Mini FLY-TTS	10.92 M	0.0127	0.0061

Model

#Params

RTF

CPU

GPU

VITS-base

28.11 M

0.1221

0.0276

MB-iSTFT-base

27.49 M

0.0274

0.0095

FLY-TTS

17.89 M

0.0139

0.0062

Mini FLY-TTS

10.92 M

0.0127

0.0061

References

[1] K. Ito, "The LJ speech dataset", https://keithito.com/LJ-Speech-Dataset/, 2017.
[2] J. Kim, J. Kong, J. Son, "Conditional variational autoencoder with adversarial learning for end-to-end text-to-speech," in Proc. ICML, 2021, pp. 5530-5540. [3] Kawamura, Masaya et al, “Lightweight and High-Fidelity End-to-End Text-to-Speech with Multi-Band Generation and Inverse Short-Time Fourier Transform,” in ICASSP, 2023.

FLY-TTS: Fast, Lightweight and High-Quality End-to-End Text-to-Speech Synthesis

Abstract

Demo

References