DocLayoutTTS: Dataset and Baselines for Layout-informed Document-level Neural Speech Synthesis

DocLayoutTTS: Dataset and Baselines for Layout-informed Document-level Neural Speech Synthesis

Abstract

We propose a new task of synthesizing speech directly from semi-structured documents where the extracted text tokens from OCR systems may not be in the correct reading order due to the complex document layout. We refer to this task as layout-informed document-level TTS and present the DocSpeech dataset which consists of 10K audio clips (approximately 830 hours with an average duration of 5 minutes) of single-speaker reading Word documents with complex layouts. For each document, we provide the natural reading order of text tokens, corresponding bounding boxes, and the audio clips synthesized with the correct reading order. We also introduce DocLayoutTTS, a Transformer encoder-decoder architecture that generates speech in an end-to-end manner given a document image with OCR extracted text. Our architecture simultaneously learns text reordering and mel-spectrogram prediction by using a multi-task setup. Moreover, we take advantage of curriculum learning to progressively learn longer, more challenging document-level text using DocSpeech and LJSpeech datasets, respectively. Our empirical results show that the underlying task is challenging. Our proposed architecture performs slightly better than competitive baseline TTS models with a pre-trained model providing reading order priors. We also release samples of the DocSpeech datase.

Publication
In Interspeech.