Your browser does not support JavaScript!

Development and Evaluation of Vietnamese Text-to-Speech Systems in the 2021 VLSP Campaign

GOOVER DAILY REPORT June 19, 2024
goover

TABLE OF CONTENTS

  1. Summary
  2. Introduction to the Vietnamese TTS Systems for VLSP 2021
  3. Thunder Text-To-Speech System
  4. VLSP 2021 TTS Challenge
  5. Comparative Analysis of TTS Systems
  6. Conclusion
  7. Conclusion

1. Summary

  • The report titled 'Development and Evaluation of Vietnamese Text-to-Speech Systems in the 2021 VLSP Campaign' covers the creation and assessment of Vietnamese Text-to-Speech (TTS) systems for the VLSP 2021 evaluation campaign. The central topic is the Thunder Text-To-Speech System, which utilizes the FastSpeech2 model. Key challenges discussed include synthesizing spontaneous speech and processing inconsistent data. The report outlines the structure of the Thunder TTS system, the dataset used, and the system's data processing techniques. It highlights the Thunder TTS system's evaluation results, including a Mean Opinion Score (MOS) of 3.94 for in-domain data and 3.3 for out-domain data, and an 85.00% intelligibility score in Semantically Unpredictable Sentences (SUS) tests. Additionally, the report provides an overview of the VLSP 2021 TTS Challenge, participation statistics, evaluation metrics, and the performance of different teams in the competition.

2. Introduction to the Vietnamese TTS Systems for VLSP 2021

  • 2-1. Introduction to VLSP 2021 evaluation campaign

  • The VLSP (Vietnamese Language and Speech Processing) consortium held its eighth annual international workshop in 2021, having conducted Text-To-Speech (TTS) shared tasks for the fourth time. The focus of the VLSP 2021 campaign was the Text-To-Speech (TTS) challenge, which aimed at synthesizing spoken dialog systems. This year’s key challenge was synthesizing spontaneous speech to enhance natural dialogue in conversational applications.

  • 2-2. Goals and objectives of the TTS track

  • The main objective of the VLSP 2021 TTS track was to build a TTS system using a 7.5-hour dataset collected from the YouTube channel 'Giang ơi'. This dataset comprised spontaneous speech recordings in Vietnamese. Participants needed to handle various challenges like background noise, inconsistent prosody, and inaccurate transcripts. The goal was to create systems that could produce natural, spontaneous speech suitable for conversational contexts.

3. Thunder Text-To-Speech System

  • 3-1. Overview and Purpose

  • The Thunder Text-To-Speech (TTS) System was developed to participate in the Vietnamese Text-to-Speech track of the 2021 VLSP evaluation campaign. The primary objective was to synthesize natural voice from a provided spontaneous speech corpus in Vietnamese. The system leverages FastSpeech2 model targeting the generation of a natural and spontaneous synthesized speech suitable for various human-machine communication applications.

  • 3-2. System Architecture with FastSpeech2 and Hifi-GAN Vocoder

  • The Thunder TTS system architecture consists of two major components: an acoustic model and a vocoder model. FastSpeech2 serves as the acoustic model, generating mel-spectrograms from a sequence of phonemes with several modifications including the inclusion of an additional Post-Net layer for improved mel-spectrogram quality, and the removal of pitch and energy predictor modules. The Hifi-GAN model is used as the vocoder, responsible for generating high-quality speech from the predicted mel-spectrogram frames. This component employs a denoiser module from the Waveglow model to reduce synthesis noise.

  • 3-3. Data Processing Steps and Dataset Details

  • The dataset used for this system was provided by the VLSP 2021 evaluation campaign, containing 5341 utterances (approximately 7.23 hours) from a single speaker with a sampling rate of 44.1 kHz. Data processing involved several steps: audio preprocessing to remove background noise and identify the main voice, peak amplitude normalization, silence portion adjustments, and concatenation of short utterances. Transcript preprocessing included ASR system decoding, filtering based on word error rate, and converting sentences into phoneme sequences. The final processed dataset contained 7743 utterances (9.67 hours), with subsets designated for training, validation, and testing.

  • 3-4. Challenges in Synthesizing Spontaneous Speech

  • The primary challenges in synthesizing spontaneous speech were dealing with background noise, varying stress and prosody, and inconsistent transcripts. Spontaneous speech data is inherently random and unscripted, making it difficult to create high-quality training datasets. This randomness necessitated special strategies for processing the spontaneous datasets, such as inserting punctuation to align with internal silences and eliminating extraneous noises.

  • 3-5. Evaluation Results: MOS Scores and SUS

  • The Thunder TTS system was evaluated using Mean Opinion Score (MOS) and Semantically Unpredictable Sentences (SUS) tests. The system achieved a MOS of 3.94 in-domain and 3.3 out-domain, indicating effective natural speech synthesis. In the SUS intelligibility test, the system recorded an 85.00% intelligibility score, affirming the clarity and comprehensibility of the synthesized speech.

4. VLSP 2021 TTS Challenge

  • 4-1. Overview of the TTS challenge

  • The VLSP 2021 - TTS Challenge was part of the eighth annual international VLSP workshop, focusing on Vietnamese spontaneous speech synthesis. Unlike previous years that used reading datasets, this year's task required participants to build a Text-To-Speech (TTS) system using spontaneous speech datasets. This approach aimed to produce more natural voices suitable for spoken dialog systems. A spontaneous 7.5-hour dataset was collected from the YouTube channel 'Giang ơi' and processed to create utterances and corresponding texts.

  • 4-2. Details on the dataset preparation

  • The dataset preparation involved collecting audio from the 'Giang ơi' YouTube channel, resulting in 22,839 sound files, equivalent to 72 hours of audio. Subsequently, extensive preprocessing was carried out to address quality issues such as background noise, multiple voices, and inconsistencies in stress and intonation. After cleaning, the dataset was reduced to 6,266 files (approximately 11 hours). Finally, 5,341 high-quality utterances (7.5 hours) were selected for the challenge.

  • 4-3. Participation statistics and submission process

  • A total of 43 teams registered for the TTS challenge, with 18 teams obtaining the dataset after signing a user agreement. Ultimately, 10 teams submitted their TTS systems for evaluation. Participants had to validate provided data, build a synthetic voice from the dataset, and submit a TTS API. An evaluation was conducted using the submitted systems, and teams were required to submit technical reports.

  • 4-4. Evaluation metrics and results

  • The evaluation involved two types of perceptual tests: Mean Opinion Score (MOS) for naturalness and Semantically Unpredictable Sentences (SUS) for intelligibility. The best MOS score achieved on dialog utterances was 3.98 out of 5. The top-performing team used FastSpeech2 with HiFi-GAN and a denoiser, achieving an out-domain MOS score of 3.56. In terms of intelligibility, the lowest syllable error rate achieved was 15%. Despite achieving similar prosody and speaking rates as natural voices, many systems had distorted segments and background noise.

  • 4-5. Key challenges encountered by participants

  • Participants faced challenges such as inconsistencies in speaking rate, intensity, stress, and prosody across the dataset. Background noises, multiple voices in one audio, and inaccurate transcripts also posed significant difficulties. Effective preprocessing strategies were essential to mitigate these issues. The challenge allowed participants to explore appropriate TTS models and preprocessing techniques to handle the spontaneous dataset.

5. Comparative Analysis of TTS Systems

  • 5-1. Comparison of different TTS systems developed for the challenge

  • The Vietnamese Language and Speech Processing (VLSP) 2021 challenge saw the participation of 43 teams, with final submissions from 10 teams. Each team was tasked with creating a Text-to-Speech (TTS) system using a provided spontaneous speech dataset. The dataset featured challenges such as background noise, inconsistent speaking rates, prosody, and stress. The Thunder Text-to-Speech System, utilizing FastSpeech2, achieved notable results. Evaluation metrics included Mean Opinion Score (MOS) for naturalness and Semantically Unpredictable Sentences (SUS) for intelligibility. Moreover, nearly all teams used FastSpeech2 as the acoustic model and HifiGAN as the vocoder with variations in preprocessing techniques and architectural enhancements. Some teams, such as Team5, utilized a completely different technology stack with the VITS model.

  • 5-2. Analysis of error rates and MOS scores

  • The MOS scores and SUS results varied significantly among the submitted systems. Key metrics included both in-domain and out-domain MOS scores, with the in-domain score focusing on the dataset directly derived from the challenge speech corpus, and out-domain evaluating performance on external data. For in-domain MOS scores, the top performance was achieved by Team10 with a score of 3.94. In the out-domain MOS test, Team1 scored the highest with 3.56. Analysis of error rates revealed that the best SUS intelligibility TTS system had a syllable error rate (SER) of 15%, whereas several systems had SERs of at least 30%. These performance metrics underscored the varying capabilities of the approaches taken by the teams in handling spontaneous speech synthesis challenges.

  • 5-3. Strengths and weaknesses of various approaches

  • The strengths of different approaches centered around model selection and preprocessing strategies. For example, Team1’s strategy included an external aligner replacement for FastSpeech2 and extensive audio filtering, which minimized distractions from background noises and inconsistent prosody. On the other hand, Team5's use of the VITS - a fully End2End model - showcased another effective methodology by diverging from conventional techniques. However, common weaknesses across several submissions encompassed dealing with distorted audio segments and inaccuracies in transcript alignment. It was also noted that while some models performed significantly better on out-domain MOS tests, others were better suited for in-domain tasks based on their training specifics and preprocessing sophistication.

6. Conclusion

  • 6-1. Summary of Key Findings

  • The development of Vietnamese Text-to-Speech systems for the 2021 VLSP campaign demonstrated significant advancements and highlighted several challenges. The Thunder Text-To-Speech System, which was built using FastSpeech2, proved to be effective in synthesizing natural speech from spontaneous data. The participation of 43 teams and the competitive nature of the VLSP 2021 TTS Challenge fostered an innovative environment, leading to noteworthy results.

  • 6-2. Importance and Implications of the Results

  • The findings underscore the progress made in the field of speech synthesis, particularly regarding the generation of natural-sounding speech from spontaneous speech datasets. The success in achieving high MOS scores indicates the model's effectiveness in natural speech synthesis. These advancements have significant implications for enhancing human-machine communication, making interactions more natural and intuitive.

  • 6-3. Limitations of the Research

  • Despite the advancements, the research faced notable limitations. These included issues with inconsistent data, such as variations in speaking rate, intensity, stress, and prosody, as well as background noises and inaccurate transcripts. Such limitations hindered the overall performance and robustness of the TTS systems.

7. Conclusion

  • The development of the Vietnamese Text-to-Speech systems for the 2021 VLSP campaign, epitomized by the Thunder Text-To-Speech System leveraging FastSpeech2, illustrates significant progress in synthesizing natural, spontaneous speech. The involvement of 43 teams in the VLSP 2021 TTS Challenge, culminating in competitive MOS scores and various system innovations, underscores the collaborative efforts in this field. Despite notable advancements, the research encountered limitations like inconsistent data and background noise, which impacted the overall performance. Future research should pivot towards enhancing data preprocessing techniques and developing more robust models to facilitate superior TTS quality. The promising outcomes reflect the potential for practical applications in improving human-machine communication, making interactions more intuitive and natural. The next steps involve addressing these limitations and focusing on methods to produce even more accurate and high-quality speech synthesis.

8. Glossary

  • 8-1. Thunder Text-To-Speech System [Technology]

  • An advanced TTS system developed during the 2021 VLSP campaign using the FastSpeech2 model for acoustic modeling and Hifi-GAN vocoder. It focuses on synthesizing spontaneous speech data and demonstrated promising results in MOS scores and SUS.

  • 8-2. VLSP 2021 TTS Challenge [Event]

  • A Text-to-Speech challenge organized as part of the 2021 VLSP evaluation campaign, which attracted 43 teams to develop TTS systems within a 24-day period, using a dataset derived from spontaneous speech by a famous YouTuber.

  • 8-3. FastSpeech2 [Technology]

  • A model used for acoustic modeling in the development of TTS systems, known for its robustness in handling spontaneous speech and delivering high-quality synthetic voices.

  • 8-4. Hifi-GAN [Technology]

  • A vocoder used in conjunction with the FastSpeech2 model in TTS system development, contributing to high-quality audio generation.