Ugrás a tartalomhoz

 

Multi-speaker child speech synthesis in low-resource Hungarian language

  • Metaadatok
Tartalom: http://hdl.handle.net/10890/54983
Archívum: Műegyetem Digitális Archívum
Gyűjtemény: 1. Tudományos közlemények, publikációk
Konferenciák gyűjteményei
2nd Workshop on Intelligent Infocommunication Networks, Systems and Services, 2024
Cím:
Multi-speaker child speech synthesis in low-resource Hungarian language
Létrehozó:
Alwaisi, Shaimaa
Al-Radhi, Mohammed Salah
Németh, Géza
Dátum:
2024-02-26T15:42:07Z
2024-02-26T15:42:07Z
2024
Tartalmi leírás:
Current deep learning-based text-to-speech (TTS) models can synthesize speech that sounds remarkably like human voices. Despite recent developments in TTS systems for adults, there are numerous considerations when it comes to TTS systems for children. These considerations involve various obstacles, such as the lack of adequate child speech resources and the unique acoustic and linguistic characteristics specific to children. The main objective of this work is to explore advanced neural vocoders, namely BIGVGAN and AutoVocoder, for synthesizing child speech in Hungarian . In our study, we focused on the Hungarian language to evaluate the efficacy of neural vocoders in capturing specific linguistic nuances and phonetic characteristics relevant to Hungarian-speaking children . In addition, to examine the fine-tuning and adaptation of vocoders to accurately capture the unique attributes of child speech while minimizing dependency on extensive child speech datasets. The experimental outcomes showcased the high performance of BIGVGAN and Autovocoder in effectively synthesizing clear and natural sounds for multi-speaker child speech in conversational settings. Despite the high quality exhibited by the synthesized sounds produced by BIGVGAN, the model has a more complex architecture with a multi-scale discriminator and require more resources and longer training time due to the larger batch size compared to the AutoVocoder. AutoVocoder notably improved the quality and clarity of the generated child's speech. Initial findings suggest that the BIGVGAN model successfully produced high-quality synthesized child sounds.
Nyelv:
angol
Típus:
Konferenciaközlemény
Formátum:
application/pdf
Azonosító: