RefineGAN: Universally Generating High-Fidelity Waveform Better than Ground Truth

Shengyuan Xu,Wenxiao Zhao,Jing Guo

[paper]

Note: Samples generated by other neural vocoders for comparison are being prepared now and will be released later.

timedomAIn is a music technology company founded in 2019. We are dedicated to exploring the use of AI to empower non-professionals to create original music and express themselves.

ACE Virtual Singer is our main product where users write songs with highly realistic voices performed by AI-powered virtual singers.

Singing Voice

Chinese (Seen Speaker Seen Language)

speaker predicted ground-truth
Female 1
Female 2
Female 3
Male 1
Male 2

Chinese (Unseen Speaker Seen Language)

speaker predicted ground-truth
Female 1
Female 1
Female 2
Female 2
Male 1

Japanese (Unseen Speaker Unseen Language)

Source:
speaker predicted ground-truth file name
Female 1 (Tohoku Kiritan) 24.wav
Female 1 (Tohoku Kiritan) 36.wav

Speech voice (completely unseen)

Chinese

Source:

https://www.data-baker.com/data/index/source/

speaker predicted ground-truth file name
Female 1 001279.wav
Female 1 005116.wav
Female 1 006757.wav
Female 1 007658.wav

English

Source:

https://openslr.org/109/
Bakhturina, E., Lavrukhin, V., Ginsburg, B., & Zhang, Y. (2021). Hi-Fi Multi-Speaker English TTS Dataset. arXiv preprint arXiv:2104.01497.

speaker predicted ground-truth file name
Male 1 antoinetteromances4_01_dumas_0061.flac
Female 1 dayoffate_33_roe_0129.flac
Male 2 roots_24_morris_0068.flac
Male 3 shadesofwilderness_13_altsheler_0097.flac

French

Source:

https://gitlab.com/nicolasobin/att-hack
Moine, C. L., & Obin, N. (2020). Att-HACK: An Expressive Speech Database with Social Attitudes. arXiv preprint arXiv:2004.04410.

speaker predicted ground-truth file name
Female 1 F03_a1_s079_v05.wav
Female 2 F15_a3_s100_v02.wav
Male 1 M07_a4_s060_v05.wav
Male 2 M17_a1_s079_v01.wav

Japanese

Source:

https://sites.google.com/site/shinnosuketakamichi/publication/jsut
Sonobe R, Takamichi S, Saruwatari H. JSUT corpus: free large-scale Japanese speech corpus for end-to-end speech synthesis[J]. arXiv preprint arXiv:1711.00354, 2017.

speaker predicted ground-truth file name
Female ONOMATOPEE300_107.wav
Female REPEAT500_set4_060.wav
Female TRAVEL1000_0670.wav
Female UT-PARAPHRASE-sent124-phrase2.wav

Under-Resourced Langeuages

Source:

Sodimana, K., Pipatsrisawat, K., Ha, L., Jansche, M., Kjartansson, O., De Silva, P., & Sarin, S. (2018). A step-by-step process for building tts voices using open source data and framework for bangla, javanese, khmer, nepali, sinhala, and sundanese.

speaker language predicted ground-truth file name
Female 1 Javanese jvf_08305_00814037052.wav
Male 1 Javanese jvm_03424_01297287738.wav
Female 1 Khmer khm_3154_0157853181.wav
Female 1 Khmer khm_6753_3404534535.wav
Female 1 Nepali nep_0546_7054581764.wav
Female 1 Nepali nep_3614_7960099494.wav
Female 1 Sundanese suf_02395_01693235787.wav
Male 1 Sundanese sum_05186_00415408849.wav
pv: