- Emo-StarGAN: A Semi-Supervised Any-to-Many Non-Parallel Emotion-Preserving Voice Conversion(arXiv)
Abstract : Speech anonymisation prevents misuse of spoken data by removing any personal identifier while preserving at least linguistic content. However, emotion preservation is crucial for natural human-computer interaction. The well-known voice conversion technique StarGANv2-VC achieves anonymisation but fails to preserve emotion. This work presents an any-to-many semi-supervised StarGANv2-VC variant trained on partially emotion-labelled non-parallel data. We propose emotion-aware losses computed on the emotion embeddings and acoustic features correlated to emotion. Additionally, we use an emotion classifier to provide direct emotion supervision. Objective and subjective evaluations show that the proposed approach significantly improves emotion preservation over the vanilla StarGANv2-VC. This considerable improvement is seen over diverse datasets, emotions, target speakers, and inter-group conversions without compromising intelligibility and anonymisation.
2. Highly Controllable Diffusion-based Any-to-Any Voice Conversion Model with Frame-level Prosody Feature(arXiv)
Abstract : We propose a highly controllable voice manipulation system that can perform any-to-any voice conversion (VC) and prosody modulation simultaneously. State-of-the-art VC systems can transfer sentence-level characteristics such as speaker, emotion, and speaking style. However, manipulating the frame-level prosody, such as pitch, energy and speaking rate, still remains challenging. Our proposed model utilizes a frame-level prosody feature to effectively transfer such properties. Specifically, pitch and energy trajectories are integrated in a prosody conditioning module and then fed alongside speaker and contents embeddings to a diffusion-based decoder generating a converted speech mel-spectrogram. To adjust the speaking rate, our system includes a self-supervised model based post-processing step which allows improved controllability. The proposed model showed comparable speech quality and improved intelligibility compared to a SOTA approach. It can cover a varying range of fundamental frequency (F0), energy and speed modulation while maintaining converted speech quality.