DeepGen Network based Voice Conversion
Sheena Christabel Pravin, M Palanivelan, S Saravanan
Published: September 30, 2021.
Abstract  
A DeepGen network is proposed for voice conversion, which is the process of modification of a speech utterance by a source orator to that of a target orator, preserving the linguistic contents. Automatic dubbing is an application of speech processing which facilitates the modelling of the large variability of pitch. The proposed DeepGen network automizes voice conversion by taking blocks of consecutive frame-wise linguistic and fundamental frequency features. This block-wise approach models temporal dependencies within the features of the input block. Voice conversion, which is also called as voice cloning has always required significant amount of recorded speech but the proposed DeepGen network intends to convert one voice to another using a relatively smaller set of speech samples by bootstrapping samples from a larger speech dataset. The proposed generative model is a variant of the convolutional generative model with encoder and decoder blocks that bring into line the hidden structures of the
feature spaces from the source and target voices on a two-stage training process. An attractive and efficient voice conversion is thus obtained in the real-world scenario using the proposed DeepGen network. The novelty of the work lies in the induction of a deep learning generative network for cloning voice in Tamil language.
Keywords: DeepGen network; Voice Conversion; Generative Network; Convolutional generative model; Feature spaces