Google has released what it says about Translatotron, which translates what people say to their voices while maintaining their voices. Unlike existing models, it is expected to open up the future of voice translation as a revolutionary system adopting different end-to-end models.
Until now, the voice translation has adopted the form in which the speaker has changed the sentences through the automatic speech recognition, and performed voice output through the machine translation. A model that combines voice, text, and voice is another method. In contrast, translattron adopts an end-to-end method that ends with a voice translation from beginning to end. As the process is simple, translation can be expected faster than conventional methods.
Translattron is the first model to translate voice directly from one language to another. Also, after the translation, the voice can maintain the voice of the speaker. According to the BLEU score, the translatoltron translation is slightly lower than the existing system, but it has more accuracy than the translation standard of this model.
|Reference translation (English)|
|Baseline cascade translation|
The machine-translated end-to-end model has been in research since its first publication in 2016, but the end-to-end model proved to be better than the previous model in 2017. Translattron is based on a Sequence to Sequence network that uses spectrogram information and generates spectrum for the target language translation. It is also one of the features that uses a Bokker to convert the output spectrum into a time domain waveform, or a speaker encoder to maintain the speaker’s voice and synthesize the voice after translation. For more information, please click here .