Google quietly announces the Translatotron direct speech-to-speech translation model

Google Translate helps you talk to people who don’t speak the same language as you, using its built-in conversation mode. You can input a message in English in speech format and translate it into a speech in Japanese. But to do this, it first breaks your speech into words in text format, performs text-to-text translation, and then backs up the translated text using a good old TTS (text-to-speech synthesis). Google is now proposing a new method: Translatotron direct speech-to-speech translation model.
Still in an experimental stage, the translatotron model leaves the middle man. In other words, it directly translates the input taken as speech and returns it using a single focused attention sequence-to-sequence model. According to Google, this direct translation model has many advantages, including “faster guessing speed, naturally recognizing and avoiding complex errors in translation, making it easier to retain the original speaker’s voice after translation, and better handling of unnecessary words.” . “
Google’s image on how Translatotron works
Work on such a direct translation model began in 2016, Google wrote in a blog post. A year later, the developer behind the world-famous Android OS showed that the new direct translation is faster and more effective. According to Google, the Translatotron source accepts spectrograms as input and creates equivalent spectrograms for the required language. “During training, the sequence-to-sequence model uses a multitask objective to predict source and target replicas at the same time as creating target spectrograms. However, no transcripts or other intermediate text representations are used during the estimation,” Google wrote. .
Although Google now owns a new translation model, it is not yet ready to include it in Google Translate and other related tools. The new system is lagging behind BLEU scores, meaning translations are still not accurate enough. On the plus side, the new model retains the user’s normal voice even after translation because it does not use TTS for output. “By incorporating a speaker encoder network, Translatotron is able to retain the vocal features of the original speaker in the translated speech, making the translated speech more normal and less shaky,” Google added.