As mentioned by Gurdeep Pall in this morning’s Big Blog post, today we have announced the first phase of the Skype Translator preview program which will kick-off with English and Spanish as the first two languages. We couldn’t be more excited; Skype Translator is the result of decades of research in speech recognition, automated translation, and general machine learning technologies, combined with an intense focus on the user experience. This next phase in the Skype Translator journey is an exciting milestone and we’re looking forward to sharing it with those who use Windows 8.1 and have signed-up for the Spanish language via the Skype Translator sign-up page.
Microsoft was one of the first to delve into the challenge of speech translation. Recent improvements in speech recognition, made possible by the introduction of deep neural networks combined with Microsoft’s proven statistical machine translation technology, allow for better translation outcomes, making meaningful one-on-one conversation possible. Skype is about helping people communicate – mind to mind, heart to heart. Skype Translator is the latest evolution of this.
The Skype Translator preview program is currently available for Spanish-speaking and English-speaking Skype customers who use Windows 8.1 or Windows 10 Technical Preview on their desktop or tablet. In addition, the app also translates text IM conversations between over 40 languages. The preview stage is critical to the development and advancement of Skype Translator as it allows customers to use the product and provide valuable feedback, which in turn will help us improve the product and consequently, help the technology get smarter and learn more languages.
How Skype Translator Works
Machine Learning is the capability of software learning from training data examples, and Skype Translator is built on a robust Machine Learning platform. By learning from the training data during this preview stage, along with all of its nuances, the software can learn to better recognize and translate the diversity of topics, accents and language variation of actual Skype Translator users.
Skype Translator’s machine learning protocols train and optimize speech recognition (SR) and automatic machine translation (MT) tasks, acting as the glue that holds these elements together. This “glue” transforms the recognized text to facilitate translation. This process includes the removal of disfluencies (i.e. ‘ahs’ and ‘umms’ as well as re-phrasings), division of the text into sentences, as well as addition of punctuation and capitalization.
The training data for speech recognition and machine translation comes from a variety of sources, including translated web pages, videos with captions, as well as previously translated and transcribed one-on-one conversations. Skype Translator records conversations in order to analyze the scripts and train the system to better learn each language. We have also had many people donate data from previous conversations, which we also analyze and use to create training material for the statistical models that teach the Speech Recognition and Machine Translation engines how to map the incoming audio stream to text, and then the text to another language. Skype Translator participants are all clearly notified as the call begins that their conversation will be recorded and used to improve the quality of Microsoft’s translation and voice recognition services.
After the data is prepared and entered into the machine learning system, the machine learning software builds a statistical model of the words in these conversations, and their context. When you say something, the software can find something similar in its statistical model, and apply the previously learned transformation from audio to text and from text into the foreign language.
While speech recognition has been an important research topic for decades, widespread adoption of the technology had been stymied by high error rates and sensitivity to speaker variation, noise conditions etc. The advent of Deep Neural Networks (DNNs) for speech recognition, pioneered by Microsoft Research, dramatically reduced error rates and improved robustness, finally enabling the use of this technology in broad contexts such as Skype Translator. At the same time, the dream of global human-to-human communication was a major motivating factor and driving force for the MSR researchers working on this technology.
The MT portion of Skype Translator translates text from one language to another. We use the same technology that powers Bing Translator on the web, which pioneered the combined use of syntax and statistical models, but trained specifically for a conversational type of language. This is particularly challenging, as the typical training data used to build text translation systems today is optimized for clean, well-formed written language. Our system combines the broad language knowledge of Bing Translator, and an extensive layer of the words and phrases that are used in spoken conversations.
Additionally, we’ve created a customized bot that orchestrates the entire experience. The bot is responsible for creating the call, and sending audio streams to the speech engines in exchange for translation and transcription. The translator bot acts like a third participant in the call. It translates what you just said when you’ve finished talking, and it translates what the person you called said, when they finish talking.
The bot’s creation required the Microsoft Research and Skype teams to combine their expertise and engineering capabilities, resulting in a complex architecture but hopefully a simple, straightforward experience for people.
Language is a wild beast. It changes all the time, it comes in many flavors and varieties, and there is a surprisingly huge difference between how people write language and how they speak it. To provide the best experience, we had to overcome several language challenges.
Humans are, well, human and they make mistakes, think, and change their minds. These thought processes show up in spoken language as disfluencies. As mentioned previously, when speaking, people place pauses and repeat things, and add fillers like ‘um’ and ‘ah’. Ideally, none of these nuances should appear in the transcript or in the translation. Our machine learning models account for these pauses. In the preview you will see how some of these fillers are removed, and how some are not—we hope to improve our capabilities through user feedback.
Humans are unique and our spoken language reflects our regional, national, and cultural identities through colloquialisms or slang. Microsoft Translator has built up strength in colloquial translation, from its years of working with social media sites, like Facebook, for instance. This existing strength helped us with Skype Translator by improving our ability to translate casual phrases and terminology. Skype Translator preview will help the system observe and learn additional levels of casual conversation bringing the system closer to compatibility with truly conversational speech.
Additionally, there are specific challenges inherent in the user experience of language translation. The automated translator in Skype Translator appears almost as a third speaker. We have seen that customers who are used to speaking through a human interpreter are quickly at ease with the situation. Others require some getting used to this new mode of interaction.
While this moment is a major milestone for our team, we see the preview as simply another step in creating the best translation experience we possibly can. We will rely on the feedback and data that our preview users share with us to help improve our technology and optimize the unique experience Skype Translator creates. It’s still early days for this technology, and although we have a solid foundation, we know that in some respects our work has just begun.
Until now, this has been our journey. Today, we’re excited to have you join us as we seek to make it easier for people to connect, communicate and collaborate using Skype Translator. To get started, please sign-up via the Skype Translator Preview page.