Facebook has developed a new machine translation system, M2M-100, that can translate almost any content posted on the platform into the language of the user using automatic machine translation. M2M-100 is a machine translation system that directly translates 100 languages, not English as an intermediate language.
Facebook only provides 20 billion translations per day in News Feed. However, this translation system usually uses English as an intermediate language. For example, if you translate from China to French, you first go through the process of translating Chinese into English and then English into French.
This method is used because the English and other language translation datasets are vast. However, the overall accuracy of the translation deteriorates due to the English intervening. The Facebook AI side pointed out that there are many regions in the world that speak languages other than English, so it is an important task for the machine translation system to meet the demand for people who do not speak it. On the Facebook platform, billions of posts are made every day, but 160 languages are used for posts, with more than two-thirds of them in languages other than English.
Therefore, Facebook has developed M2M-100, a new machine translation system that can directly translate two languages without using English as an intermediate language. Facebook claims that the M2M-100 is the first multilingual machine translation model capable of directly translating a set of over 100 languages in any direction.
While developing the M2M-100, Facebook built a vast dataset of 7.5 billion sentences in 100 languages. It is said that text data is collected using Common Crawl, which will crawl web pages, and then specified the text language using a text classification system called FastText.
Translation data is often created using human translation, but it is much more difficult to find a translator that speaks French and Tamil than to find a translator that speaks English and Tamil. To obtain data for direct translation of languages other than English, the research team used a tool that maps according to the meaning of multilingual sentences called LASER (Language-Agnostic SEntence Representations).
Facebook also introduced a strategy to categorize languages into 14 groups based on language classification, geography and cultural similarity. Languages belonging to the same group tend to communicate more often, so there is higher quality interactive translation data. Of course, not all languages have a lot of text available on the Internet, so the research team focused on data in a single language. The Facebook side takes Chinese to French translation as an example, and the goal is to translate from China to French, but if for some reason enough data cannot be obtained, it uses French single language data to improve it, and reverses the system called French to Chinese translation. Train. For example, Wikipedia obtains all French data and translates it into Chinese. The machine translation system becomes more powerful by adding new text obtained from reverse translation to the dataset, increasing the data available for both input and output.
The M2M-100 developed in this way is said to have exceeded 10 points in the BLEU (Bilingual Evaluation Understudy) score, which measures the accuracy of machine translation, than the machine translation system using English as an intermediate language. Of course, the languages not covered by the M2M-100 are still vast, and it is still unknown whether it will lead to the development of a system that can directly translate all languages later. Related information can be found here .