The Current State of Machine Translation

8 min readSep 2, 2019

Takeaways from the 2019 Machine Translation Summit in Dublin

“Has machine translation (MT) become so good that we are leaving massive productivity gains on the table when we translate from scratch? In other words, are the non-MT users of today the non-CAT tool users of the early 2000s (a.k.a. dinosaurs)?”

I set out to Dublin to find out.

After spending five days at the 17th Machine Translation Summit, here is what I learned.

The alchemy of neural machine translation

If you don’t understand how neural machine translation (NMT) actually works, you’re in good company. The previous statistics-based model (SMT) worked essentially by decomposing a sentence into building blocks, translating those blocks, then putting them together. NMT is still based on statistical methods similar to SMT. The big shift is that instead of stitching building blocks together, it translates through a sophisticated sequence of encoding and decoding based on neural networks, a notion early computer scientists saw as “mystical”, and that some language researchers hope might reveal a universal grammar. To put the system’s complexity into perspective, consider that the same model that translates from English to Korean and from English to Italian can also translate from Italian to Korean without any data in that language combination (a process called “zero-shot translation”).

The implications of this are that outside of Google and co., NMT engines are essentially a black box. Instead of looking under the hood to fine-tune performance, industry and academia are mainly working upstream and downstream, i.e. adding various mechanisms pre- and post-processing to increase domain adaptation or correct the machine’s “hallucinations”, the term used in the community to describe the bizarre mistakes characteristic of NMT, such as inserting content into the translation that is non-existent in the source text.

One downstream technique is called automatic post-editing, where the translation produced by the machine is corrected by a machine that is trained differently. For example, the post-editing machine can learn from the corrections made by humans and automatically apply those corrections to the machine translation.

In a fascinating study published this summer, Google’s Markus Freitag et al. were successful in making translations more idiomatic by automatically post-editing them with a model trained on target language data. If a machine can learn from monolingual data — naturally written, non-translated texts — , could it eventually produce more natural-sounding texts than humans by correcting our tendency to reproduce source text attributes in the target text (a phenomenon referred to as “translationese”)? While Freitag and his team were apparently able to achieve perfect results within a very narrow domain (sport results), the researcher said that the machine is still very far from achieving this — if it ever does.

Another big research focus is quality estimation (QE). QE can be used to help translators zero in on the areas that most likely require their attention, to assess the usability of machine translation, to compare the output of different engines, to estimate the level of effort required for post-editing, and to increase the amount of raw machine translation that can be sent directly to publication. Given the fluency of NMT output, there is a lot of interest in quality estimation, as a reliable indicator of potential errors could allow for a substantial reduction in post-editing work in many contexts. At the summit, Unbabel, an innovative localization company based in Portugal, presented their in-house system, which they appear to have success with, although only a fraction of machine-translated sentences reach the threshold required to forego human review. There is a lot of work being done on QE and some off-the-shelf systems are available such as Memsource’s, but none seem to have yet reached the level of robustness required for large-scale adoption.

It’s all about the data

A recurring theme throughout the conference was that the quality of an engine’s output is largely dependent on the quality and domain specificity of the data used to train it. Everyone who has been involved in NMT implementation seems to agree that a large volume of high-quality, in-domain data is the key to producing output that requires minimal post-editing. The effect of this is that data has become a new source of power, which would tend to disadvantage individual translators over big companies with large amounts of private, industry-specific translation memories. Although this may be mitigated to a certain extent by the work being done to allow for domain adaptation with small amounts of data layered on top of a generic engine, as does ModernMT for example.

Harvesting, aligning and cleaning data have become major activities in the translation industry, and programs like ParaCrawl and Sketch Engine are joining PDF converters and alignment software in an increasing number of translators’ toolkit. In a world where domain-specific corpora will become increasingly valuable, translators may want to invest effort into building their own over time, for example by adding domain metatags to their translation memories.

Efforts are being made to get better output by manipulating training data. One way of doing this is by adding placeholders to decrease “noise” in the data from people, company and product names for example. Another method, called “oversampling”, is to flood the data with multiple instances of desired groups of words (such as those in a client glossary) to increase the probability that the machine will use those words in the translation. Other methods to improve the robustness of NMT engines through data preparation are presented in this paper by machine translation provider Iconic Translation Machines.

NMT use and impact

Everyone is using the same open-source engines. Most neural machine translation users and providers seem to have switched from OpenNMT to marianNMT, and to now be progressively moving to Google’s Transformer.

Successful use cases in a wide range of contexts were presented at the summit, including at the European Commission, a major Swiss bank, an Italian agency specializing in patents and e-commerce companies like eBay and Alibaba. The productivity gains range from marginal (<10%) to massive (>50%).

Speakers noted throughout the conference that the dust has yet to settle in the industry since the advent of NMT. When translation memories became ubiquitous in the late ’90s and early 2000s, standard billing practices eventually emerged based on average daily production and match rates. The variability and unpredictability of NMT quality and productivity gains has left everyone wondering where the industry is headed. While edit distance could potentially serve as a new objective basis for billing, as it does at Amazon, many see hourly billing as the only way forward. In any case, there were no clear answers at the summit.

In terms of market impact, the advances in machine translation have made it possible for tech-savvy companies like Unbabel to effectively cater to the middle ground between human and raw machine translation. Unbabel translates emails written by English-speaking customer service agents into the language of customers worldwide, a service for which raw MT wouldn’t be suitable, but traditional human translation may be too slow and expensive. This type of service might have saved the Government of Canada from the Portage fiasco, and will inevitably replace human translation in this market segment, where affordability and fast turnaround are key and anything above “good enough” is over-quality.

What’s next

If the delegation at the summit is any indication, the next advances in MT are likely to come from companies like Google, Facebook, Amazon, Microsoft, Baidu and Alibaba, and research hubs in Spain, Italy, Ireland, the UK, the Netherlands, Germany, China and Japan. Sidenote: Canada, apparently famous in the machine translation world for developing the first successful system (designed for weather forecasts), was represented by only three delegates at the summit — none researchers — among the record 280 in attendance.

There was no talk at the summit of any emerging successor to NMT, which replaced the statistics model as the state-of-the-art around 2016, which itself overtook the rule-based model in the early 2000s. Nor does there seem to be an expectation of any upcoming massive breakthroughs. Because machine translation performance is highly dependent on training data, domain adaptation as well as customized upstream and downstream processes, the biggest gains to be made may be context-specific, rather than from improvements to the general model.

One research focus which may bring NMT to a new level is the machine’s ability to consider context beyond the sentence-level, since many of its current shortcomings result from translating sentences in isolation, such as the tendency to translate terms inconsistently within the same document. In his award-winning study, Longyue Wang was successful in correcting mistranslations of dropped pronouns (in Chinese and Japanese, pronouns can be omitted when it is possible to infer the referent from the context), which, like cohesion markers, are particularly challenging for machine translation.

Another big topic of discussion at the summit was the integration of machine translation and translation memories (TM). There was debate over the ideal threshold at which MT should be suggested to the translator rather than a TM match. Establishing a threshold that would be relevant in all contexts seems difficult, as it depends highly on specific factors, such as the importance of consistency and the quality of TM matches. Examples of work being done to better leverage translation memories in machine translation include John Ortega et al.’s study on fuzzy match repair through automatic post-editing, and Catarina Silva’s study on improving domain adaptation by using “translation pieces” (or sub-segment TM matches).

Conclusion

So, if you are not already using MT, should you be? The answer is not as straightforward as whether you should have invested in a translation memory fifteen years ago. But those who have large amounts of in-domain data can easily train an off-the-shelf engine with providers such as KantanMT, Tilde, Globalese or ModernMT, have the machine translation results input directly into their CAT tool, and see for themselves, with a relatively small investment, where they fall on the wide-ranging NMT productivity gain scale.