It’s arguably the most exciting technology to arrive since the invention of the internet itself. The ability to converse effortlessly with anyone in any language is finally here, thanks to tiny in-ear devices or free apps on your phone. Or so the technologists say. The results might have been mixed at best so far but Moore’s Law and lots of hand-waving tell us that we are close to the finish line of replacing humans, right?
The Problem as Most Technologists see it
Up until very recently, the problem statement of speech translation was simple. Take in spoken language, turn it into written language, use machine translation to flip that into another language, use voice synthesis to speak out the result.
The key in that process was to hit 100% accuracy at each stage and suffer no loss at any single point. Hence the plethora of press releases proudly parroting figures like “97% accuracy” (the exact phrasing used by Tencent about their system before it fell down spectacularly in front of an audience).
A Major Flaw Appears
Apart from the gigantic holes in their reasoning that are obvious to anyone who has ever performed or studied interpreting (and which will be discussed at length in my new book), there is one major flaw in their problem definition. No-one, not even the greatest expert in interpreting, not even the best machine translation researcher, has a solid, empirically-reliable and practically realistic definition of accuracy. Any attempt to do so quickly runs up against either real life or logical potholes the size of a small continent, as the video below illustrates:
Why this Leads to “The Wall” (at least for now)
To discuss all the difficulties caused by this problem would take a book, not a blog post. The main point to understand is that this problem with “accuracy” is a symptom of the wider problem that the makers of speech translation symptoms do not understand how communication works between people, never mind how interpreting works. This point was underlined by Prof Andy Way at the recent ITI conference, when he pointed out that recently, there has been a trend for newcomers to attempt to solve machine translation, without ever having learned a language or studying linguistics. This inevitably leads to embarassing shocks.
In speech translation, the shocks are even worse. Without a basic knowledge of culture-specific pronoun and register use, the relationship of language to social context and how for example, the functions of intonation in English are mirrored by sentence structure in French, any attempts at speech translation will never get past the stage of helping people find the toilet.
Haven’t Google Solved all that?
Google might just have found a way through some of the mess, with its much-vaunted “Translatotron”, which claims to work directly from speech to speech, even to the point of keeping speech patterns in the interpreted version. If they are actually telling the complete truth, that would indeed by a real breakthrough but that breakthrough also hides an uncomfortable fact.
Speech patterns don’t work the same in different languages. Where English uses intonation for emphasis, clarification and expressing attitude, other languages use word order, speed, noun declensions or even code switching to do the same things. That means that the goal of making you sound the same in Spanish as you do in English is itself a pretty pointless goal.
The Coming Wall
This hints at a coming moment, which is likely to arrive sooner rather than later, when investment in speech translation begins to generate decreasing returns. While the current ways of doing speech translation are sufficient to make passable devices for tourists, the costs of doing so are still high. The ability to take this technology and turn it into either a replacement for human interpreters or even a consistently useful help for them seems out of reach. Why?
Quite simply, the current capability of speech translation isn’t limited by processor power or memory or programming but simply by that superficial understanding of language and communication I mentioned earlier. Pouring more money into speech translation might be a very good way to eventually make the devices cheaper or improve resistance to background noise but it won’t solve the underlying problems. In short, the weakest link in speech translation is the thinking of the engineers making the software.
Could this change? There is no reason why not. Anyone smart enough to build a system that can connect speech recognition, machine translation and voice synthesis is smart enough to pick up any book on interpreting or any book on spoken language and rewrite their algorithms accordingly.
That might be enough, assuming that there can be an algorithm that can fully understand not just words but meaning and intention. It might be enough if it is possible to make an algorithm that processes language as flexibly and quickly as the human brain and can detect new words and phrases and work out their meaning from context.
In the Meantime
As I haven’t seen any sign of speech translation makers moving away from the current faulty understanding of language and communication, I wouldn’t presume to predict the future of their work. I would suggest, however, that if current trends continue, the gains in quality in speech translation will soon slow to a crawl.
For businesses, this means relying on humans for all important communication and leaving finding your way to the nearest metro stop to the speech translation devices. For interpreters, this means keeping an eye on our “robot overlords” and keeping one step ahead. And if you want to know exactly how to do that, keep your eye out for a new book.