史上第一次,AI能自学翻译地球上任何语言了

来源:译世界 作者: 时间:2017/12/07

作为智库型研究与资讯平台,译世界【官方微信号译?世界(YEEWORLD)】现推出“环球?译事”栏目,聚焦全球语言服务行业,以专业的视角、前沿的眼光,通过双语译介、原创策划等多种形式,评述行业现象、观察业态发展,欢迎关注!


  本期为第三期,和您分享机器翻译研究最新成果——机器自学成才自主翻译。


  Machine-based translation is amazing, but hundreds of millions of people on our Pale Blue Dot can’t enjoy its benefits–because their language is nowhere to be found in the translator’s pull-down menu. Now, two new artificial intelligence systems–one from the Universidad del País Vasco (UPV) in Spain and another from Carnegie Mellon University (CMU)–promise to change all that, opening the door to true universal translators like the ones in Star Trek.

机器翻译发展惊人,但是地球上还是有数以亿计的人无法享受它的好处——因为他们的语言在翻译器的下拉菜单中根本找不到。现在,西班牙的巴斯克大学和美国卡内基梅隆大学开发的两个新型人工智能系统承诺会改变这一切,为像《星际迷航》中那样的真正的宇宙通用译者的到来打开大门。

 


  To understand the potential of these new systems, it helps to know how current machine translation works. The current de facto standard is Google Translate, a system that covers 103 languages from Afrikaans to Zulu, including the top 10 languages in the world–in order, Mandarin, Spanish, English, Hindi, Bengali, Portuguese, Russian, Japanese, German, and Javanese. Google’s system uses human-supervised neural networks that compare parallel texts–books and articles that have been previously translated by humans. By comparing extremely large amounts of these parallel texts, Google Translate learns the equivalences between any two given languages, thus acquiring the ability to quickly translate between them. Sometimes the translations are funny or don’t really capture the original meaning but, in general, they are functional and, overtime, they’re getting better and better.

要弄清这些新系统的潜力,首先要了解当前的机器翻译是如何工作的。目前机器翻译的事实标杆是谷歌翻译,这个系统涵盖了从阿非利卡语到祖鲁语的103种语言,包括世界上使用人数最多的10种语言——依次为汉语、西班牙语、英语、印地语、孟加拉语、葡萄牙语、俄语、日语、德语和爪哇语。谷歌的系统采用人类监督的神经网络来比较平行文本,即人类以前翻译过的书籍和文章。通过比较海量的平行文本,谷歌翻译可以学习任意两种指定语言之间的对等关系,从而获得快速翻译两种语言的能力。有时翻译结果会很有趣,可能并不能真实反映原文的意思,但总的来说,这些翻译是有用的,随着时间的推移,翻译质量会越来越好。

 


  Google’s approach is good, and it works. But unfortunately, it’s not universally functional. That’s because supervised training requires a very long time and a lot of supervisors–so many that Google actually uses crowdsourcing–but also because there just aren’t enough of these parallel texts translated between all the languages in the world. Consider this: According to the Ethnologue catalog of world languages, there are 6,909 living languages on Earth. 414 of those account for 94% of humanity. Since Google Translate covers 103, that leaves 6,806 languages without automated translation–311 with more than one million speakers. In total, at least eight hundred million people can’t enjoy the benefits of automated translation.

谷歌的这种做法很好,而且很有效,但可惜并不能全球通用。这是因为监督学习需要耗费很长的时间以及很多监督人员,另外,并非世界所有语言之间都有足够多的平行翻译文本,所以谷歌采取了众包。想想看:根据世界民族语目录,地球上现存有6909种语言。其中有414种语言的使用人数占人类总数的94%。谷歌翻译涵盖了103个,所以余下的6806种语言没有机器翻译,其中有311种语言的使用人数超过百万。总的来说,至少有8亿人无法享受机器自动翻译的好处。


  Now, the two new artificial intelligence systems show that neural networks can learn to translate with no parallel texts—a surprising advance that could make documents in many languages more accessible.

目前这两个新的人工智能系统取得了惊人的进展,其神经网络可以在无需平行文本的情况下进行翻译学习,未来人们可以选择更多语言进行文档翻译。


  “Imagine that you give one person lots of Chinese books and lots of Arabic books—none of them overlapping—and the person has to learn to translate Chinese to Arabic. That seems impossible, right?” says Mikel Artetxe, a computer scientist at the UPV. “But we show that a computer can do that.”

“想象一下,你给一个人很多中文书籍和阿拉伯书籍,而且它们没有任何重叠性,然后一个人需要学习将其从中文翻译为阿拉伯语。这听起来似乎不可能,对吧?”巴斯克大学计算机专家Mikel Artetxe说,“但我们证明计算机可以做到这一点。”


  Most machine learning—in which neural networks and other computer algorithms learn from experience—is “supervised.” A computer makes a guess, receives the right answer, and adjusts its process accordingly. That works well when teaching a computer to translate between, say, English and French, because many documents exist in both languages. It doesn’t work so well for rare languages, or for popular ones without many parallel texts.

大多数机器学习是“监督式”的,神经网络和其他计算机算法都是从经验中进行学习。计算机首先进行猜测,然后接收正确的答案,并相应地调整其处理过程。在教一台计算机如何在诸如英语和法语这样的语言之间进行翻译时,这种模式会非常有效,因为有很多文档同时拥有这两种语言版本。但对于罕见的语言或者是那些没有很多平行文本的通用语言来说,这种方法并不奏效。


  But the two new artificial intelligence systems focus on another method: unsupervised machine learning. To start, each constructs bilingual dictionaries without the aid of a human teacher telling them when their guesses are right. That’s possible because languages have strong similarities in the ways words cluster around one another. The words for table and chair, for example, are frequently used together in all languages. So if a computer maps out these co-occurrences like a giant road atlas with words for cities, the maps for different languages will resemble each other, just with different names. A computer can then figure out the best way to overlay one atlas on another. Voilà! You have a bilingual dictionary.

不过两款新的人工智能系统聚焦了另一种方法:无监督机器学习。一开始,计算机在没有人类老师判断它们的推测是否正确的情况下构建起双语词典,这种做法得以实现是因为语言在词汇结合方面有着很强的相似性,例如,在各种语言中,桌子和椅子的词汇经常会一起使用。因此,如果计算机将这些共现组合像一个城市巨大的路网那样描绘出来,不同语言的图谱就会彼此相似,只不过它们拥有不同的名字。如此一来,计算机就能找出将一个图谱集覆盖在另一个图谱集上的最佳方法。瞧!一本双语词典诞生了。

 


  The new systems, which use remarkably similar methods, can also translate at the sentence level. They both use two training strategies, called back translation and denoising. In back translation, a sentence in one language is roughly translated into the other, then translated back into the original language. If the back-translated sentence is not identical to the original, the neural networks are adjusted so that next time they’ll be closer. Denoising is similar to back translation, but instead of going from one language to another and back, it adds noise to a sentence (by rearranging or removing words) and tries to translate that back into the original. Together, these methods teach the networks the deeper structure of language.

这两个系统运用了非常类似的方法,均能在语句层面进行翻译。它们均使用两种训练策略,即回译和去噪。在回译中,一种语言中的一句话被粗略地翻译成另一种语言,然后再被转译回最初的语言。如果回译的语句与最初语句并不相同,就将对神经网络进行调整,从而使它们在下一次翻译得更加准确。去噪类似于回译,但不是从一种语言翻译为另一种语言,然后再转换为原语言,而是在一个句子中加入噪音(重新编排或是删除词汇),并尝试将其翻译到原语言中去。这两种方法相结合教会了网络更深层次的语言结构。


  There are slight differences between the techniques. The UPV system back translates more frequently during training. The other system, created by Facebook computer scientist Guillaume Lample, based in Paris, and collaborators, adds an extra step during translation. Both systems encode a sentence from one language into a more abstract representation before decoding it into the other language, but the Facebook system verifies that the intermediate “language” is truly abstract. Artetxe and Lample both say they could improve their results by applying techniques from the other’s system.

不过,这两种技术之间也稍有不同。巴斯克大学的系统在训练过程中进行了更多的回译。而另一个由位于法国巴黎的脸谱网计算机科学家Guillaume Lample与其合作者研发的系统,则在翻译过程中加入了额外的步骤。在将其“解码”为另一种语言之前,两套系统都会将一种语言的一个句子编码为一种更加抽象的表征,但脸谱网的系统会核实中继语言是否真正抽象。Artetxe和Lample均表示,他们可以通过应用对方系统中的技术来改善自己的结果。

 


  In addition to translating between languages without many parallel texts, both Artetxe and Lample say their systems could help with common pairings like English and French if the parallel texts are all the same kind, like newspaper reporting, but you want to translate into a new domain, like street slang or medical jargon. But, “This is in infancy,” Artetxe’s co-author Eneko Agirre cautions. “We just opened a new research avenue, so we don’t know where it’s heading.”

除了不需要平行文本进行跨语言翻译之外,Artetxe和Lample均表示,他们的系统还有助于进行诸如英语和法语之间的常用翻译匹配,特别是如果平行文本是同一类的话,如新闻报道。但除此之外,人们还希望将其翻译为不同类型的文本,如街头俚语或是医学术语。“但这一切尚处于新生阶段。”Artetxe的共同作者Eneko Agirre说,“我们刚刚开启新的研究之路,还不知道它会通向哪里。”


  One caveat? The systems are not as accurate as current parallel text deep learning systems–but the fact that a computer can guess all this without any human guidance is, like Microsoft AI expert Di He points out, nothing short incredible. We’re just scratching the surface of this new learning method. It seems very likely that sometime soon, a true universal translator that allows us to talk to anyone in their native tongue won’t just be the stuff of sci-fi.

要提醒各位的是,这个系统并不像目前的平行文本深度学习系统那么精确——但是正如微软人工智能专家Di He所指出的,电脑能够在没有任何人类指导的情况下猜测所有这些事实,这本身就很不可思议。我们只是从浅层接触到了这种新的学习方法,似乎很快就会出现一种真正的通用翻译,让我们能够与任何人用对方的母语交谈,这将不再仅仅是科幻的东西。


  人工智能长期以来的一个目标,是创造一个能够在具有挑战性的领域超越人类的精通程度学习的算法。此前,AlphaGo Zero从空白状态学起,在无任何人类输入的条件下,能够迅速自学围棋。或许在未来,人工智能可以在很多领域“无师自通”,机器翻译或许会由此获得更大突破。对此,您有什么看法?欢迎留言讨论~


  来源:译世界微信公众号参考雷锋网及《中国科学报》编译,转载文章请按以下要求,否则将被“举报”:

文章开头,按此话术注明来源 ↓

本文来源:译·世界微信公众号(YEEWORLD),文章已获授权。

1
分类:翻译业内动态 标签:AI 翻译 自学 语言 历史首次  | 收藏

评论:


关于我们 | 联系我们 | 商务合作 | 网站地图 | 诚聘英才 | 免责声明
中译语通科技股份有限公司 版权所有
Copyright © 2012-2017 www.yeeworld.com All rights reserved. 京ICP备13002826号-3
京网文[2017]5582-659号  京ICP证140152号
京公网安备 11010702001424号