您的位置:首页 > 编程语言 > Go语言

Goslate: Free Google Translate API

2016-03-11 17:19 1056 查看

Python爬虫视频教程零基础小白到scrapy爬虫高手-轻松入门

https://item.taobao.com/item.htm?spm=a1z38n.10677092.0.0.482434a6EmUbbW&id=564564604865

谷歌新的ticket机制已经封杀IP,不能再次使用

http://ju.outofmemory.cn/entry/37127

http://pythonhosted.org/goslate/

Goslate 免费谷歌翻译

重要更新: 谷歌刚升级了在线翻译系统, 新加入的 ticket 机制能有效地防止类似 goslate 这样简单的爬虫系统的访问. 技术上来说, 更加复杂的爬虫仍有可能成功抓取翻译, 但这么做已经越过了红线. Goslate 不会再继续更新去破解 Google 的 ticket 机制. 免费午餐结束.

起因

使用

原理

优化

设计

开源

起因

机器翻译虽然质量差,但胜在省时省力。网上常见的翻译系统中,谷歌的质量算好的。谷歌翻译不但提供在线界面,还开放了 API 让程序直接调用翻译。美中不足的是从 2012 年开始谷歌翻译 API 收费了。可这难不倒聪明的程序员,只要谷歌网站上的翻译是免费使用的,你总是可以写个爬虫自动从网站抓取翻译结果。我花了点功夫写了个爬虫,又把爬虫封装成了简单高效的 Python 库来免费使用谷歌翻译,这就是 Goslate ( Go ogle Tran slate ) 。

使用

Goslate 支持 Python2.6 以上版本,包括 Python3!你可以通过 pip 或 easy_install 安装

$ pip install goslate

Goslate 目前只包含单个 python 文件,你也可以直接下载 最新版本的 goslate.py 。使用很简单,下面是英译德的例子

>>> import goslate
>>> gs = goslate.Goslate()
>>> print gs.translate('hello world', 'de')
hallo welt

goslate.py 不仅是一个 python 模块,它还是个命令行工具,你可以直接使用

通过标准输入英译汉输出到屏幕

$ echo "hello world" | goslate.py -t zh-CN


翻译两个文件,将结果用 UTF-8 编码保存到 out.txt

$ goslate.py -t zh-CN -o utf-8 src/1.txt "src 2.txt" > out.txt


更多高级用法参看 文档

原理

要使用谷歌翻译官方 API 需要先付费获得 Key。如果 Key 非法,谷歌 API 就会返回错误禁止使用。那么 Goslate 怎么绕过 Key 的验证呢,难道走了后门?恰恰相反,Goslate 光明正大地走前门,也就是直接抓取谷歌翻译网站的结果。

我们用浏览器去谷歌翻译 hello world,抓包发现,浏览器访问了这个 URL:


很容易看出源文本是作为 text 参数直接编码在 URL 中的。而相应的 tl 参数表示 translate language,这里是 zh-CN (简体中文)。

谷歌翻译返回:

{"sentences":[{"trans":"世界,你好!","orig":"hello world!","translit":"Shìjiè, nǐ hǎo!","src_translit":""},{"trans":"认识你很高兴。","orig":"nice to meet you.","translit":"Rènshi nǐ hěn gāoxìng.","src_translit":""}],"src":"en","server_time":48}

格式类似 JSON,但不标准。其中不但有翻译结果,还包含汉语拼音和源文本的语言等附加信息,猜测这些可能是为了客户端的某些特殊功能。

这个过程很简单,我们的爬虫逻辑是

先把源文本和目标语言组成类似上面的 URL

再用 python 的 urllib2 去到谷歌翻译站点上 HTTP GET 结果

拿到返回数据后再把翻译结果单独抽取出来

有一点要注意,谷歌很不喜欢 python 爬虫:) 它会禁掉所有 User-Agent 是 Python-urllib/2.7 的 HTTP 请求。我们要 伪装成浏览器 User-Agent: Mozilla/4.0 来让谷歌放心。另外还有一个小窍门,URL 中可将参数 client 从 t 改成其它值,返回的就是标准 JSON 格式,方便解析结果。

优化

爬虫虽然工作正常,但有两个问题:

短:受限于 URL 长度,只能翻译不超过 2000 字节的短文本。长文本需要手工分隔多次翻译

慢:每次翻译都要一个 HTTP 网络应答,时间接近 1 秒,开销很大。以 8000 个短句的翻译为例,全部翻完就需要近 2 个小时

短的问题可用自动分拆,多次查询解决:对于长文本,Goslate 会在标点换行等分隔处把文本分拆为若干接近 2000 字节的子文本,再一一查询,最后将翻译结果拼接后返回用户。通过这种方式,Goslate 突破了文本长度的限制。

慢的问题比较难,性能卡在网络延迟上。谷歌官方 API 可以一次传入多个文本进行批量翻译,大大减少了 HTTP 网络应答。Goslate 也支持批量翻译,既然一次查询最大允许 2000 字节的文本,那就尽量用足。用户传入多个文本后 Goslate 会把若干小文本尽量拼接成不超过 2000 字节的大文本,再通过一次 HTTP 请求进行翻译,最后将结果分拆成相应的若干翻译文本返回。

这里可以看到,批量查询和长文本支持正好相反,批量查询是要拼接成大块后一次翻译再分拆结果,长文本支持是要拆分后多次翻译再拼接结果。如果批量查询中有某个文本过长,那它本身就要先被拆分,然后再和前后的小文本合并。看起来逻辑有些复杂,但其实只要功能合理分层实现就好了:

最底层的 Goslate._basic_translate() 具体负责通过 HTTP 请求翻译单个文本,不支持长文本分拆

Goslate._translate_single_text() 在 _basic_translate() 基础上通过自动分拆多次查询支持长文本

最后外部 API Goslate.translate() 通过拼接后调用 _translate_single_text() 来支持批量翻译

通过三层递进,复杂的拆分拼接逻辑就完美表达出来了。All problems in computer science can be solved by another level of indirection 额外的间接层解决一切,诚不余欺也。

另一个加速方法是利用并发。Goslate 内部用线程池并发多个 HTTP 请求,大大加快查询的速度。这里没有选择重新发明轮子,而是直接用 futures 库提供的线程池。

批量加并发后效果非常明显,8000 个短句的翻译时间从 2 小时缩短到 10 秒钟。速度提升了 700 倍。看来 Goslate 不但免费,还比谷歌官方 API 要高效得多。

设计

能工作,性能高,再进一步的要求就是好用了。这就涉及到 API 设计问题。Goslate API 总的设计原则是简约但不简陋 (Make things as simple as possible, but not simpler)

Goslate 功能虽然简单,但魔鬼在细节中。比如出入参数的字符编码,proxy 设定,超时处理,并发线程数等该怎么合理组织规划?按设计原则,我们把灵活性分为三大类,Goslate 区别对待:

必需的灵活,比如待翻译的源文本,目标语言等。这些是基本功能,将做为 API 的入参

高级场景下才需要的灵活,比如 proxy 地址,出错后重试的次数,并发线程数等。通常用户不会关心,但特殊场景下又希望能控制。Goslate 用参数默认值解决这个两难问题。为了进一步简化和性能考虑,这类灵活性都放在了 Goslate 对象的构造中 Goslate.__init__() 一次性设定

无意义的灵活,例如文本的编码等。对这种灵活性要敢于说不,过度设计只会增加不必要的复杂度。Goslate 的入参只支持 Unicode 字串或 UTF-8 编码的字节流,返回值一律是 Unicode 字符串,需要什么编码自己转去。为什么说这些灵活性毫无意义?因为用户本来就有更自然的方式实现同样的功能。拒绝无意义的灵活反而能让用户达到真正的灵活

设计上还有其它考虑点:

消灭全局状态,所有状态都在 Goslate 对象中。如果想的话,urllib2 提供的 HTTP opener 甚至线程池你都可以替换定制。

应用依赖注入原则,所有的灵活性设置都本着最少知道原则 (Law of Demeter) 依赖于直接的外部功能接口。例如,proxy 地址不直接通过参数传入 Goslate,而是需要构造一个支持 proxy 的 urllib2.opener 传给 Goslate 的构造函数。这么做的直接好处是参数少了。更本质的好处是解耦。内部实现依赖的是 opener 的接口,而不是具体的 opener 实现,更加不是 proxy 的配置。你可以配一个支持 proxy 的 opener 来实现 proxy 转发访问,也可以配一个支持用户认证的opener 来应对复杂的网络环境。极端情况下,甚至可以自己从头定制一个 opener 实现来完成特殊的需求

批量查询的入参出参使用 generator 而不是 list 。这样就可以按照 pipeline 的方式组织代码:批量翻译的源文本序列可以是 generator,翻译一条就可以通过返回的 generator 实时拿到一个结果进行后面的处理,不用等着全部批量操作完成,增加了整个流程的效率。

额外提供了命令行界面,简单的翻译任务可以直接用命令行完成。命令行参数的设计也遵照 Unix 设计哲学:从标准输入读取源文本,输出到标准输出。方便与其它工具集成使用

开源

如果只是自己使用,API 完全不用考虑的这么周全。之所以精雕细琢,目标就是开源!Python 社区对开源支持的非常好,开源库只要按照一定的规范去操作就好了:

选版权:挑了半天,选择了自由的 MIT

代码管理: Goslate 托管在 Bitbucket 上,虽然名气没有 Github 响,但胜在可以用我喜欢的 hg

单元测试: 自动化单元测试很有必要,既是对用户负责,也让自己能放手做优化。Python 标准的 unittest 框架很好用。同时 Goslate 也在 docstring 中加入了doctest

文档: 使用 Python 社区标准的文档工具 sphinx 生成,它可以自动从代码中抽取信息生成文档。文档生成好了后还可以免费上传到 pythonhosted 托管

部署: 按规矩在 setup.py 中写好元信息,将代码打包上传到 pypi 就成了 ( 这里有详细步骤 ) 。这样全世界的用户就可以用 pip 或 easy_install 安装使用了。为了让受众更广,Goslate 还花了点力气同时支持 python2,python3

宣传: 酒香也怕巷深,何况我这个小小的开源库呢。这也是本文的初衷

回过头看,Goslate 代码共 700 行,其中只有 300 行是实际功能代码,剩下的 400 行包括 150 行的文档注释和 250 行的单元测试。库虽小,但每样都要做好的话工作量比预想的要大很多,算起来写功能代码的时间远没有做开源周边辅助工作的时间长。

天下事必做于细,信哉!

Goslate: Free Google Translate API

Note

Google has updated its translation service recently with a ticket mechanism to prevent simple crawler program like
goslate
from accessing. Though a more sophisticated crawler may still work technically, however it would have crossed the fine line between using the service and breaking the service.
goslate
will not be updated to break google’s ticket mechanism. Free lunch is over. Thanks for using.

Simple Usage

Installation

Proxy Support

Romanlization

Language Detection

Concurrent Querying

Batch Translation

Performance Consideration

Lookup Details in Dictionary

Query Error

API References

Command Line Interface

How to Contribute

What’s New

1.5.0

1.4.0

1.3.2

1.3.0

Reference

Donate

goslate
provides you free python API to google translation service by querying google translation website.

It is:

Free: get translation through public google web site without fee

Fast: batch, cache and concurrently fetch

Simple: single file module, just
Goslate().translate('Hi!', 'zh')


Simple Usage

The basic usage is simple:

>>> import goslate
>>> gs = goslate.Goslate()
>>> print(gs.translate('hello world', 'de'))
hallo welt


Installation

goslate support both Python2 and Python3. You could install it via:

$ pip install goslate


or just download latest goslate.py directly and use

futures
pacakge is optional but recommended to install for best performance in large text translation task.

Proxy Support

Proxy support could be added as following:

import urllib2
import goslate

proxy_handler = urllib2.ProxyHandler({"http" : "http://proxy-domain.name:8080"})
proxy_opener = urllib2.build_opener(urllib2.HTTPHandler(proxy_handler),
urllib2.HTTPSHandler(proxy_handler))

gs_with_proxy = goslate.Goslate(opener=proxy_opener)
translation = gs_with_proxy.translate("hello world", "de")


Romanlization

Romanization or latinization (or romanisation, latinisation), in linguistics, is the conversion of writing from a different writing system to the Roman (Latin) script, or a system for doing so.

For example, pinyin is the default romanlization method for Chinese language.

You could get translation in romanlized writing as following:

>>> import goslate
>>> roman_gs = goslate.Goslate(writing=goslate.WRITING_ROMAN)
>>> print(roman_gs.translate('China', 'zh'))
Zhōngguó


You could also get translation in both native writing system and ramon writing system

>>> import goslate
>>> gs = goslate.Goslate(writing=goslate.WRITING_NATIVE_AND_ROMAN)
>>> gs.translate('China', 'zh')
('中国', 'Zhōngguó')


You could see the result will be a tuple in this case:
(Translation-in-Native-Writing, Translation-in-Roman-Writing)


Language Detection

Sometimes all you need is just find out which language the text is:

>>> import golsate
>>> gs = goslate.Goslate()
>>> language_id = gs.detect('hallo welt')
>>> language_id
'de'
>>> gs.get_languages()[language_id]
'German'


Concurrent Querying

It is not necessary to roll your own multi-thread solution to speed up massive translation. Goslate already done it for you. It utilizes
concurrent.futures
for concurent querying. The max worker number is 120 by default.

The worker number could be changed as following:

>>> import golsate
>>> import concurrent.futures
>>> executor = concurrent.futures.ThreadPoolExecutor(max_workers=200)
>>> gs = goslate.Goslate(executor=executor)
>>> it = gs.translate(['text1', 'text2', 'text3'])
>>> list(it)
['tranlation1', 'translation2', 'translation3']


It is adviced to install
concurrent.futures
backport lib in python2.7 (python3 has it by default) to enable concurrent querying.

The input could be list, tuple or any iterater, even the file object which iterate line by line

>>> translated_lines = gs.translate(open('readme.txt'))
>>> translation = '\n'.join(translated_lines)


Do not worry about short texts will increase the query time. Internally, goslate will join small text into one big text to reduce the unnecessary query round trips.

Batch Translation

Google translation does not support very long text, goslate bypass this limitation by split the long text internally before send to Google and join the mutiple results into one translation text to the end user.

>>> import golsate
>>> with open('the game of thrones.txt', 'r') as f:
>>>     novel_text = f.read()
>>> gs = goslate.Goslate()
>>> gs.translate(novel_text)


Performance Consideration

Goslate use batch and concurrent fetch aggresivelly to achieve maximized translation speed internally.

All you need to do is reducing API calling times by utilize batch tranlation and concurrent querying.

For example, say if you want to translate 3 big text files. Instead of manually translate them one by one, line by line:

import golsate

big_files = ['a.txt', 'b.txt', 'c.txt']
gs = goslate.Goslate()

translation = []
for big_file in big_files:
with open(big_file, 'r') as f:
translated_lines = []
for line in f:
translated_line = gs.translate(line)
translated_lines.append(translated_line)

translation.append('\n'.join(translated_lines))


It is better to leave them to Goslate totally. The following code is not only simpler but also much faster (+100x) :

import golsate

big_files = ['a.txt', 'b.txt', 'c.txt']
gs = goslate.Goslate()

translation_iter = gs.translate(open(big_file, 'r').read() for big_file in big_files)
translation = list(translation_iter)


Internally, goslate will first adjust the text to make them not so big that do not fit Google query API nor so small that increase the total HTTP querying times. Then it will use concurrent query to speed thing even further.

Lookup Details in Dictionary

If you want detail dictionary explaination for a single word/phrase, you could

>>> import goslate
>>> gs = goslate.Goslate()
>>> gs.lookup_dictionary('sun', 'de')
[[['Sonne', 'sun', 0]],
[['noun',
['Sonne'],
[['Sonne', ['sun', 'Sun', 'Sol'], 0.44374731, 'die']],
'sun',
1],
['verb',
['der Sonne aussetzen'],
[['der Sonne aussetzen', ['sun'], 1.1544633e-06]],
'sun',
2]],
'en',
0.9447732,
[['en'], [0.9447732]]]


There are 2 limitaion for this API:

The result is a complex list structure which you have to parse for your own usage

The input must be a single word/phase, batch translation and concurrent querying are not supported

Query Error

If you get HTTP 5xx error, it is probably because google has banned your client IP address from transation querying.

You could verify it by access google translation service in browser manully.

You could try the following to overcome this issue:

query through a HTTP/SOCK5 proxy, see Proxy Support

using another google domain for translation:
gs = Goslate(service_urls=['http://translate.google.de'])


wait for 3 seconds before issue another querying

API References

please check API reference

Command Line Interface

goslate.py
is also a command line tool which you could use directly

Translate
stdin
input into Chinese in GBK encoding

$ echo "hello world" | goslate.py -t zh-CN -o gbk


Translate 2 text files into Chinese, output to UTF-8 file

$ goslate.py -t zh-CN -o utf-8 source/1.txt "source 2.txt" > output.txt


use
--help
for detail usage

$ goslate.py -h


How to Contribute

Report issues & suggestions

Fork repository

Donation

What’s New

1.5.0

Add new API
Goslate.lookup_dictionary()
to get detail information for a single word/phrase, thanks for Adam’s suggestion

Improve document with more user scenario and performance consideration

1.4.0

[fix bug] update to adapt latest google translation service changes

1.3.2

[fix bug] fix compatible issue with latest google translation service json format changes

[fix bug] unit test failure

1.3.0

[new feature] Translation in roman writing system (romanlization), thanks for Javier del Alamo’s contribution.

[new feature] Customizable service URL. you could provide multiple google translation service URLs for better concurrency performance

[new option] roman writing translation option for CLI

[fix bug] Google translation may change normal space to no-break space

[fix bug] Google web API changed for getting supported language list

Reference

Goslate: Free Google Translate API

exception
goslate.
Error

Error type

class
goslate.
Goslate
(writing=(u'trans', ), opener=None, retry_times=4, executor=None, timeout=4, service_urls=(u'http://translate.google.com', ), debug=False)
All goslate API lives in this class

You have to first create an instance of Goslate to use this API

Parameters:writing
The translation writing system. Currently 3 values are valid

WRITING_NATIVE
for native writing system

WRITING_ROMAN
for roman writing system

WRITING_NATIVE_AND_ROMAN
for both native and roman writing system. output will be a tuple in this case

opener (urllib2.OpenerDirector) – The url opener to be used for HTTP/HTTPS query. If not provide, a default opener will be used. For proxy support you should provide an
opener
with
ProxyHandler


retry_times (int) – how many times to retry when connection reset error occured. Default to 4

timeout (int/float) – HTTP request timeout in seconds

debug (bool) – Turn on/off the debug output

service_urls (single string or a sequence of strings) – google translate url list. URLs will be used randomly for better concurrent performance. For example
['http://translate.google.com', 'http://translate.google.de']


executor (
futures.ThreadPoolExecutor
) – the multi thread executor for handling batch input, default to a global
futures.ThreadPoolExecutor
instance with 120 max thead workers if
futures
is avalible. Set to None to disable multi thread support

Note

multi thread worker relys on futures, if it is not avalible,
goslate
will work under single thread mode

Example:
>>> import goslate
>>>
>>> # Create a Goslate instance first
>>> gs = goslate.Goslate()
>>>
>>> # You could get all supported language list through get_languages
>>> languages = gs.get_languages()
>>> print(languages['en'])
English
>>>
>>> # Tranlate English into German
>>> print(gs.translate('Hello', 'de'))
Hallo
>>> # Detect the language of the text
>>> print(gs.detect('some English words'))
en
>>> # Get goslate object dedicated for romanlized translation (romanlization)
>>> gs_roman = goslate.Goslate(WRITING_ROMAN)
>>> print(gs_roman.translate('hello', 'zh'))
Nín hǎo


detect
(text)
Detect language of the input text

Note

Input all source strings at once. Goslate will detect concurrently for maximize speed.

futures is required for best performance.

It returns generator on batch input in order to better fit pipeline architecture.

Parameters:text (UTF-8 str; unicode; sequence of string) – The source text(s) whose language you want to identify. Batch detection is supported via sequence input
Returns:the language code(s)

unicode: on single string input

generator of unicode: on batch input of string sequence

Raises:
Error
if parameter type or value is not valid
Example:

>>> gs = Goslate()
>>> print(gs.detect('hello world'))
en
>>> for i in gs.detect([u'hello', 'Hallo']):
...     print(i)
...
en
de


get_languages
()
Discover supported languages

It returns iso639-1 language codes for supported languages for translation. Some language codes also include a country code, like zh-CN or zh-TW.

Note

It only queries Google once for the first time and use cached result afterwards

Returns:a dict of all supported language code and language name mapping
{'language-code', 'Language name'}
Example:
>>> languages = Goslate().get_languages()
>>> assert 'zh' in languages
>>> print(languages['zh'])
Chinese


lookup_dictionary
(text, target_language, source_language=u'auto', examples=False, etymology=False, pronunciation=False, related_words=False, synonyms=False,antonyms=False, output_language=None)
Lookup detail meaning for single word/phrase

Note

Do not input sequence of texts

Parameters:text (UTF-8 str) – The source word/phrase(s) you want to lookup.

target_language (str; unicode) – The language to translate the source text into. The value should be one of the language codes listed in
get_languages()


source_language (str; unicode) – The language of the source text. The value should be one of the language codes listed in
get_languages()
. If a language is not specified, the system will attempt to identify the source language automatically.

examples – include example sentences or not

pronunciation – include pronunciation in roman writing or not

related_words – include related words or not

output_language – the dictionary’s own language, default to English.

Returns:a complex list structure contains multiple translation meanings for this word/phrase and detail explaination.

translate
(text, target_language, source_language=u'auto')
Translate text from source language to target language

Note

Input all source strings at once. Goslate will batch and fetch concurrently for maximize speed.

futures is required for best performance.

It returns generator on batch input in order to better fit pipeline architecture

Parameters:text (UTF-8 str; unicode; string sequence (list, tuple, iterator, generator)) – The source text(s) to be translated. Batch translation is supported via sequence input

target_language (str; unicode) – The language to translate the source text into. The value should be one of the language codes listed in
get_languages()


source_language (str; unicode) – The language of the source text. The value should be one of the language codes listed in
get_languages()
. If a language is not specified, the system will attempt to identify the source language automatically.

Returns:the translated text(s)

unicode: on single string input

generator of unicode: on batch input of string sequence

tuple: if WRITING_NATIVE_AND_ROMAN is specified, it will return tuple/generator for tuple (u”native”, u”roman format”)

Raises:
Error
(‘invalid target language’) if target language is not set

Error
(‘input too large’) if input a single large word without any punctuation or space in between

Example:
>>> gs = Goslate()
>>> print(gs.translate('Hello World', 'de'))
Hallo Welt
>>>
>>> for i in gs.translate(['good', u'morning'], 'de'):
...     print(i)
...
gut
Morgen


To output romanlized translation

Example:
>>> gs_roman = Goslate(WRITING_ROMAN)
>>> print(gs_roman.translate('Hello', 'zh'))
Nín hǎo


goslate.
WRITING_NATIVE
= (u'trans',)
native target language writing system

goslate.
WRITING_NATIVE_AND_ROMAN
= (u'trans', u'translit')
both native and roman writing. The output will be a tuple

goslate.
WRITING_ROMAN
= (u'translit',)
romanlized writing system. only valid for some langauges, otherwise it outputs empty string
内容来自用户分享和网络整理,不保证内容的准确性,如有侵权内容,可联系管理员处理 点击这里给我发消息
标签: