您的位置:首页 > 编程语言

Goslate-免费谷歌翻译项目(文章包括完整的设计,代码,开源,部署过程)

2014-01-28 14:54 826 查看
下面一位牛人的一个项目,项目不是很难,但是却包括了怎么设计,写文档,部署,值得学习,很完整的项目。

项目在https://bitbucket.org/zhuoqiang/goslate


Goslate 免费谷歌翻译

起因
使用
原理
优化
设计
开源


起因

机器翻译虽然质量差,但胜在省时省力。网上常见的翻译系统中,谷歌的质量算好的。谷歌翻译不但提供在线界面,还开放了 API 让程序直接调用翻译。美中不足的是从 2012 年开始谷歌翻译 API 收费了。可这难不倒聪明的程序员,只要谷歌网站上的翻译是免费使用的,你总是可以写个爬虫自动从网站抓取翻译结果。我花了点功夫写了个爬虫,又把爬虫封装成了简单高效的 Python 库来免费使用谷歌翻译,这就是 Goslate (Google
Translate) 。


使用

Goslate 支持 Python2.6 以上版本,包括 Python3!你可以通过 pip 或 easy_install 安装

$ pip install goslate


Goslate 目前只包含单个 python 文件,你也可以直接下载最新版本的 goslate.py 。使用很简单,下面是英译德的例子

>>> import goslate
>>> gs = goslate.Goslate()
>>> print gs.translate('hello world', 'de')
hallo welt


goslate.py 不仅是一个 python 模块,它还是个命令行工具,你可以直接使用

通过标准输入英译汉输出到屏幕

$ echo "hello world" | goslate.py -t zh-CN


翻译两个文件,将结果用 UTF-8 编码保存到 out.txt

$ goslate.py -t zh-CN -o utf-8 src/1.txt "src 2.txt" > out.txt


更多高级用法参看文档


原理

要使用谷歌翻译官方 API 需要先付费获得 Key。如果 Key 非法,谷歌 API 就会返回错误禁止使用。那么 Goslate 怎么绕过 Key 的验证呢,难道走了后门?恰恰相反,Goslate 光明正大地走前门,也就是直接抓取谷歌翻译网站的结果。

我们用浏览器去谷歌翻译 hello world,抓包发现,浏览器访问了这个 URL:

http://translate.google.com/translate_a/t?client=t&hl=en&sl=en&tl=zh-CN&ie=UTF-8&oe=UTF-8&multires=1&prev=conf&psl=en&ptl=en&otf=1&it=sel.2016&ssel=0&tsel=0&prev=enter&oc=3&ssel=0&tsel=0&sc=1&text=hello%20world

很容易看出源文本是作为 text 参数直接编码在 URL 中的。而相应的 tl 参数表示 translate language,这里是 zh-CN (简体中文)。

谷歌翻译返回:

{"sentences":[{"trans":"世界,你好!","orig":"hello world!","translit":"Shìjiè, nǐ hǎo!","src_translit":""},{"trans":"认识你很高兴。","orig":"nice to meet you.","translit":"Rènshi nǐ hěn gāoxìng.","src_translit":""}],"src":"en","server_time":48}

格式类似 JSON,但不标准。其中不但有翻译结果,还包含汉语拼音和源文本的语言等附加信息,猜测这些可能是为了客户端的某些特殊功能。

这个过程很简单,我们的爬虫逻辑是

先把源文本和目标语言组成类似上面的 URL
再用 python 的 urllib2 去到谷歌翻译站点上 HTTP GET 结果
拿到返回数据后再把翻译结果单独抽取出来

有一点要注意,谷歌很不喜欢 python 爬虫:) 它会禁掉所有 User-Agent 是 Python-urllib/2.7 的 HTTP 请求。我们要伪装成浏览器 User-Agent: Mozilla/4.0 来让谷歌放心。另外还有一个小窍门,URL
中可将参数 client 从 t 改成其它值,返回的就是标准 JSON 格式,方便解析结果。


优化

爬虫虽然工作正常,但有两个问题:

短:受限于 URL 长度,只能翻译不超过 2000 字节的短文本。长文本需要手工分隔多次翻译
慢:每次翻译都要一个 HTTP 网络应答,时间接近 1 秒,开销很大。以 8000 个短句的翻译为例,全部翻完就需要近 2 个小时

短的问题可用自动分拆,多次查询解决:对于长文本,Goslate 会在标点换行等分隔处把文本分拆为若干接近 2000 字节的子文本,再一一查询,最后将翻译结果拼接后返回用户。通过这种方式,Goslate 突破了文本长度的限制。

慢的问题比较难,性能卡在网络延迟上。谷歌官方 API 可以一次传入多个文本进行批量翻译,大大减少了 HTTP 网络应答。Goslate 也支持批量翻译,既然一次查询最大允许 2000 字节的文本,那就尽量用足。用户传入多个文本后 Goslate 会把若干小文本尽量拼接成不超过 2000 字节的大文本,再通过一次 HTTP 请求进行翻译,最后将结果分拆成相应的若干翻译文本返回。

这里可以看到,批量查询和长文本支持正好相反,批量查询是要拼接成大块后一次翻译再分拆结果,长文本支持是要拆分后多次翻译再拼接结果。如果批量查询中有某个文本过长,那它本身就要先被拆分,然后再和前后的小文本合并。看起来逻辑有些复杂,但其实只要功能合理分层实现就好了:

最底层的 Goslate._basic_translate() 具体负责通过 HTTP 请求翻译单个文本,不支持长文本分拆
Goslate._translate_single_text() 在 _basic_translate() 基础上通过自动分拆多次查询支持长文本
最后外部 API Goslate.translate() 通过拼接后调用 _translate_single_text()来支持批量翻译

通过三层递进,复杂的拆分拼接逻辑就完美表达出来了。All problems in computer science can be solved by another level of indirection 额外的间接层解决一切,诚不余欺也。

另一个加速方法是利用并发。Goslate 内部用线程池并发多个 HTTP 请求,大大加快查询的速度。这里没有选择重新发明轮子,而是直接用 futures 库提供的线程池。

批量加并发后效果非常明显,8000 个短句的翻译时间从 2 小时缩短到 10 秒钟。速度提升了 700 倍。看来 Goslate 不但免费,还比谷歌官方 API 要高效得多。


设计

能工作,性能高,再进一步的要求就是好用了。这就涉及到 API 设计问题。Goslate API 总的设计原则是简约但不简陋 (Make things as simple as possible, but not simpler)

Goslate 功能虽然简单,但魔鬼在细节中。比如出入参数的字符编码,proxy 设定,超时处理,并发线程数等该怎么合理组织规划?按设计原则,我们把灵活性分为三大类,Goslate 区别对待:

必需的灵活,比如待翻译的源文本,目标语言等。这些是基本功能,将做为 API 的入参
高级场景下才需要的灵活,比如 proxy 地址,出错后重试的次数,并发线程数等。通常用户不会关心,但特殊场景下又希望能控制。Goslate 用参数默认值解决这个两难问题。为了进一步简化和性能考虑,这类灵活性都放在了 Goslate 对象的构造中 Goslate.__init__() 一次性设定
无意义的灵活,例如文本的编码等。对这种灵活性要敢于说不,过度设计只会增加不必要的复杂度。Goslate 的入参只支持 Unicode 字串或 UTF-8 编码的字节流,返回值一律是 Unicode 字符串,需要什么编码自己转去。为什么说这些灵活性毫无意义?因为用户本来就有更自然的方式实现同样的功能。拒绝无意义的灵活反而能让用户达到真正的灵活

设计上还有其它考虑点:

消灭全局状态,所有状态都在 Goslate 对象中。如果想的话,urllib2 提供的 HTTP opener 甚至线程池你都可以替换定制。
应用依赖注入原则,所有的灵活性设置都本着最少知道原则 (Law of Demeter) 依赖于直接的外部功能接口。例如,proxy 地址不直接通过参数传入 Goslate,而是需要构造一个支持 proxy 的 urllib2.opener 传给 Goslate 的构造函数。这么做的直接好处是参数少了。更本质的好处是解耦。内部实现依赖的是
opener 的接口,而不是具体的 opener 实现,更加不是 proxy 的配置。你可以配一个支持 proxy 的 opener 来实现 proxy 转发访问,也可以配一个支持用户认证的opener 来应对复杂的网络环境。极端情况下,甚至可以自己从头定制一个 opener 实现来完成特殊的需求
批量查询的入参出参使用 generator 而不是 list。这样就可以按照 pipeline 的方式组织代码:批量翻译的源文本序列可以是 generator,翻译一条就可以通过返回的 generator 实时拿到一个结果进行后面的处理,不用等着全部批量操作完成,增加了整个流程的效率。
额外提供了命令行界面,简单的翻译任务可以直接用命令行完成。命令行参数的设计也遵照 Unix 设计哲学:从标准输入读取源文本,输出到标准输出。方便与其它工具集成使用


开源

如果只是自己使用,API 完全不用考虑的这么周全。之所以精雕细琢,目标就是开源!Python 社区对开源支持的非常好,开源库只要按照一定的规范去操作就好了:

选版权:挑了半天,选择了自由的 MIT
代码管理: Goslate 托管在 Bitbucket 上,虽然名气没有 Github 响,但胜在可以用我喜欢的 hg
单元测试: 自动化单元测试很有必要,既是对用户负责,也让自己能放手做优化。Python 标准的 unittest 框架很好用。同时 Goslate 也在 docstring 中加入了 doctest
文档: 使用 Python 社区标准的文档工具 sphinx 生成,它可以自动从代码中抽取信息生成文档。文档生成好了后还可以免费上传到 pythonhosted 托管
部署: 按规矩在 setup.py 中写好元信息,将代码打包上传到 pypi 就成了 (这里有详细步骤)
。这样全世界的用户就可以用 pip 或 easy_install 安装使用了。为了让受众更广,Goslate 还花了点力气同时支持 python2,python3
宣传: 酒香也怕巷深,何况我这个小小的开源库呢。这也是本文的初衷

回过头看,Goslate 代码共 700 行,其中只有 300 行是实际功能代码,剩下的 400 行包括 150 行的文档注释和 250 行的单元测试。库虽小,但每样都要做好的话工作量比预想的要大很多,算起来写功能代码的时间远没有做开源周边辅助工作的时间长。

天下事必做于细,信哉!

源代码:

#! /usr/bin/env python
# -*- coding: utf-8 -*-

'''Goslate: Free Google Translate API
'''
from __future__ import print_function
from __future__ import unicode_literals

import sys
import os
import json
import itertools
import functools
import time
import socket
import xml.etree.ElementTree

try:
from urllib.request import build_opener, Request, HTTPHandler, HTTPSHandler
from urllib.parse import quote_plus, urlencode, unquote_plus
except ImportError:
from urllib2 import build_opener, Request, HTTPHandler, HTTPSHandler
from urllib import urlencode, unquote_plus, quote_plus

try:
import concurrent.futures
_g_executor = concurrent.futures.ThreadPoolExecutor(max_workers=120)
except ImportError:
_g_executor = None

__author__ = 'ZHUO Qiang'
__email__ = 'zhuo.qiang@gmail.com'
__copyright__ = "2013, http://zhuoqiang.me" __license__ = "MIT"
__date__ = '2013-05-11'
__version_info__ = (1, 1, 2)
__version__ = '.'.join(str(i) for i in __version_info__)
__home__ = 'https://bitbucket.org/zhuoqiang/goslate'
__download__ = 'https://pypi.python.org/pypi/goslate'

try:
unicode
except NameError:
unicode = str

def _is_sequence(arg):
return (not isinstance(arg, unicode)) and (
not isinstance(arg, bytes)) and (
hasattr(arg, "__getitem__") or hasattr(arg, "__iter__"))

def _is_bytes(arg):
return isinstance(arg, bytes)

class Error(Exception):
'''Error type
'''
pass

class Goslate(object):
'''All goslate API lives in this class

You have to first create an instance of Goslate to use this API

:param opener: The url opener to be used for HTTP/HTTPS query.
If not provide, a default opener will be used.
For proxy support you should provide an ``opener`` with ``ProxyHandler``
:type debug: `urllib2.OpenerDirector <http://docs.python.org/2/library/urllib2.html#urllib2.OpenerDirector>`_

:param retry_times: how many times to retry when connection reset error occured. Default to 4
:type retry_times: int

:param executor: the multi thread executor for handling batch input, default to a global ``futures.ThreadPoolExecutor`` instance with 120 max thead workers if ``futures`` is avalible. Set to None to disable multi thread support

.. note:: multi thread worker relys on `futures <https://pypi.python.org/pypi/futures>`_, if it is not avalible, ``goslate`` will work under single thread mode

:type executor: ``futures.ThreadPoolExecutor``

:type max_workers: int

:param timeout: HTTP request timeout in seconds
:type timeout: int/float

:param debug: Turn on/off the debug output
:type debug: bool

:Example:

>>> import goslate
>>>
>>> # Create a Goslate instance to use first
>>> gs = goslate.Goslate()
>>>
>>> # You could get all supported language list through get_languages
>>> languages = gs.get_languages()
>>> print(languages['en'])
English
>>>
>>> # Tranlate the languages' name into Chinese
>>> language_names = languages.values()
>>> language_names_in_chinese = gs.translate(language_names, 'zh')
>>>
>>> # verify each Chinese name is really in Chinese using detect
>>> language_codes = gs.detect(language_names_in_chinese)
>>> for code in language_codes:
...     assert 'zh-CN' == code
...
>>>
'''

_MAX_LENGTH_PER_QUERY = 1800

def __init__(self, opener=None, retry_times=4, executor=_g_executor, timeout=4, debug=False):
self._DEBUG = False
self._MIN_TASKS_FOR_CONCURRENT = 2
self._opener = opener
self._languages = None
self._TIMEOUT = timeout
if not self._opener:
debuglevel = self._DEBUG and 1 or 0
self._opener = build_opener(
HTTPHandler(debuglevel=debuglevel),
HTTPSHandler(debuglevel=debuglevel))

self._RETRY_TIMES = retry_times
self._executor = executor

def _open_url(self, url):
if len(url) > self._MAX_LENGTH_PER_QUERY+100:
raise Error('input too large')

# Google forbits urllib2 User-Agent: Python-urllib/2.7
request = Request(url, headers={'User-Agent':'Mozilla/4.0'})

exception = None
# retry when get (<class 'socket.error'>, error(54, 'Connection reset by peer')
for i in range(self._RETRY_TIMES):
try:
response = self._opener.open(request, timeout=self._TIMEOUT)
response_content = response.read().decode('utf-8')
if self._DEBUG:
print(response_content)
return response_content
except socket.error as e:
if self._DEBUG:
import threading
print(threading.currentThread(), e)
if 'Connection reset by peer' not in str(e):
raise e
exception = e
time.sleep(0.0001)
raise exception

def _execute(self, tasks):
first_tasks = [next(tasks, None) for i in range(self._MIN_TASKS_FOR_CONCURRENT)]
tasks = (task for task in itertools.chain(first_tasks, tasks) if task)

if not first_tasks[-1] or not self._executor:
for each in tasks:
yield each()
else:
exception = None
for each in [self._executor.submit(t) for t in tasks]:
if exception:
each.cancel()
else:
exception = each.exception()
if not exception:
yield each.result()

if exception:
raise exception

def _basic_translate(self, text, target_language, source_language=''):
# assert _is_bytes(text)

if not target_language:
raise Error('invalid target language')

if not text.strip():
return u'', unicode(target_language)

# Browser request for 'hello world' is:
# http://translate.google.com/translate_a/t?client=t&hl=en&sl=en&tl=zh-CN&ie=UTF-8&oe=UTF-8&multires=1&prev=conf&psl=en&ptl=en&otf=1&it=sel.2016&ssel=0&tsel=0&prev=enter&oc=3&ssel=0&tsel=0&sc=1&text=hello%20world 
GOOGLE_TRASLATE_URL = 'http://translate.google.com/translate_a/t'
GOOGLE_TRASLATE_PARAMETERS = {
# 't' client will receiver non-standard json format
# change client to something other than 't' to get standard json response
'client': 'z',
'sl': source_language,
'tl': target_language,
'ie': 'UTF-8',
'oe': 'UTF-8',
'text': text
}

url = '?'.join((GOOGLE_TRASLATE_URL, urlencode(GOOGLE_TRASLATE_PARAMETERS)))
response_content = self._open_url(url)
data = json.loads(response_content)
translation = u''.join(i['trans'] for i in data['sentences'])
detected_source_language = data['src']
return translation, detected_source_language

def get_languages(self):
'''Discover supported languages

It returns iso639-1 language codes for
`supported languages <https://developers.google.com/translate/v2/using_rest#language-params>`_
for translation. Some language codes also include a country code, like zh-CN or zh-TW.

.. note:: It only queries Google once for the first time and use cached result afterwards

:returns: a dict of all supported language code and language name mapping ``{'language-code', 'Language name'}``

:Example:

>>> languages = Goslate().get_languages()
>>> assert 'zh' in languages
>>> print(languages['zh'])
Chinese

'''
if self._languages:
return self._languages

GOOGLE_TRASLATOR_URL = 'http://translate.google.com/translate_a/l'
GOOGLE_TRASLATOR_PARAMETERS = {
'client': 't',
}

url = '?'.join((GOOGLE_TRASLATOR_URL, urlencode(GOOGLE_TRASLATOR_PARAMETERS)))
response_content = self._open_url(url)
root = xml.etree.ElementTree.fromstring(response_content)

if root.tag != 'LanguagePairs':
return {}

languages = {}
for i in root.findall('Pair'):
languages[i.get('target_id')] = i.get('target_name')
languages[i.get('source_id')] = i.get('source_name')

if 'auto' in languages:
del languages['auto']
self._languages = languages
return self._languages

_SEPERATORS = [quote_plus(i.encode('utf-8')) for i in
u'.!?,;。,?!::"\'“”’‘#$%&()()*×+/<=>@#¥[\]…[]^`{|}{}~~\n\r\t ']

def _translate_single_text(self, text, target_language='zh-CN', source_lauguage=''):
assert _is_bytes(text)
def split_text(text):
start = 0
text = quote_plus(text)
length = len(text)
while (length - start) > self._MAX_LENGTH_PER_QUERY:
for seperator in self._SEPERATORS:
index = text.rfind(seperator, start, start+self._MAX_LENGTH_PER_QUERY)
if index != -1:
break
else:
raise Error('input too large')
end = index + len(seperator)
yield unquote_plus(text[start:end])
start = end

yield unquote_plus(text[start:])

def make_task(text):
return lambda: self._basic_translate(text, target_language, source_lauguage)[0]

return ''.join(self._execute(make_task(i) for i in split_text(text)))

def translate(self, text, target_language, source_language=''):
'''Translate text from source language to target language

.. note::

- Input all source strings at once. Goslate will batch and fetch concurrently for maximize speed.
- `futures <https://pypi.python.org/pypi/futures>`_ is required for best performance.
- It returns generator on batch input in order to better fit pipeline architecture

:param text: The source text(s) to be translated. Batch translation is supported via sequence input
:type text: UTF-8 str; unicode; string sequence (list, tuple, iterator, generator)

:param target_language: The language to translate the source text into.
The value should be one of the language codes listed in :func:`get_languages`
:type target_language: str; unicode

:param source_language: The language of the source text.
The value should be one of the language codes listed in :func:`get_languages`.
If a language is not specified,
the system will attempt to identify the source language automatically.
:type source_language: str; unicode

:returns: the translated text(s)

- unicode: on single string input
- generator of unicode: on batch input of string sequence

:raises:
- :class:`Error` ('invalid target language') if target language is not set
- :class:`Error` ('input too large') if input a single large word without any punctuation or space in between

:Example:

>>> gs = Goslate()
>>> print(gs.translate('Hello World', 'de'))
Hallo Welt
>>>
>>> for i in gs.translate(['thank', u'you'], 'de'):
...     print(i)
...
danke
Sie

'''

if not target_language:
raise Error('invalid target language')

if not _is_sequence(text):
if isinstance(text, unicode):
text = text.encode('utf-8')
return self._translate_single_text(text, target_language, source_language)

JOINT = u'\u26ff'
UTF8_JOINT = (u'\n%s\n' % JOINT).encode('utf-8')

def join_texts(texts):
def convert_to_utf8(texts):
for i in texts:
if isinstance(i, unicode):
i = i.encode('utf-8')
yield i.strip()

texts = convert_to_utf8(texts)
text = next(texts)
for i in texts:
new_text = UTF8_JOINT.join((text, i))
if len(quote_plus(new_text)) < self._MAX_LENGTH_PER_QUERY:
text = new_text
else:
yield text
text = i
yield text

def make_task(text):
return lambda: (i.strip('\n') for i in self._translate_single_text(text, target_language, source_language).split(JOINT))

return itertools.chain.from_iterable(self._execute(make_task(i) for i in join_texts(text)))

def _detect_language(self, text):
if _is_bytes(text):
text = text.decode('utf-8')
return self._basic_translate(text[:50].encode('utf-8'), 'en')[1]

def detect(self, text):
'''Detect language of the input text

.. note::

- Input all source strings at once. Goslate will detect concurrently for maximize speed.
- `futures <https://pypi.python.org/pypi/futures>`_ is required for best performance.
- It returns generator on batch input in order to better fit pipeline architecture.

:param text: The source text(s) whose language you want to identify.
Batch detection is supported via sequence input
:type text: UTF-8 str; unicode; sequence of string
:returns: the language code(s)

- unicode: on single string input
- generator of unicode: on batch input of string sequence

:raises: Error if parameter type or value is not valid

Example::

>>> gs = Goslate()
>>> print(gs.detect('hello world'))
en
>>> for i in gs.detect([u'hello', 'Hallo']):
...     print(i)
...
en
de

'''
if _is_sequence(text):
return self._execute(functools.partial(self._detect_language, i) for i in text)
return self._detect_language(text)

def _main(argv):
import optparse

usage = "usage: %prog [options] <file1 file2 ...>\n<stdin> will be used as input source if no file specified."

parser = optparse.OptionParser(usage=usage, version="%%prog %s @ Copyright %s" % (__version__, __copyright__))
parser.add_option('-t', '--target-language', metavar='zh-CN',
help='specify target language to translate the source text into')
parser.add_option('-s', '--source-language', default='', metavar='en',
help='specify source language, if not provide it will identify the source language automatically')
parser.add_option('-i', '--input-encoding', default=sys.getfilesystemencoding(), metavar='utf-8',
help='specify input encoding, default to current console system encoding')
parser.add_option('-o', '--output-encoding', default=sys.getfilesystemencoding(), metavar='utf-8',
help='specify output encoding, default to current console system encoding')

options, args = parser.parse_args(argv[1:])

if not options.target_language:
print('Error: missing target language!')
parser.print_help()
return

gs = Goslate()
import fileinput
# inputs = fileinput.input(args, mode='rU', openhook=fileinput.hook_encoded(options.input_encoding))
inputs = fileinput.input(args, mode='rb')
inputs = (i.decode(options.input_encoding) for i in inputs)
outputs = gs.translate(inputs, options.target_language, options.source_language)
for i in outputs:
sys.stdout.write((i+u'\n').encode(options.output_encoding))
sys.stdout.flush()

if __name__ == '__main__':
try:
_main(sys.argv)
except:
error = sys.exc_info()[1]
if len(str(error)) > 2:
print(error)
内容来自用户分享和网络整理,不保证内容的准确性,如有侵权内容,可联系管理员处理 点击这里给我发消息
标签: 
相关文章推荐