您的位置:首页 > 编程语言 > Python开发

python爬虫中对含中文的url处理

2014-06-30 00:18 453 查看
在练习urllib操作中,遇到了url中含有中文字符的问题。比如http://dotamax.com/,看下源码的话,上方的搜索框的name=p,输入内容点击搜索以后,通过GET方法进行传递,比如我们搜索”意“,url变为http://dotamax.com/search/?q=意。但是url中是不允许出现中文字符的,这时候就改用urllib.parse.quote方法对中文字符进行转换。

url = "http://dotamax.com/"
search = "search/?q=" + urllib.parse.quote("意")
html = urllib.request.urlopen(url + search)


这样就可以正常获取页面了。

需要注意的是不能对整个url调用quote方法。

print(urllib.parse.quote("http://dotamax.com/search/?q=意"))


上面代码输出结果:

http%3A//dotamax.com/search/%3Fq%3D%E6%84%8F


可以看到,' : ', ' ? ', ' = '都被解码,因此需要将最后的中文字符部分调用quote方法后接在后面。

但是还有更方便的方法:

import urllib.parse

b = b'/:?='
print(urllib.parse.quote("http://dotamax.com/search/?q=意", b))
输出结果为:

http://dotamax.com/search/?q=%E6%84%8F
这就是我们想要的结果了。对quote方法是用help命令可以看到如下信息:

Help on function quote in module urllib.parse:

quote(string, safe='/', encoding=None, errors=None)
quote('abc def') -> 'abc%20def'

Each part of a URL, e.g. the path info, the query, etc., has a
different set of reserved characters that must be quoted.

RFC 2396 Uniform Resource Identifiers (URI): Generic Syntax lists
the following reserved characters.

reserved    = ";" | "/" | "?" | ":" | "@" | "&" | "=" | "+" |
"$" | ","

Each of these characters is reserved in some component of a URL,
but not necessarily in all of them.

By default, the quote function is intended for quoting the path
section of a URL.  Thus, it will not encode '/'.  This character
is reserved, but in typical usage the quote function is being
called on a path where the existing slash characters are used as
reserved characters.

string and safe may be either str or bytes objects. encoding must
not be specified if string is a str.

The optional encoding and errors parameters specify how to deal with
non-ASCII characters, as accepted by the str.encode method.
By default, encoding='utf-8' (characters are encoded with UTF-8), and
errors='strict' (unsupported characters raise a UnicodeEncodeError).

None


safe为可以忽略的字符,可以str类型或者bytes类型。

更详细的一些用法可以看这里:
http://www.nowamagic.net/academy/detail/1302863
内容来自用户分享和网络整理,不保证内容的准确性,如有侵权内容,可联系管理员处理 点击这里给我发消息
标签: