您的位置：首页 > 编程语言 > Python开发

python爬虫中对含中文的url处理

2014-06-30 00:18 453 查看

在练习urllib操作中，遇到了url中含有中文字符的问题。比如http://dotamax.com/，看下源码的话，上方的搜索框的name=p，输入内容点击搜索以后，通过GET方法进行传递，比如我们搜索”意“，url变为http://dotamax.com/search/?q=意。但是url中是不允许出现中文字符的，这时候就改用urllib.parse.quote方法对中文字符进行转换。

url = "http://dotamax.com/"
search = "search/?q=" + urllib.parse.quote("意")
html = urllib.request.urlopen(url + search)

这样就可以正常获取页面了。

需要注意的是不能对整个url调用quote方法。

print(urllib.parse.quote("http://dotamax.com/search/?q=意"))

上面代码输出结果：

http%3A//dotamax.com/search/%3Fq%3D%E6%84%8F

可以看到，' : ', ' ? ', ' = '都被解码，因此需要将最后的中文字符部分调用quote方法后接在后面。

但是还有更方便的方法：

import urllib.parse

b = b'/:?='
print(urllib.parse.quote("http://dotamax.com/search/?q=意", b))

输出结果为：

http://dotamax.com/search/?q=%E6%84%8F

这就是我们想要的结果了。对quote方法是用help命令可以看到如下信息：

Help on function quote in module urllib.parse:

quote(string, safe='/', encoding=None, errors=None)
quote('abc def') -> 'abc%20def'

Each part of a URL, e.g. the path info, the query, etc., has a
different set of reserved characters that must be quoted.

RFC 2396 Uniform Resource Identifiers (URI): Generic Syntax lists
the following reserved characters.

reserved    = ";" | "/" | "?" | ":" | "@" | "&" | "=" | "+" |
"$" | ","

Each of these characters is reserved in some component of a URL,
but not necessarily in all of them.

By default, the quote function is intended for quoting the path
section of a URL.  Thus, it will not encode '/'.  This character
is reserved, but in typical usage the quote function is being
called on a path where the existing slash characters are used as
reserved characters.

string and safe may be either str or bytes objects. encoding must
not be specified if string is a str.

The optional encoding and errors parameters specify how to deal with
non-ASCII characters, as accepted by the str.encode method.
By default, encoding='utf-8' (characters are encoded with UTF-8), and
errors='strict' (unsupported characters raise a UnicodeEncodeError).

None

safe为可以忽略的字符，可以str类型或者bytes类型。

更详细的一些用法可以看这里：
http://www.nowamagic.net/academy/detail/1302863

内容来自用户分享和网络整理，不保证内容的准确性，如有侵权内容，可联系管理员处理

标签：

相关文章推荐

新的分享

章节导航