您的位置:首页 > 理论基础 > 计算机网络

python单线程网络爬虫

2015-06-27 18:32 525 查看
源程序:以http://jp.tingroom.com/yuedu/yd300p/网为例
#-*-coding:utf8-*-
import requests
import re
import sys
reload(sys)
sys.setdefaultencoding("gb18030")
type = sys.getfilesystemencoding()

html = requests.get('http://jp.tingroom.com/yuedu/yd300p/')
html.encoding = 'utf-8'
print html.text.encode("gb18030")

title = re.findall('color:#666666;">(.*?)</span>',html.text,re.S)
for each in title:
print each

chinese = re.findall('color: #039;">(.*?)</a>',html.text,re.S)
for each in chinese:
print each
编程中遇到的问题及解决方案:
问题1:字符编码格式不匹配
D:\Python27\python.exe D:/pycharm/class2/test.pyTraceback (most recent call last):File "D:/pycharm/class2/test.py", line 12, in <module>print html.textUnicodeEncodeError: 'gbk' codec can't encode character u'\xa9' in position 28478: illegal multibyte sequenceProcess finished with exit code 1
解决方案:将输出字文字的编码设为gb18030即可,代码:print html.text.encode("gb18030")

                                            
内容来自用户分享和网络整理,不保证内容的准确性,如有侵权内容,可联系管理员处理 点击这里给我发消息
标签: