您的位置:首页 > 编程语言 > Python开发

从一个简单的爬虫实例看python2和python3的区别

2018-01-07 11:18 357 查看
  以前写爬虫我都是用的python2.7,昨天晚上我试着用python3.6写爬虫,写的过程中遇到了几个问题。好在经过查资料和有大神指点,我这个小白学到了很多东西。下面我来把我这两天学到的东西记录一下。

  程序比较简单,爬取某网站的图片。写了一半有个问题一直解决不了,先用2.7版本写了一遍:

# -*- coding:utf-8 -*-
import urllib,re

#获取源码
def page(pg):
url = 'https://www.pengfu.com/index_%s.html'%pg
html = urllib.urlopen(url).read()#读取所有源代码
return html

#title
def title(html):
#html = page(1)
reg = re.compile(r'<h1 class="dp-b"><a href=".*?" target="_blank">(.*?)</a>')#正则 .*?代表所有字符
item = re.findall(reg,html)#匹配
return item

#picture
def content(html):
reg = r'<img src="(.*?)" width'
item = re.findall(reg,html)
return item

#download
def download(url,name):
path = 'H:\image\%s.jpg'%name.decode('utf-8').encode('gbk')
urllib.urlretrieve(url,path)

if __name__ == '__main__':
for i in range(1, 3):
html = page(i)
title_list = title(html) # 图片名称
content_list = content(html)
for i, z in zip(title_list, content_list):
download(z, i)
print(i, z)

python3和python2有些地方的差异还是比较明显的:
# -*- coding:utf-8 -*-
import urllib.request,re
from urllib import request

#获取源码
def page(pg):
url = 'https://www.pengfu.com/index_%s.html'%pg
html = urllib.request.urlopen(url).read()#读取所有源代码
#print(html)
return html
#title
def title(html):
#html = page(1)
#html = html.decode('utf-8')#python3.x
reg = re.compile(r'<h1 class="dp-b"><a href=".*?" target="_blank">(.*?)</a>')#正则 .*?代表所有字符
item = re.findall(reg,html)#匹配
return item

#picture
def content(html):
reg = r'<img src="(.*?)" width'
item = re.findall(reg,html)
return item

#download
def download(url,name):
with request.urlopen(url) as web:
with open('H:/pengfuimg/%s.jpg'%name,'wb') as image:
image.write(web.read())

if __name__ == '__main__':
for i in range(1, 6):
html = page(i)
html = html.decode('utf-8')
title_l = title(html) # 图片名称
content_l = content(html)
for i, z in zip(title_l, content_l): # .itervalues():
download(z,i)
print(i, z)



  首先导入的库就不一样,在Python3.X中,把urllib和urllib2统一合并到urllib中。
下载功能的函数我是这样的写的,不知道还有没有其它写法。

def download(url,name):
path = 'H:\image\%s.jpg'%name.decode('utf-8').encode('gbk')
urllib.urlretrieve(url,path)


html = html.decode('utf-8')#python3.x,这里也不一样。python3需要.decode('utf-8')一下。
内容来自用户分享和网络整理,不保证内容的准确性,如有侵权内容,可联系管理员处理 点击这里给我发消息
标签: