您的位置:首页 > 编程语言 > Python开发

初识python爬虫

2018-12-12 19:22 162 查看

 没有反爬技术下的代码:

[code]from lxml import etree
import requests

url = 'https://movie.douban.com/chart'
headers = {'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) .......'}#在开发者模式中获取
response = requests.get(url,headers=headers)#get方法请求响应
html = etree.HTML(response.content)#获取代码字节类型
href_list = html.xpath('//a/img/@src')#xpath获取html里的数据
for img in href_list:
img_response = requests.get(img,headers=headers)
with open(img[-15:],"wb") as f:
f.write(img_response.content)

网页隐藏数据后的代码:

[code]import re #re模块 正则匹配
import requests
from lxml import etree

headers = {'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) ...'}
url = "https://www.pearvideo.com/category_8"

response = requests.get(url,headers=headers)
html = etree.HTML(response.content)
video_url = html.xpath('//div[@class="vervideo-bd"]/a/@href')
print(video_url)
real_video_url = []
main_url = "https://www.pearvideo.com/"
for i in video_url:
real_video_url.append(main_url+i)
print(real_video_url)

video_list = []
for video in real_video_url:
video_response = requests.get(video,headers=headers)
#正则出要查找的字符串
reg = 'ldUrl="",srcUrl="(.*?)"'
video_list.append(re.findall(reg,video_response.content.decode())[0])

for index,i in enumerate(video_list):
video_response = requests.get(i,headers = headers)
with open(str(index)+i[-4:],"wb") as f:
f.write(video_response.content)

print(video_list)

 

阅读更多
内容来自用户分享和网络整理,不保证内容的准确性,如有侵权内容,可联系管理员处理 点击这里给我发消息
标签: