初识python爬虫
2018-12-12 19:22
162 查看
没有反爬技术下的代码:
[code]from lxml import etree import requests url = 'https://movie.douban.com/chart' headers = {'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) .......'}#在开发者模式中获取 response = requests.get(url,headers=headers)#get方法请求响应 html = etree.HTML(response.content)#获取代码字节类型 href_list = html.xpath('//a/img/@src')#xpath获取html里的数据 for img in href_list: img_response = requests.get(img,headers=headers) with open(img[-15:],"wb") as f: f.write(img_response.content)
网页隐藏数据后的代码:
[code]import re #re模块 正则匹配 import requests from lxml import etree headers = {'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) ...'} url = "https://www.pearvideo.com/category_8" response = requests.get(url,headers=headers) html = etree.HTML(response.content) video_url = html.xpath('//div[@class="vervideo-bd"]/a/@href') print(video_url) real_video_url = [] main_url = "https://www.pearvideo.com/" for i in video_url: real_video_url.append(main_url+i) print(real_video_url) video_list = [] for video in real_video_url: video_response = requests.get(video,headers=headers) #正则出要查找的字符串 reg = 'ldUrl="",srcUrl="(.*?)"' video_list.append(re.findall(reg,video_response.content.decode())[0]) for index,i in enumerate(video_list): video_response = requests.get(i,headers = headers) with open(str(index)+i[-4:],"wb") as f: f.write(video_response.content) print(video_list)
阅读更多
相关文章推荐
- python网络爬虫之初识网络爬虫
- python 爬虫系列(0) --- 初识网络爬虫
- Python爬虫初识
- 初识python爬虫 Python网络数据采集1.0 BeautifulSoup通过网站css爬取信息
- 第一课 Python爬虫初识与网络请求
- Python爬虫初识
- 初识python爬虫 Python网络数据采集1.0 BeautifulSoup安装测试
- python 爬虫初识 ,不断更新中
- Python网络爬虫——1、初识网络爬虫
- 基于python3.4.3的Requests2.18.4爬虫学习系列之一 安装及初识
- python爬虫从入门到放弃(一)之初识爬虫
- 初识python爬虫
- Python爬虫(入门+进阶)学习笔记 1-2 初识Python爬虫
- python爬虫从入门到放弃(一)之初识爬虫
- Python爬虫开发与项目实战 3: 初识爬虫
- 手把手教你学python第十八讲(初识爬虫)
- Python爬虫初识
- python爬虫从入门到放弃(一)之初识爬虫
- Python爬虫模拟登录带验证码网站
- Python爬虫四(正则表达式)