python爬虫学习第一周总结
2018-01-18 11:14
204 查看
Beatifulsoup中文文档:http://beautifulsoup.readthedocs.io/zh_CN/latest/
Requests官方文档(中文):http://docs.python-requests.org/zh_CN/latest/user/quickstart.html
一、解析网页中的元素
beatifulsoup
右键copy selector或xpath,描述元素在网页中的什么位置什么位置
对xpath的理解
练习1
from bs4 import BeautifulSoup
import urllib
url = "http://www.mmjpg.com/"
html = urllib.request.urlopen(url)
response = html.read()
soup = BeautifulSoup(response,'lxml')
images = soup.select('body > div.main > div.pic > ul > li > a > img')
titles = soup.select('body > div.topbar > div.subnav > a')
#print(images)
for image in images:
print(image.get('src'))练习2
二、如何获得网页中的异步加载数据点击XHR,在网页中下拉加载数据
Requests官方文档(中文):http://docs.python-requests.org/zh_CN/latest/user/quickstart.html
一、解析网页中的元素
beatifulsoup
右键copy selector或xpath,描述元素在网页中的什么位置什么位置
对xpath的理解
练习1
from bs4 import BeautifulSoup
import urllib
url = "http://www.mmjpg.com/"
html = urllib.request.urlopen(url)
response = html.read()
soup = BeautifulSoup(response,'lxml')
images = soup.select('body > div.main > div.pic > ul > li > a > img')
titles = soup.select('body > div.topbar > div.subnav > a')
#print(images)
for image in images:
print(image.get('src'))练习2
from bs4 import BeautifulSoup import requests import time url_saves = 'http://www.tripadvisor.com/Saves#37685322' url = 'https://cn.tripadvisor.com/Attractions-g60763-Activities-New_York_City_New_York.html' urls = ['https://cn.tripadvisor.com/Attractions-g60763-Activities-oa{}-New_York_City_New_York.html#ATTRACTION_LIST'.format(str(i)) for i in range(30,930,30)] headers = { 'User-Agent':'', 'Cookie':'' } def get_attractions(url,data=None): wb_data = requests.get(url) time.sleep(4) soup = BeautifulSoup(wb_data.text,'lxml') titles = soup.select('div.property_title > a[target="_blank"]') imgs = soup.select('img[width="160"]') cates = soup.select('div.p13n_reasoning_v2') if data == None: for title,img,cate in zip(titles,imgs,cates): data = { 'title' :title.get_text(),#获得对应的文本 'img' :img.get('src'),#获得属性的值 'cate' :list(cate.stripped_strings),#获得标签下所以子节点的文本,为了存储,用列表表示 } print(data) def get_favs(url,data=None): wb_data = requests.get(url,headers=headers) soup = BeautifulSoup(wb_data.text,'lxml') titles = soup.select('a.location-name') imgs = soup.select('div.photo > div.sizedThumb > img.photo_image') metas = soup.select('span.format_address') if data == None: for title,img,meta in zip(titles,imgs,metas): data = { 'title' :title.get_text(), 'img' :img.get('src'), 'meta' :list(meta.stripped_strings) } print(data) for single_url in urls: get_attractions(single_url) # from mobile web site ''' headers = { 'User-Agent':'', #mobile device user agent from chrome } mb_data = requests.get(url,headers=headers) soup = BeautifulSoup(mb_data.text,'lxml') imgs = soup.select('div.thumb.thumbLLR.soThumb > img') for i in imgs: print(i.get('src')) '''
二、如何获得网页中的异步加载数据点击XHR,在网页中下拉加载数据
相关文章推荐
- 【Python爬虫学习笔记(3)】Beautiful Soup库相关知识点总结
- Python的第一周学习总结
- Python爬虫学习总结
- Python 学习第一周总结
- python实现网络爬虫学习总结
- Python爬虫,月薪25K的爬虫工程师对近期爬虫学习的总结!超全!
- 【Python学习笔记(三)】:爬虫用到的知识点总结
- 【Python爬虫学习笔记(1)】urllib2库相关知识点总结
- Python爬虫系列(一)初期学习爬虫的拾遗与总结
- Python第一周学习总结
- python爬虫学习第二周总结
- python学习总结----爬虫爬一个网站的图片
- 【Python3.6爬虫学习记录】(十四)多线程爬虫模板总结
- 【Python爬虫学习笔记(2)】正则表达式(re模块)相关知识点总结
- [Python] 网络爬虫和正则表达式学习总结
- 用Python写网络爬虫-学习总结
- Python爬虫系列(一)初期学习爬虫的拾遗与总结(6.8更)
- Python第一周学习总结
- python3进阶学习总结——简单爬虫实现
- Python学习第五天——第一周总结