新手入门爬虫lxml+Requests+MongoDB
2019-03-26 21:55
197 查看
新手入门爬虫lxml+Requests+MongoDB
测试爬取一加社区
import requests from lxml import etree import pymongo import proxyIP import time def get_UrlInfos(url,proxyIp): header = { 'User-Agent': 'Mozilla/5.0 (Windows NT 6.1; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/69.0.3493.3 Safari/537.36' } response = requests.get(url,proxies = proxyIp,headers = header).text html = etree.HTML(response) items = html.xpath('//tbody') for item in items: info = { 'type':item.xpath('tr/th/div[2]/span/em/a')[0].text.strip() if len(item.xpath('tr/th/div[2]/span/em/a'))>0 else '', 'title':item.xpath('tr/th/div[2]/a/text()')[0], 'author':item.xpath('tr/th/div[2]/div/em[2]/a/text()')[0].strip(), 'time': item.xpath('tr/th/div[2]/div/em[3]/span/text()')[0].strip() if len(item.xpath('tr/th/div[2]/div/em[3]/span/text()'))>0 else item.xpath('tr/th/div[2]/div/em[3]/span/span/@title')[0], 'view' :int(item.xpath('tr/th/div[2]/div/em[1]/text()')[0].split(':')[1]), 'reply':int(item.xpath('tr/th/div[2]/div/em[1]/a/text()')[0].strip()) } yijia.insert_one(info) if __name__ == '__main__': start = time.time() mongoclient = pymongo.MongoClient('127.0.0.1',27017) mydb = mongoclient['mydb'] yijia = mydb['yijia'] proxyIp = proxyIP.getIp() urls = ['http://www.oneplusbbs.com/forum-116-{}.html'.format(i) for i in range(2,1000)] for url in urls: get_UrlInfos(url,proxyIp) end = time.time() print("单线程完成耗时:%d"%(end-start))
多线程可以自己用multiprocessing玩一下
相关文章推荐
- Mac新手从入门到放弃MongoDB
- 爬虫入门系列三用requests构建知乎api
- python3 [爬虫入门实战]爬虫之scrapy爬取织梦者网站并存mongoDB
- python3 [爬虫入门实战]scrapy爬取盘多多五百万数据并存mongoDB
- requests和lxml实现爬虫的方法
- Python爬虫入门实战八:数据储存——MongoDB与MySQL
- python3 [爬虫入门实战]爬虫之scrapy爬取织梦者网站并存mongoDB
- python3 [爬虫入门实战]爬虫之scrapy爬取游天下南京短租房存mongodb
- MongoDB新手入门篇
- python爬虫入门教程--利用requests构建知乎API(三)
- 放养的小爬虫--豆瓣电影入门级爬虫(mongodb使用教程~)
- Python爬虫—3第三方库_1_requests_入门
- 爬虫入门系列(二):优雅的HTTP库requests
- 使用threading,queue,fake_useragent,requests ,lxml,多线程爬取嗅事百科13页文字数据,爬虫案例
- Python爬虫入门之一-requests+BeautifulSoup
- 爬虫入门(三)连接mongodb
- python requests爬虫使用lxml解析HTML获取信息不对等的问题
- Python爬虫大杀器之Requests快速入门
- python爬虫入门之requests