您的位置:首页 > 数据库 > Mongodb

python项目实战:用多进程(multiprocessing)+多线程(threading)的方式并发爬取淘宝商品信息并存入MongoDB

2019-02-22 14:57 896 查看

用多进程(multiprocessing)+多线程(threading)的方式并发爬取淘宝商品信息并存入MongoDB

声明:本文仅供学习用,旨在分享
基于上次写的python实战:将cookies添加到requests.session中实现淘宝的模拟登录 ,此次我们实现在该登陆状况下抓取淘宝商品信息(以抓取美食信息为例),并用并发方式来对请求的URL进行访问爬取数据后存入MongoDB。
**1、**首先分析URL的请求规律。打开chrome的开发者工具,刷新页面后找出数据是由哪个URL请求得到的。经分析可知该URL为:https://s.taobao.com/search?data-key=s%2Cps&data-value=0%2C1&ajax=true&_ksTS=1550817297379_1096&callback=jsonp1097&q=美食&imgfile=&commend=all&ssid=s5-e&search_type=item&sourceId=tb.index&spm=a21bo.2018.201856-taobao-item.1&ie=utf8&initiative_id=tbindexz_20170306&bcoffset=0&ntoffset=6&p4ppushleft=1%2C48&s=44 ,返回的数据格式是JSON。如下图所示:

**2、**为了能得到URL的请求规律,我们在Network中选中Preserve log标签,然后点击下一页获取请求的URL为 https://s.taobao.com/search?data-key=s&data-value=44&ajax=true&_ksTS=1550817667669_1346&callback=jsonp1347&q=美食&imgfile=&commend=all&ssid=s5-e&search_type=item&sourceId=tb.index&spm=a21bo.2018.201856-taobao-item.1&ie=utf8&initiative_id=tbindexz_20170306&bcoffset=3&ntoffset=0&p4ppushleft=1%2C48&s=0 ,将这俩次的请求URL进行对比会看到有些字段的值是不同的,针对这些字段我们逐个剔除,以其返回的数据是否仍包括我们所需的为依据,测试几次后得到了其简化的请求URL为: https://s.taobao.com/search?data-key=s&data-value={页码-1*44}&ajax=true&ie=utf8&spm=aa21bo.2017.201856-taobao-item.2&sourceId=tb.index&search_type=item&ssid=s5-e&commend=all&q=美食 ,测试效果如下图:

**3、**知道其规律后我们基于上次写的python实战:将cookies添加到requests.session中实现淘宝的模拟登录 ,开始编写爬取淘宝数据程序:
主程序:taobao_spider,py

import requests
import threading
import multiprocessing
from config import *
import json
import pymongo
from urllib.parse import quote

client = pymongo.MongoClient(MONGO_URL, connect=False)

class TaoBao:
def __init__(self):
self.url_temp='https://s.taobao.com/search?data-key=s&data-value={}&ajax=true&ie=utf8&spm=aa21bo.2017.201856-taobao-item.2&sourceId=tb.index&search_type=item&ssid=s5-e&commend=all&q='
self.headers={"Origin":"https://login.taobao.com",
"Upgrade-Insecure-Requests":"1",
"Content-Type":"application/x-www-form-urlencoded",
"Accept":"text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,image/apng,*/*;q=0.8",
"Referer":"https://login.taobao.com/member/login.jhtml?redirectURL=https%3A%2F%2Fwww.taobao.com%2F",
"Accept-Encoding":"gzip, deflate, br",
"Accept-Language":"zh-CN,zh;q=0.9",
"User-Agent":set_user_agent()}
self.cookies = {}    # 申明一个用于存储手动cookies的字典
self.res_cookies_txt = ""# 申明刚开始浏览器返回的cookies为空字符串
self.keyword = "美食"

#读取mycookies.txt中的cookies
def read_cookies(self):
with open("mycookies.txt",'r',encoding='utf-8') as f:
cookies_txt = f.read().strip(';')  #读取文本内容
#由于requests只保持 cookiejar 类型的cookie,而我们手动复制的cookie是字符串需先将其转为dict类型后利用requests.utils.cookiejar_from_dict转为cookiejar 类型
for cookie in cookies_txt.split(';'):
name,value=cookie.strip().split('=',1)  #用=号分割,分割1次
self.cookies[name]=value  #为字典cookies添加内容
#将字典转为CookieJar:
cookiesJar = requests.utils.cookiejar_from_dict(self.cookies, cookiejar=None,overwrite=True)
return cookiesJar

#保存模拟登陆成功后从服务器返回的cookies,通过对比可以发现是有所不同的
def set_cookies(self,cookies):
# 将CookieJar转为字典:
res_cookies_dic = requests.utils.dict_from_cookiejar(cookies)
#将新的cookies信息更新到手动cookies字典
for i in res_cookies_dic.keys():
self.cookies[i] = res_cookies_dic[i]

#将更新后的cookies写入到文本
for k in self.cookies.keys():
self.res_cookies_txt += k+"="+self.cookies[k]+";"
#将服务器返回的cookies写入到mycookies.txt中实现更新
with open('mycookies.txt',"w",encoding="utf-8") as f:
f.write(self.res_cookies_txt)

def parse_url(self,url):
# 开启一个session会话
session = requests.session()
# 设置请求头信息
session.headers = self.headers
# 将cookiesJar赋值给会话
session.cookies = self.read_cookies()
# 向目标网站发起请求
response = session.get(url)
self.set_cookies(response.cookies)
return response.content.decode()

def get_goods_list(self,json_str):
dirt_ret=json.loads(json_str)
goods_list=dirt_ret["mods"]["itemlist"]["data"]["auctions"]
if goods_list:
for goods in goods_list:
goods_content = {}
goods_content['title'] = goods['raw_title']  # 名称
goods_content['url'] = goods['detail_url']  # 商品详情页链接
goods_content['price'] = goods['view_price']  # 价格
goods_content['address'] = goods['item_loc']  # 发货地址
goods_content['sales'] = goods['view_sales']  # 已付款人数
goods_content['shops'] = goods['nick']  # 店名
goods_content['comment_count'] = goods['comment_count']  # 评论数
self.save_to_mongo(goods_content)

def save_to_mongo(self, goods_content):
db = client[taget_DB]
if db[taget_TABLE].update_one(goods_content, {'$set': goods_content}, upsert=True):
print('Successfully Saved to Mongo', goods_content)

def run(self, page_num):
page_size = page_num*44
url = self.url_temp.format(page_size)+quote(self.keyword)
json_str = self.parse_url(url)
self.get_goods_list(json_str)

if __name__ == '__main__':
page_num = 2  # 总页码数
taobao=TaoBao()
pool = multiprocessing.Pool()
# 多进程
thread = threading.Thread(target=pool.map, args=(taobao.run, [i for i in range(page_num)]))
thread.start()
thread.join()

辅助程序:config.py,用于连接MongoDB,及设置随机请求UA以应对反爬

import random
MONGO_URL = 'localhost'
taget_DB = "TAOBAO"
taget_TABLE =  "TAOBAO"

def set_user_agent():
USER_AGENTS = [
"Mozilla/5.0 (compatible; MSIE 9.0; Windows NT 6.1; Win64; x64; Trident/5.0; .NET CLR 3.5.30729; .NET CLR 3.0.30729; .NET CLR 2.0.50727; Media Center PC 6.0)",
"Mozilla/5.0 (compatible; MSIE 8.0; Windows NT 6.0; Trident/4.0; WOW64; Trident/4.0; SLCC2; .NET CLR 2.0.50727; .NET CLR 3.5.30729; .NET CLR 3.0.30729; .NET CLR 1.0.3705; .NET CLR 1.1.4322)",
"Mozilla/4.0 (compatible; MSIE 7.0b; Windows NT 5.2; .NET CLR 1.1.4322; .NET CLR 2.0.50727; InfoPath.2; .NET CLR 3.0.04506.30)",
"Mozilla/5.0 (Windows; U; Windows NT 5.1; zh-CN) AppleWebKit/523.15 (KHTML, like Gecko, Safari/419.3) Arora/0.3 (Change: 287 c9dfb30)",
"Mozilla/5.0 (X11; U; Linux; en-US) AppleWebKit/527+ (KHTML, like Gecko, Safari/419.3) Arora/0.6",
"Mozilla/5.0 (Windows; U; Windows NT 5.1; en-US; rv:1.8.1.2pre) Gecko/20070215 K-Ninja/2.1.1",
"Mozilla/5.0 (Windows; U; Windows NT 5.1; zh-CN; rv:1.9) Gecko/20080705 Firefox/3.0 Kapiko/3.0",
"Mozilla/5.0 (X11; Linux i686; U;) Gecko/20070322 Kazehakase/0.4.5"
]

user_agent = random.choice(USER_AGENTS)
return user_agent

爬取结果如下图:

内容来自用户分享和网络整理,不保证内容的准确性,如有侵权内容,可联系管理员处理 点击这里给我发消息
标签: