<四>、python爬虫抓取购物网站商品信息--图片价格名称
2017-09-06 15:08
1236 查看
本篇博客参考:python爬虫入门教程 http://blog.csdn.net/wxg694175346/article/category/1418998
一、项目分析
为了给我的出于实验目的网上商城批量增加商品信息,我需要自动从网上获取大量的商品名称、价格、图片信息保存到本地,再传到我自己的web应用中,为后续实验使用。
看完上面的参考博客就基本可以上手了,需要注意的一点是网上很多案例是python 2.X版本的,而现在一般是python 3.X版本的环境,有些地方代码需要调整,引用的包也有不同。
整个项目没使用scrapy、bs4,比较原生简单,最大的难点应该在于对网页源代码分析,通过正则表达式获取url,这里可能会出现两种和预期不同的错误场景:一是匹配不到,二是匹配过多,需要对正则表达式好好检查。
我选择爬取的是苏宁易购里面的大聚惠类似于聚划算,起点是 https://ju.suning.com/,分析源代码很简易找到各分类的URL
https://ju.suning.com/pc/column/products-1-0.html 就是大家电分类的显示页面,然后再对其进行源码分析
https://ju.suning.com/pc/brandComm-100036641-1.html
就是帅康品牌的所有商品的显示页面,然后再对其进行源码分析
https://ju.suning.com/pc/jusp/product-00010641eb93529d.html
就是帅康(sacon)烟灶套餐TE6789W+35C欧式不锈钢油烟机灶具套餐商品的显示页面,然后再对其进行源码分析就可以提取出商品的信息了,下面讲代码实现。
二、项目实现
我最开始完全按照我分析网页源代码的思路一层一级调用实现
Python爬虫爬取网页图片 http://www.cnblogs.com/abelsu/p/4540711.html
一、项目分析为了给我的出于实验目的网上商城批量增加商品信息,我需要自动从网上获取大量的商品名称、价格、图片信息保存到本地,再传到我自己的web应用中,为后续实验使用。
看完上面的参考博客就基本可以上手了,需要注意的一点是网上很多案例是python 2.X版本的,而现在一般是python 3.X版本的环境,有些地方代码需要调整,引用的包也有不同。
整个项目没使用scrapy、bs4,比较原生简单,最大的难点应该在于对网页源代码分析,通过正则表达式获取url,这里可能会出现两种和预期不同的错误场景:一是匹配不到,二是匹配过多,需要对正则表达式好好检查。
我选择爬取的是苏宁易购里面的大聚惠类似于聚划算,起点是 https://ju.suning.com/,分析源代码很简易找到各分类的URL
<!--商品列表一级导航栏 [[ --> <div class="ju-nav-wrapper"> <div class="ju-nav"> <table> <tr> <td class="active"><a name="columnId" id="0" value="1" href="/pc/new/home.html" name1="mps_index_qbsp_qb">全部商品</a> </td> <td><a name="categCode" href="/pc/column/products-1-0.html#refresh" value="1" name1="mps_index_qbsp_spml1">大家电</a></td> <td><a name="categCode" href="/pc/column/products-2-0.html#refresh" value="2" name1="mps_index_qbsp_spml2">电脑数码</a></td> <td><a name="categCode" href="/pc/column/products-17-0.html#refresh" value="17" name1="mps_index_qbsp_spml3">生活家电</a></td> <td><a name="categCode" href="/pc/column/products-733-0.html#refresh" value="733" name1="mps_index_qbsp_spml4">手机</a></td> <td><a name="categCode" href="/pc/column/products-81-0.html#refresh" value="81" name1="mps_index_qbsp_spml5">车品</a></td> <td><a name="categCode" href="/pc/column/products-11-0.html#refresh" value="11" name1="mps_index_qbsp_spml6">居家日用</a></td> <td><a name="categCode" href="/pc/column/products-10-0.html#refresh" value="10" name1="mps_index_qbsp_spml7">食品</a></td> <td><a name="categCode" href="/pc/column/products-8-0.html#refresh" value="8" name1="mps_index_qbsp_spml8">美妆</a></td> <td><a name="categCode" href="/pc/column/products-9-0.html#refresh" value="9" name1="mps_index_qbsp_spml9">母婴</a></td> <td><a name="categCode" href="/pc/column/products-464-0.html#refresh" value="464" name1="mps_index_qbsp_spml10">服饰鞋包</a></td> <td><a name="categCode" href="/pc/column/products-468-0.html#refresh" value="468" name1="mps_index_qbsp_spml11">纸品洗护</a></td> <td><a name="categCode" href="/pc/column/products-125-0.html#refresh" value="125" name1="mps_index_qbsp_spml12">家装</a></td> </tr> </table> </div> </div>将/pc/column/products-1-0.html改成
https://ju.suning.com/pc/column/products-1-0.html 就是大家电分类的显示页面,然后再对其进行源码分析
<a href="/pc/column/products-1-0.html#refresh" value="0" class="active" name1="mps_1_qbsp_ejqb">全 部</a> <input type="hidden" value="P" id="secCategCodeBrand"/> <a href="/pc/column/products-1-.html#P" value="P" class="floor" name1="mps_1_qbsp_ejml1">精选品牌</a> <a href="/pc/column/products-1-139.html#139" value="139" class="floor" name1="mps_1_qbsp_ejml1">厨卫</a> <a href="/pc/column/products-1-137.html#137" value="137" class="floor" name1="mps_1_qbsp_ejml2">冰箱</a> <a href="/pc/column/products-1-191.html#191" value="191" class="floor" name1="mps_1_qbsp_ejml3">彩电影音</a> <a href="/pc/column/products-1-138.html#138" value="138" class="floor" name1="mps_1_qbsp_ejml4">空调</a> <a href="/pc/column/products-1-410.html#410" value="410" class="floor" name1="mps_1_qbsp_ejml5">热水器</a> <a href="/pc/column/products-1-409.html#409" value="409" class="floor" name1="mps_1_qbsp_ejml6">洗衣机</a> <a href="/pc/column/products-1-552.html#552" value="552" class="floor" name1="mps_1_qbsp_ejml7">净水设备</a> <a href="/pc/column/products-1-617.html#617" value="617" class="floor" name1="mps_1_qbsp_ejml8">爆款预订</a>可以得到二级分类的URL,将/pc/column/products-1-.html 改为 https://ju.suning.com/pc/column/products-1-.html就是精选品牌显示页面,然后再对其进行源码分析
<!-- 精选品牌列表 --> <h5 id ="P" class="ju-prodlist-head"><span>精选品牌</span></h5> <ul class="ju-prodlist-floor1 ju-prodlist-lazyBrand clearfix"> <li class="ju-brandlist-item" name="brandCollect" value="100036641"> <a href="/pc/brandComm-100036641-1.html" title="帅康(sacon)" expotype="2" expo="mps_1_qbsp_jxpp1:帅康(sacon)" name1="mps_1_qbsp_jxpp1" target="_blank" shape="" class="brand-link"></a> <img orig-src-type="1-4" orig-src="//image3.suning.cn/uimg/nmps/PPZT/1000592621751_2_390x195.jpg" width="390" height="195" class="brand-pic lazy-loading" alt="帅康(sacon)"> <div class="sale clearfix"> <span class="brand-countdown ju-timer" data-time-now="" name="dateNow" data-time-end="2017-09-06 23:59:57.0"> </span> <span class="brand-buynum" id="100036641"></span> </div> <div class="border"></div> </li> <li class="ju-brandlist-item" name="brandCollect" value="100036910"> <a href="/pc/brandComm-100036910-1.html" title="富士通(FUJITSU)" expotype="2" expo="mps_1_qbsp_jxpp2:富士通(FUJITSU)" name1="mps_1_qbsp_jxpp2" target="_blank" shape="" class="brand-link"></a> <img orig-src-type="1-4" orig-src="//image1.suning.cn/uimg/nmps/PPZT/1000601630663_2_390x195.jpg" width="390" height="195" class="brand-pic lazy-loading" alt="富士通(FUJITSU)"> <div class="sale clearfix"> <span class="brand-countdown ju-timer" data-time-now="" name="dateNow" data-time-end="2017-09-06 23:59:55.0"> </span> <span class="brand-buynum" id="100036910"></span> </div> <div class="border"></div> </li>将/pc/brandComm-100036641-1.html改成
https://ju.suning.com/pc/brandComm-100036641-1.html
就是帅康品牌的所有商品的显示页面,然后再对其进行源码分析
<li class="ju-prodlist-item" id="6494577"> <div class="item-wrap"> <a title="帅康(sacon)烟灶套餐TE6789W+35C欧式不锈钢油烟机灶具套餐" expotype="1" expo="mpsblist_100036641_ppsp_mrsp1:0070068619|126962539" name1="mpsblist_100036641_ppsp_mrsp1" href="/pc/jusp/product-00010641eb93529d.html" target="_blank" shape="" class="prd-link"> </a> <img class="prd-pic lazy-loading" orig-src-type="0-1" orig-src="//image4.suning.cn/uimg/nmps/ZJYDP/100059262126962539picA_1_392x294.jpg" width="390" height="292"> <div class="detail"> <p class="prd-name fixed-height-name">帅康(sacon)烟灶套餐TE6789W+35C欧式不锈钢油烟机灶具套餐</p> <p class="prd-desp-items fixed-height-desp"> <span>17大吸力</span> <span>销量TOP</span> <span>一级能效</span> <span>限时抢烤箱!</span> </p> </div> <div class="sale clearfix"> <div class="prd-price clearfix"> <div class="sn-price"></div> <div class="discount"> <p class="full-price"></p> </div> </div> <div class="prd-sale"> <p class="prd-quan" id="000000000126962539-0070068619"> </p> <p class="sale-amount"> </p> </div> </div> </div> <div class="border"></div> </li>将/pc/jusp/product-00010641eb93529d.html改成
https://ju.suning.com/pc/jusp/product-00010641eb93529d.html
就是帅康(sacon)烟灶套餐TE6789W+35C欧式不锈钢油烟机灶具套餐商品的显示页面,然后再对其进行源码分析就可以提取出商品的信息了,下面讲代码实现。
二、项目实现
我最开始完全按照我分析网页源代码的思路一层一级调用实现
import urllib import urllib.parse import urllib.request import re import threading import queue import time q = queue.Queue() r = re.compile(r'href="(http://ju\.suning\.com/pc/jusp/product.+?)"') urls = [] #商品-四级分类 def save_products_from_url(contents): category_products = re.findall('href="/pc/jusp/product.+?.html"',contents,re.S) print('所有四级分类') print(category_products) for url_product in category_products: url_product = url_product.replace("\"","") url_product = url_product.replace("href=","") url_product = url_product.replace("/pc","http://ju.suning.com/pc") if url_product in urls: continue else: html = download_page(url_product) get_image(html) #设置sleep否则网站会认为是恶意访问而终止访问 time.sleep(1) return #品牌-三级分类 def save_brand_from_url(contents): category_brand = re.findall('href="/pc/brandComm.+?.html"',contents,re.S) print('所有三级分类') print(category_brand) for url_brand in category_brand: url_brand = url_brand.replace("\"","") url_brand = url_brand.replace("href=","") url_brand = url_brand.replace("/pc","http://ju.suning.com/pc") if url_brand in urls: continue else: urls.append(url_brand) q.put(url_brand) print('三级分类--:海信') print(url_brand) opener = urllib.request.urlopen(url_brand) contents = opener.read() contents = contents.decode("utf-8") opener.close() time.sleep(1) save_products_from_url(contents) def save_contents_from_url(contents): #二级分类:空调 regx = r'href="/pc/column/products-[\d]{1,3}-[\d][\d][\d].html' pattern = re.compile(regx) category_two = re.findall(pattern,repr(contents)) print('所有二级分类') print(category_two) for url_two in category_two: url_two = url_two.replace("\"","") url_two = url_two.replace("href=","") url_two = url_two.replace("/pc","http://ju.suning.com/pc") if url_two in urls: continue else: urls.append(url_two) q.put(url_two) print('二级分类--:空调') print(url_two) opener = urllib.request.urlopen(url_two) contents = opener.read() contents = contents.decode("utf-8") opener.close() time.sleep(1) save_brand_from_url(contents) def set_urls_from_contents(contents): #一级分类:大家电 g = re.findall('href="/pc/column/products.+?.html#refresh"',contents,re.S) print('所有一级分类') print(g) for url in g : print('一级分类--:大家电') print(url) url = url.replace("\"","") url = url.replace("#refresh","") url = url.replace("href=","") url = url.replace("/pc","http://ju.suning.com/pc") print(url) if url in urls: b9ea continue else: urls.append(url) q.put(url) opener = urllib.request.urlopen(url) contents = opener.read() contents = contents.decode("utf-8") opener.close() time.sleep(1) save_contents_from_url(contents) def save_contents(): url = "https://ju.suning.com/" opener = urllib.request.urlopen(url) contents = opener.read() contents = contents.decode("utf-8") opener.close() print('首页') print(url) set_urls_from_contents(contents) def download_page(url): request = urllib.request.Request(url) response = urllib.request.urlopen(request) data = response.read() return data #下载图片 def get_image(html): print('price') regx = r'sn.gbPrice ="\d*?.\d*?";' pattern = re.compile(regx) get_price = re.findall(pattern,repr(html)) print(get_price) for title in get_price: myindex = title.index('"') newprice = title[myindex+1:len(title)-2] print(newprice) print('title') regx = r'<title>.*?苏宁大聚惠</title>' pattern = re.compile(regx) html = html.decode('utf-8') get_title = re.findall(pattern,repr(html)) for title in get_title: myindex = title.index('【') newtitle = title[7:myindex] print(newtitle) regx = r'orig-src="//image[\d].suning.cn/uimg/nmps/ZJYDP/[\S]*\.jpg' pattern = re.compile(regx) get_img = re.findall(pattern,repr(html)) num = 1 for img in get_img: img = img.replace("\"","") img = img.replace("orig-src=","http:") print(img) index = img.index('picA') item_id = img[index-18:index] name = img[index-18:index]+'.jpg' print(name) image = download_page(img) with open(name,'wb') as fp: fp.write(image) print('正在下载第%s张图片'%num) num += 1 #将商品价格、名称、编号id写入文件 with open('items.txt','ab') as files: items = '|'+ newprice +'|'+ newtitle +'|'+ item_id + '\r\n' items = items.encode('utf-8') files.write(items) time.sleep(1) return q.put("https://ju.suning.com/") ts = [] t = threading.Thread(target=save_contents) t.start()这样写写比较清晰明了,便于理解,但太笨了都没有用到爬虫经常使用的递归,所以我后面修改了一版,难点是各层次的正则表达式不同,修改后终于使用递归了!
import urllib import urllib.parse import urllib.request import re import threading import queue import time q = queue.Queue() mylock = threading.RLock() urls = [] level = 0 category = 0 categorysed = 0 #层级与正则表达式映射 def numbers_to_strings(argument): switcher = { 1: regx_1, 2: regx_2, 3: regx_3, 4: regx_4, } return switcher.get(argument, "nothing") def set_urls_from_contents(contents): global level global category global categorysed #一级分类 regx_1 = r'href="/pc/column/products.+?.html#refresh"' #二级分类 regx_2 = r'href="/pc/column/products-[\d]{1,3}-[\d][\d][\d].html' #三级分类:品牌 regx_3 = r'href="/pc/brandComm.+?.html"' #四级分类:商品 regx_4 = r'href="/pc/jusp/product-.+?.html"' pattern = re.compile(regx_4) g = re.findall(pattern,repr(contents)) if len(g) >0: level = 4 else: level = 0 print('商品分类不匹配') print(str(level)+':1') if level == 0: pattern = re.compile(regx_3) g = re.findall(pattern,repr(contents)) if len(g) >0: level = 3 else: level = 0 print('品牌分类不匹配') else: print('品牌分类跳过') print(str(level)+':2') if level == 0: pattern = re.compile(regx_2) g = re.findall(pattern,repr(contents)) if len(g) >0: level = 2 else: level = 0 print('二级分类不匹配') else: print('二级分跳过') print(str(level)+':3') if level == 0: pattern = re.compile(regx_1) g = re.findall(pattern,repr(contents)) if len(g) >0: level = 1 else: level = 0 print('一级分类不匹配') else: print('一级分类跳过') print(str(level)+':4') print('所有分类明细') print(g) for url in g : #url = url.groups()[0] print(str(level)+'级分类:') print(url) if url.find('#refresh')>0: eindex = url.index('.html') print(eindex) sindex = url.index('s-') category = url[sindex+2:eindex-2] print('一级分类id') print(category) elif url.find('products-')>0: eindex = url.index('.html') #sindex = url.index('s-') categorysed = url[eindex-3:eindex] print('二级分类id') print(categorysed) url = url.replace("\"","") url = url.replace("#refresh","") url = url.replace("href=","") url = url.replace("/pc","http://ju.suning.com/pc") print(url) if url.find('product-') >0: level =4 else: level = level -1 if url in urls: continue else: urls.append(url) q.put(url) if level == 4: html = download_page(url) get_image(html,category,categorysed) else: opener = urllib.request.urlopen(url) contents = opener.read() contents = contents.decode("utf-8") opener.close() time.sleep(0.1) set_urls_from_contents(contents) def save_contents(): url = "https://ju.suning.com/" opener = urllib.request.urlopen(url) contents = opener.read() contents = contents.decode("utf-8") opener.close() print('首页') print(url) set_urls_from_contents(contents) #下载具体一个商品页面中的信息 def download_page(url): request = urllib.request.Request(url) response = urllib.request.urlopen(request) data = response.read() return data #下载图片 def get_image(html,category,categorysed): print('price') regx = r'sn.gbPrice ="\d*?.\d*?";' pattern = re.compile(regx) get_price = re.findall(pattern,repr(html)) print(get_price) #print(html) for title in get_price: myindex = title.index('"') newprice = title[myindex+1:len(title)-2] print(newprice) print('title') regx = r'<title>.*?苏宁大聚惠</title>' pattern = re.compile(regx) html = html.decode('utf-8') get_title = re.findall(pattern,repr(html)) #print(get_title) for title in get_title: myindex = title.index('【') newtitle = title[7:myindex] print(newtitle) regx = r'orig-src="//image[\d].suning.cn/uimg/nmps/ZJYDP/[\S]*\.jpg' pattern = re.compile(regx) get_img = re.findall(pattern,repr(html)) num = 1 for img in get_img: img = img.replace("\"","") img = img.replace("orig-src=","http:") print(img) index = img.index('pic') item_id = img[index-18:index] name = img[index-18:index]+'.jpg' print(name) image = download_page(img) with open(name,'wb') as fp: fp.write(image) print('正在下载第%s张图片'%num) num += 1 with open('items.txt','ab') as files: items = str(category)+'|'+str(categorysed)+'|'+ newprice +'|'+ newtitle +'|'+ item_id + '\r\n' items = items.encode('utf-8') files.write(items) time.sleep(1) return #首页入口 q.put("https://ju.suning.com/") ts = [] t = threading.Thread(target=save_contents) t.start()我在晚上睡觉前运行程序,第二天查看爬了 近三千条记录,程序没有报错,应该是电脑休眠网络中断了,不过数据已经足够了
相关文章推荐
- [Python爬虫] 之二十六:Selenium +phantomjs 利用 pyquery抓取智能电视网站图片信息
- 简单的抓取淘宝关键字信息、图片的Python爬虫|Python3中级玩家:淘宝天猫商品搜索爬虫自动化工具(第二篇)
- Python爬虫学习之抓取商品名称和价格
- Python入门-编写抓取网站图片的爬虫-正则表达式
- python抓取京东商城的商品名称和价格
- Python爬虫之一 PySpider 抓取淘宝MM的个人信息和图片
- Python爬虫小项目(1):抓取转转网西安二手商品的详细信息并导入mongo,绘制图表,慢更
- python 实现网站图片抓取小爬虫
- python 抓取凡客商品名称和价格demo
- python小爬虫—抓取pixabay网站的图片资源
- 【Python】爬虫入门--抓取网站图片
- [笔记]python爬虫:淘宝商品价格信息爬取示例
- java爬虫实战(1):抓取信息门户网站中的图片及其他文件并保存至本地
- java 输入输出--抓取凡客一万多商品名称和价格,图片~用到的
- Python爬虫---------------<妹子图>图片抓取(1)
- 抓取凡客一万多商品,链接,商品名称,价格,图片~~~版本1
- python3.x爬虫实战:阿里巴巴网站定向信息抓取
- python利用urllib实现爬取京东网站商品图片的爬虫实例
- Python爬虫(一)京东商品价格及详情页抓取
- 【python】100行代码python爬虫程序,抓取网站图片存储本地(附:中文链接解决)