您的位置：首页 > 编程语言 > Python开发

python-scrapy教程（三）：实战——采集芒果TV综艺节目信息

2017-08-04 10:43 585 查看

我们先来看看需要采集的网站：

以综艺类搜索页第一页为例子：http://list.mgtv.com/1/---------2-1---.html，其中一页有60部综艺信息

信息：

教程：

创建工程：具体方法前面教程都有，创建完了之后，整体大概如下图：

data.py为主要运行的文件，贴上代码：

我们来看一下re_search.py 文件：

# -*- coding: utf-8 -*-
import re

def search(text,html):
if re.search(r'%s'%(text),html):
message = re.search(r'%s'%(text),html).group(1)
else:
message = ''
return message

就是一个简单的正则。

关于data.py：

# -*- coding: utf-8 -*-
import scrapy
from re_search import *
import re
from variety.items import VarietyItem

class DataSpider(scrapy.Spider):
name = "data"
allowed_domains = ["mgtv.com"]
start_urls = ['http://list.mgtv.com/1/---------2-1---.html']

def parse(self, response):
html = response.body
body = search('<div class="m-result-list">([\s\S]*?)<div class="m-result-list">',html)
li_div = re.findall(r'<li[\s\S]*?>([\s\S]*?)</li>',body)[0:-1]
for index,li in enumerate(li_div):
release = search('([\s\S]*?)',li).strip()
hosts = search(r'主持人：([\s\S]*?)',li)
title = search('<a class="u-title"[\s\S]*?>([\s\S]*?)</a>',li)
if '暂无' in hosts:
hosts = ''
else:
hosts = '|'.join(re.findall(r'<a[\s\S]*?>([\s\S]*?)</a>',hosts))
img = re.search(r'<img[\s\S]*?src="([\s\S]*?)"',li).group(1)
href = re.search(r'<a[\s\S]*?class="u-video[\s\S]*?href="([\s\S]*?)"',li).group(1)
if re.search(r'style="background:#F06000;">([\s\S]*?)',li):
lable = re.search(r'style="background:#F06000;">([\s\S]*?)',li).group(1)
else:
lable = ''

yield scrapy.Request(url=href,callback=self.parse_play,meta={'img':img,'lable':lable,'title':title,\
'release':release,'host':hosts})

def parse_play(self,response):
html = response.body
item = VarietyItem()
title = response.meta['title']
release = response.meta['release']
host = response.meta['host']
img = response.meta['img']
broadcast = search('播出：[\s\S]*?<a[\s\S]*?>([\s\S]*?)</a>',html)
summary = search('([\s\S]*?)',html)
if re.search(r'类型：([\s\S]*?)',html):
types = re.search(r'类型：([\s\S]*?)',html).group(1).strip()
types = '|'.join(re.findall(r'<a[\s\S]*?>([\s\S]*?)</a>',types))
else:
types = ''
infoid = re.search(r'b/([\s\S]*?)/[\s\S]*?.html',response.url).group(1)
areas = search('地区：([\s\S]*?)',html).strip()
area = '|'.join(re.findall(r'<a[\s\S]*?>([\s\S]*?)</a>',areas)).replace('国内','')
url = 'http://www.mgtv.com/h/'+str(infoid)+'.html'
print url
print infoid
print title
print img
print area
print release
print host
print broadcast
print summary
print types
print response.meta['lable']
print ''
item['url'] = url
item['infoid'] = infoid
item['title'] = title
item['img'] = img
item['area'] = area
item['release'] = release
item['host'] = host
item['broadcast'] = broadcast
item['summary'] = summary
item['types'] = types
item['lable'] = response.meta['lable']
yield item

接下来就是设置setting.py,pipelines.py.items.py了：

setting：配置文件

pipelines：数据存储

main：调用cmdline运行程序

建议大家采数据的时候用正则吧，虽然很繁琐，但是比较准确，同时，做循环的时候用enumerate（）代替，如果大家还有不懂的，可以留言@我。

内容来自用户分享和网络整理，不保证内容的准确性，如有侵权内容，可联系管理员处理

标签： Python 爬虫正则表达式

相关文章推荐

新的分享

章节导航