Python+Scrapy 爬取豆瓣电影排行榜Top250
2016-09-08 11:17
691 查看
环境配置
WindowsPython 2.7
Scrapy
PyMongo
创建工程
scrapy startproject douban_movie
目录结构如下
|– douban_movie
| |– init.py
| |– items.py
| |– middlewares.py
| |– pipelines.py
| |– settings.py
|
-- spiders | |-- __init__.py |– spiders.py
|– README.md
|– run.py
`– scrapy.cfg
middlewares.py: 设置User-Agent
pipelines.py:处理爬取的内容,插入到mongodb中
items.py:要爬取的数据的结构
spiders.py:具体的爬取的逻辑
spiders.py:
from scrapy.spiders import CrawlSpider
from scrapy.http import Request
from scrapy.selector import Selector
from douban_movie.items import DoubanMovieItem
from bson import ObjectId
import logging
logger = logging.getLogger(‘doubanspider’)
class Spiders(CrawlSpider):
name = “movie”
start_urls = [
“https://movie.douban.com/top250/”
]
def parse(self,response):
selector = Selector(response)
ol_li = selector.xpath(‘//div[@class=”item”]’)
for li in ol_li:
movie = DoubanMovieItem()
movie[‘_id’] = str(ObjectId())
movie[‘rank’] = li.xpath(‘div[@class=”pic”]/em/text()’).extract_first()
movie[‘link’] = li.xpath(‘div[@class=”pic”]/a/@href’).extract_first()
movie[‘img’] = li.xpath(‘div[@class=”pic”]/a/img/@src’).extract_first()
movie[‘title’] = li.xpath(‘div[@class=”pic”]/a/img/@alt’).extract_first()
movie[‘star’] = li.xpath(‘div[@class=”info”]/div[@class=”bd”]/div[@class=”star”]/span[@class=”rating_num”]/text()’).extract_first()
movie[‘quote’] = li.xpath(‘div[@class=”info”]/div[@class=”bd”]/p[@class=”quote”]/span[@class=”inq”]/text()’).extract_first()
yield movie
next_page = response.xpath(‘//span[@class=”next”]/a/@href’)
if next_page:
url = ‘https://movie.douban.com/top250‘+next_page[0].extract()
yield Request(url=url,callback=self.parse)
具体的请下载 源码
相关文章推荐
- Python 采用Scrapy爬虫框架爬取豆瓣电影top250
- 运维学python之爬虫高级篇(五)scrapy爬取豆瓣电影TOP250
- 利用 Python 爬取豆瓣电影排行榜 Top250 的数据
- 实践Python的爬虫框架Scrapy来抓取豆瓣电影TOP250
- 实践Python的爬虫框架Scrapy来抓取豆瓣电影TOP250
- python爬虫 Scrapy2-- 爬取豆瓣电影TOP250
- Scrapy框架学习 - 爬取豆瓣电影排行榜TOP250所有电影信息并保存到MongoDB数据库中
- [python爬虫入门]爬取豆瓣电影排行榜top250
- Python爬虫初学(2)豆瓣电影top250评论数
- scrapy爬虫框架教程(二)-- 爬取豆瓣电影TOP250
- Python爬虫实战——豆瓣电影Top250
- Python 3爬虫小实战(一)—— 豆瓣电影Top250
- python3[爬虫基础入门实战] 爬取豆瓣电影排行top250
- [151116 记录] 使用Python3.5爬取豆瓣电影Top250
- (7)Python爬虫——爬取豆瓣电影Top250
- Python爬虫获取豆瓣电影TOP250
- python中lxml+cssselect爬取豆瓣电影Top250
- Python爬虫豆瓣电影top250
- [Python/爬虫]利用xpath爬取豆瓣电影top250
- python爬豆瓣电影Top250