您的位置:首页 > 编程语言 > Python开发

Python+Scrapy 爬取豆瓣电影排行榜Top250

2016-09-08 11:17 691 查看

环境配置

Windows

Python 2.7

Scrapy

PyMongo

创建工程

scrapy startproject douban_movie

目录结构如下

|– douban_movie

| |– init.py

| |– items.py

| |– middlewares.py

| |– pipelines.py

| |– settings.py

|
-- spiders

|       |-- __init__.py

|
– spiders.py

|– README.md

|– run.py

`– scrapy.cfg

middlewares.py: 设置User-Agent

pipelines.py:处理爬取的内容,插入到mongodb中

items.py:要爬取的数据的结构

spiders.py:具体的爬取的逻辑

spiders.py:

from scrapy.spiders import CrawlSpider

from scrapy.http import Request

from scrapy.selector import Selector

from douban_movie.items import DoubanMovieItem

from bson import ObjectId

import logging

logger = logging.getLogger(‘doubanspider’)

class Spiders(CrawlSpider):

name = “movie”

start_urls = [

https://movie.douban.com/top250/

]

def parse(self,response):

selector = Selector(response)

ol_li = selector.xpath(‘//div[@class=”item”]’)

for li in ol_li:

movie = DoubanMovieItem()

movie[‘_id’] = str(ObjectId())

movie[‘rank’] = li.xpath(‘div[@class=”pic”]/em/text()’).extract_first()

movie[‘link’] = li.xpath(‘div[@class=”pic”]/a/@href’).extract_first()

movie[‘img’] = li.xpath(‘div[@class=”pic”]/a/img/@src’).extract_first()

movie[‘title’] = li.xpath(‘div[@class=”pic”]/a/img/@alt’).extract_first()

movie[‘star’] = li.xpath(‘div[@class=”info”]/div[@class=”bd”]/div[@class=”star”]/span[@class=”rating_num”]/text()’).extract_first()

movie[‘quote’] = li.xpath(‘div[@class=”info”]/div[@class=”bd”]/p[@class=”quote”]/span[@class=”inq”]/text()’).extract_first()

yield movie

next_page = response.xpath(‘//span[@class=”next”]/a/@href’)

if next_page:

url = ‘https://movie.douban.com/top250‘+next_page[0].extract()

yield Request(url=url,callback=self.parse)

具体的请下载 源码
内容来自用户分享和网络整理,不保证内容的准确性,如有侵权内容,可联系管理员处理 点击这里给我发消息
标签:  python scrapy 豆瓣 爬虫