您的位置:首页 > 编程语言 > Python开发

Python3 爬取豆瓣电影信息

2017-12-17 00:00 513 查看
豆瓣api
https://developers.douban.com/wiki/?title=movie_v2
请求码返回表
http://blog.unvs.cn/archives/douban-oauth-2-0-error_code.html
限制请求数目为40次每分钟

超过次数会出现

msg"rate_limit_exceeded2: 1.85.33.69"
code112
request"GET /v2/movie/26313744"
爬取链接

https://movie.douban.com/j/new_search_subjects?sort=T&range=0,10&tags=%E7%94%B5%E5%BD%B1&start=1

格式说明

https://movie.douban.com/j/new_search_subjects?
sort=T
&range=0,10
&tags=%E7%94%B5%E5%BD%B1
&start=1

sort=T 按照类型排序
&range=0,10 选取电影的评分范围
&tags=%E7%94%B5%E5%BD%B1 标签为电影
&start=1 开始的索引

返回20个电影的json数据




豆瓣电影api

返回指定编号电影的信息

https://api.douban.com/v2/movie/1295644




由于豆瓣有反爬虫机制,需要考虑一下怎么能够把这些信息全部爬取出来...

爬取简介信息,每页20条一共不到10000条,注意每次爬取需要停止1s,为了防止反爬虫机制

url_base = 'https://movie.douban.com/j/new_search_subjects?sort=T&range=0,10&tags=%E7%94%B5%E5%BD%B1&start={0}'

all_page = 500
import requests
import time

save_dir = 'D:/data/douban/movie_list'

for i in range(all_page):
url = url_base.format(i * 20)
print(url)
text = requests.get(url).text
path = save_dir + '/page_' + str(i) + '.json'
with open(path, mode='w+', encoding='utf8') as f:
f.write(text)
time.sleep(2)

合并id

import json
import os

# 获取所有电影id和名称

movie_dir = 'd:/data/douban/movie_page'

movie_csv = []
for p in os.listdir(movie_dir):
path = movie_dir + '/' + p
with open(path, mode='r', encoding='utf8') as f:
js = json.load(f)['data']
for i in js:
print(i)
movie_csv.append(
[i['id'], i['title'], i['url']]
)

print(len(movie_csv))
with open('movie_info.csv', mode='w+', encoding='utf8') as f:
f.write('id,title,url\n')
for i in movie_csv:
f.write(','.join(i) + '\n')

获取所有电影详情

import time
import requests
import json

# 获取所有电影id
movie_ids = []
with open('movie_info.csv', encoding='utf8') as f:
f.readline()
for i in f.readlines():
movie_ids.append(i.split(',')[0].strip())

print(len(movie_ids))

url_base = 'https://api.douban.com/v2/movie/{0}'
save_dir = 'd:/data/douban/movie_info'
for i in movie_ids:
url = url_base.format(i)
text = requests.get(url).text
path = save_dir + '/m_' + str(i) + '.json'
print(url)
with open(path, mode='w+', encoding='utf8') as f:
f.write(text)
time.sleep(1)


注意编码问题

爬取电影详情页面

# https://movie.douban.com/subject/26378579/ 
import time
import requests
import json

# 获取所有电影id
movie_ids = []
with open('movie_info.csv', encoding='utf8') as f:
f.readline()
for i in f.readlines():
movie_ids.append(i.split(',')[0].strip())

print(len(movie_ids))

url_base = 'https://movie.douban.com/subject/{0}/'
save_dir = 'd:/data/douban/movie_info_html'
for i in movie_ids:
url = url_base.format(i)
text = requests.get(url).text
path = save_dir + '/m_' + str(i) + '.html'
print(url)
with open(path, mode='w+', encoding='utf8') as f:
f.write(t
7fe0
ext)
time.sleep(1)


豆瓣API爬取结果



读取json文件,格式化输出,使用json 设置输出中文而不是乱码

import json
with  open('m_1291546.json') as f:
js = json.load(f)
print(json.dumps(js,indent=2,ensure_ascii=False))

"D:\Program Files\py36\python3.exe" D:/code/pycharm/py36/db/t.py
{
"rating": {
"max": 10,
"average": "9.5",
"numRaters": 667821,
"min": 0
},
"author": [
{
"name": "陈凯歌 Kaige Chen"
}
],
"alt_title": "再见,我的妾",
"image": "https://img3.doubanio.com/view/photo/s_ratio_poster/public/p1910813120.jpg",
"title": "霸王别姬",
"summary": "段小楼(张丰毅)与程蝶衣(张国荣)是一对打小一起长大的师兄弟,两人一个演生,一个饰旦,一向配合天衣无缝,尤其一出《霸王别姬》,更是誉满京城,为此,两人约定合演一辈子《霸王别姬》。但两人对戏剧与人生关系的理解有本质不同,段小楼深知戏非人生,程蝶衣则是人戏不分。\n段小楼在认为该成家立业之时迎娶了名妓菊仙(巩俐),致使程蝶衣认定菊仙是可耻的第三者,使段小楼做了叛徒,自此,三人围绕一出《霸王别姬》生出的爱恨情仇战开始随着时代风云的变迁不断升级,终酿成悲剧。",
"attrs": {
"language": [
"汉语普通话"
],
"pubdate": [
"1993-01-01(香港)"
],
"title": [
"霸王别姬"
],
"country": [
"中国大陆",
"香港"
],
"writer": [
"芦苇 Wei Lu",
"李碧华 Lillian Lee"
],
"director": [
"陈凯歌 Kaige Chen"
],
"cast": [
"张国荣 Leslie Cheung",
"张丰毅 Fengyi Zhang",
"巩俐 Li Gong",
"葛优 You Ge",
"英达 Da Ying",
"蒋雯丽 Wenli Jiang",
"吴大维 David Wu",
"吕齐 Qi Lü",
"雷汉 Han Lei",
"尹治 Zhi Yin",
"马明威 Mingwei Ma",
"费振翔 Zhenxiang Fei",
"智一桐 Yitong Zhi",
"李春 Chun Li",
"赵海龙 Hailong Zhao",
"李丹 Dan Li",
"童弟 Di Tong",
"沈慧芬 Huifen Shen",
"黄斐 Fei Huang"
],
"movie_duration": [
"171 分钟"
],
"year": [
"1993"
],
"movie_type": [
"剧情",
"爱情",
"同性"
]
},
"id": "https://api.douban.com/movie/1291546",
"mobile_link": "https://m.douban.com/movie/subject/1291546/",
"alt": "https://movie.douban.com/movie/1291546",
"tags": [
{
"count": 119042,
"name": "经典"
},
{
"count": 60501,
"name": "中国电影"
},
{
"count": 57896,
"name": "爱情"
},
{
"count": 55252,
"name": "文艺"
},
{
"count": 52815,
"name": "人性"
},
{
"count": 48132,
"name": "同志"
},
{
"count": 42505,
"name": "人生"
},
{
"count": 32240,
"name": "剧情"
}
]
}

Process finished with exit code 0
内容来自用户分享和网络整理,不保证内容的准确性,如有侵权内容,可联系管理员处理 点击这里给我发消息
标签:  Python