Python3 爬取豆瓣电影信息
2017-12-17 00:00
513 查看
豆瓣api
https://developers.douban.com/wiki/?title=movie_v2
请求码返回表
http://blog.unvs.cn/archives/douban-oauth-2-0-error_code.html
限制请求数目为40次每分钟
超过次数会出现
爬取链接
格式说明
豆瓣电影api
返回指定编号电影的信息
由于豆瓣有反爬虫机制,需要考虑一下怎么能够把这些信息全部爬取出来...
爬取简介信息,每页20条一共不到10000条,注意每次爬取需要停止1s,为了防止反爬虫机制
合并id
获取所有电影详情
注意编码问题
爬取电影详情页面
豆瓣API爬取结果
读取json文件,格式化输出,使用json 设置输出中文而不是乱码
https://developers.douban.com/wiki/?title=movie_v2
请求码返回表
http://blog.unvs.cn/archives/douban-oauth-2-0-error_code.html
限制请求数目为40次每分钟
超过次数会出现
msg | "rate_limit_exceeded2: 1.85.33.69" |
code | 112 |
request | "GET /v2/movie/26313744" |
https://movie.douban.com/j/new_search_subjects?sort=T&range=0,10&tags=%E7%94%B5%E5%BD%B1&start=1
格式说明
https://movie.douban.com/j/new_search_subjects? sort=T &range=0,10 &tags=%E7%94%B5%E5%BD%B1 &start=1 sort=T 按照类型排序 &range=0,10 选取电影的评分范围 &tags=%E7%94%B5%E5%BD%B1 标签为电影 &start=1 开始的索引 返回20个电影的json数据
豆瓣电影api
返回指定编号电影的信息
https://api.douban.com/v2/movie/1295644
由于豆瓣有反爬虫机制,需要考虑一下怎么能够把这些信息全部爬取出来...
爬取简介信息,每页20条一共不到10000条,注意每次爬取需要停止1s,为了防止反爬虫机制
url_base = 'https://movie.douban.com/j/new_search_subjects?sort=T&range=0,10&tags=%E7%94%B5%E5%BD%B1&start={0}' all_page = 500 import requests import time save_dir = 'D:/data/douban/movie_list' for i in range(all_page): url = url_base.format(i * 20) print(url) text = requests.get(url).text path = save_dir + '/page_' + str(i) + '.json' with open(path, mode='w+', encoding='utf8') as f: f.write(text) time.sleep(2)
合并id
import json import os # 获取所有电影id和名称 movie_dir = 'd:/data/douban/movie_page' movie_csv = [] for p in os.listdir(movie_dir): path = movie_dir + '/' + p with open(path, mode='r', encoding='utf8') as f: js = json.load(f)['data'] for i in js: print(i) movie_csv.append( [i['id'], i['title'], i['url']] ) print(len(movie_csv)) with open('movie_info.csv', mode='w+', encoding='utf8') as f: f.write('id,title,url\n') for i in movie_csv: f.write(','.join(i) + '\n')
获取所有电影详情
import time import requests import json # 获取所有电影id movie_ids = [] with open('movie_info.csv', encoding='utf8') as f: f.readline() for i in f.readlines(): movie_ids.append(i.split(',')[0].strip()) print(len(movie_ids)) url_base = 'https://api.douban.com/v2/movie/{0}' save_dir = 'd:/data/douban/movie_info' for i in movie_ids: url = url_base.format(i) text = requests.get(url).text path = save_dir + '/m_' + str(i) + '.json' print(url) with open(path, mode='w+', encoding='utf8') as f: f.write(text) time.sleep(1)
注意编码问题
爬取电影详情页面
# https://movie.douban.com/subject/26378579/ import time import requests import json # 获取所有电影id movie_ids = [] with open('movie_info.csv', encoding='utf8') as f: f.readline() for i in f.readlines(): movie_ids.append(i.split(',')[0].strip()) print(len(movie_ids)) url_base = 'https://movie.douban.com/subject/{0}/' save_dir = 'd:/data/douban/movie_info_html' for i in movie_ids: url = url_base.format(i) text = requests.get(url).text path = save_dir + '/m_' + str(i) + '.html' print(url) with open(path, mode='w+', encoding='utf8') as f: f.write(t 7fe0 ext) time.sleep(1)
豆瓣API爬取结果
读取json文件,格式化输出,使用json 设置输出中文而不是乱码
import json with open('m_1291546.json') as f: js = json.load(f) print(json.dumps(js,indent=2,ensure_ascii=False))
"D:\Program Files\py36\python3.exe" D:/code/pycharm/py36/db/t.py { "rating": { "max": 10, "average": "9.5", "numRaters": 667821, "min": 0 }, "author": [ { "name": "陈凯歌 Kaige Chen" } ], "alt_title": "再见,我的妾", "image": "https://img3.doubanio.com/view/photo/s_ratio_poster/public/p1910813120.jpg", "title": "霸王别姬", "summary": "段小楼(张丰毅)与程蝶衣(张国荣)是一对打小一起长大的师兄弟,两人一个演生,一个饰旦,一向配合天衣无缝,尤其一出《霸王别姬》,更是誉满京城,为此,两人约定合演一辈子《霸王别姬》。但两人对戏剧与人生关系的理解有本质不同,段小楼深知戏非人生,程蝶衣则是人戏不分。\n段小楼在认为该成家立业之时迎娶了名妓菊仙(巩俐),致使程蝶衣认定菊仙是可耻的第三者,使段小楼做了叛徒,自此,三人围绕一出《霸王别姬》生出的爱恨情仇战开始随着时代风云的变迁不断升级,终酿成悲剧。", "attrs": { "language": [ "汉语普通话" ], "pubdate": [ "1993-01-01(香港)" ], "title": [ "霸王别姬" ], "country": [ "中国大陆", "香港" ], "writer": [ "芦苇 Wei Lu", "李碧华 Lillian Lee" ], "director": [ "陈凯歌 Kaige Chen" ], "cast": [ "张国荣 Leslie Cheung", "张丰毅 Fengyi Zhang", "巩俐 Li Gong", "葛优 You Ge", "英达 Da Ying", "蒋雯丽 Wenli Jiang", "吴大维 David Wu", "吕齐 Qi Lü", "雷汉 Han Lei", "尹治 Zhi Yin", "马明威 Mingwei Ma", "费振翔 Zhenxiang Fei", "智一桐 Yitong Zhi", "李春 Chun Li", "赵海龙 Hailong Zhao", "李丹 Dan Li", "童弟 Di Tong", "沈慧芬 Huifen Shen", "黄斐 Fei Huang" ], "movie_duration": [ "171 分钟" ], "year": [ "1993" ], "movie_type": [ "剧情", "爱情", "同性" ] }, "id": "https://api.douban.com/movie/1291546", "mobile_link": "https://m.douban.com/movie/subject/1291546/", "alt": "https://movie.douban.com/movie/1291546", "tags": [ { "count": 119042, "name": "经典" }, { "count": 60501, "name": "中国电影" }, { "count": 57896, "name": "爱情" }, { "count": 55252, "name": "文艺" }, { "count": 52815, "name": "人性" }, { "count": 48132, "name": "同志" }, { "count": 42505, "name": "人生" }, { "count": 32240, "name": "剧情" } ] } Process finished with exit code 0
相关文章推荐
- python爬取豆瓣电影信息
- Python抓取豆瓣电影详情并提取信息
- Python爬虫学习---------根据分类爬取豆瓣电影的电影信息
- Python爬虫入门2 | 爬取豆瓣电影信息
- Python3.6爬虫爬取豆瓣电影Top250信息
- 爬虫入门:python+pycharm,豆瓣电影信息,短评,分页爬取,mysql数据库连接
- python爬取豆瓣电影信息
- [python爬虫] BeautifulSoup和Selenium对比爬取豆瓣Top250电影信息
- 如何用Python在豆瓣中获取自己喜欢的TOP N电影信息
- 爬虫实战【11】Python获取豆瓣热门电影信息
- python 爬虫学习三(Scrapy 实战,豆瓣爬取电影信息)
- 使用python抓取豆瓣电影信息
- Python网络爬虫学习案例——爬取豆瓣电影top250信息
- 一个简单的python爬虫程序 爬取豆瓣热度Top100以内的电影信息
- P_010.~慢慢悠悠~使用Python的Scrapy框架成功爬取豆瓣电影的全部信息
- 【python】自动获取豆瓣电影信息
- 如何用Python在豆瓣中获取自己喜欢的TOP N电影信息
- 【Python爬虫第二弹】基于爬虫爬取豆瓣书籍的书籍信息查询
- python3爬虫爬取豆瓣电影并保存到sql serve数据库
- [置顶] python爬虫实践——零基础快速入门(二)爬取豆瓣电影