您的位置：首页 > 编程语言 > Python开发

python爬虫入门练习：BeautifulSoup爬取猫眼电影TOP100排行榜，pandas保存本地excel文件

2019-07-11 23:13 489 查看

传送门：【python爬虫入门练习】正则表达式爬取猫眼电影TOP100排行榜，openpyxl保存本地excel文件
对于上文使用的正则表达式匹配网页内容，的确是有些许麻烦，替换出现任何的差错都会导致匹配不上的情况，本文将介绍一种更简便更流行的匹配方式：beautifulsoup
首先，安装beautifulsoup：pip install beautifulsoup4
其次，python是自带标准解析器的，但是更建议使用执行速度更快，文档容错能力强的lxml，安装：pip install lxml
导入方法：from bs4 import Beautifulsoup4
同样的我们先拿到网页源代码

在<dd></dd>中间，直接通过目标信息的标签来定位，相同的标签，通过加属性条件来区分：

from bs4 import BeautifulSoup

data = []
def page(text):
soup = BeautifulSoup(text, 'lxml')
for s in soup.find_all('dd'):
datalist = {'排名': s.find('i').get_text(), '电影名称': s.find('p', class_="name").get_text(),
'主演': s.find('p', class_="star").get_text().strip()[3:],
'上映时间': s.find('p', class_="releasetime").get_text().strip()[5:],
'评分': s.find('p', class_="score").get_text()}
data.append(datalist)

是不是清晰明了很多，就这样匹配这一步就搞定了，然后我们来说下导出，在前一篇，我们是通过openpyxl方式导出数据到本地Excel，下面介绍一种新的方式，通过pandas导出，更适合大数据量的处理安装：pip install pandas

import pandas

def writeexcel(localfile, text):
pandas.DataFrame(text,columns=['排名','电影名称','主演','上映时间']).to_excel(localfile,index=False)

一行代码，搞定，解释一下，默认的导出文件会在第一列生成序号，我们可以通过 index=False 来取消这个操作
放一下全部代码

import requests
import threading
from bs4 import BeautifulSoup
import pandas

def login(url):
req = requests.get(url)
if req.ok:
return req.content.decode('utf-8')
else:
return None

data = []

def page(text):
soup = BeautifulSoup(text, 'lxml')
for s in soup.find_all('dd'):
datalist = {'排名': s.find('i').get_text(), '电影名称': s.find('p', class_="name").get_text(),
'主演': s.find('p', class_="star").get_text().strip()[3:],
'上映时间': s.find('p', class_="releasetime").get_text().strip()[5:],
'评分': s.find('p', class_="score").get_text()}
data.append(datalist)

def writeexcel(localfile, text):
pandas.DataFrame(text, columns=['排名', '电影名称', '主演', '上映时间']).to_excel(localfile, index=False)

def main(i):
html = login('https://maoyan.com/board/4?offset=' + str(10 * i))
page(html)

if __name__ == '__main__':
Threadtest = [threading.Thread(target=main, args=[i, ]) for i in range(10)]
for t in Threadtest:
t.start()
t.join()
excelpath = 'C:\\Users\\ll\\Desktop\\film.xlsx'
writeexcel(excelpath, data)

内容来自用户分享和网络整理，不保证内容的准确性，如有侵权内容，可联系管理员处理

标签：

相关文章推荐

新的分享

章节导航