您的位置:首页 > 编程语言 > Python开发

Python爬虫(十五)_案例:使用bs4的爬虫

2017-12-19 20:27 639 查看

本章将从Python案例讲起:所使用bs4做一个简单的爬虫案例,更多内容请参考:Python学习指南


案例:使用BeautifulSoup的爬虫

我们已腾讯社招页面来做演示:http://hr.tencent.com/position.php?&start=10#a



使用BeautifulSoup4解析器,将招聘网页上的职位名称、职位类别、招聘人数、工作地点、时间、以及每个职位详情的点击链接存储出来。

#-*- coding:utf-8 -*-

from bs4 import BeautifulSoup

import urllib2
import urllib
import json   #使用json格式存储

def tencent():
url = "http://hr.tencent.com/"

request = urllib2.Request(url+"position.php?&start=10#a")
response = urllib2.urlopen(request)

resHtml = response.read()

output = open('tencent.json', 'w')

html = BeautifulSoup(resHtml, 'lxml')

#创建CSS选择器
result = html.select('tr[class="even"]')
result2 = html.select('tr[class="odd"]')
result += result2

print(result)
items = []
for site in result:
item = {}

name = site.select('td a')[0].get_text()
dataLink = site.select('td a')[0].attrs['href']
catalog = site.select('td')[1].get_text()
recruitNumber = site.select('td')[2].get_text()
workLocation = site.select('td')[3].get_text()
publishTime = site.select('td')[4].get_text()

item['name'] = name
item['datailLink'] = url + dataLink
item['catalog'] = catalog
item['recruitNumber'] = recruitNumber
item['publishTime'] = publishTime

items.append(item)
#禁用ascii编码,按utf-8编码
line = json.dumps(items, ensure_ascii = False)

output.write(line.encode('utf-8'))

output.close()

if __name__ == '__main__':
tencent()
内容来自用户分享和网络整理,不保证内容的准确性,如有侵权内容,可联系管理员处理 点击这里给我发消息
标签: