您的位置:首页 > 理论基础 > 计算机网络

好书推荐:Python网络数据采集

2017-04-19 20:28 477 查看
小编最近在学习Python网络爬虫爬取数据,发现一本挺不错的教材《Python网络数据采集》,推荐给大家,有需要Python学习资料的可以来这个群,首先是四七二,中间是三零九,最后是二六一,里面有大量的学习资料可以下载。当然也会写一些小的爬虫程序,欢迎留言交流。

案例分享:为了找一份Python实习,我用爬虫收集数据

import requests,xlwt,os

from bs4 import BeautifulSoup

from lxml import etree

from fake_useragent import UserAgent

ua = UserAgent()

headers = {'User-Agent': 'ua.random'}

job = []

location = []

company = []

salary = []

link = []

for k in range(1, 10):

url = 'http://www.shixiseng.com/interns?k=python&p=' + str(k)

r = requests.get(url, headers=headers).text

s = etree.HTML(r)

job1 = s.xpath('//a/h3/text()')

location1 = s.xpath('//span/span/text()')

company1 = s.xpath('//p/a/text()')

salary1 = s.xpath('//span[contains(@class,"money_box")]/text()')

link1 = s.xpath('//div[@class="job_head"]/a/@href')

for i in link1:

url = 'http://www.shixiseng.com' + i

link.append(url)

salary11 = salary1[1::2]

for i in salary11:

salary.append(i.replace('\n\n', ''))

job.extend(job1)

location.extend(location1)

company.extend(company1)

detail = []

for i in link:

r = requests.get(i, headers=headers).text

soup = BeautifulSoup(r, 'lxml')

word = soup.find_all(class_="dec_content")

for i in word:

a = i.get_text()

detail.append(a)

book = xlwt.Workbook()

sheet = book.add_sheet('sheet', cell_overwrite_ok=True)

path = 'D:\\Pycharm\\spider'

os.chdir(path)

j = 0

for i in range(len(job)):

try:

sheet.write(i + 1, j, job[i])

sheet.write(i + 1, j + 1, location[i])

sheet.write(i + 1, j + 2, company[i])

sheet.write(i + 1, j + 3, salary[i])

sheet.write(i + 1, j + 4, link[i])

sheet.write(i + 1, j + 5, detail[i])

except Exception as e:

print('出现异常:' + str(e))

continue

book.save('d:\\python.xls')







内容来自用户分享和网络整理,不保证内容的准确性,如有侵权内容,可联系管理员处理 点击这里给我发消息