您的位置：首页 > 编程语言 > Python开发

python3爬虫实战（3）

2017-04-07 20:07 363 查看

今天心血来潮去爬取了一下招聘网站的实时招聘信息。网址是http://xiaoyuan.zhaopin.com

选的条件是北京，实习生，计算机软件。分析代码之后发现还是很容易的，不过过程中出了不少小问题，在这里分享一下。

想要爬取的是类似的表单内容。是在ul的li里。

用BeautifulSoup解析之后，tem_ul = bsoj.find("ul",{"class":"terminal-ul clearfix"})存下整个ul元素。

对于从tem_ul中提取出所有li元素，lis = str(tem_ul("li"))，返回lis是列表类型。

lisoup = BeautifulSoup(lis,'lxml')将lis继续进行解析。分析代码发现冒号前的是在li.span里，而冒号后的是在li.strong里，而且互相混杂，比如strong里有span等。

分析出数据错乱的问题所在，使用del list[i]删除第i+1个元素。

    spans = lisoup.find_all("span")

    del spans[3]

    strongs = lisoup.find_all("strong")

    d = zip(spans,strongs)

    d = dict(d)

    for item1,item2 in d.items():

        fp.write(item1.get_text())

        fp.write(item2.get_text()+"\n")

如上，将span和strong元素分到两个列表中，将列表合到一个字典里，输出到文件里。检查格式正确。

对于公司介绍等其他感兴趣的信息可以用上述类似的方法来获取。

介绍完如何提取每个职业具体信息的方法后，下面介绍提取一个网页的所有职业，并打开对应得网页提取具体信息。

链接的a元素是这样的：<a joburl="" href="http://xiaoyuan.zhaopin.com/other/page?r=http://jobs.zhaopin.com/135232279253833.htm"...(省略，非重点)>

典型的网址编码方式，从?r=分开。我们实际要打开的是r=后面的网址。如果直接打开的话其实是js加载，和源代码并不同。这里把后面的网址提取出来还是很容易的。

def deal_page(lists):

    urls = []

    for item in lists:

        tem = item.a.attrs["href"].split("?r=")

        urls.append(tem[1])

    return urls

传入一个该网页所有网址a元素列表，使用split分开后存到新列表。该列表就是我们需要打开提取信息的网址列表了。

所有代码：

import urllib

from bs4 import BeautifulSoup

import time

import sys

import os

targetDir = r'D:\temp\zhaopin'

def destPath(name):

    if not os.path.isdir(targetDir):

        os.mkdir(targetDir)

    t = os.path.join(targetDir,"%s.txt"%name)

    return t

def get_url(url):

    page = urllib.request.urlopen(url).read()

    bsoj = BeautifulSoup(page,'lxml')

    lists = bsoj.find_all("p",{"class":"searchResultJobName clearfix"})

    return lists

def deal_page(lists):

    urls = []

    for item in lists:

        tem = item.a.attrs["href"].split("?r=")

        urls.append(tem[1])

    return urls

def file_deal(url):

    headers = {

        "User-agent":'Mozilla/5.0 (Windows NT 6.2; rv:16.0) Gecko/20100101 Firefox/16.0',

        }

    req = urllib.request.Request(url,headers=headers)

    page = urllib.request.urlopen(req).read().decode('utf-8')

    bsoj = BeautifulSoup(page,'lxml')

    file_name = bsoj.find("div",{"class":"inner-left fl"}).h1.get_text()

    if file_name is None:

        print("目标为空")

        return

    print(file_name)

    addr = destPath(file_name)

    fp = open(addr,'a',encoding="utf-8")

    fp.write("职位描述\n")

    tem_ul = bsoj.find("ul",{"class":"terminal-ul clearfix"})

    lis = str(tem_ul("li"))

    lisoup = BeautifulSoup(lis,'lxml')

    spans = lisoup.find_all("span")

    del spans[3]

    strongs = lisoup.find_all("strong")

    d = zip(spans,strongs)

    d = dict(d)

    for item1,item2 in d.items():

        fp.write(item1.get_text())

        fp.write(item2.get_text()+"\n")

    detail = bsoj.find("div",{"class":"tab-inner-cont"})

    fp.write(detail.get_text())

    fp.write("公司介绍\n")

    tem_ul2 = bsoj.find("div",{"class":"company-box"})

    fp.write(tem_ul2.get_text()+"\n")

    tem_p = bsoj.find("div",{"style":"display:none;word-wrap:break-word;"})

    fp.write(tem_p.get_text()+"\n")



urls = get_url("http://xiaoyuan.zhaopin.com/part/industry/160400/530_299_0_0_-1_0_1_0")

pages = deal_page(urls)

for page in pages:

    file_deal(page)

这样把提取到的信息存到一个文件夹中，自动创建相应的txt文件，存入信息，而且格式也是很不错的。

再写一个.bat脚本，设置成开机启动的话每天就可以把想要的招聘信息抓取下来了。

对于自己感兴趣的招聘信息，只需要把网址改一下就可以了。

这里我只提取第一页，提取所有的感觉不是很有必要。如果需要的话，只要对分页列表处理一下，用一个新的列表存起来就可以了

内容来自用户分享和网络整理，不保证内容的准确性，如有侵权内容，可联系管理员处理

标签：

相关文章推荐

新的分享

章节导航