您的位置：首页 > 编程语言 > Python开发

Python开发笔记之二——Python网页爬虫与画图

2017-01-23 19:23 603 查看

我们经常需要拉一下业务的某一个指标数据汇总给老板看，在我们内部监控平台上，因为系统平台的人没有暴露api给业务层，之前想汇总统计数据基本都是手动一天一天的去采集，每次采集都是一个机械重复蛋疼的过程，这次狠下心来，把这个过程脚本化了。为了防止脚本丢失，特意在这里做个笔记记录一下。

这个过程其实就是简单的两步：数据的采集和画图

1、数据的采集

因为这些数据是按天产生的，数据的URL地址只有日期不一样，如下：

http://######/###/##/####?op=view&query1=2016-12-25;1;30;;;;;;;&query2=&type=delay&device1=&device2=

所以，首先需要爬某一天URL链接的某一字断的数据，这个过程其实就是爬网页的数据，需要线download下来网页的html，然后去取对应的标签下的数据，然后遍历URL数组即可：

def download(url, num_retries=2):
print 'Downloading:', url
try:
html = urllib2.urlopen(url).read()
except urllib2.URLError as e:
print 'Downloading error:', e.reason
html = None
if num_retries > 0:
if hasattr(e, 'code') and 500 <= e.code < 600:
return download(url, num_retries -1)
return html

def urlDataSource():
urlArray = []
for i in range(1, 31):
url = 'http://######/###/##/####?op=view&query1=2016-12-%s;1;30;;;;;;;&query2=&type=delay&device1=&device2='%(i)
urlArray.append(url)
return urlArray

def arr1DataSource(urlArr):
array1 = []
for url in urlArr:
html = download(url)
str = BeautifulSoup(html,"html.parser")
str1 = str.find_all('tbody')
for i in str1:
href = i.find_all('td')
count = 0
for j in href:
count = count + 1
if (count == 4):
print j.text
array1.append(int(j.text.replace(',','')))
return array1

2、画图

这个过程需要你去安装matplotlib相关的库，具体可以去matplotlib官网找下安装教程，实现如下：

首先需要构建一个X轴的坐标数组，使用draw()直接画即可，如下：

def xDataSource(begin, end):
array3 = []
for i in range(begin, end):
array3.append(i)
return array3

def draw(x, y):
plt.figure(figsize=(10,4)) #创建绘图对象
plt.plot(x, y, 'b*')
plt.plot(x, y, 'b')
plt.xlabel("Date(December)") #X轴标签
plt.ylabel("Ratio(%)")  #Y轴标签
plt.show()

3、效果图

如果需要出不同风格的图，只需要设置plt.figure()即可。

4、完整的代码

#!/usr/bin/env python
#coding:utf-8
import urllib2
import re
import numpy as np
import matplotlib.pyplot as plt
from bs4 import BeautifulSoup

def download(url, num_retries=2):
print 'Downloading:', url
try:
html = urllib2.urlopen(url).read()
except urllib2.URLError as e:
print 'Downloading error:', e.reason
html = None
if num_retries > 0:
if hasattr(e, 'code') and 500 <= e.code < 600:
return download(url, num_retries -1)
return html

def urlDataSource():
urlArray = []
for i in range(1, 31):
url = 'http://######/###/##/####?op=view&query1=2016-12-%s;1;30;;;;;;;&query2=&type=delay&device1=&device2='%(i)
urlArray.append(url)
return urlArray

def arrDataSource(urlArr):
array = []
for url in urlArr:
html = download(url)
str = BeautifulSoup(html,"html.parser")
str1 = str.find_all('tbody')
for i in str1:
href = i.find_all('td')
count = 0
for j in href:
count = count + 1
if (count == 4):
print j.text
array.append(int(j.text.replace(',','')))
return array

def xDataSource(begin, end):
array = []
for i in range(begin, end):
array.append(i)
return array

def draw(x, y):
plt.figure(figsize=(10,4)) #创建绘图对象
plt.plot(x, y, 'b*')
plt.plot(x, y, 'b')
plt.xlabel("Date(December)") #X轴标签
plt.ylabel("Ratio(%)")  #Y轴标签
plt.show()

if __name__ == "__main__":
draw(xDataSource(1, 31), arrDataSource(urlDataSource()))

内容来自用户分享和网络整理，不保证内容的准确性，如有侵权内容，可联系管理员处理

标签： BeautifulSoup matplotlib draw 网页爬虫 python画图

相关文章推荐

新的分享

章节导航