Python3 多种方式爬取最新30期双色球历史数据存入csv
2018-02-07 23:24
411 查看
双色球这个页面有点坑,首先是https加密链接,然后主要的是,他们页面可能编码不是用的常用编码,
用urllib的urlopen方法爬取下来的内容貌似16进制码,如下:
b'\x1f\x8b\x08\x00\x00\x00\x00\x00\x00\x03\xad[\xdb\x8e\x1b\xc7\x11}\x96\xbf\x82a\x90\xc0\x02f\xb3\xd3\xdds\x95v\xe7\xc5A~c\xc1%\xa9]&\x14\xb9 \xa9\x8b\xf3$\xc3\x08\x0c\x1b\x06l#\xb0\x13\x046\xe4\x181\x90
尝试了各种转换各种decode 无效,希望各位大佬能指点迷津。
urllib不行后换了其它的三方库,requests和httplib2两个库均可,大家有类似情况可参考:
## Available
h = httplib2.Http('.cache')
resp, content = h.request(url)
# print(content.decode())
return content.decode()
## Available
resp = requests.get(url, headers=headers,
timeout=10,
verify=False)
contents = resp.content
# print(type(contents))
# print(contents)
完整代码:
爬取结果:
18016,01,11,12,18,25,27,16
18015,11,15,20,21,26,33,15
18014,09,12,20,24,28,31,07
18013,06,08,13,15,22,33,06
18012,11,12,13,19,26,28,12
18011,03,10,21,23,27,33,11
18010,01,08,17,20,21,22,03
18009,05,10,17,23,26,32,07
18008,05,09,10,12,17,19,13
18007,13,14,20,25,27,31,12
18006,02,07,08,09,17,29,11
18005,02,20,21,28,31,33,06
18004,14,18,19,26,30,31,11
18003,01,14,16,17,20,31,04
18002,07,18,24,29,31,33,16
18001,01,08,11,26,28,31,04
17154,05,09,13,15,18,26,05
17153,07,11,12,13,18,19,16
17152,06,10,23,25,26,29,05
17151,02,05,07,09,11,27,16
17150,06,14,19,20,21,23,08
17149,05,08,15,20,27,30,13
17148,04,07,11,14,29,32,12
17147,03,07,20,21,25,31,14
17146,01,19,25,26,27,33,10
17145,02,06,12,17,25,28,12
17144,03,14,16,20,31,32,09
17143,04,06,09,14,20,29,14
17142,08,13,14,18,23,33,06
17141,01,06,07,11,13,15,05
以上。
用urllib的urlopen方法爬取下来的内容貌似16进制码,如下:
b'\x1f\x8b\x08\x00\x00\x00\x00\x00\x00\x03\xad[\xdb\x8e\x1b\xc7\x11}\x96\xbf\x82a\x90\xc0\x02f\xb3\xd3\xdds\x95v\xe7\xc5A~c\xc1%\xa9]&\x14\xb9 \xa9\x8b\xf3$\xc3\x08\x0c\x1b\x06l#\xb0\x13\x046\xe4\x181\x90
尝试了各种转换各种decode 无效,希望各位大佬能指点迷津。
urllib不行后换了其它的三方库,requests和httplib2两个库均可,大家有类似情况可参考:
## Available
h = httplib2.Http('.cache')
resp, content = h.request(url)
# print(content.decode())
return content.decode()
## Available
resp = requests.get(url, headers=headers,
timeout=10,
verify=False)
contents = resp.content
# print(type(contents))
# print(contents)
完整代码:
import requests import csv import socket import urllib.request as req import urllib from bs4 import BeautifulSoup import gzip import ssl import httplib2 # 获取双色球往期数据 # @author puck # @Date 2018-02-01 # https://datachart.500.com/ssq/history/history.shtml # https://datachart.500.com/ssq/history/newinc/history.php?limit=30&sort=0 def get_content(url): try: context = ssl._create_unverified_context() headers = {'User-Agent':'Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/63.0.3239.132 Safari/537.36'} # requset = req.Request(url, headers=headers, method='GET') # result = req.urlopen(requset, timeout=10, context=context).read() # 此方法获取的内容貌似16进制乱码 # rep.encoding = 'utf-8' # return rep.text # print(type(result)) # print(result.decode()) ## Available # h = httplib2.Http('.cache') # resp, content = h.request(url) # print(content.decode()) # return content.decode() ## Available resp = requests.get(url, headers=headers, timeout=10, verify=False) # verify 参数必传,falst不校验SSL证书 contents = resp.content # print(type(contents)) # print(contents) if contents: return contents.decode() else: print("can't get data from url:" + url) except req.HTTPError as e: print('HTTPError:', e) except req.URLError as e: # SSL证书验证失败错误 # URLError: <urlopen error [SSL: CERTIFICATE_VERIFY_FAILED] certificate verify failed (_ssl.c:748)> print('URLError:', e) except socket.timeout as e: print('socket.timeout:', e) except BaseException as e: print('BaseException:', e) def get_data(html): if html is None: return final = [] bs = BeautifulSoup(html, "html.parser") print(bs) datas = bs.find("tbody", {'id': 'tdata'}) trs = datas.find_all('tr') for tr in trs: temp = [] tds = tr.find_all('td') temp.append(tds[0].string) temp.append(tds[1].string) temp.append(tds[2].string) temp.append(tds[3].string) temp.append(tds[4].string) temp.append(tds[5].string) temp.append(tds[6].string) temp.append(tds[7].string) final.append(temp) print(final) return final def write_data(data, name): with open(name, 'w', errors='ignore', newline='') as f: csvFile = csv.writer(f) csvFile.writerows(data) if __name__ == '__main__': url = 'https://datachart.500.com/ssq/history/newinc/history.php?limit=30&sort=0' html = get_content(url) data = get_data(html) write_data(data, 'TwoColorBallData30.csv')
爬取结果:
18016,01,11,12,18,25,27,16
18015,11,15,20,21,26,33,15
18014,09,12,20,24,28,31,07
18013,06,08,13,15,22,33,06
18012,11,12,13,19,26,28,12
18011,03,10,21,23,27,33,11
18010,01,08,17,20,21,22,03
18009,05,10,17,23,26,32,07
18008,05,09,10,12,17,19,13
18007,13,14,20,25,27,31,12
18006,02,07,08,09,17,29,11
18005,02,20,21,28,31,33,06
18004,14,18,19,26,30,31,11
18003,01,14,16,17,20,31,04
18002,07,18,24,29,31,33,16
18001,01,08,11,26,28,31,04
17154,05,09,13,15,18,26,05
17153,07,11,12,13,18,19,16
17152,06,10,23,25,26,29,05
17151,02,05,07,09,11,27,16
17150,06,14,19,20,21,23,08
17149,05,08,15,20,27,30,13
17148,04,07,11,14,29,32,12
17147,03,07,20,21,25,31,14
17146,01,19,25,26,27,33,10
17145,02,06,12,17,25,28,12
17144,03,14,16,20,31,32,09
17143,04,06,09,14,20,29,14
17142,08,13,14,18,23,33,06
17141,01,06,07,11,13,15,05
以上。
相关文章推荐
- 把CSV数据合成json样书存入字典、列表并打印的python脚本
- [python]通达信历史日线数据解析转换为CSV文件进行存储
- Python数据分析之获取双色球历史信息的方法示例
- python数据分析1:获取双色球历史信息
- 把CSV数据合成json样书存入字典、列表并打印的python脚本
- 1029.在线视频―开源网管Cacti系列讲座(四)多种数据采集方式
- 以上为Python3.*及更早之前的方式,最新Pyhon3.*的调用方式: def md5Encode(str): import hashlib #参数必须是byte
- python--数据写入csv
- Python抓取数据并存入到mysql
- python实现提取数据并保存在csv中
- python 从csv读数据到mysql的例子
- 运用python抓取博客园首页的所有数据,而且定时持续抓取新公布的内容存入mongodb中
- python图片尺寸多种处理方式
- 详解Java实现多种方式的http数据抓取
- python中,从mysql读取数据,并存入redis里面(2)
- Python股票历史涨跌幅数据获取
- python向打印机发送数据(3)-- 端口(LPT)方式
- python创建字典多种方式
- caffe prototxt 中的多种数据输入方式
- python网页数据转码方式