您的位置：首页 > 编程语言 > Python开发

python 3.5：爬虫--爬取人民日报1946-2003年所有新闻

2017-10-17 22:13 489 查看

爬取资料库网站上的人民日报新闻（1946-2003）

总网址如下：
http://www.ziliaoku.org/rmrb?from=groupmessage&isappinstalled=0
从此网页开始爬取，进下一层，再进入下一层，再进行爬取。

由于本人还在学习过程中，有些很多其他更方便快捷的方法，以及一些爬虫功能还未用到，所以结果还是有两处需改进的地方，下面会上代码，欢迎一起讨论学习。

1.非按时间顺序出来文件（txt）

2.由于网站源代码的特殊，还未弄清如何爬取一天中每一版的，所以最后只能爬取每天的第一版，一天中每一版的网站都是同一个，版里每一条新闻都指向这个网站。

本次的爬取新闻是我学习爬虫的一个步骤过程，下次将发我运用scrapy爬取的一次实例

#coding=utf-8
import requests
import re # 正则表达式
import bs4 # Beautiful Soup 4 解析模块
import urllib.request # 网络访问模块
import News #自己定义的新闻结构
import codecs #解决编码问题的关键，使用codecs.open打开文件
import sys #1解决不同页面编码问题
import importlib
importlib.reload(sys)

# 从首页获取所有链接
def GetAllUrl(home):
html = urllib.request.urlopen(home).read().decode('utf8')
soup = bs4.BeautifulSoup(html, 'html.parser')
pattern = 'http://www.ziliaoku.org/rmrb/[\d\S].*?'
links = soup.find_all('a', href=re.compile(pattern))
for link in links:
url_set.add(link['href'])
def GetAllUrlL(home):
html = urllib.request.urlopen(home).read().decode('utf8')
soup = bs4.BeautifulSoup(html, 'html.parser')
pattern = 'http://www.ziliaoku.org/rmrb/[\d\S].*?'
links = soup.find_all('a', href=re.compile(pattern))
for link in links:
url_set1.add(link['href'])
def GetNews(url,i):
response = requests.get(url)
html = response.text
article = News.News()
try:
article.title = re.findall(r'<h2 id=".*?">(.+?)</h2>', html)
article.content = re.findall(r'<div class="article">([\w\W]*?)</div>', html)

t = ""
for j in article.title:
t+=str('标题：'+j+'\n')
c = ""
for m in article.content:
c+=str(m)
article.content1 = '　' + '\n'.join(c.split('
')).strip()

file = codecs.open('/tmp/luo/news '+str(i)+'.txt', 'w+')
file.write(t+"\t"+article.content1)
file.close()
print('ok')
except Exception as e:
print('Error1:', e)

def GetAllUrlK(home,i):
html = urllib.request.urlopen(home).read().decode('utf8')
soup = bs4.BeautifulSoup(html, 'html.parser')
pattern = 'http://www.ziliaoku.org/rmrb/[\d\S].*?'
link = soup.find('a', href=re.compile(
4000
pattern))
link1 = link['href']
print(link1)
GetNews(link1,i)

url_set = set() # url集合
url_set1 = set() # url集合
home = 'http://www.ziliaoku.org/rmrb?from=groupmessage&isappinstalled=0'
GetAllUrl(home)
try:
for d in url_set:
GetAllUrlL(d)
print(d)
i = 0
for b in url_set1:
i = i+ 1
print(b)
GetAllUrlK(b,i)
except Exception as e:
print('Error:', e)

# home = 'http://www.ziliaoku.org/rmrb/1984-06-21'
# i = 10
# GetAllUrlK(home,i)

txt文件为新闻，格式可自己用正则去规范。

内容来自用户分享和网络整理，不保证内容的准确性，如有侵权内容，可联系管理员处理

标签：

相关文章推荐

新的分享

章节导航