您的位置：首页 > 编程语言 > Python开发

[3]Python学习笔记---写一个Python程序把CSDN专栏的所有文章另存为本地网页

2017-07-25 23:55 741 查看

了解和学习了Python已经有一小段时间了，是应该写一个程序出来练练手了。我们知道Python里面有很多的第三方库，而且也支持面向对象的编程思想。作为一个从事了Java编码工作快10年的程序员来说，应该不难理解和使用其面向对象的思想。下面就以把我的PowerShell DSC专栏里面所有的文章下载到本地为例子，看用Python如何简单。代码如下,import urllib.requestfrom bs4 import BeautifulSoupimport osclass CSDNSpecialTopic():def __init__(self,sepcialTopicIndexURLs=[],rootFolder="c:/test/"):self.sepcialTopicIndexURLs=sepcialTopicIndexURLsself.rootFolder=rootFolderdef openURLAndSave(self):csdnList,csdnDic=self.getCSDNSpecialTopicPages()for title,tempPageURL in csdnDic.items():html = urllib.request.urlopen(tempPageURL)s=html.read()if not os.path.exists(self.rootFolder):os.mkdir(self.rootFolder)outfile = open(self.rootFolder+ "/"+ title+".html", "wb")outfile.write(s)outfile.close()def getCSDNSpecialTopicPages(self):csdnList = []csdnDic = {}for sepcialTopicIndexURL in self.sepcialTopicIndexURLs:f = urllib.request.urlopen(sepcialTopicIndexURL)htmlstr = f.read().decode('utf-8')soup = BeautifulSoup(htmlstr)h4s = soup.findAll('h4')for h4 in h4s:href = h4.aif href is not None:if href.get("href") is not None:csdnList.append(href.get("href"))csdnDic[href.get_text()] = href.get("href")print(len(csdnList))print(len(csdnDic))print(csdnDic,sep='\n')return csdnList,csdnDicsepcialTopicIndexURLs=["http://blog.csdn.net/column/details/14191.html","http://blog.csdn.net/column/details/14191.html?&page=2"];csdnSpecialTopic=CSDNSpecialTopic(sepcialTopicIndexURLs,"c:/test3")csdnList,csdnDic=csdnSpecialTopic.getCSDNSpecialTopicPages()csdnSpecialTopic.openURLAndSave()其中使用了urllib第三方库进行http的访问，使用bs4中的BeautifulSoup 类来进行HTML页面的解析。先对urllib库做一个简单的描述。第一种是get请求方式一：urllib.request.urlopen(url) 方式二：先创建一个urllib.request.Request对象，然后将对象放入urlopen中。urllib.request.urlopen(Request对象)第二种是POST:携带数据过去比如：登录用户名，密码先创建一个urllib.request.Request 将请求数据放到Request对象中urllib.request.Request(url,data) 注意：在此之前先将data进行转化： data = urllib.parse.urlencode(values)第三种是：添加请求头比如：headers = {'User-Agent':user_agent} 然后把headers放到Request方法的第三个参数urllib.request.Request(url, data.encode("utf-8"), headers)下面是从网上找的一个简单的例子。import urllib.parseimport urllib.requesturl = "http://www.baidu.com"user_agent = 'Mozilla/4.0 (compatible; MSIE 5.5; Windows NT)'values = {'name' : 'Michael Foord','location' : 'Northampton','language' : 'Python'}headers = {'User-Agent':user_agent}data = urllib.parse.urlencode(values)req = urllib.request.Request(url, data.encode("utf-8"), headers)response = urllib.request.urlopen(req)the_page = response.read().decode("utf-8")print(the_page)基础讲完了，下面到具体的例子，CSDNSpecialTopic是我定义的一个类，里面提供了访问网页，解析网页和把网页另存为的功能。其接受两个参数：一个数组，专栏的URL地址，可以有多个；另外一个就是网页文件需要存放的目录，如果当前没有这个目录，就创建一个新的目录。这两个参数是通过其构造函数：def __init__(self,sepcialTopicIndexURLs=[],rootFolder="c:/test/"):

传入进来的，默认的路径是c:/test文件夹。

下面是运行的结果，其很好的支持中文。

内容来自用户分享和网络整理，不保证内容的准确性，如有侵权内容，可联系管理员处理

标签： python 构造函数 csdn

相关文章推荐

新的分享

章节导航