您的位置:首页 > 编程语言 > Python开发

学习python的第十六天-BeautifulSoup和Tkinter的使用

2016-11-27 11:24 281 查看
https://github.com/A-lPha/duzhe.py.git

昨天逛知乎的时候看到一篇名为大家都用 Python 来做什么啊?的帖子,其中
Tsing
的回答吸引了我,顺便学习(拷贝)了一段python爬虫代码。不过让我不能理解的是代码都已经公布出来了,但是知乎的答案设置不能转载,难道是为了让我们这些新手多练习码字吗?好了,闲话不多说,进入主题。

批量下载读者杂志某一期的全部文章

#!/usr/bin/env python
#-*- coding: utf-8 -*-
# 保存读者杂志某一期的全部文章为TXT
# By Tsing
# Python 2.7.9

import urllib2
import os
from bs4 import BeautifulSoup

def urlBS(url):
response = urllib2.urlopen(url)
html = response.read()
soup = BeautifulSoup(html)
return soup

def main(url):
soup = urlBS(url)
link = soup.select('.booklist a')
path = os.getcwd()+u'/读者文章保存/'
if not os.path.isdir(path):
os.mkdir(path)
for item in link:
newurl = baseurl + item['href']
result = urlBS(newurl)
title = result.find("h1").string
writer = result.find(id="pub_date").string.strip()
filename = path + title + '.txt'
print filename.encode("gbk")
new=open(filename,"w")
new.write("<<" + title.encode("gbk") + ">>\n\n")
new.write(writer.encode("gbk")+"\n\n")
text = result.select('.blkContainerSblkCon p')
for p in text:
context = p.text
new.write(context.encode("utf-8"))
new.close()

if __name__ == '__main__':
year = raw_input("Please enter the year:")
mouth = raw_input("Please enter the mouth:")
time = year + "_" + mouth
baseurl = 'http://www.52duzhe.com/' + time +'/'
firsturl = baseurl + 'index.html'
main(firsturl)


以上就是全部代码,当时我简单的认为只要把代码完整无误的写下来就可以高枕无忧,谁知错误接踵而至。我先把我遇到的错误罗列出来,给你们一个清晰的解决思路。不过因为错误都已经解决,
python
错误提示我这里就不写了。

from bs4 import BeautifulSoup
提示我找不到
bs4


new.write(context.encode("utf-8"))
提示
gbk
格式错误

还有一些错误是我想给这个脚本制作
GUI
时遇到的错误,主要是因为我不了解
Tkinter
的用法导致的。

解决办法

1.
from bs4 import BeautifulSoup
提示我找不到
bs4

导致这个问题的根本原因是因为我没有安装
BeautifulSoup
,听起来很白痴,不过当时的确没反应过来是这个问题,之前总以为
python
的库都是已经下载好可以直接使用的。接下来我会介绍怎么安装
BeautifulSoup
库。附:
BeautifulSoup
官方文档


首先要在官网下载最新的
BeautifulSoup
下载链接。我当时直接下载到
python
安装目录下,并解压。然后打开
cmd
,输入
cd C:\python2.7\beautifulsoup4-4.5.1
。其中
cd
是改变目录命令,用于切换路径目录。

如图:



依次输入:

setup.py build




setup.py install


然后输入
python
,输入
from bs4 import BeautifulSoup
,如果显示
">>>"
就是安装完成了。

2.
new.write(context.encode("utf-8"))
提示
gbk
格式错误

这个问题在于编码,只要把这行代码改为
new.write(context.encode("utf-8"))
即可。

3.GUI

因为想给这个脚本加一个用户界面,所以使用了
python
自带的
Tkinter


自学的这部分代码可以运行一个框架,可以参考学习。

from Tkinter import *

top = Tk()
top.geometry('650x100')
def main():
print var1.get()
print var2.get()
var1=StringVar()
var2=StringVar()
label1 = Label(top,text='读者文章下载器',font='Helvetica -20 bold')
label2 = Label(text='请输入年份:',font='Helvetica -16 bold')
label3 = Label(text='请输入月份:',font='Helvetica -16 bold')
entry1 = Entry(top,text='',textvariable=var1,font='Helvetica -16 bold')
entry2 = Entry(top,text='',textvariable=var2,font='Helvetica -16 bold')
button1 = Button(text=' 确认 ',command=main,font='Helvetica -16 bold')
button2 = Button(text=' 退出 ',command=top.quit,activeforeground='white',activebackground='red',font='Helvetica -16 bold')
label1.pack()
label2.pack(side='left')
entry1.pack(side='left')
label3.pack(side='left')
entry2.pack(side='left')
button1.pack()
button2.pack()
top.mainloop()


这个运行以后时这样的:



好吧,看起来有些丑,美化的话以后再说~

后来经过我的各种学习加工,最后代码是这个样子的:

from Tkinter import *
import urllib2
import os
from bs4 import BeautifulSoup
import webbrowser

top = Tk()
top.geometry('650x150')
def open_web():
year = var1.get()
mouth = var2.get()
time = year + "_" + mouth
baseurl = 'http://www.52duzhe.com/' + time +'/'
firsturl = baseurl + 'index.html'
webbrowser.open(firsturl)

def download_txt():
year = var1.get()
mouth = var2.get()
time = year + "_" + mouth
baseurl = 'http://www.52duzhe.com/' + time +'/'
firsturl = baseurl + 'index.html'
def urlBS(url):
response = urllib2.urlopen(url)
html = response.read()
soup = BeautifulSoup(html)
return soup
soup = urlBS(firsturl)
link = soup.select('.booklist a')
path = os.getcwd()+u'/读者文章保存/'
if not os.path.isdir(path):
os.mkdir(path)
for item in link:
newurl = baseurl + item['href']
result = urlBS(newurl)
title = result.find("h1").string
writer = result.find(id="pub_date").string.strip()
filename = path + title + '.txt'
print filename.encode("gbk")
new=open(filename,"w")
new.write("<<" + title.encode("gbk") + ">>\n\n")
new.write(writer.encode("gbk")+"\n\n")
text = result.select('.blkContainerSblkCon p')
for p in text:
context = p.text
new.write(context.encode("utf-8"))
new.close()
var1=StringVar()
var2=StringVar()
label1 = Label(top,text='读者文章下载器',font='Helvetica -20 bold')
label2 = Label(text='请输入年份:',font='Helvetica -16 bold')
label3 = Label(text='请输入月份:',font='Helvetica -16 bold')
entry1 = Entry(top,text='',textvariable=var1,font='Helvetica -16 bold')
entry2 = Entry(top,text='',textvariable=var2,font='Helvetica -16 bold')
button1 = Button(text=' 打开 ',command=open_web,font='Helvetica -16 bold')
button2 = Button(text=' 退出 ',command=top.quit,activeforeground='white',activebackground='red',font='Helvetica -16 bold')
button3 = Button(text=' 下载 ',command=download_txt,font='Helvetica -16 bold')
label1.pack()
label2.pack(side='left')
entry1.pack(side='left')
label3.pack(side='left')
entry2.pack(side='left')
button1.pack()
button2.pack()
button3.pack()
top.mainloop()


点击“打开”,打开对应时间的读者网址,点击下载,下载对应时间的读者文章到“读者文章保存”文件夹。

这段代码花费我挺长时间的:

def download_txt():
year = var1.get()
mouth = var2.get()
time = year + "_" + mouth
baseurl = 'http://www.52duzhe.com/' + time +'/'
firsturl = baseurl + 'index.html'
def urlBS(url):
response = urllib2.urlopen(url)
html = response.read()
soup = BeautifulSoup(html)
return soup
soup = urlBS(firsturl)
link = soup.select('.booklist a')
path = os.getcwd()+u'/读者文章保存/'
if not os.path.isdir(path):
os.mkdir(path)
for item in link:
newurl = baseurl + item['href']
result = urlBS(newurl)
title = result.find("h1").string
writer = result.find(id="pub_date").string.strip()
filename = path + title + '.txt'
print filename.encode("gbk")
new=open(filename,"w")
new.write("<<" + title.encode("gbk") + ">>\n\n")
new.write(writer.encode("gbk")+"\n\n")
text = result.select('.blkContainerSblkCon p')
for p in text:
context = p.text
new.write(context.encode("utf-8"))
new.close()


因为不是太懂
Tkinter
的特性,定义函数出了问题,不过“打开对应网址“启发了我,所以就把原代码中两个函数结合起来了,没想到成功了。我的经验希望可以帮到你!附:GUI编程(Tkinter)
内容来自用户分享和网络整理,不保证内容的准确性,如有侵权内容,可联系管理员处理 点击这里给我发消息
标签:  python tkinter bs4 爬虫