您的位置:首页 > 编程语言 > Python开发

[python]抓取网页的内容

2011-10-21 20:17 316 查看
#-*- coding: UTF-8 -*-

import urllib2, BeautifulSoup
# @param url: complete url
#             完整的url
# @param usr, pwd: if the page need account,
#        \p usr and \p pwd will be used
#             当访问的页面需要密码的时候会用到
# @return: the formatted string content of the url
#             用了BeautifulSoup返回结果文本
def getWebPage(url, usr=None, pwd=None):
if not usr and not pwd:
content = urllib2.urlopen(url).read()
else:
pwdMgr = urllib2.HTTPPasswordMgrWithDefaultRealm()
pwdMgr.add_password(None, url, usr, pwd)
handler = urllib2.HTTPBasicAuthHandler(pwdMgr)
opener = urllib2.build_opener(handler)
page = opener.open(url).read()
content = BeautifulSoup.BeautifulSoup(page).prettify()
return content

url='http://www.csdn.net/'
print getWebPage(url)
内容来自用户分享和网络整理,不保证内容的准确性,如有侵权内容,可联系管理员处理 点击这里给我发消息
标签: