您的位置:首页 > 编程语言 > Python开发

python爬虫,爬取虎扑网新闻

2017-09-16 22:48 274 查看
以前写过的代码过一段时间就会忘记,需要按时复习

最近闲来无事,写了一个简单的爬虫程序,无奈知识遗忘太快,竟然花了我好长时间

女票喜欢库里,但无奈库里新闻太少,只好爬一波勇士队消息

【女票是不会喜欢从文件中看信息的。。。但我还是要写】

# -*- coding:utf-8 -*-
import requests
from bs4 import BeautifulSoup
import re
import xlrd

def get_content(hupu_url):
headers={'User-agent':'Mozilla/5.0'}
try:
page=requests.get(hupu_url,headers=headers,timeout=3)
except:
return 'False',''
content=page.text
soup=BeautifulSoup(content,'lxml')
title=soup.title
#print(title.contents[0])
bodys=soup.find_all(class_="artical-main-content")
#print(bodys)
body=BeautifulSoup(str(bodys[0]),'lxml')
tips=body.find_all("p")
main_content=""
for tip in tips:
if "img" not in str(tip) and "href" not in str(tip):
tip=re.findall(r'<p>(.*)?</p>',str(tip))
main_content+='\t'+str(tip[0])+'\n'
#print(main_content)
return title.contents[0],main_content
def writetofile(title,content):
with open('ToWenWen.txt','a',encoding='utf-8') as f:
news='<title>'+'\n'+'\t'+str(title)+'\n'+'<content>'+'\n'+str(content)
f.write(news)
for i in range(0,100):
f.write('==')
f.write('\n')
if __name__=='__main__':
f=open("ToWenWen.txt",'wt')
f.close()

for j in range(1,15):
warriors_url='https://voice.hupu.com/nba/tag/2982-'+str(j)+'.html'
headers={'User-agent':'Mozilla/5.0'}
try:
page=requests.get(warriors_url,headers=headers,timeout=15)
except:
continue;
content=page.text
soup=BeautifulSoup(content,'lxml')
hupu=soup.find_all(class_="list-content")
hupu2=BeautifulSoup(str(hupu),'lxml')
hupu3=hupu2.find_all(class_="n1")
news_urls=re.findall('<a href="(.*)?" target="_blank">',str(hupu3))
#print(news_urls)
for hupu_url in news_urls:
print('search url',hupu_url)
try:
title,content=get_content(hupu_url)
writetofile(title,content)
except:
continue


内容来自用户分享和网络整理,不保证内容的准确性,如有侵权内容,可联系管理员处理 点击这里给我发消息
标签:  爬虫