您的位置:首页 > 编程语言 > Python开发

Python爬虫之<单线程爬虫>

2015-08-14 19:47 597 查看
一.直接获取源代码

>>> import requests
>>> url='http://www.wutnews.net/'
>>> html=requests.get(url)
>>> print html.content
<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Transitional//EN" "http://www.w3.org/TR/xhtml1/DTD/xhtml1-transitional.dtd">
<html xmlns="http://www.w3.org/1999/xhtml">
<head><meta http-equiv="Content-Type" content="text/html; charset=utf-8" /><meta name="Keywords" content="经纬,经纬网,武汉理工大学,武汉理工大学门户,新闻经纬,时政视窗,校园文化,皮壳网,选修客,Token,拓垦团队" /><meta name="Description" content="武汉理工大学门户网站" /><meta name="robots" content="index, follow" /><meta name="googlebot" content="index, follow" /><meta name="author" content="Token Team" /><title>
武汉理工大学经纬网
</title>


二.修改http头获取源代码

有的网站可能会对发送请求的程序进行审查,比如只会允许浏览器访问,而对爬虫进行拒绝,这时,我们可以添加http头,来让网站误认为我们的爬虫是浏览器。

>>> import requests
>>> url='http://www.wutnews.net/'
>>> headers={'User-Agent':'Mozilla/5.0 (iPhone; CPU iPhone OS 8_0 like Mac OS X) AppleWebKit/600.1.3 (KHTML, like Gecko) Version/8.0 Mobile/12A4345d Safari/600.1.4'}
>>> html=requests.get(url,headers=headers)
>>> print html.content
<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Transitional//EN" "http://www.w3.org/TR/xhtml1/DTD/xhtml1-transitional.dtd">
<html xmlns="http://www.w3.org/1999/xhtml">
<head><meta http-equiv="Content-Type" content="text/html; charset=utf-8" /><meta name="Keywords" content="经纬,经纬网,武汉理工大学,武汉理工大学门户,新闻经纬,时政视窗,校园文化,皮壳网,选修客,Token,拓垦团队" /><meta name="Description" content="武汉理工大学门户网站" /><meta name="robots" content="index, follow" /><meta name="googlebot" content="index, follow" /><meta name="author" content="Token Team" /><title>
武汉理工大学经纬网
</title>


三.Requests与正则表达式

单线程简单爬虫的基本原理:使用Requests获取网页源代码,再使用正则表达式匹配出感兴趣的内容。

import requests
import re
url='http://www.wutnews.net/'
html=requests.get(url)
navTitles=re.findall('<img src="(.*?)"',html.text)
for each in navTitles:
print each


部分结果为:

>>>
images/news-more.jpg
images/news-more.jpg
images/subject.jpg
images/newsphoto-more.jpg http://www.wutnews.net/thumbnail.aspx?width=84&height=80&path=http://www.wutnews.net/uploads/2015-07-26/23393633491043.jpg http://www.wutnews.net/thumbnail.aspx?width=84&height=80&path=http://www.wutnews.net/uploads/2015-07-17/17164432633597.jpg http://www.wutnews.net/thumbnail.aspx?width=84&height=80&path=http://www.wutnews.net/uploads/2015-07-17/08452651575529.jpg images/culture-more.jpg
images/token.jpg


四.向网页提交数据

方法:requests.post

import requests
import re
url='https://www.crowdfunder.com/browse/deals&template=false'
data={
'entities_only':'true',
'page':'1'
}
html=requests.post(url,data=data)
title=re.findall('"card-title">(.*?)</div>',html.text,re.S)
for each in title:
print each


部分结果如下:

>>>
Electric World Carnival
Intox-Detox
Aquavert by EIJ Industries
CafeBellas, Inc.
SU Labs Accelerator Seed Fund
Net Zero Urban Greens
SixthContinent Inc.
Paul Davis Restoration of Western Michigan
Vinavanti Urban Winery
Pipeline Wizard
Lavon Estates - Cavender Real Estate Group LLC
Jibehealth.com
EM&N8, Controllers Incorporated
AxCent Tuning Systems
内容来自用户分享和网络整理,不保证内容的准确性,如有侵权内容,可联系管理员处理 点击这里给我发消息
标签: