您的位置:首页 > 运维架构 > 网站架构

网站爬取准备

2017-02-06 00:00 162 查看
摘要: 《用python写网络爬虫》

1. 爬虫作用
用网络爬虫技术让重复性的手工流程实现自动化处理。
2. 爬取准备
a. 检查robots.txt
在链接后加robots.txt查看是否有要求或限制
User-agent : 后表示禁止的用户代理
Crawl-delay : 后表示要求的爬取延迟
Sitemap : 后的链接提供网站地图文件
例:伯乐在线提供的网站地图
b. 估算网站大小
site: +网站链接或URL路径 (用goole吧)
c. 识别网站所用技术
i. 在windows powershell 中输入pip查看是否已安装pip
ii. 使用pip install builtwith安装 builtwith模块
iii. 使用该模块将URL作为参数,对该URL进行分析
>>> import builtwith
>>> builtwith.parse('http://example.webscraping.com')
{u'javascript-frameworks': [u'jQuery', u'Modernizr', u'jQuery UI'],
u'web-frameworks': [u'Web2py', u'Twitter Bootstrap'],
u'programming-languages': [u'Python'],
u'web-servers': [u'Nginx']    }
>>> builtwith.parse('http://jianshu.com')
{u'javascript-frameworks': [u'Prototype', u'RequireJS'], u'web-frameworks':  [u'Twitter Bootstrap', u'Ruby on Rails'],u'Twprogramming-languages': [u'Ruby'], u'web-servers': [u'Tengine']}
>>> builtwith.parse('http://chinadaily.com.cn')
{u'javascript-frameworks': [u'jQuery'], u'web-servers': [u'Nginx']}
>>> builtwith.parse('http://oschina.net')
{u'javascript-frameworks': [u'jQuery', u'Vue.js'], u'web-servers': [u'Tengine']}
d. 寻找网站所有者
i. 安装WHOIS协议封装库
pip install python-whois
ii. 使用
>>>import whois
>>> print whois.whois('jianshu.com')
{
"updated_date": [
"2016-04-06 00:00:00",
"2016-04-06 10:24:47"
],
"status": [
"clientTransferProhibited https://icann.org/epp#clientTransferProhibited", "clientTransferProhibited"
],
"name": "Shanghai Bai Ji Information Technology Inc. Ltd,",
"dnssec": "Unsigned",
"city": "Shanghai",
"expiration_date": [
"2020-03-20 00:00:00",
"2020-03-20 18:28:58"
],
"zipcode": "200433",
"domain_name": "JIANSHU.COM",
"country": "CN",
"whois_server": "whois.name.com",
"state": "Shanghai",
"registrar": "Name.com, Inc.",
"referral_url": "http://www.name.com",
"address": "Innospace 2, B1, Building #5, KIC, No.316 Songhu Road , Yangpu District",
"name_servers": [
"F1G1NS1.DNSPOD.NET",
"F1G1NS2.DNSPOD.NET",
"f1g1ns1.dnspod.net",
"f1g1ns2.dnspod.net"
],
"org": "Shanghai Bai Ji Information Technology Inc. Ltd,",
"creation_date": [
"2008-03-20 00:00:00",
"2008-03-20 18:28:58"
],
"emails": [
"contact@jianshu.com",
"abuse@name.com"
]
}
内容来自用户分享和网络整理,不保证内容的准确性,如有侵权内容,可联系管理员处理 点击这里给我发消息
标签:  爬虫