您的位置：首页 > 编程语言 > Python开发

python 爬虫 robots协议

2020-06-29 04:46 204 查看

Requests库：小规模，数据量小，爬取速度不敏感。

Scrapy:中规模，数据量较大，爬取速度敏感

定制开发（Google/Firefox）：大规模，搜索引擎，爬取全网,爬取速度关键

Robots:Robots Exclusion Standard 网络爬虫排除标准，网站告知爬虫哪些页面可以爬取形式：在网站根目录下的robots.txt

eg:http://www.jd.com/robots.txt

http://www.moe.edu.cn/robots.txt #无robots协议

[code]User-agent: * #对于任何网络爬虫来源
Disallow: /?* #不允许访问以？开头
Disallow: /pop/*.html
Disallow: /pinpai/*.html?* #符合该通配符均不允许访问
User-agent: EtaoSpider
Disallow: /
User-agent: HuihuiSpider
Disallow: /
User-agent: GwdangSpider
Disallow: /
User-agent: WochachaSpider #恶意爬虫，拒绝访问京东所有信息
Disallow: /#所有目录

内容来自用户分享和网络整理，不保证内容的准确性，如有侵权内容，可联系管理员处理

标签：

相关文章推荐

新的分享

章节导航