您的位置：首页 > 移动开发

Python 爬虫APP URL

2017-12-25 16:36 225 查看

1、安装环境 python 2.7

2、安装scrapy

Pip2.7 install scrapy; 如果不是这么安装，则windows下scrapy命令用不了；先pip2.7 uninstall scrapy再install;

3、输入scrapy
有命令提示则安装正确；
4、Windows
下进入爬虫项目里，cd D:\PythonWorkspace\spider; 执行命令：scrapystartprojecttutorial
5、执行以后会出现很多脚本

tutorial/
scrapy.cfg            # deploy configuration file

tutorial/             # project's Python module, you'll import your code from here
__init__.py

items.py          # project items definition file

pipelines.py      # project pipelines file

settings.py       # project settings file

spiders/          # a directory where you'll later put your spiders
__init__.py

6、编写爬虫脚本

# -*- coding:utf-8 -*-
import sys
import scrapy

reload(sys)
sys.setdefaultencoding('utf-8')

class WanDouJia_browser_Spider(scrapy.Spider):
name ="Spider-appLabel_URL"

def start_requests(self):
#app应用大类入口
#url="http://www.wandoujia.com/category/app"

for line in open("D:\\PythonWorkspace\\spider\\Resources\\appLabels.csv".decode('utf-8'),'r'):
keyWord= line.split(",")[0].strip()
url = "http://www.wandoujia.com/search?key="+keyWord.decode('utf-8')
yield scrapy.Request(url=url,meta={'appname':keyWord.decode('utf-8')},callback=self.parse_big_class)

#解析出app的入口url
def parse_big_class(self,response):
appName = response.xpath('//h2[@class="app-title-h2"]/a/text()').extract_first()
url = response.xpath('//h2[@class="app-title-h2"]/a/@href').extract_first()
print  appName+" "+url

f=open("appListURLs.csv", 'a')
f.write(str(appName)+","+str(url)+"\n")

7、设置调试设置。

打开pycharm工程调试配置界面（Run -> Edit Configurations）。

选择工程。选择调试工程

Spider

。
设置执行脚本（Script）。设置为

D:\Python27\Lib\site-packages\scrapy\cmdline.py

，

cmdline.py

是

scrapy

提供的命令行调用脚本，此处将启动脚本设置为

cmdline.py

，将需要调试的工程作为参数传递给此脚本。
设置执行脚本参数（Script parameters）。设置为

crawl Spider-appLabel_URL

，参数命令参照官方文档提供的爬虫执行命；
设置工作目录（Work Directory）。设置为工程根目录 D:\PythonWorkspace\spider\tutorial

，根目录下包含爬虫配置文件

scrapy.cfg

。

配置如下图：

配置完成后，可设置断点，调试运行配置好的工程，断点命中，并在控制台输出调试信息。

8、调试会报错，ImportError:
No module named win32api

安装win32api：pip install win32api;

9、有可能报错 Unknown
command: crawl

设置工作目录
aae2
有误，造成无法识别 scrapy 命令，按照上文所说，将工作目录设置为包含 scrapy.cfg，重新运行，OK。

内容来自用户分享和网络整理，不保证内容的准确性，如有侵权内容，可联系管理员处理

标签：

相关文章推荐

新的分享

章节导航