Python Scraping Tools
2016-01-05 06:52
537 查看
1. Tools Introduction
scrapy: application framework for web scraping and crawling
beautifulsoup: library for parsing HTML
mechanize
lxml
selenium/PhantomJS/casperJS for script executing.
2. install scrapy on Windows 7 32bit
(1) install python 2.7
note: installing this software needs administrator privilege
(2) install Microsoft Visual C++ Compiler for Python 2.7 (http://www.microsoft.com/en-us/download/details.aspx?id=44266)
note one: without this package, you will get error like: Microsoft Visual C++ 9.0 is required (Unable to find vcvarsall.bat)
note two: installing this software doesn't need administrator privilege.
(3) install lxml (https://pypi.python.org/pypi/lxml/3.5.0), download the installer and install it.
note: if you don't install this in advance, the Scrapy installing process will complain it couldn't find libxml2
(4) execute the command under python 2.7 scripts directory: pip.exe install Scrapy
(5) Install pywin32 (otherwise, you will get error: no module named win32api)
Reference[2]. under command line tool: easy_install-2.7.exe e:\software\pywin32-219.win32-py2.7.exe
software path: http://sourceforge.net/projects/pywin32/files/pywin32/
(6) install selenium. pip.exe install selenium
3. Setup Eclipse PyDev for Scrapy
(1) Download Eclipse Luna (4.4)
(2) Install the Eclipse plugin PyDev for Eclipse 4.4
and set up the PyDev in Eclipse preferences.
(3) Reference[1]
step 1: create a scrapy project by scrapy command
step 2: create a pydev project in eclipse
step 3: copy the scrapy project files to pydev project folder
after this step, you can see 4-layer folder hierarchy, as scrapy project itself has 3.
step 4: set eclipse->run->debug configurations->Main
name: configuration name, whatever
project: choose the scrapy project
Main Module: don't browse, just enter full path of cmdline.py (in my case: D:\Python\Python27\Lib\site-packages\scrapy\cmdline.py)
step 5: set eclipse->run->debug configurations->Arguments
Program arguments: crawl spidername
Working directory -> other: choose the spider working directory
References:
[1] https://www.zhihu.com/question/28565716/answer/53736780
[2] http://stackoverflow.com/questions/26689371/scrapy-no-module-named-win32api-windows
scrapy: application framework for web scraping and crawling
beautifulsoup: library for parsing HTML
mechanize
lxml
selenium/PhantomJS/casperJS for script executing.
2. install scrapy on Windows 7 32bit
(1) install python 2.7
note: installing this software needs administrator privilege
(2) install Microsoft Visual C++ Compiler for Python 2.7 (http://www.microsoft.com/en-us/download/details.aspx?id=44266)
note one: without this package, you will get error like: Microsoft Visual C++ 9.0 is required (Unable to find vcvarsall.bat)
note two: installing this software doesn't need administrator privilege.
(3) install lxml (https://pypi.python.org/pypi/lxml/3.5.0), download the installer and install it.
note: if you don't install this in advance, the Scrapy installing process will complain it couldn't find libxml2
(4) execute the command under python 2.7 scripts directory: pip.exe install Scrapy
(5) Install pywin32 (otherwise, you will get error: no module named win32api)
Reference[2]. under command line tool: easy_install-2.7.exe e:\software\pywin32-219.win32-py2.7.exe
software path: http://sourceforge.net/projects/pywin32/files/pywin32/
(6) install selenium. pip.exe install selenium
3. Setup Eclipse PyDev for Scrapy
(1) Download Eclipse Luna (4.4)
(2) Install the Eclipse plugin PyDev for Eclipse 4.4
and set up the PyDev in Eclipse preferences.
(3) Reference[1]
step 1: create a scrapy project by scrapy command
step 2: create a pydev project in eclipse
step 3: copy the scrapy project files to pydev project folder
after this step, you can see 4-layer folder hierarchy, as scrapy project itself has 3.
step 4: set eclipse->run->debug configurations->Main
name: configuration name, whatever
project: choose the scrapy project
Main Module: don't browse, just enter full path of cmdline.py (in my case: D:\Python\Python27\Lib\site-packages\scrapy\cmdline.py)
step 5: set eclipse->run->debug configurations->Arguments
Program arguments: crawl spidername
Working directory -> other: choose the spider working directory
References:
[1] https://www.zhihu.com/question/28565716/answer/53736780
[2] http://stackoverflow.com/questions/26689371/scrapy-no-module-named-win32api-windows
相关文章推荐
- python urllib2使用小记
- 树莓派高级GPIO库,wiringpi2 for python使用笔记(四)实战DHT11解码
- 理解监测指标,并使用 Python 去监测它们
- 菜鸟使用python实现正则检测密码合法性
- Python聊天室实例程序分享
- 使用Python神器对付12306变态验证码
- Python开发必学课程知多少?
- python 串行编程简单例程
- python 代码片段24
- python 代码片段23
- Python学习之--socket续集
- python中sys.path使用
- python 代码片段22
- 【opencv + python in ubuntu】在ubuntu中安装opencv
- python中if __name__==’__main__’的作用
- windows下pip安装python模块时报错总结
- python 代码片段21
- python实现批量注册网站用户
- Python: How to iterate list in reverse order
- 【极客学院】-python学习笔记-2-Python特色,学习路线