您的位置:首页 > 编程语言 > Python开发

Python Scraping Tools

2016-01-05 06:52 537 查看
1. Tools Introduction

scrapy: application framework for web scraping and crawling

beautifulsoup: library for parsing HTML

mechanize

lxml

selenium/PhantomJS/casperJS for script executing.

2. install scrapy on Windows 7 32bit

(1) install python 2.7

note: installing this software needs administrator privilege

(2) install Microsoft Visual C++ Compiler for Python 2.7 (http://www.microsoft.com/en-us/download/details.aspx?id=44266)

note one: without this package, you will get error like: Microsoft Visual C++ 9.0 is required (Unable to find vcvarsall.bat)

note two: installing this software doesn't need administrator privilege.

(3) install lxml (https://pypi.python.org/pypi/lxml/3.5.0), download the installer and install it.

note: if you don't install this in advance, the Scrapy installing process will complain it couldn't find libxml2

(4) execute the command under python 2.7 scripts directory: pip.exe install Scrapy

(5) Install pywin32 (otherwise, you will get error: no module named win32api)

Reference[2]. under command line tool: easy_install-2.7.exe e:\software\pywin32-219.win32-py2.7.exe

software path: http://sourceforge.net/projects/pywin32/files/pywin32/
(6) install selenium. pip.exe install selenium

3. Setup Eclipse PyDev for Scrapy

(1) Download Eclipse Luna (4.4)

(2) Install the Eclipse plugin PyDev for Eclipse 4.4

and set up the PyDev in Eclipse preferences.

(3) Reference[1]

step 1: create a scrapy project by scrapy command

step 2: create a pydev project in eclipse

step 3: copy the scrapy project files to pydev project folder

after this step, you can see 4-layer folder hierarchy, as scrapy project itself has 3.

step 4: set eclipse->run->debug configurations->Main

name: configuration name, whatever

project: choose the scrapy project

Main Module: don't browse, just enter full path of cmdline.py (in my case: D:\Python\Python27\Lib\site-packages\scrapy\cmdline.py)

step 5: set eclipse->run->debug configurations->Arguments

Program arguments: crawl spidername

Working directory -> other: choose the spider working directory

References:

[1] https://www.zhihu.com/question/28565716/answer/53736780
[2] http://stackoverflow.com/questions/26689371/scrapy-no-module-named-win32api-windows
内容来自用户分享和网络整理,不保证内容的准确性,如有侵权内容,可联系管理员处理 点击这里给我发消息
标签: