Nutch的配置(使用MySQL作为数据存储)
2017-11-27 15:25
190 查看
首先先从http://www.apache.org/dyn/closer.cgi/nutch/下载安装包这里假定nutch的根目录为:${APACHE_NUTCH_HOME}配置${APACHE_NUTCH_HOME}/ivy/ivy.xml,确保Nutch使用MySQL作为数据存储将[html] view plain copy print?<dependency org="org.apache.gora" name="gora-core" rev="0.3" conf="*->default"/>
<dependency org="org.apache.gora" name="gora-core" rev="0.3" conf="*->default"/>改成[html] view plain copy print?<dependency org="org.apache.gora" name="gora-core" rev="0.2.1" conf="*->default"/>
<dependency org="org.apache.gora" name="gora-core" rev="0.2.1" conf="*->default"/>取消以下行的注释[html] view plain copy print?<dependency org="org.apache.gora" name="gora-sql" rev="0.1.1-incubating" conf="*->default" />
<dependency org="org.apache.gora" name="gora-sql" rev="0.1.1-incubating" conf="*->default" />取消以下的行注释使用Mysql作为gora存储[html] view plain copy print?<!-- Uncomment this to use MySQL as database with SQL as Gora store. --><dependency org="mysql" name="mysql-connector-java" rev="5.1.18" conf="*->default"/>
<!-- Uncomment this to use MySQL as database with SQL as Gora store. --><dependency org="mysql" name="mysql-connector-java" rev="5.1.18" conf="*->default"/>编辑${APACHE_NUTCH_HOME}/conf/gora.properties添加以下代码激活MySQL的配置[html] view plain copy print?################################ MySQL properties ################################gora.sqlstore.jdbc.driver=com.mysql.jdbc.Drivergora.sqlstore.jdbc.url=jdbc:mysql://localhost:3306/nutch?createDatabaseIfNotExist=truegora.sqlstore.jdbc.user=xxxxxgora.sqlstore.jdbc.password=xxxxx
################################ MySQL properties ################################gora.sqlstore.jdbc.driver=com.mysql.jdbc.Drivergora.sqlstore.jdbc.url=jdbc:mysql://localhost:3306/nutch?createDatabaseIfNotExist=truegora.sqlstore.jdbc.user=xxxxxgora.sqlstore.jdbc.password=xxxxx编辑 ${APACHE_NUTCH_HOME}/conf/gora-sql-mapping.xml ,将主键的长度由512改成767[html] view plain copy print?<primarykey column="id" length="767"/>
<primarykey column="id" length="767"/>配置${APACHE_NUTCH_HOME}/conf/nutch-site.xml在 http.agent.name字段下增加一个名字,可以是任意值但不能为空! 如果需要的话可以添加额外的语言(例如en为英语),同时也可以设置默认编码格式为utf-8[html] view plain copy print?<property><name>http.agent.name</name><value>YourNutchSpider</value></property><property><name>http.accept.language</name><value>ja-jp, en-us,en-gb,en;q=0.7,*;q=0.3</value><description>Value of the "Accept-Language" request header field.This allows selecting non-English language as default one to retrieve.It is a useful setting for search engines build for certain national group.</description></property><property><name>parser.character.encoding.default</name><value>utf-8</value><description>The character encoding to fall back to when no other informationis available</description></property><property><name>storage.data.store.class</name><value>org.apache.gora.sql.store.SqlStore</value><description>The Gora DataStore class for storing and retrieving data.Currently the following stores are available: ....</description></property>
<property><name>http.agent.name</name><value>YourNutchSpider</value></property><property><name>http.accept.language</name><value>ja-jp, en-us,en-gb,en;q=0.7,*;q=0.3</value><description>Value of the "Accept-Language" request header field.This allows selecting non-English language as default one to retrieve.It is a useful setting for search engines build for certain national group.</description></property><property><name>parser.character.encoding.default</name><value>utf-8</value><description>The character encoding to fall back to when no other informationis available</description></property><property><name>storage.data.store.class</name><value>org.apache.gora.sql.store.SqlStore</value><description>The Gora DataStore class for storing and retrieving data.Currently the following stores are available: ....</description></property>在命令行下输入
sudo apt-get install ant安装配置ant在命令行界面使用cd切换到nutch的根目录可以在终端中输入以下指令开始你的第一个爬虫工作[html] view plain copy print?cd ${APACHE_NUTCH_HOME}/runtime/localmkdir -p urlsecho 'http://nutch.apache.org/' > urls/seed.txt
cd ${APACHE_NUTCH_HOME}/runtime/localmkdir -p urlsecho 'http://nutch.apache.org/' > urls/seed.txtNutch 2.2使用以下命令开始爬虫,设置线程数为30[html] view plain copy print?bin/nutch crawl urls -threads 30
bin/nutch crawl urls -threads 30要查看爬取的数据时,进入数据库中输入以下指令即可查看
mysql -u xxxxx -puse nutch;SELECT * FROM nutch.webpage;翻译源:http://wiki.apache.org/nutch/#Nutch_2.X_tutorial.28s.29
相关文章推荐
- Nutch2.2.1的配置(使用MySQL作为数据存储)
- Nutch的配置(使用MySQL作为数据存储)
- entityframework Identity codefirst 使用MySql作为数据存储的笔记
- hive元数据存储使用mysql配置
- 使用XML作为项目的配置文件使用,并解析之,获得数据作为链接数据库的参数
- nutch2.1在windows平台上使用eclipsedebug 存储在mysql的搭建过程
- hive安装、配置 mysql存储元数据
- Windows+OpenLDAP+MySQL配置及使用详解(三)——LDAP数据的维护
- 使用ibatis存储数据再mysql时乱码
- 配置MySQL5.6.15存储Hive-0.11.0元数据
- 使用Hibernate向MySQL存储中文字符数据
- mysql存储过程使用CURSOR操作多列数据实用案例
- (WebSite----Asp.Net Configuration----->无法连接到SQL Server数据库------>选择数据存储区---->应用程序当前被配置为使用提供程序:AspNetSqlProvider)解决方案
- hive存储元数据在mysql配置
- java 使用mysql 的 blob 存储 protobuf 数据 的读写操作核心代码
- mysql 使用 存储过程制造测试数据
- 数据/配置 的存储方式 Json篇 以JsonCpp库使用为例
- 配置ASP.NET网站使用AppFabric Caching存储Session数据
- MySql 存储过程使用游标循环插入数据示例
- 使用XML作为项目的配置文件使用,并解析之,获得数据作为链接数据库的参数