nutch2.1抓取中文网站
2014-05-18 09:51
197 查看
对nutch添加中文网站抓取功能。1、中文网页抓取 A、调整mysql配置,避免存入mysql的中文出现乱码。修改 ${APACHE_NUTCH_HOME} /runtime/local/conf/gora.properties
################################ MySQL properties ################################gora.sqlstore.jdbc.driver=com.mysql.jdbc.Drivergora.sqlstore.jdbc.url=jdbc:mysql://10.10.11.252:3306/nutch? useUnicode=true&characterEncoding=utf8&autoReconnect=true&zeroDateTimeBehavior=convertToNullgora.sqlstore.jdbc.user=devusergora.sqlstore.jdbc.password=devuser B、修改 ${APACHE_NUTCH_HOME} /runtime/local/conf/nutch-site.xml文件 <property><name>http.accept.language</name><value>ja-jp, en-us, zh-cn,en-gb,en;q=0.7,*;q=0.3</value><description>Value of the “Accept-Language” request header field. This allows selecting non-English language as default one to retrieve. It is a useful setting for search engines build for certain national group.</description> </property>
################################ MySQL properties ################################gora.sqlstore.jdbc.driver=com.mysql.jdbc.Drivergora.sqlstore.jdbc.url=jdbc:mysql://10.10.11.252:3306/nutch? useUnicode=true&characterEncoding=utf8&autoReconnect=true&zeroDateTimeBehavior=convertToNullgora.sqlstore.jdbc.user=devusergora.sqlstore.jdbc.password=devuser B、修改 ${APACHE_NUTCH_HOME} /runtime/local/conf/nutch-site.xml文件 <property><name>http.accept.language</name><value>ja-jp, en-us, zh-cn,en-gb,en;q=0.7,*;q=0.3</value><description>Value of the “Accept-Language” request header field. This allows selecting non-English language as default one to retrieve. It is a useful setting for search engines build for certain national group.</description> </property>
相关文章推荐
- Nutch2.1+mysql+solr3.6.1+中文网站抓取
- Nutch2.1+mysql+solr3.6.1+中文网站抓取
- nutch2.1中文网站抓取
- Nutch2.1+mysql+solr3.6.1+中文网站抓取
- ubuntu 下nutch 网站抓取配置关键
- 【python】100行代码python爬虫程序,抓取网站图片存储本地(附:中文链接解决)
- 教育网中Nutch如何抓取国外网站
- phantomjs 抓取、截图中文网站乱码的问题的解决
- Nutch 2.2+MySQL+Solr4.2实现网站内容的抓取和索引
- Nutch 2.2+MySQL+Solr4.2实现网站内容的抓取和索引
- Nutch 2.2+MySQL+Solr4.2实现网站内容的抓取和索引
- Nutch 2.2.1+MySQL+Solr4.2实现网站内容的抓取和索引
- Nutch 2.2+MySQL实现网站内容的抓取和索引
- HttpClient4入门应用----抓取网站内容(解决中文乱码)
- Nutch的配置以及动态网站的抓取
- 关于windows下Nutch 2.2+MySQL实现网站内容的抓取的搭建步骤
- Nutch 2.2+MySQL+Solr4.2实现网站内容的抓取和索引
- nutch2.3爬虫抓取电影网站
- Nutch的配置以及动态网站的抓取
- phantomjs 抓取、截图中文网站乱码的问题的解决