您的位置:首页 > 运维架构 > 网站架构

nutch2.1抓取中文网站

2014-05-18 09:51 197 查看
nutch添加中文网站抓取功能。1、中文网页抓取 A、调整mysql配置,避免存入mysql的中文出现乱码。修改 ${APACHE_NUTCH_HOME} /runtime/local/conf/gora.properties
################################ MySQL properties ################################gora.sqlstore.jdbc.driver=com.mysql.jdbc.Drivergora.sqlstore.jdbc.url=jdbc:mysql://10.10.11.252:3306/nutch? useUnicode=true&characterEncoding=utf8&autoReconnect=true&zeroDateTimeBehavior=convertToNullgora.sqlstore.jdbc.user=devusergora.sqlstore.jdbc.password=devuser B、修改 ${APACHE_NUTCH_HOME} /runtime/local/conf/nutch-site.xml文件 <property><name>http.accept.language</name><value>ja-jp, en-us, zh-cn,en-gb,en;q=0.7,*;q=0.3</value><description>Value of the “Accept-Language” request header field. This allows selecting non-English language as default one to retrieve. It is a useful setting for search engines build for certain national group.</description> </property>
内容来自用户分享和网络整理,不保证内容的准确性,如有侵权内容,可联系管理员处理 点击这里给我发消息
标签:  mysql local 中文网页