您的位置:首页 > 运维架构 > Tomcat

windows构建网页版搜索引擎 Nutch+Lucene+Mysql+Tomcat(一)

2017-11-10 17:40 676 查看
环境:

(这些工具官网都有,自行下载安装)

nutch 2.2.0

lucene 7.1.0

apache-ant-1.10.1

apache-ivy-2.4.0

apache-tomcat-9.0.1

mysql

jdk-9.0.1_windows-x64_bin

一、eclipse环境下 Nutch+Mysql 二次开发环境

1、通过利用Nutch进行爬虫,将爬出网页的内容存入mysql中

2、修改Nutch2.2.1 源码中的ivy/ivysetings.xml

添加一个源:

<property name="org.restlet"
value="http://maven.restlet.org"
override="false"/>


找到以下部分代码,将没有resolver加入

<chain name="default" dual="true">
<resolver ref="local"/>
<resolver ref="maven2"/>
<resolver ref="apache-snapshot"/>
<resolver ref="sonatype"/>
<resolver ref="restlet"/>
</chain>


经过测试,没有增加这个有些包下载不了,可能和网络有关系。

3.修改ivy/ivy.xml

启用以下两个依赖

<dependency org="org.apache.gora" name="gora-sql" rev="0.1.1-incubating" conf="*->default" />

<dependency org="mys
4000
ql" name="mysql-connector-java" rev="5.1.18" conf="*->default"/>


4.进入命令行,并定位到Nutch目录

执行:

ant eclipse -verbose

由于网络带宽问题,整个过程执行了半个小时

执行完成之后如下图所示



在编译的过程中,我出现了以下问题,以及解决方案:

[b](1)ant Unable to find a javac compiler[/b]

java环境没有配置好

Unable to find a javac compiler;
com.sun.tools.javac.Main is not on the classpath.
Perhaps JAVA_HOME does not point to the JDK
org.apache.tools.ant.taskdefs.compilers.CompilerAdapterFactory.getCompiler(CompilerAdapterFactory.java:106)
org.apache.tools.ant.taskdefs.Javac.compile(Javac.java:935)
org.apache.tools.ant.taskdefs.Javac.execute(Javac.java:764)
org.apache.jasper.compiler.Compiler.generateClass(Compiler.java:382)
org.apache.jasper.compiler.Compiler.compile(Compiler.java:472)
org.apache.jasper.compiler.Compiler.compile(Compiler.java:451)
org.apache.jasper.compiler.Compiler.compile(Compiler.java:439)
org.apache.jasper.JspCompilationContext.compile(JspCompilationContext.java:511)
org.apache.jasper.servlet.JspServletWrapper.service(JspServletWrapper.java:295)
org.apache.jasper.servlet.JspServlet.serviceJspFile(JspServlet.java:292)
org.apache.jasper.servlet.JspServlet.service(JspServlet.java:236)
javax.servlet.http.HttpServlet.service(HttpServlet.java:802)

note The full stack trace of the root cause is available in the Apache Tomcat/5.0.28 logs.


解决方案:

首先,你必需检查一下自己的环境变量是不是正确;这个我想大家都会,只是有时候会忘了定一些,不过检查一下看看就行了。

其次:在JDK的lib目录下有一个tools.jar文件,你把它拷到Tomcat安装目录下的common/lib目录下,应该就可以了,你试试吧

最后:如果不可以,在打开tomcat的configue tomcatg ,找到java,在java optioons里填上:-Djava.home=C:/Program Files/Java/jdk1.5.0_04;就好了。

(2)ivy:resolve doesn’t support the “log” attribute

需要将apache-nutch-2.2.1\ivy\ivy-2.2.0.jar加到CLASSPATH中

5、发现build文件夹比原来多了很多内容。



6、打开Eclipse 使用Import 导入Nutch工程



7、配置conf/nutch-site.xml

<?xml version="1.0"?>
<?xml-stylesheet type="text/xsl" href="configuration.xsl"?>

<!-- Put site-specific property overrides in this file. -->

<configuration>
<property>
<name>http.agent.name</name>
<value>YourNutchSpider</value>
</property>

<property>
<name>http.accept.language</name>
<value>ja-jp, en-us,en-gb,en,zh-cn,zh-tw;q=0.7,*;q=0.3</value>
<description>Value of the “Accept-Language” request header field.
This allows selecting non-English language as default one to retrieve.
It is a useful setting for search engines build for certain national group.</description>
</property>

<property>
<name>parser.character.encoding.default</name>
<value>utf-8</value>
<description>The character encoding to fall back to when no other information
is available</description>
</property>

<property>
<name>plugin.folders</name>
<value>D:\APP\apache-nutch-2.2\src\plugin</value>
<description>Directories where nutch plugins are located. Each
element may be a relative or absolute path. If absolute, it is used
as is. If relative, it is searched for on the classpath.</description>
</property>
<property>

</property>

<property>
<name>storage.data.store.class</name>
<value>org.apache.gora.sql.store.SqlStore</value>
<description>The Gora DataStore class for storing and retrieving data.
Currently the following stores are available: ….</description>
</property>

<property>
<name>generate.batch.id</name>
<value>*</value>
</property>

</configuration>


8、配置 gora.properties

gora.datastore.default=org.apache.gora.sql.store.SqlStore
gora.datastore.autocreateschema=true
gora.sqlstore.jdbc.driver=com.mysql.jdbc.Driver
gora.sqlstore.jdbc.url=jdbc:mysql://localhost:3306/nutch?createDatabaseIfNotExist=true&useUnicode=true&characterEncoding=utf8&autoReconnect=true&zeroDateTimeBehavior=convertToNull
gora.sqlstore.jdbc.user=root
gora.sqlstore.jdbc.password=root


9、创建mysql数据库和表结构

(1)由于在我将爬取的网页存入到mysql表中的时候,会发生varchar(256)不够用的情况,所以我将varchar改为最大的varchar(768)

(2)由于在爬取网址的时候会出现表情等四个字节

MySql中报错:java.sql.SQLException: Incorrect string value: '\xF0\x9F\x90\xBB' for column


解决方法:

1).建表的时候添加如下限制:ENGINE=InnoDB DEFAULT CHARSET=utf8mb4 COLLATE=utf8mb4_unicode_ci;


2).在C:\ProgramData\MySQL\MySQL Server 5.7\my.ini上修改如下:(programData是隐藏文件
------------------my.ini------------------------------------------------------
# For advice on how to change settings please see
# http://dev.mysql.com/doc/refman/5.6/en/server-configuration-defaults.html [client]
default-character-set=utf8mb4
[mysql]
default-character-set = utf8mb4
# Remove leading # and set to the amount of RAM for the most important data
# cache in MySQL. Start at 70% of total RAM for dedicated server, else 10%.
# innodb_buffer_pool_size = 128M

# Remove leading # to turn on a very important data integrity option: logging
# changes to the binary log between backups.
# log_bin
# These are commonly set, remove the # and set as required.
# basedir = .....
# datadir = .....
# port = .....
# server_id = .....
# socket = .....

# Remove leading # to set options mainly useful for reporting servers.
# The server defaults are faster for transactions and fast SELECTs.
bb9f
# Adjust sizes as needed, experiment to find the optimal values.
# join_buffer_size = 128M
# sort_buffer_size = 2M
# read_rnd_buffer_size = 2M

sql_mode=NO_ENGINE_SUBSTITUTION,STRICT_TRANS_TABLES
log-error=/var/log/mysqld.log
long_query_time=3

[mysqld]
character-set-client-handshake = FALSE
character-set-server = utf8mb4
collation-server = utf8mb4_unicode_ci
init_connect='SET NAMES utf8mb4'

#log-slow-queries= /usr/local/mysql/log/slowquery.log
------------------------------------------------------------------------


3).重启mysql服务,service mysql stop;  service mysql start;问题解决。
造成这个问题的原因(网上找的):
mysql中规定utf8字符的最大字节数是3,但是某些unicode字符转成utf8编码之后有4个字节,导致出错。


(3)、爬取网页的过程中会发现Nutch默认最大的content内容为65536

Content of size 79091 was truncated to 65536


解决方案:

修改config/nutch-default.xml(-1代表爬取内容不限)

<property>
<name>http.content.limit</name>
<value>-1</value>
<description>The length limit for downloaded content using the http
protocol, in bytes. If this value is nonnegative (>=0), content longer
than it will be truncated; otherwise, no truncation at all. Do not
confuse this setting with the file.content.limit setting.
</description>
</property>


<property>
<name>parser.skip.truncated</name>
<value>false</value>
<description>Boolean value for whether we should skip parsing for truncated documents. By default this
property is activated due to extremely high levels of CPU which parsing can sometimes take.
</description>
</property>


CREATE TABLE nutch.webpage(

id varchar(768) NOT NULL,
headers blob,
text longtext DEFAULT NULL,
status int(11) DEFAULT NULL,
markers blob,
parseStatus blob,
modifiedTime bigint(20) DEFAULT NULL,
prevModifiedTime bigint(20) DEFAULT NULL,
score float DEFAULT NULL,
typ varchar(768) CHARACTER SET latin1 DEFAULT NULL,
batchId varchar(768) CHARACTER SET latin1 DEFAULT NULL,
baseUrl varchar(768) DEFAULT NULL,
content longblob,
title text DEFAULT NULL,
reprUrl varchar(768) DEFAULT NULL,
fetchInterval int(11) DEFAULT NULL,
prevFetchTime bigint(20) DEFAULT NULL,
inlinks mediumblob,
prevSignature blob,
outlinks mediumblob,
fetchTime bigint(20) DEFAULT NULL,
retriesSinceFetch int(11) DEFAULT NULL,
protocolStatus blob,
signature blob,
metadata blob,
PRIMARY KEY (id)
) ENGINE=InnoDB CHARACTER SET utf8mb4 COLLATE utf8mb4_general_ci;


10、配置Crawler.java 的执行参数

右键Crawler.java文件run->run configuration->java Application->Arguments

如果不需要爬这么深的话,只需要减小depth和topN的值

urls -depth 10 -topN 500
-Dhadoop.log.dir=logs -Dhadoop.log.file=hadoop.log




但一般运行Crawel.java都会出现以下问题:

Exception in thread “main” java.io.IOException: Failed to set permissions of path: \tmp\hadoop-Administrator\mapred\staging\Administrator606301699.staging to 0700

解决hadoop运行在windows下的错误:(由于windows平台问题,需要修改FileUtil.java 代码)需要注释掉以下两个句子,否则在执行Crawl 过程中会报Hadoop的路径权限错误。

private static void checkReturnValue(boolean rv, File p, FsPermission permission)
throws IOException
{
//if (!rv)
//  throw new IOException(new StringBuilder().append("Failed to set permissions of path: ").append(p).append(" to ").append(String.format("%04o", new Object[] { Short.valueOf(permission.toShort()) })).toString());
}


由于自己修改class文件需要反编译,所以我这里提供了一份已经修改编译好的hadoop-core-1.2.0

修改过的注释掉checkReturnValue方法后hadoop-core-1.2.0的hadoop-core-1.2.0.jar包,另有FileUtil.java编译后的class文件,可用于替换掉hadoop-core-1.2.0.jar中对应的编译类

11、创建需要爬取的网页urls

在工程目录创建urls 文件夹,并在文件夹中创建seed.txt文件

添加需要爬取的网站URL路径,如: http://www.cnblogs.com/

注意:这个urls文件夹与Crawler执行参数的urls 对应。

12、执行Crawler.java 观察Mysql 数据

13、在大多数情况下,网站可能配置了反爬虫的功能robots.txt

Nutch也遵守了该协议,但可以通过修改Nutch的源码绕过反爬虫。

只需要将类FetcherReducer 的以下这个代码注释掉即可

/*
if (!rules.isAllowed(fit.u.toString())) {
// unblock
fetchQueues.finishFetchItem(fit, true);
if (LOG.isDebugEnabled()) {
LOG.debug("Denied by robots.txt: " + fit.url);
}
output(fit, null, ProtocolStatusUtils.STATUS_ROBOTS_DENIED,
CrawlStatus.STATUS_GONE);
continue;
}
*/


参考资料

[1]. http://blog.csdn.net/zoucui/article/details/1419019

[2]. http://www.cnblogs.com/lilies/p/5607388.html
内容来自用户分享和网络整理,不保证内容的准确性,如有侵权内容,可联系管理员处理 点击这里给我发消息
标签: