如何提高团队协作的效率
2013-11-14 11:35
316 查看
注意,tomcat和nutch路径需要修改成自己的
# nutch更目录
NUTCH_HOME=/cygdrive/e/java/CoreJava/IndexSearchAbout/nutch-1.0
# tomcat目录
CATALINA_HOME=/cygdrive/d/JavaTools/apache-tomcat-6.0.14
还有批量将crawled/替换为你的索引存储目录。
将该shell代码保存到你的爬虫nutch更目录下,可任意命名(如:runbot)
然后在cygwin里直接输入一下文件名就可以运行
# nutch更目录
NUTCH_HOME=/cygdrive/e/java/CoreJava/IndexSearchAbout/nutch-1.0
# tomcat目录
CATALINA_HOME=/cygdrive/d/JavaTools/apache-tomcat-6.0.14
还有批量将crawled/替换为你的索引存储目录。
将该shell代码保存到你的爬虫nutch更目录下,可任意命名(如:runbot)
然后在cygwin里直接输入一下文件名就可以运行
#!/bin/sh # runbot script to run the Nutch bot for crawling and re-crawling. # Usage: bin/runbot [safe] # If executed in 'safe' mode, it doesn't delete the temporary # directories generated during crawl. This might be helpful for # analysis and recovery in case a crawl fails. # # Author: Susam Pal # # 增量采集时候特别注意,如果在同一台机器上运行Crawl和Searcher, # 由于tomcat处于启动状态,tomcat线程占用着索引文件,所以在增量 # 爬取时候蜘蛛需要删除旧索引后从新生成新索引(Crawl/index文件夹), # 由于索引文件夹被TOMCAT占用着,所以蜘蛛操作不了程序就报错了。 # 判断crawl/index文件夹是否被占用简单的方法就是直接手动删除一 # 下index(删除前做一下这文件夹的备份,呵呵),删除不了说明被占用了。 # # 下边这段增量爬取脚本逻辑上是解决了线程占用的问题,不过可能由于机器不能及时关闭java.exe,所以很多时候都抛异常了“提示什么dir exist # 1,注入爬取入口 # 2,逐个深度进行爬取 # 3,合并爬取下来的内容 # 4,将数据段相关链接写入linkdb # 5,生成indexes # 6,去重 # 7,合并索引 # 8,停止tomcat,释放操作索引的线程,Crawl更新线程后启动TOMCAT # # 参数设置 depth=5 threads=10 adddays=1 topN=30 #Comment this statement if you don't want to set topN value # Arguments for rm and mv RMARGS="-rf" MVARGS="--verbose" # Parse arguments # 模式,yes启动,在索引操作时候会做备份,否则反之,直接更新索引! safe=yes # nutch更目录 NUTCH_HOME=/cygdrive/e/java/CoreJava/IndexSearchAbout/nutch-1.0 # tomcat目录 CATALINA_HOME=/cygdrive/d/JavaTools/apache-tomcat-6.0.14 if [ -z "$NUTCH_HOME" ] then echo runbot: $0 could not find environment variable NUTCH_HOME echo runbot: NUTCH_HOME=$NUTCH_HOME has been set by the script else echo runbot: $0 found environment variable NUTCH_HOME=$NUTCH_HOME fi if [ -z "$CATALINA_HOME" ] then echo runbot: $0 could not find environment variable NUTCH_HOME echo runbot: CATALINA_HOME=$CATALINA_HOME has been set by the script else echo runbot: $0 found environment variable CATALINA_HOME=$CATALINA_HOME fi if [ -n "$topN" ] then topN="-topN $topN" else topN="" fi steps=8 # 1,注入爬取入口 echo "----- Inject (Step 1 of $steps) -----" $NUTCH_HOME/bin/nutch inject crawled/crawldb urls/url.txt # 2,逐个深度进行爬取 echo "----- Generate, Fetch, Parse, Update (Step 2 o $steps) -----" for((i=0; i <= $depth; i++)) do echo "--- Beginning crawl at depth `expr $i + 1` of $depth ---" $NUTCH_HOME/bin/nutch generate crawled/crawldb crawled/segments $topN \ -adddays $adddays if [ $? -ne 0 ] then echo "runbot: Stopping at depth $depth. No more URLs to fetcfh." break fi segment=`ls -d crawled/segments/* | tail -1` $NUTCH_HOME/bin/nutch fetch $segment -threads $threads if [ $? -ne 0 ] then echo "runbot: fetch $segment at depth `expr $i + 1` failed." echo "runbot: Deleting segment $segment." rm $RMARGS $segment continue fi $NUTCH_HOME/bin/nutch updatedb crawled/crawldb $segment done # 3,合并爬取下来的内容 echo "----- Merge Segments (Step 3 of $steps) -----" #将多个数据段合并到一个数据中并且保存至MERGEDsegments $NUTCH_HOME/bin/nutch mergesegs crawled/MERGEDsegments crawled/segments/* #rm $RMARGS crawled/segments rm $RMARGS crawled/BACKUPsegments mv $MVARGS crawled/segments crawled/BACKUPsegments mkdir crawled/segments mv $MVARGS crawled/MERGEDsegments/* crawled/segments rm $RMARGS crawled/MERGEDsegments # 4,将数据段相关链接写入linkdb echo "----- Invert Links (Step 4 of $steps) -----" $NUTCH_HOME/bin/nutch invertlinks crawled/linkdb crawled/segments/* # 5,生成indexes echo "----- Index (Step 5 of $steps) -----" $NUTCH_HOME/bin/nutch index crawled/NEWindexes crawled/crawldb crawled/linkdb crawled/segments/* # 6,去重 echo "----- Dedup (Step 6 of $steps) -----" $NUTCH_HOME/bin/nutch dedup crawled/NEWindexes # 7,合并索引 echo "----- Merge Indexes (Step 7 of $steps) -----" $NUTCH_HOME/bin/nutch merge crawled/NEWindex crawled/NEWindexes # 8,停止tomcat,释放操作索引的线程,Crawl更新线程后启动TOMCAT # 需要先停止tomcat,否则tomcat占用着索引文件夹index,不能对索引文件进行更新!(异常:什么文件以存在之类的,dir exists……) echo "----- Loading New Index (Step 8 of $steps) -----" #${CATALINA_HOME}/bin/shutdown.sh #如果是安全模式则先备份后删除索引 if [ "$safe" != "yes" ] then rm $RMARGS crawled/NEWindexes rm $RMARGS crawled/index else rm $RMARGS crawled/BACKUPindexes rm $RMARGS crawled/BACKUPindex mv $MVARGS crawled/NEWindexes crawled/BACKUPindexes mv $MVARGS crawled/index crawled/BACKUPindex rm $RMARGS crawled/NEWindexes rm $RMARGS crawled/index fi #需要先删除旧索引(在上边已经完成)后在生成新索引 mv $MVARGS crawled/NEWindex crawled/index #索引更新完成后启动tomcat #${CATALINA_HOME}/bin/startup.sh echo "runbot: FINISHED: Crawl completed!" echo ""
相关文章推荐
- 如何提高团队协作的效率
- 如何提高团队协作的效率
- 如何提高团队协作的效率
- Visual Studio 2008开发新特性系列课程(13):团队协作开发利器——VSTS2008如何提高团队开发效率
- 如何提高团队协作的效率
- 如何提高团队协作的效率
- 如何提高团队协作的效率
- 如何提高团队协作的效率
- 如何提高团队协作的效率
- 《程序员》 -- 如何提高团队协作的效率
- Visual Studio 2008开发新特性系列课程(13):团队协作开发利器——VSTS2008如何提高团队开发效率
- 如何提高团队协作的效率
- 如何提高团队协作的效率 推荐
- 如何提高团队协作的效率
- 如何提高团队协作的效率
- 如何提高团队效率!
- 如何提高团队的工作效率
- 浅析如何提高SEO优化团队效率
- 团队协作,提高开发速度和效率
- 如何提高团队开发效率