您的位置:首页 > 其它

Introduction to Nutch, Part 2: Searching

2010-04-04 12:05 459 查看

Introduction to Nutch, Part 2: Searching

Thu, 2006-02-16

Tom White

In part one of this two part series on Nutch, theopen-source Java search engine, we looked at how to crawl websites.Recall that the Nutch crawler system produces three key datastructures:

The WebDB containing the web graph of pages andlinks.

A set of segments containing the raw data retrieved fromthe Web by the fetchers.

The merged index created by indexing and de-duplicatingparsed data from the segments.

In this article, we turn to searching. The Nutch searchsystem uses the index and segments generated during the crawlingprocess to answer users' search queries. We shall see how to getthe Nutch search application up and running, and how to customizeand extend it for integration into an existing website. We'll alsolook at how to re-crawl sites to keep your index up to date--arequirement of all real-world search engines.

Running the Search Application

Without further ado, let's run a search using the results of thecrawl we did last time. Tomcat seems to be the most popularservlet container for running Nutch, so let's assume you have itinstalled (although there is some guidanceon the Nutch wiki for Resin).The first step is to install the Nutch web app. There are somereported problems with running Nutch (version 0.7.1) as anon-root web app, so it is currently safest to install it as theroot web app. This is what the Nutch tutorial advises. If Tomcat'sweb apps are in ~/tomcat/webapps/, then type the following inthe directory where you unpacked Nutch:

rm -rf ~/tomcat/webapps/ROOT*
cp nutch*.war ~/tomcat/webapps/ROOT.war

[/code]
The second step is to ensure that the web app can find the indexand segments that we generated last time. Nutch looks for these inthe index and segments subdirectories of thedirectory defined in the
searcher.dir
property. Thedefault value for
searcher.dir
is the currentdirectory (
.
), which is where you started Tomcat.While this may be convenient during development, often you don'thave so much control over the directory in which Tomcat starts up,so you want to be explicit about where the index and segments maybe found. Recall from part one that Nutch's configuration files arefound in the conf subdirectory of the Nutch distribution.For the web app, these files can be found inWEB-INF/classes/. So we simply create a file callednutch-site.xml in this directory (of the unpacked web app)and set
searcher.dir
to be the crawl directorycontaining the index and segments.

<?xml version="1.0"?>
<?xml-stylesheet type="text/xsl" href="nutch-conf.xsl"?>


<!-- Put site-specific property overrides in this file. -->

<nutch-conf>
<property>
<name>searcher.dir</name>
<value>/Users/tom/Applications/nutch-0.7.1/crawl-tinysite</value>
</property>
</nutch-conf>

[/code]
After restarting Tomcat, enter the URL of the root web app inyour browser (in this example, I'm running Tomcat on port 80, butthe default is port 8080) and you should see the Nutch home page.Do a search and you will get a page of search results like Figure1.



Figure 1. Nutch search results for the query "animals"

The search results are displayed using the format used by allmainstream search engines these days. The explain andanchors links that are shown for each hit are unusual anddeserve further comment.

Score Explanation
Clicking the explain link for the page A hit brings upthe page shown in Figure 2. It shows some metadata for the page hit(page A), and a score explanation. The score explanation isa Lucene feature that shows all of the factors that contribute to thecalculated score for a particular hit. The formula for scorecalculation is rather technical, so it is natural to ask why this page is promoted byNutch when it is clearly unsuitable for the average user.



Figure 2. Nutch's score explanation page for page A, matching thequery "animals"

One of Nutch's key selling points is its transparency. Itsranking algorithms are open source, so anyone can see them. Nutch'sability to "explain" its rankings online--via the explainlink--takes this one step further and allows an (expert) user tosee why one particular hit ranked above another for a given search.In practice, this page is only really useful for diagnosticpurposes for people running a Nutch search engine, so there is noneed to expose it publicly, except perhaps for PR reasons.

Anchors
The anchors page (not illustrated here) provides a listof the incoming anchor text for the pages that link to the page ofinterest. In this case, the link to page A from page B had theanchor text "A." Again, this is a feature for Nutch sitemaintainers rather than the average user of the site.

Integrating Nutch Search

While the Nutch web app is a great way to get started withsearch, most projects using Nutch require the search function to bemore tightly integrated with their application. There are variousways to achieve this, depending on the application. The two wayswe'll look at here are using the Nutch API and using theOpenSearch API

Using the Nutch API
If your application is written in Java, then it is worthconsidering using Nutch's API directly to add a search capability.Of course, the Nutch web app is written using the Nutch API, so youmay find it fruitful to use it as a starting point for yourapplication. If you take this approach, the files to take a look atfirst are the JSPs in src/web/jsp in the Nutchdistribution.

To demonstrate Nutch's API, we'll write a minimal command-lineprogram to perform a search. We'll run the program using Nutch'slauncher, so for the search we did above, for the term "animals,"we type:

bin/nutch org.tiling.nutch.intro.SearchApp animals

[/code]
And the output is as follows.

'A' is for Alligator (http://keaton/tinysite/A.html)
<b> ... </b>Alligators' main prey are smaller <b>animals</b> that they can kill and<b> ... </b>


'C' is for Cow (http://keaton/tinysite/C.html)
<b> ... </b>leather and as draught <b>animals</b> (pulling carts, plows and<b> ... </b>

[/code]
Here's the program that achieves this. To get it to run, thecompiled class is packaged in a .jar file, which is then placed inNutch's lib directory. See the Resources section to obtain the .jar file.

package org.tiling.nutch.intro;


import java.io.IOException;

import org.apache.nutch.searcher.Hit;
import org.apache.nutch.searcher.HitDetails;
import org.apache.nutch.searcher.Hits;
import org.apache.nutch.searcher.NutchBean;
import org.apache.nutch.searcher.Query;

public class SearchApp {

private static final int NUM_HITS = 10;

public static void main(String[] args)
throws IOException {

if (args.length == 0) {
String usage = "Usage: SearchApp query";
System.err.println(usage);
System.exit(-1);
}

NutchBean bean = new NutchBean();
Query query = Query.parse(args[0]);
Hits hits = bean.search(query, NUM_HITS);

for (int i = 0; i < hits.getLength(); i++) {
Hit hit = hits.getHit(i);
HitDetails details = bean.getDetails(hit);

String title = details.getValue("title");
String url = details.getValue("url");
String summary =
bean.getSummary(details, query);

System.out.print(title);
System.out.print(" (");
System.out.print(url);
System.out.println(")");
System.out.println("\t" + summary);
}

}

}

[/code]
Although it's a short and simple program, Nutch is doing lots ofwork for us, so we'll examine it in some detail. The central classhere is
NutchBean
--it orchestrates the search forus. Indeed, the doc comment for
NutchBean
states that it provides"One-stop shopping for search-related functionality."

Upon construction, the
NutchBean
object opens theindex it is searching against in read-only mode, and reads the setof segment names and filesystem locations into memory. The indexand segments locations are configured in the same way as they werefor the web app: via the
searcher.dir
property.

Before we can perform the search, we parse the query stringgiven as the first parameter on the command line(
args[0]
) into a Nutch
Query
object. The
Query.parse()
method invokes Nutch's specializedparser (
org.apache.nutch.analysis.NutchAnalysis
), whichis generated from a grammar using the JavaCC parser generator.Although Nutch relies heavily on Lucene for its text indexing,analysis, and searching capabilities, there are many places whereNutch enhances or provides different implementations of core Lucenefunctions. This is the case for
Query
, so be carefulnot to confuse Lucene's
org.apache.lucene.search.Query
with Nutch's
org.apache.nutch.searcher.Query
. Thetypes represent the same concept (a user's query), but they are nottype-compatible with one another.

With a
Query
object in hand, we can now ask the beanto do the search for us. It does this by translating the Nutch
Query
into an optimized Lucene
Query
,then carrying out a regular Lucene search. Finally, a Nutch
Hits
object is returned, which represents the topmatches for the query. This object only contains index and documentidentifiers. To return useful information about each hit, we goback to the bean to get a
HitDetails
object for eachhit we are interested in, which contains the data from the index.We retrieve only the title and URL fields here, but there are morefields available: the field names may be found using the
getField(int i)
method of
HitDetails
.

The last piece of information that is displayed by theapplication is a short HTML summary that shows the context of thequery terms in each matching document. The summary is constructedby the bean's
getSummary()
method. The
HitDetails
argument is used to find the segment anddocument number for retrieving the document's parsed text, which isthen processed to find the first occurrence of any of the terms inthe
Query
argument. Note that the amount of context toshow in the summary--that is, the number of terms before and afterthe matching query terms--and the maximum summary length are bothNutch configuration properties(
searcher.summary.context
and
searcher.summary.length
, respectively).

That's the end of the example, but you may not be surprised tolearn that
NutchBean
provides access to more of thedata stored in the segments, such as cached content and fetch date.Take a look at the API documentation for more details.

Using the OpenSearch API
OpenSearch is anextension of RSS 2.0 for publishing search engine results, and wasdeveloped by A9.com, the search engineowned by Amazon.com. Nutch supports OpenSearch 1.0 out of the box.The OpenSearch results for the search in Figure 1 can be accessedby clicking on the RSS link in the bottom right-hand corner of thepage. This is the XML that is returned:

<?xml version="1.0" encoding="UTF-8"?>
<rss version="2.0"
xmlns:nutch="http://www.nutch.org/opensearchrss/1.0/"
xmlns:opensearch="http://a9.com/-/spec/opensearchrss/1.0/">


<channel>
<title>Nutch: animals</title>
<description>Nutch search results for query: animals</description>
<link>http://localhost/search.jsp?query=animals&start=0&hitsPerDup=2&hitsPerPage=10</link>

<opensearch:totalResults>2</opensearch:totalResults>
<opensearch:startIndex>0</opensearch:startIndex>
<opensearch:itemsPerPage>10</opensearch:itemsPerPage>

<nutch:query>animals</nutch:query>

<item>
<title>'A' is for Alligator</title>
<description><b> ... </b>Alligators'
main prey are smaller <b>animals</b>
that they can kill and<b> ... </b></description>
<link>http://keaton/tinysite/A.html</link>

<nutch:site>keaton</nutch:site>
<nutch:cache>http://localhost/cached.jsp?idx=0&id=0</nutch:cache>
<nutch:explain>http://localhost/explain.jsp?idx=0&id=0&query=animals</nutch:explain>
<nutch:docNo>0</nutch:docNo>
<nutch:segment>20051025121334</nutch:segment>
<nutch:digest>fb8b9f0792e449cda72a9670b4ce833a</nutch:digest>
<nutch:boost>1.3132616</nutch:boost>
</item>

<item>
<title>'C' is for Cow</title>
<description><b> ... </b>leather
and as draught <b>animals</b>
(pulling carts, plows and<b> ... </b></description>
<link>http://keaton/tinysite/C.html</link>

<nutch:site>keaton</nutch:site>
<nutch:cache>http://localhost/cached.jsp?idx=0&id=2</nutch:cache>
<nutch:explain>http://localhost/explain.jsp?idx=0&id=2&query=animals</nutch:explain>
<nutch:docNo>1</nutch:docNo>
<nutch:segment>20051025121339</nutch:segment>
<nutch:digest>be7e0a5c7ad9d98dd3a518838afd5276</nutch:digest>
<nutch:boost>1.3132616</nutch:boost>
</item>

</channel>
</rss>

[/code]
This document is an RSS 2.0 document, where each hit isrepresented by an
item
element. Notice the two extranamespaces,
opensearch
and
nutch
, whichallow search-specific data to be included in the RSS document. Forexample, the
opensearch:totalResults
element tells youthe number of search results available (not just those returned inthis page). Nutch also defines its own extensions, allowingconsumers of this document to access page metadata or relatedresources, such as the cached content of a page, via the URL in the
nutch:cache
element.

Using OpenSearch to integrate Nutch is a great fit if yourfront-end application is not written in Java. For example, youcould write a PHP front end to Nutch by writing a PHP search pagethat calls the OpenSearch servlet and then parses the RSS response anddisplays the results.

Real-World Nutch Search

The examples we have looked at so far have been very simple inorder to demonstrate the concepts behind Nutch. In a real Nutchsetup, other considerations come into play. One of the mostfrequently asked questions on the Nutch newsgroups concerns keepingthe index up to date. The rest of this article looks at how tore-crawl pages to keep your search results fresh and relevant.

Re-Crawling
Unfortunately, re-crawling is not as simple as re-running the
crawl
tool that we saw in part one. Recall that thistool creates a pristine WebDB each time it is run, and startscompiling lists of URLs to fetch from a small set of seed URLs. Are-crawl starts with the WebDB structure from the previous crawland constructs the fetchlist from there. This is generally a goodidea, as most sites have a relatively static URL scheme. It is,however, possible to filter out the transient portions of a site'sURL space that should not be crawled by editing theconf/regex-urlfilter.txt configuration file. Don't beconfused by the similarity between conf/crawl-urlfilter.txtand conf/regex-urlfilter.txt--while they both have thesame syntax, the former is used only by the
crawl
tool, and the latter by all other tools.

The re-crawl amounts to running the generate/fetch/update cycle,followed by index creation. To accomplish this, we employ thelower-level Nutch tools to which the
crawl
tool delegates. Here is a simple shell script to do it, with the tool nameshighlighted:

#!/bin/bash


# A simple script to run a Nutch re-crawl

if [ -n "$1" ]
then
crawl_dir=$1
else
echo "Usage: recrawl crawl_dir [depth] [adddays]"
exit 1
fi

if [ -n "$2" ]
then
depth=$2
else
depth=5
fi

if [ -n "$3" ]
then
adddays=$3
else
adddays=0
fi

webdb_dir=$crawl_dir/db
segments_dir=$crawl_dir/segments
index_dir=$crawl_dir/index

# The generate/fetch/update cycle
for ((i=1; i <= depth ; i++))
do
bin/nutch generate $webdb_dir $segments_dir -adddays $adddays
segment=`ls -d $segments_dir/* | tail -1`
bin/nutch fetch $segment
bin/nutch updatedb $webdb_dir $segment
done

# Update segments
mkdir tmp
bin/nutch updatesegs $webdb_dir $segments_dir tmp
rm -R tmp

# Index segments
for segment in `ls -d $segments_dir/* | tail -$depth`
do
bin/nutch index $segment
done

# De-duplicate indexes
# "bogus" argument is ignored but needed due to
# a bug in the number of args expected
bin/nutch dedup $segments_dir bogus

# Merge indexes
ls -d $segments_dir/* | xargs bin/nutch merge $index_dir

[/code]
To re-crawl the toy site we crawled in part one, we would run:

./recrawl crawl-tinysite 3

[/code]
The script is practically identical to the
crawl
tool except that it doesn't create a new WebDB or inject it withseed URLs. Like
crawl
, the script takes an optionalsecond argument, depth,which controls the number ofiterations of the generate/fetch/update cycle to run (the default isfive). Here we have specified a depth of three. This allows us topick up new links that may have been created since the lastcrawl.

The script supports a third argument, adddays, which isuseful for forcing pages to be retrieved even if they are not yetdue to be re-fetched. The page re-fetch interval in Nutch iscontrolled by the configuration property
db.default.fetch.interval
, and defaults to 30 days.The adddays arguments can be used to advance the clock forfetchlist generation (but not for calculating the next fetch time),thereby fetching pages early.

Updating the Live Search Index
Even with the re-crawl script, we have a problem with updatingthe live search index. As mentioned above, the
NutchBean
class opens the index to search when it isinitialized. Since the Nutch web app caches the
NutchBean
in the application servlet context, updatesto the index will never be picked up as long as the servletcontainer is running.

This problem is recognized by the Nutch community, so it willlikely be fixed in an upcoming release (Nutch 0.7.1 was the stablerelease at the time of writing). Until Nutch provides a way to doit, you can work around the problem--possibly the simplest way isto reload the Nutch web app after the re-crawl completes. Moresophisticated ways of solving the problem are discussed on the newsgroups. These typically involve modifying
NutchBean
and the search JSP to pick up changes to theindex.

Conclusion

In this two-article series, we introduced Nutch and discoveredhow to crawl a small collection of websites and run a Nutch searchengine using the results of the crawl. We covered the basics ofNutch, but there are many other aspects to explore, such as thenumerous plugins availableto customize your setup, the tools for maintaining the search index(type
bin/nutch
to get a list), or even whole-webcrawling and searching. Possibly the best thing about Nutch, though,is its vibrant userand developercommunity, which is continually coming up with new ideas and waysto do all things search-related.

Resources

Download the code supporting thisarticle.

Part one of this series covers the Nutch crawler system. It alsolists a number of useful resources.
内容来自用户分享和网络整理,不保证内容的准确性,如有侵权内容,可联系管理员处理 点击这里给我发消息
标签: