您的位置：首页 > 其它

使用Lucene+Paoding构建SSH系统的站内搜索

2010-08-22 18:08 405 查看

使用Lucene+Paoding构建SSH系统的站内搜索

关键字: lucene paoding 搜索

目标：创建一个具有高度可移植的，定时创建索引的站内搜索。
途径：dic和index都放到程序中去。

准备：
1   Lucene
Lucene Java(以下简称Lucene)目前可用版本是2.4.0，关于Lucene的详细信息请查看http://lucene.apache.org/java/docs/index.html。

2 Paoding
Qieqie同学的伟大作品、优秀的Lucene中文分词组件，目前的版本为paoding-analysis-2.0.4-beta，对应的Lucene的版本为2.2。关于Paoding的具体信息请查看http://code.google.com/p/paoding/。

3 下载最新的paoding-analysis-2.0.4-beta版本（里面包含了lucene-core-2.2.0.jar, lucene-analyzers-2.2.0.jar,lucene-highlighter-2.2.0.jar, junit.jar, commons-logging.jar）。

开始工作：
   1 试运行
打开下载包中的examples文件夹，运行一下吧（注意一下编码）。

   2 集成到SSH2系统中去（系统结构Action->service->dao）
1）由于SSH2系统是web系统，因此在配置Paoding上就有可能和第一步有些不同。
直接把paoding文件夹下的src文件夹下的所有文件和dic文件夹复制到你的项目中去。打开paoding-dic-home.properties文件，修改paoding.dic.home.config-fisrt=this,使得程序知道该配置文件，修改paoding.dic.home=classpath:dic，使得字典在该项目中。保存就可以了。在这里我使用了classpath:dic是为了增加可移植性。如果使用绝对路径没有什么可说的了，但是如果你是制定为classpath:dic，则需要修改一下Paoding中的代码了。找到PaodingMaker.java的setDicHomeProperties方法，修改File dicHomeFile = getFile(dicHome);为

Java代码

File dicHomeFile2 = getFile(dicHome);

        String path="";

        try {

            path = URLDecoder.decode(dicHomeFile2.getPath(),"UTF-8");

        } catch (UnsupportedEncodingException e) {

            e.printStackTrace();

        }

    File dicHomeFile = new File(path);

File dicHomeFile2 = getFile(dicHome);
String path="";
try {
path = URLDecoder.decode(dicHomeFile2.getPath(),"UTF-8");
} catch (UnsupportedEncodingException e) {
e.printStackTrace();
}
File dicHomeFile = new File(path);

目的是解码，不然如果你的词典路径中有空格和汉字会出现找不到字典的异常。

2）表结构

Sql代码

CREATE TABLE `news` (

  `id` int(11) NOT NULL auto_increment,

  `title` varchar(255) default NULL,

  `details` mediumtext,

  `author` varchar(255) default NULL,

  `publisher` varchar(100) default NULL,

  `clicks` int(11) default NULL,

  `source` varchar(255) default NULL,

  `addtime` datetime default NULL,

  ` category ` varchar(100) default NULL,

  `keywords` varchar(255) default NULL,

  PRIMARY KEY  (`id`)

) ENGINE=InnoDB DEFAULT CHARSET=gbk;

CREATE TABLE `news` (
`id` int(11) NOT NULL auto_increment,
`title` varchar(255) default NULL,
`details` mediumtext,
`author` varchar(255) default NULL,
`publisher` varchar(100) default NULL,
`clicks` int(11) default NULL,
`source` varchar(255) default NULL,
`addtime` datetime default NULL,
` category ` varchar(100) default NULL,
`keywords` varchar(255) default NULL,
PRIMARY KEY  (`id`)
) ENGINE=InnoDB DEFAULT CHARSET=gbk;

3 正式实施编码
编写站内搜索分为两步：创建索引和进行搜索，所需类：SearchAction.java和TaskAction.java(同一目录)
1）创建索引
主要任务：从已有的txt文件中读取上一次进行索引的最后一条新闻的id号，然后从业务逻辑中查找大于这个id号的所有新闻进行索引，最后把这次最后的一条新闻id写入txt文件中。在这里要处理好路径的问题。在这里所有的记录id号的txt文件都放到了action目录下面。
新建TaskAction，增加如下方法

Java代码

public void createIndex() {

        String path;

        try {

//两个参数：创建索引的位置  和上一次创建索引最后的新闻id所在文件

    createNewsIndex(getPath(TaskAction.class, "date/index/news"),"newsid.txt");

        } catch (Exception e) {

            e.printStackTrace();

        }

    }



public String getPath(Class clazz, String textName)

            throws IOException {

        String path = (URLDecoder.decode(

                clazz.getResource(textName).toString(), "UTF-8")).substring(6);

        return path;

    }



public void createNewsIndex(String path,String textName) throws Exception {

        String newsId = "0";



        newsId = readText(TaskAction.class, textName);

        if (null ==newsId || "".equals(newsId))

            newsId = "0";



        // 使用paoding中文分析器

        Analyzer analyzer = new PaodingAnalyzer();

        FSDirectory directory = FSDirectory.getDirectory(path);

        System.out.println(directory.toString());

        IndexWriter writer = new IndexWriter(directory, analyzer, isEmpty(TaskAction.class, textName));

        Document doc = new Document();



        // 从业务逻辑层读取大于当前id的信息

        List list = newsManageService.getNewsBigId(Integer.parseInt(newsId));

        Iterator iterator = list.iterator();

        News news = new News();

        while (iterator.hasNext()) {

            doc = new Document();

            news = (News) iterator.next();

            doc.add(new Field("id", "" + news.getId(), Field.Store.YES,

                    Field.Index.UN_TOKENIZED));

            doc.add(new Field("title", "" + news.getTitle(), Field.Store.YES,

                    Field.Index.TOKENIZED));

            doc.add(new Field("author", "" + news.getAuthor(), Field.Store.YES,

                    Field.Index.TOKENIZED));

            doc.add(new Field("details", ""

                    + Constants.splitAndFilterString(news.getDetails()),

                    Field.Store.YES, Field.Index.TOKENIZED,

                    Field.TermVector.WITH_POSITIONS_OFFSETS));

            doc.add(new Field("addtime", "" + news.getAddtime(),

                    Field.Store.YES, Field.Index.TOKENIZED));

            doc.add(new Field("keywords", "" + news.getKeywords(),

                    Field.Store.YES, Field.Index.TOKENIZED));

            System.out.println("Indexing file " + news.getName() + "...");

            articleId = String.valueOf(news.getId());

            try {

                writer.addDocument(doc);

            } catch (IOException e) {

                e.printStackTrace();

            }

        }

        // 优化并关闭

        writer.optimize();

        writer.close();



        // 将我索引的最后一篇文章的id写入文件

        String content = WriteText(TaskAction.class,

                textName, newsId);

    }



public boolean isEmpty(Class clazz, String textName) throws Exception {

        String articleId = "0";

        boolean isEmpty = true;

        articleId = ContentReader.readText(clazz, textName);

        if (null == articleId || "".equals(articleId))

            articleId = "0";

        if (!articleId.equals("0"))

            isEmpty = false;

        System.out.println(clazz.getName()+" "+isEmpty);

        return isEmpty;

    }



//该方法参考了paoding中example中的一个方法。

public String readText(Class clazz, String textName)

            throws IOException {

        InputStream in = clazz.getResourceAsStream(textName);

        Reader re = new InputStreamReader(in, "UTF-8");

        char[] chs = new char[1024];

        int count;

        String content = "";

        while ((count = re.read(chs)) != -1) {

            content = content + new String(chs, 0, count);

        }

        return content;

    }



public String WriteText(Class clazz, String textName, String text)

            throws IOException {

        String path = (URLDecoder.decode(

                clazz.getResource(textName).toString(), "UTF-8")).substring(6);

        System.out.println(path);

        File file = new File(path);

        BufferedWriter bw = new BufferedWriter(new FileWriter(file));

        String temp = text;

        bw.write(temp);

        bw.close();

        return temp;

    }

public void createIndex() {
String path;
try {
//两个参数：创建索引的位置  和 上一次创建索引最后的新闻id所在文件
createNewsIndex(getPath(TaskAction.class, "date/index/news"),"newsid.txt");
} catch (Exception e) {
e.printStackTrace();
}
}

public String getPath(Class clazz, String textName)
throws IOException {
String path = (URLDecoder.decode(
clazz.getResource(textName).toString(), "UTF-8")).substring(6);
return path;
}

public void createNewsIndex(String path,String textName) throws Exception {
String newsId = "0";

newsId = readText(TaskAction.class, textName);
if (null ==newsId || "".equals(newsId))
newsId = "0";

// 使用paoding中文分析器
Analyzer analyzer = new PaodingAnalyzer();
FSDirectory directory = FSDirectory.getDirectory(path);
System.out.println(directory.toString());
IndexWriter writer = new IndexWriter(directory, analyzer, isEmpty(TaskAction.class, textName));
Document doc = new Document();

// 从业务逻辑层读取大于当前id的信息
List list = newsManageService.getNewsBigId(Integer.parseInt(newsId));
Iterator iterator = list.iterator();
News news = new News();
while (iterator.hasNext()) {
doc = new Document();
news = (News) iterator.next();
doc.add(new Field("id", "" + news.getId(), Field.Store.YES,
Field.Index.UN_TOKENIZED));
doc.add(new Field("title", "" + news.getTitle(), Field.Store.YES,
Field.Index.TOKENIZED));
doc.add(new Field("author", "" + news.getAuthor(), Field.Store.YES,
Field.Index.TOKENIZED));
doc.add(new Field("details", ""
+ Constants.splitAndFilterString(news.getDetails()),
Field.Store.YES, Field.Index.TOKENIZED,
Field.TermVector.WITH_POSITIONS_OFFSETS));
doc.add(new Field("addtime", "" + news.getAddtime(),
Field.Store.YES, Field.Index.TOKENIZED));
doc.add(new Field("keywords", "" + news.getKeywords(),
Field.Store.YES, Field.Index.TOKENIZED));
System.out.println("Indexing file " + news.getName() + "...");
articleId = String.valueOf(news.getId());
try {
writer.addDocument(doc);
} catch (IOException e) {
e.printStackTrace();
}
}
// 优化并关闭
writer.optimize();
writer.close();

// 将我索引的最后一篇文章的id写入文件
String content = WriteText(TaskAction.class,
textName, newsId);
}

public boolean isEmpty(Class clazz, String textName) throws Exception {
String articleId = "0";
boolean isEmpty = true;
articleId = ContentReader.readText(clazz, textName);
if (null == articleId || "".equals(articleId))
articleId = "0";
if (!articleId.equals("0"))
isEmpty = false;
System.out.println(clazz.getName()+" "+isEmpty);
return isEmpty;
}

//该方法参考了paoding中example中的一个方法。
public String readText(Class clazz, String textName)
throws IOException {
InputStream in = clazz.getResourceAsStream(textName);
Reader re = new InputStreamReader(in, "UTF-8");
char[] chs = new char[1024];
int count;
String content = "";
while ((count = re.read(chs)) != -1) {
content = content + new String(chs, 0, count);
}
return content;
}

public String WriteText(Class clazz, String textName, String text)
throws IOException {
String path = (URLDecoder.decode(
clazz.getResource(textName).toString(), "UTF-8")).substring(6);
System.out.println(path);
File file = new File(path);
BufferedWriter bw = new BufferedWriter(new FileWriter(file));
String temp = text;
bw.write(temp);
bw.close();
return temp;
}

2)进行搜索

Java代码

public void searchIndex(String path, String keywords) throws Exception {

        String[] FIELD = { "title", "details" };

        String QUERY = keywords;



        Analyzer analyzer = new PaodingAnalyzer();

        FSDirectory directory = FSDirectory.getDirectory(path);

        IndexReader reader = IndexReader.open(directory);

        String queryString = QUERY;

        BooleanClause.Occur[] flags = new BooleanClause.Occur[] {

                BooleanClause.Occur.SHOULD, BooleanClause.Occur.SHOULD };

        Query query = MultiFieldQueryParser.parse(queryString, FIELD, flags,

                analyzer);



        Searcher searcher = new IndexSearcher(directory);

        query = query.rewrite(reader);

        System.out.println("Searching for: " + query.toString());

        Hits hits = searcher.search(query);



        NewsDTO news = new NewsDTO();

        String highLightText = "";



        for (int i = 0; i < hits.length(); i++) {



            Document doc = hits.doc(i);

            String title1 = doc.get("title");

            String contents1 = doc.get("details");



            SimpleHTMLFormatter simpleHTMLFormatter = new SimpleHTMLFormatter(

                    "", "");



            Highlighter highlighter = new Highlighter(simpleHTMLFormatter,

                    new QueryScorer(query));

            highlighter.setTextFragmenter(new SimpleFragmenter(200));



            if (contents1 != null) {

                TokenStream tokenStream = analyzer.tokenStream("details",

                        new StringReader(contents1));

                highLightText = highlighter.getBestFragment(tokenStream,

                        contents1);

            }

            news = new NewsDTO();

            news.setId(Integer.parseInt(doc.get("id")));

            news.setName(doc.get("title"));

            news.setDetails(highLightText);

            news.setAddtime(doc.get("addtime"));

            news.setAuthor(doc.get("author"));

            searchResultItem.add(news);

        }

        reader.close();



    }

public void searchIndex(String path, String keywords) throws Exception {
String[] FIELD = { "title", "details" };
String QUERY = keywords;

Analyzer analyzer = new PaodingAnalyzer();
FSDirectory directory = FSDirectory.getDirectory(path);
IndexReader reader = IndexReader.open(directory);
String queryString = QUERY;
BooleanClause.Occur[] flags = new BooleanClause.Occur[] {
BooleanClause.Occur.SHOULD, BooleanClause.Occur.SHOULD };
Query query = MultiFieldQueryParser.parse(queryString, FIELD, flags,
analyzer);

Searcher searcher = new IndexSearcher(directory);
query = query.rewrite(reader);
System.out.println("Searching for: " + query.toString());
Hits hits = searcher.search(query);

NewsDTO news = new NewsDTO();
String highLightText = "";

for (int i = 0; i < hits.length(); i++) {

Document doc = hits.doc(i);
String title1 = doc.get("title");
String contents1 = doc.get("details");

SimpleHTMLFormatter simpleHTMLFormatter = new SimpleHTMLFormatter(
"", "");

Highlighter highlighter = new Highlighter(simpleHTMLFormatter,
new QueryScorer(query));
highlighter.setTextFragmenter(new SimpleFragmenter(200));

if (contents1 != null) {
TokenStream tokenStream = analyzer.tokenStream("details",
new StringReader(contents1));
highLightText = highlighter.getBestFragment(tokenStream,
contents1);
}
news = new NewsDTO();
news.setId(Integer.parseInt(doc.get("id")));
news.setName(doc.get("title"));
news.setDetails(highLightText);
news.setAddtime(doc.get("addtime"));
news.setAuthor(doc.get("author"));
searchResultItem.add(news);
}
reader.close();

}

核心代码已经基本完成了，还有一个加亮显示，非常不错的哦。

3）再来一个定时创建索引：
定义一下bean

Java代码

<bean id="myTask" class="edu.cumt.jnotnull.action.TaskAction">

        <property name="newsManageService">

            <ref bean="newsManageService" />

        </property>

    </bean>



    <bean id="entity"

        class="org.springframework.scheduling.quartz.MethodInvokingJobDetailFactoryBean">

        <property name="targetObject">

            <ref local="myTask" />

        </property>

        <property name="targetMethod">

            <value>createIndex</value>

        </property>

    </bean>



    <bean id="cron"

        class="org.springframework.scheduling.quartz.CronTriggerBean">

        <property name="jobDetail">

            <ref bean="entity" />

        </property>

        <property name="cronExpression">

            <value>0 0-5 2 * * ?</value>

        </property>

    </bean>



    <bean autowire="no"

        class="org.springframework.scheduling.quartz.SchedulerFactoryBean">

        <property name="triggers">

            <list>

                <ref local="cron" />

            </list>

        </property>

    </bean>

   引用原帖：http://jnotnull.javaeye.com/blog/275327

内容来自用户分享和网络整理，不保证内容的准确性，如有侵权内容，可联系管理员处理

标签： lucene ssh string path null exception

相关文章推荐

新的分享

章节导航