使用Lucene+Paoding构建SSH系统的站内搜索
2010-08-22 18:08
405 查看
使用Lucene+Paoding构建SSH系统的站内搜索
关键字: lucene paoding 搜索目标:创建一个具有高度可移植的,定时创建索引的站内搜索。
途径:dic和index都放到程序中去。
准备:
1 Lucene
Lucene Java(以下简称Lucene)目前可用版本是2.4.0,关于Lucene的详细信息请查看http://lucene.apache.org/java/docs/index.html。
2 Paoding
Qieqie同学的伟大作品、优秀的Lucene中文分词组件,目前的版本为paoding-analysis-2.0.4-beta,对应的Lucene的版本为2.2。关于Paoding的具体信息请查看http://code.google.com/p/paoding/。
3 下载最新的paoding-analysis-2.0.4-beta版本(里面包含了lucene-core-2.2.0.jar, lucene-analyzers-2.2.0.jar,lucene-highlighter-2.2.0.jar, junit.jar, commons-logging.jar)。
开始工作:
1 试运行
打开下载包中的examples文件夹,运行一下吧(注意一下编码)。
2 集成到SSH2系统中去 (系统结构Action->service->dao)
1) 由于SSH2系统是web系统,因此在配置Paoding上就有可能和第一步有些不同。
直接把paoding文件夹下的src文件夹下的所有文件和dic文件夹复制到你的项目中去。打开paoding-dic-home.properties文件,修改paoding.dic.home.config-fisrt=this,使得程序知道该配置文件,修改paoding.dic.home=classpath:dic,使得字典在该项目中。保存就可以了。在这里我使用了classpath:dic是为了增加可移植性。如果使用绝对路径没有什么可说的了,但是如果你是制定为classpath:dic,则需要修改一下Paoding中的代码了。找到PaodingMaker.java的setDicHomeProperties方法,修改File dicHomeFile = getFile(dicHome);为
Java代码
File dicHomeFile2 = getFile(dicHome);
String path="";
try {
path = URLDecoder.decode(dicHomeFile2.getPath(),"UTF-8");
} catch (UnsupportedEncodingException e) {
e.printStackTrace();
}
File dicHomeFile = new File(path);
File dicHomeFile2 = getFile(dicHome); String path=""; try { path = URLDecoder.decode(dicHomeFile2.getPath(),"UTF-8"); } catch (UnsupportedEncodingException e) { e.printStackTrace(); } File dicHomeFile = new File(path);
目的是解码,不然如果你的词典路径中有空格和汉字会出现找不到字典的异常。
2)表结构
Sql代码
CREATE TABLE `news` (
`id` int(11) NOT NULL auto_increment,
`title` varchar(255) default NULL,
`details` mediumtext,
`author` varchar(255) default NULL,
`publisher` varchar(100) default NULL,
`clicks` int(11) default NULL,
`source` varchar(255) default NULL,
`addtime` datetime default NULL,
` category ` varchar(100) default NULL,
`keywords` varchar(255) default NULL,
PRIMARY KEY (`id`)
) ENGINE=InnoDB DEFAULT CHARSET=gbk;
CREATE TABLE `news` ( `id` int(11) NOT NULL auto_increment, `title` varchar(255) default NULL, `details` mediumtext, `author` varchar(255) default NULL, `publisher` varchar(100) default NULL, `clicks` int(11) default NULL, `source` varchar(255) default NULL, `addtime` datetime default NULL, ` category ` varchar(100) default NULL, `keywords` varchar(255) default NULL, PRIMARY KEY (`id`) ) ENGINE=InnoDB DEFAULT CHARSET=gbk;
3 正式实施编码
编写站内搜索分为两步:创建索引和进行搜索,所需类:SearchAction.java和TaskAction.java(同一目录)
1) 创建索引
主要任务:从已有的txt文件中读取上一次进行索引的最后一条新闻的id号,然后从业务逻辑中查找大于这个id号的所有新闻进行索引,最后把这次最后的一条新闻id写入txt文件中。在这里要处理好路径的问题。在这里所有的记录id号的txt文件都放到了action目录下面。
新建TaskAction,增加如下方法
Java代码
public void createIndex() {
String path;
try {
//两个参数:创建索引的位置 和 上一次创建索引最后的新闻id所在文件
createNewsIndex(getPath(TaskAction.class, "date/index/news"),"newsid.txt");
} catch (Exception e) {
e.printStackTrace();
}
}
public String getPath(Class clazz, String textName)
throws IOException {
String path = (URLDecoder.decode(
clazz.getResource(textName).toString(), "UTF-8")).substring(6);
return path;
}
public void createNewsIndex(String path,String textName) throws Exception {
String newsId = "0";
newsId = readText(TaskAction.class, textName);
if (null ==newsId || "".equals(newsId))
newsId = "0";
// 使用paoding中文分析器
Analyzer analyzer = new PaodingAnalyzer();
FSDirectory directory = FSDirectory.getDirectory(path);
System.out.println(directory.toString());
IndexWriter writer = new IndexWriter(directory, analyzer, isEmpty(TaskAction.class, textName));
Document doc = new Document();
// 从业务逻辑层读取大于当前id的信息
List list = newsManageService.getNewsBigId(Integer.parseInt(newsId));
Iterator iterator = list.iterator();
News news = new News();
while (iterator.hasNext()) {
doc = new Document();
news = (News) iterator.next();
doc.add(new Field("id", "" + news.getId(), Field.Store.YES,
Field.Index.UN_TOKENIZED));
doc.add(new Field("title", "" + news.getTitle(), Field.Store.YES,
Field.Index.TOKENIZED));
doc.add(new Field("author", "" + news.getAuthor(), Field.Store.YES,
Field.Index.TOKENIZED));
doc.add(new Field("details", ""
+ Constants.splitAndFilterString(news.getDetails()),
Field.Store.YES, Field.Index.TOKENIZED,
Field.TermVector.WITH_POSITIONS_OFFSETS));
doc.add(new Field("addtime", "" + news.getAddtime(),
Field.Store.YES, Field.Index.TOKENIZED));
doc.add(new Field("keywords", "" + news.getKeywords(),
Field.Store.YES, Field.Index.TOKENIZED));
System.out.println("Indexing file " + news.getName() + "...");
articleId = String.valueOf(news.getId());
try {
writer.addDocument(doc);
} catch (IOException e) {
e.printStackTrace();
}
}
// 优化并关闭
writer.optimize();
writer.close();
// 将我索引的最后一篇文章的id写入文件
String content = WriteText(TaskAction.class,
textName, newsId);
}
public boolean isEmpty(Class clazz, String textName) throws Exception {
String articleId = "0";
boolean isEmpty = true;
articleId = ContentReader.readText(clazz, textName);
if (null == articleId || "".equals(articleId))
articleId = "0";
if (!articleId.equals("0"))
isEmpty = false;
System.out.println(clazz.getName()+" "+isEmpty);
return isEmpty;
}
//该方法参考了paoding中example中的一个方法。
public String readText(Class clazz, String textName)
throws IOException {
InputStream in = clazz.getResourceAsStream(textName);
Reader re = new InputStreamReader(in, "UTF-8");
char[] chs = new char[1024];
int count;
String content = "";
while ((count = re.read(chs)) != -1) {
content = content + new String(chs, 0, count);
}
return content;
}
public String WriteText(Class clazz, String textName, String text)
throws IOException {
String path = (URLDecoder.decode(
clazz.getResource(textName).toString(), "UTF-8")).substring(6);
System.out.println(path);
File file = new File(path);
BufferedWriter bw = new BufferedWriter(new FileWriter(file));
String temp = text;
bw.write(temp);
bw.close();
return temp;
}
public void createIndex() { String path; try { //两个参数:创建索引的位置 和 上一次创建索引最后的新闻id所在文件 createNewsIndex(getPath(TaskAction.class, "date/index/news"),"newsid.txt"); } catch (Exception e) { e.printStackTrace(); } } public String getPath(Class clazz, String textName) throws IOException { String path = (URLDecoder.decode( clazz.getResource(textName).toString(), "UTF-8")).substring(6); return path; } public void createNewsIndex(String path,String textName) throws Exception { String newsId = "0"; newsId = readText(TaskAction.class, textName); if (null ==newsId || "".equals(newsId)) newsId = "0"; // 使用paoding中文分析器 Analyzer analyzer = new PaodingAnalyzer(); FSDirectory directory = FSDirectory.getDirectory(path); System.out.println(directory.toString()); IndexWriter writer = new IndexWriter(directory, analyzer, isEmpty(TaskAction.class, textName)); Document doc = new Document(); // 从业务逻辑层读取大于当前id的信息 List list = newsManageService.getNewsBigId(Integer.parseInt(newsId)); Iterator iterator = list.iterator(); News news = new News(); while (iterator.hasNext()) { doc = new Document(); news = (News) iterator.next(); doc.add(new Field("id", "" + news.getId(), Field.Store.YES, Field.Index.UN_TOKENIZED)); doc.add(new Field("title", "" + news.getTitle(), Field.Store.YES, Field.Index.TOKENIZED)); doc.add(new Field("author", "" + news.getAuthor(), Field.Store.YES, Field.Index.TOKENIZED)); doc.add(new Field("details", "" + Constants.splitAndFilterString(news.getDetails()), Field.Store.YES, Field.Index.TOKENIZED, Field.TermVector.WITH_POSITIONS_OFFSETS)); doc.add(new Field("addtime", "" + news.getAddtime(), Field.Store.YES, Field.Index.TOKENIZED)); doc.add(new Field("keywords", "" + news.getKeywords(), Field.Store.YES, Field.Index.TOKENIZED)); System.out.println("Indexing file " + news.getName() + "..."); articleId = String.valueOf(news.getId()); try { writer.addDocument(doc); } catch (IOException e) { e.printStackTrace(); } } // 优化并关闭 writer.optimize(); writer.close(); // 将我索引的最后一篇文章的id写入文件 String content = WriteText(TaskAction.class, textName, newsId); } public boolean isEmpty(Class clazz, String textName) throws Exception { String articleId = "0"; boolean isEmpty = true; articleId = ContentReader.readText(clazz, textName); if (null == articleId || "".equals(articleId)) articleId = "0"; if (!articleId.equals("0")) isEmpty = false; System.out.println(clazz.getName()+" "+isEmpty); return isEmpty; } //该方法参考了paoding中example中的一个方法。 public String readText(Class clazz, String textName) throws IOException { InputStream in = clazz.getResourceAsStream(textName); Reader re = new InputStreamReader(in, "UTF-8"); char[] chs = new char[1024]; int count; String content = ""; while ((count = re.read(chs)) != -1) { content = content + new String(chs, 0, count); } return content; } public String WriteText(Class clazz, String textName, String text) throws IOException { String path = (URLDecoder.decode( clazz.getResource(textName).toString(), "UTF-8")).substring(6); System.out.println(path); File file = new File(path); BufferedWriter bw = new BufferedWriter(new FileWriter(file)); String temp = text; bw.write(temp); bw.close(); return temp; }
2)进行搜索
Java代码
public void searchIndex(String path, String keywords) throws Exception {
String[] FIELD = { "title", "details" };
String QUERY = keywords;
Analyzer analyzer = new PaodingAnalyzer();
FSDirectory directory = FSDirectory.getDirectory(path);
IndexReader reader = IndexReader.open(directory);
String queryString = QUERY;
BooleanClause.Occur[] flags = new BooleanClause.Occur[] {
BooleanClause.Occur.SHOULD, BooleanClause.Occur.SHOULD };
Query query = MultiFieldQueryParser.parse(queryString, FIELD, flags,
analyzer);
Searcher searcher = new IndexSearcher(directory);
query = query.rewrite(reader);
System.out.println("Searching for: " + query.toString());
Hits hits = searcher.search(query);
NewsDTO news = new NewsDTO();
String highLightText = "";
for (int i = 0; i < hits.length(); i++) {
Document doc = hits.doc(i);
String title1 = doc.get("title");
String contents1 = doc.get("details");
SimpleHTMLFormatter simpleHTMLFormatter = new SimpleHTMLFormatter(
"", "");
Highlighter highlighter = new Highlighter(simpleHTMLFormatter,
new QueryScorer(query));
highlighter.setTextFragmenter(new SimpleFragmenter(200));
if (contents1 != null) {
TokenStream tokenStream = analyzer.tokenStream("details",
new StringReader(contents1));
highLightText = highlighter.getBestFragment(tokenStream,
contents1);
}
news = new NewsDTO();
news.setId(Integer.parseInt(doc.get("id")));
news.setName(doc.get("title"));
news.setDetails(highLightText);
news.setAddtime(doc.get("addtime"));
news.setAuthor(doc.get("author"));
searchResultItem.add(news);
}
reader.close();
}
public void searchIndex(String path, String keywords) throws Exception { String[] FIELD = { "title", "details" }; String QUERY = keywords; Analyzer analyzer = new PaodingAnalyzer(); FSDirectory directory = FSDirectory.getDirectory(path); IndexReader reader = IndexReader.open(directory); String queryString = QUERY; BooleanClause.Occur[] flags = new BooleanClause.Occur[] { BooleanClause.Occur.SHOULD, BooleanClause.Occur.SHOULD }; Query query = MultiFieldQueryParser.parse(queryString, FIELD, flags, analyzer); Searcher searcher = new IndexSearcher(directory); query = query.rewrite(reader); System.out.println("Searching for: " + query.toString()); Hits hits = searcher.search(query); NewsDTO news = new NewsDTO(); String highLightText = ""; for (int i = 0; i < hits.length(); i++) { Document doc = hits.doc(i); String title1 = doc.get("title"); String contents1 = doc.get("details"); SimpleHTMLFormatter simpleHTMLFormatter = new SimpleHTMLFormatter( "", ""); Highlighter highlighter = new Highlighter(simpleHTMLFormatter, new QueryScorer(query)); highlighter.setTextFragmenter(new SimpleFragmenter(200)); if (contents1 != null) { TokenStream tokenStream = analyzer.tokenStream("details", new StringReader(contents1)); highLightText = highlighter.getBestFragment(tokenStream, contents1); } news = new NewsDTO(); news.setId(Integer.parseInt(doc.get("id"))); news.setName(doc.get("title")); news.setDetails(highLightText); news.setAddtime(doc.get("addtime")); news.setAuthor(doc.get("author")); searchResultItem.add(news); } reader.close(); }
核心代码已经基本完成了,还有一个加亮显示,非常不错的哦。
3)再来一个定时创建索引:
定义一下bean
Java代码
<bean id="myTask" class="edu.cumt.jnotnull.action.TaskAction">
<property name="newsManageService">
<ref bean="newsManageService" />
</property>
</bean>
<bean id="entity"
class="org.springframework.scheduling.quartz.MethodInvokingJobDetailFactoryBean">
<property name="targetObject">
<ref local="myTask" />
</property>
<property name="targetMethod">
<value>createIndex</value>
</property>
</bean>
<bean id="cron"
class="org.springframework.scheduling.quartz.CronTriggerBean">
<property name="jobDetail">
<ref bean="entity" />
</property>
<property name="cronExpression">
<value>0 0-5 2 * * ?</value>
</property>
</bean>
<bean autowire="no"
class="org.springframework.scheduling.quartz.SchedulerFactoryBean">
<property name="triggers">
<list>
<ref local="cron" />
</list>
</property>
</bean>
引用原帖:http://jnotnull.javaeye.com/blog/275327
相关文章推荐
- 使用Lucene+Paoding构建SSH2系统的站内搜索
- 使用Lucene+Paoding构建SSH2系统的站内搜索---
- Lucene构建网站搜索系统
- 使用Lucene2.3构建搜索引
- 使用Lucene.NET实现站内搜索
- 使用Lucene开发简单的站内新闻搜索引擎(索引的搜索)
- 使用es构建全自动搜索系统
- Lucene+paoding 使用"庖丁解牛" 构建Analyzer paoding
- 使用 Apache Lucene 搜索文本——轻松为应用程序构建搜索和索引功能
- 使用Lucene开发简单的站内新闻搜索引擎(搜索结果的显示)
- 使用Lucene.NET实现简单的站内搜索
- 使用Lucene.NET实现站内搜索
- 站内搜索--3--之Lucene.Net使用
- Lucene.net 站内搜索 学习
- 使用logcxx库和boost库构建系统日志的格式化输出
- Lucene全文搜索原理与使用
- 使用 Microsoft SQL Server 2000 的全文搜索功能构建 Web 搜索应用程序
- 关于构建一个使用EJB组件的新系统
- 使用MPICH构建一个四节点的集群系统
- 使用Google Web API (Web Service) 构建本站搜索