您的位置:首页 > 其它

《开发自己的搜索引擎》读书笔记——一个简单的过滤的例子

2017-04-15 11:03 435 查看
Lucene中所有的过滤器均来自于一个抽象的基类org.apache.lucene.search.Filter,它定义了过滤器的基本行为。

Filter是一种过滤行为,这种过滤行为在Lucene的搜索时的表现就是“视而不见”,即遇到该文档时,发现它被“过滤”了,于是就省略它。BitSet是一种“位集合”队列,这个队列中的每个元素都只有两种取值,即true或false,Lucene以这两种取值来代表文档是否被过滤。也就是说,当Lucene返回结果时,会首先遍历BitSet,仅将那些对应值为true的文档返回。BitSet集合中,将其索引号看做是文档内部的ID号。

例如,在图书检索系统中,有些保密性质的书籍或论文只有高级权限的用户才可以访问,这时候就需要为其设计一个Filter。当低权限的用户发出检索请求时,要将这个Filter用上,来过滤掉那些保密性质的书籍或论文。

下面是一个这样的例子的实现,所有的书籍分为三级。

建立索引的代码如下:

package Filter;

import org.apache.lucene.analysis.standard.*;
import org.apache.lucene.document.*;
import org.apache.lucene.index.*;

/**
* 过滤器测试
* @author sdu20
*
*/
public class FilterTest {

public static final String INDEX_STORE_PATH = "E:\\编程局\\Java编程处\\Index\\ch5001\\";
public static final String SECURITY_ADVANCED = "advanced";
public static final String SECURITY_MIDDLE = "middle";
public static final String SECURITY_NORMAL = "normal";

private IndexWriter writer;

public static void main(String[] args) {
// TODO Auto-generated method stub

}

public FilterTest(){
try{
writer = new IndexWriter(INDEX_STORE_PATH,new StandardAnalyzer(),true);
writer.setUseCompoundFile(false);
}
catch(Exception e){
System.out.println(e.getStackTrace());
}
}

private Document aDocument(String number,String name,String data,String security){
Document doc1 = new Document();
Field f1 = new Field("bookNumber",number,Field.Store.YES,Field.Index.UN_TOKENIZED);
Field f2 = new Field("bookname",name,Field.Store.YES,Field.Index.UN_TOKENIZED);
Field f3 = new Field("publishdate",data,Field.Store.YES,Field.Index.UN_TOKENIZED);
Field f4 = new Field("securitylevel",security+"",Field.Store.YES,Field.Index.UN_TOKENIZED);
doc1.add(f1);
doc1.add(f2);
doc1.add(f3);
doc1.add(f4);
return doc1;
}

/**
* 建立索引
*/
public void createIndex(){
try{

Document doc1 = aDocument("0000003","自然哲学的数学原理","1999-01-01",SECURITY_ADVANCED);
Document doc2 = aDocument("0000005","微积分","1995-07-01",SECURITY_MIDDLE);
Document doc3 = aDocument("0000001","氢弹研究","1963-02-11",SECURITY_ADVANCED);
Document doc4 = aDocument("0000006","太平广记","1988-05-11",SECURITY_NORMAL);
Document doc5 = aDocument("0000004","弹道导弹轨迹研究","1959-10-22",SECURITY_ADVANCED);
Document doc6 = aDocument("0000007","乡土中国","1970-01-11",SECURITY_MIDDL
dac0
E);
Document doc7 = aDocument("0000002","三国演义","1977-09-07",SECURITY_NORMAL);

writer.addDocument(doc1);
writer.addDocument(doc2);
writer.addDocument(doc3);
writer.addDocument(doc4);
writer.addDocument(doc5);
writer.addDocument(doc6);
writer.addDocument(doc7);
writer.close();

}
catch(Exception e){
System.out.println(e.getStackTrace());
}
}

}

我们可以通过如下代码来检查索引中的所有信息:
package Filter;

import org.apache.lucene.document.Document;
import org.apache.lucene.index.*;

/**
* 打印出索引中所有文档的详细信息
* @author sdu20
*
*/
public class ShowInfo {

public static void main(String[] args) {
// TODO Auto-generated method stub

FilterTest test = new FilterTest();
test.createIndex();
try{

IndexReader reader = IndexReader.open(FilterTest.INDEX_STORE_PATH);
for(int i = 0;i<reader.numDocs();i++){
Document doc = reader.document(i);
System.out.println("书号:"+doc.get("bookNumber"));
System.out.println("书名:"+doc.get("bookname"));
System.out.println("发布日期:"+doc.get("publishdate"));
System.out.print("安全级别:");
String level = doc.get("securitylevel");
switch(level){
case FilterTest.SECURITY_ADVANCED:
System.out.println("高级");
break;
case FilterTest.SECURITY_MIDDLE:
System.out.println("中级");
break;
case FilterTest.SECURITY_NORMAL:
System.out.println("一般");
break;
}
System.out.println("========================");

}
}
catch(Exception e){
System.out.println(e.getStackTrace());
}
}

}

索引中的信息情况如下所示:



我们可以自定义一个过滤器,来过滤掉高级的结果:
package Filter;

import java.io.IOException;
import java.util.BitSet;

import org.apache.lucene.index.IndexReader;
import org.apache.lucene.index.*;
import org.apache.lucene.search.*;

/**
* 一个简单的Filter
* @author sdu20
*
*/
public class AdvancedFilter extends Filter{

@Override
public BitSet bits(IndexReader reader) throws IOException {
// TODO Auto-generated method stub
final BitSet bits = new BitSet(reader.maxDoc());
bits.set(0,bits.size()-1);

Term term = new Term("securitylevel",FilterTest.SECURITY_ADVANCED);
TermDocs termDocs = reader.termDocs(term);

while(termDocs.next()){
bits.set(termDocs.doc(),false);
}

return bits;
}

}

运用这个过滤器来进行查询出版日期在1900年1月1日到2000年1月1日之间的书籍信息:
package Filter;

import org.apache.lucene.document.Document;
import org.apache.lucene.index.*;
import org.apache.lucene.search.*;

/**
* 在搜索时应用我们的简单过滤器
* @author sdu20
*
*/
public class UseMyFilter {

public static void main(String[] args) {
// TODO Auto-generated method stub

//构建索引
FilterTest test = new FilterTest();
test.createIndex();
System.out.println("索引创建成功");

try{

Term begin = new Term("publishdate","1900-01-01");
Term end = new Term("publishdate","2000-01-01");
RangeQuery q = new RangeQuery(begin,end,true);

IndexSearcher searcher = new IndexSearcher(FilterTest.INDEX_STORE_PATH);
Hits hits = searcher.search(q,new AdvancedFilter());
System.out.println(hits.length());
for(int i = 0;i<hits.length();i++){
Document doc = hits.doc(i);
System.out.println("书号:"+doc.get("bookNumber"));
System.out.println("书名:"+doc.get("bookname"));
System.out.println("发布日期:"+doc.get("publishdate"));
System.out.print("安全级别:");
String level = doc.get("securitylevel");
switch(level){
case FilterTest.SECURITY_ADVANCED:
System.out.println("高级");
break;
case FilterTest.SECURITY_MIDDLE:
System.out.println("中级");
break;
case FilterTest.SECURITY_NORMAL:
System.out.println("一般");
break;
}
System.out.println("========================");

}

}catch(Exception e){
System.out.println(e.getStackTrace());
}
}

}

查询结果如下所示:



我们可以看出,查询返回的结果只有四个,所有的高级属性的书籍信息都被过滤掉了。
内容来自用户分享和网络整理,不保证内容的准确性,如有侵权内容,可联系管理员处理 点击这里给我发消息
标签: