您的位置:首页 > 其它

lucene 构建索引

2012-10-22 23:53 323 查看
索引过程:

提取文本和创建文档
分析文档
向索引添加文档(倒排索引)
基本索引操作

向索引添加文档
addDocument(Document 使用默认分析器添加文档
addDocument(Document,Analyzer) -- 使用指定的分析器添加文档)

删除索引中的文档
---deleteDocuments(Term)
---deleteDocuments(Term[])
---deleteDocuments(Query)
---deleteDocuments(Query[])
maxDoc 和 numDocs()方法的区别:maxDoc()返回索引中删除和未被删除的文档总数,后者返回索引中未被删除的文档总数。

更新索引中的文档
步骤:首先删除旧文档,再插入新文档。新文档必须包含旧文档中的所有域。
---updateDocument(Term,Document)
---updateDocument(Term,Doducment,Analyzer)

域选项
域索引选项(Field.Index.*)
Index.ANALYZED :使用分析器将域值分解成单独的词汇单元流,并使每个词汇单元能被单独搜索。适用于 普通文本区域。
Index.NOT_ANALYZED: 对域索引,但不对String 值进行分析。适用于分析哪些不能被拆分的域值,如URL,文件路径,日期,人名,使用于“精确匹配”搜索。
Index.ANALYZED_NORMS:一个扁体,不会在索引中存储 norms 信息。norm记录索引中的index-time boost信息。
Index.NOT_ANALYZED_NO_NORMS 不存储norms。
Index.NO:不被索引.
域存储选项 (Field.Store.*)
Store.YES: 指定存储域值,用于搜索结果显示域值。
Store.NO:不存储域值。
CompressionTools:提供静态方法压缩和解压字节数组。
域项向量选项

Field对象其它初始化方法
Field(String name,Reader value,TermVector termVector): 使用reader来表示域值

域排序选项

对文档和域进行加权操作
文档加权: setBoost(float)
域加权: Field.setBoost(float);

索引数字:NumericField(name).set<Type>Value;

近实时搜索(near-real-time search)

索引优化:
indexwriter优化方法
optimize() 将索引压缩至一个段,操作完成再返回
optimize(int maxNumSegments) :部分优化,将所以压缩至最多 max个段。
optimize(boolean doWait):true立即执行;false 后台线程调用合并程序
optimize(int maxNumSegments,boolean doWaint):组合
消耗大量CPU和I/O资源。

Directory子类
SimpleFSDirectory
NIOFSDirectory: java.nio.*
MMapDirectory: 内存映射I/O进行文件访问。64位jre是好选择。
RAMDirectory:存入内存
FileSwitchDirectory:使用两个文件目录,根据文件扩展名在两个目录间切换使用。

package com.bit.section2;

import static org.junit.Assert.*;

import java.io.IOException;

import org.apache.lucene.analysis.WhitespaceAnalyzer;
import org.apache.lucene.document.Document;
import org.apache.lucene.document.Field;
import org.apache.lucene.document.FieldSelectorResult;
import org.apache.lucene.index.CorruptIndexException;
import org.apache.lucene.index.IndexReader;
import org.apache.lucene.index.IndexWriter;
import org.apache.lucene.index.IndexWriterConfig;
import org.apache.lucene.index.Term;
import org.apache.lucene.search.IndexSearcher;
import org.apache.lucene.search.Query;
import org.apache.lucene.search.TermQuery;
import org.apache.lucene.store.Directory;
import org.apache.lucene.store.LockObtainFailedException;
import org.apache.lucene.store.RAMDirectory;
import org.apache.lucene.util.Version;
import org.junit.Before;
import org.junit.Test;

import com.bit.util.TestUtil;

public class IndexingTest {

protected String[] ids = {"1","2"};
protected String[] unindexed = {"Netherlands","Italy"};
protected String[] unsorted = {"Amesterdam has lots of canals","Venice has lots of canals"};
protected String[] text = {"Amsterdam","Venice"};

private Directory directory;

@Before
public void setUp() throws Exception {
directory = new RAMDirectory();

IndexWriter writer = getWriter();

for(int i=0;i<ids.length;i++){
Document doc = new Document();
doc.add(new Field("id", ids[i],Field.Store.YES,Field.Index.NOT_ANALYZED));
doc.add(new Field("country",unindexed[i],Field.Store.YES,Field.Index.NO));
doc.add(new Field("contents",unsorted[i],Field.Store.NO,Field.Index.ANALYZED));
doc.add(new Field("city",text[i],Field.Store.YES,Field.Index.ANALYZED));
writer.addDocument(doc);
}
writer.close();
}

/**
* 生成索引生成器
* @return
* @throws IOException
* @throws LockObtainFailedException
* @throws CorruptIndexException
*/
private IndexWriter getWriter() throws CorruptIndexException, LockObtainFailedException, IOException {
return new IndexWriter(directory, new IndexWriterConfig(Version.LUCENE_36,new WhitespaceAnalyzer()));
}
@Test
public void test() {
fail("Not yet implemented");
}

protected int getHitCount(String fieldName,String searchString) throws IOException, Exception {
IndexSearcher searcher = new IndexSearcher(directory);
Term term = new Term(fieldName, searchString);
Query query = new TermQuery(term);
int hitCount = TestUtil.hitCount(searcher,query);
searcher.close();
return hitCount;
}

@Test
public void testIndexWriter() throws CorruptIndexException, LockObtainFailedException, IOException{
IndexWriter writer = getWriter();
assertEquals(ids.length, writer.numDocs());
writer.close();
}

@Test
public void testIndexReader() throws CorruptIndexException, IOException{
IndexReader reader = IndexReader.open(directory);
assertEquals(ids.length, reader.maxDoc());
assertEquals(ids.length, reader.numDocs());
reader.close();
}

/**
* 测试删除文档
*/
@Test
public void testDeleteBeforeOptimize() throws IOException{
IndexWriter writer = getWriter();
assertEquals(2, writer.numDocs());

//确认被标记删除的文档
writer.deleteDocuments(new Term("id", "1"));
writer.commit();
assertTrue(writer.hasDeletions());

//确认删除一个文档并剩余一个文档
assertEquals(2, writer.maxDoc());
assertEquals(1, writer.numDocs());
writer.close();
}

/**
* 优化后的删除
* @throws IOException
*/
@Test
public void testDeleteAfterOptimize() throws IOException{
IndexWriter writer = getWriter();
assertEquals(2, writer.numDocs());

writer.deleteDocuments(new Term("id", "1"));
writer.optimize();
writer.commit();

assertFalse(writer.hasDeletions());
assertEquals(1, writer.maxDoc());
assertEquals(1, writer.numDocs());
writer.close();
}

/**
* 更新文档
* @throws Exception
*/
@Test
public void testUpdate() throws Exception{
assertEquals(1, getHitCount("city", "Amsterdam"));

IndexWriter writer = getWriter();

Document doc = new Document();
doc.add(new Field("id", "1",Field.Store.YES,Field.Index.NOT_ANALYZED));
doc.add(new Field("country","Netherlands",Field.Store.YES,Field.Index.NO));
doc.add(new Field("contents","Den Haag has lots of museums",Field.Store.NO,Field.Index.ANALYZED));
doc.add(new Field("city","DenHaag",Field.Store.YES,Field.Index.ANALYZED));

writer.updateDocument(new Term("id","1"),doc );

writer.commit();

assertEquals(0, getHitCount("city", "Amsterdam"));
assertEquals(1, getHitCount("city", "DenHaag"));
}

}




                                            
内容来自用户分享和网络整理,不保证内容的准确性,如有侵权内容,可联系管理员处理 点击这里给我发消息
标签: