您的位置：首页 > 其它

lucene学习笔记之构建索引

2012-09-11 14:32 211 查看

构建索引
2.2理解索引过程

文本首先从原始数据中提取出来用于创建对应的Document实例，该实例包括多个Field实例，他们都用来保存原始数据信息，随后的分析过程将域文本处理成大量的语汇单元，最后将语汇单元加入到段结构中。

2.2.1提取文本和创建文档

有关提取文本信息的细节将在第七章结合Tika框架详谈。

2.2.2 分析文档

在索引操作时，Lucene首先分析文本，将文本数据分割成语汇单元串，对于中文主要是分词和去停用词，这样就产生了大批的语汇单元，随后这些语汇单元将被写入索引文件中。

2.2.3 向索引添加文档

Lucene的索引文件目录有唯一一个段结构：索引段

索引段：Lucene索引都包含一个或多个段，每个段都是一个独立的索引，它包含整个文档索引的一个子集。每当writer刷新缓冲区增加的文档，以及挂起目录删除操作时，索引文件都会建立一个新段。在搜索索引时，每个段都是单独访问的，但搜索结果是合并返回的。

每个段都包含多个文件，文件格式_X.<ext>，这里X代表段名称，<ext>为扩展名，用来标识该文件对应索引的某个部分，各个独立的文件共同组成了索引的不同部分（项向量，存储的域，倒排索引....）。如果使用混合文件格式（这是Lucene默认的处理方式，但可以通过IndexWriter.setUseCompoundFile方法进行修改），那么上述索引文件都会被压缩成一个单一的文件：_X.cfs。这种方式能在搜索期间减少打开的文件数量。

还有一个特殊文件，段文件，用段_<N>标识，该文件指向所有激活的段。Lucene会首先打开该文件，然后打开它所指向的其他文件，Lucene每次向索引提交更改都会将这个数加1。

久而久之，索引会聚集很多段，特别是当程序打开和关闭writer较为频繁时，IndexWriter类会周期性的选择一些段，然后将它们合并到一个新段。

2.3 基本索引操作

2.3.1 想索引添加文档

添加文档的方法有两个：

addDocument(Document)-----使用默认分析器添加文档，该分析器在创建IndexWriter对象时指定，用于语汇单元化操作。

addDocument(Document , Analyzer)-----使用指定的分析器添加文档和语汇单元操作。

整个建立索引的代码如下：

public class LuceneIndex {
public static void main(String[] args) throws Exception {
//A path to a directory where we store the Lucene index
File indexDir = new File("F:\\ntcr_index");
//A path to a directory that contains the files we want to index
File dataDir = new File("F:\\NTCR_ChangeCodeToUTF");
long start = new Date().getTime();
int numIndexed = index(indexDir, dataDir);//get the number of Indexed
long end = new Date().getTime();
System.out.println("一共索引了 " + numIndexed + " 个文件，共消耗时间 " + (end - start) + " 毫秒。");
}
//open an index and start file directory traversal0
public static int index(File indexDir, File dataDir) throws IOException {
//	Indexer: traverses a file system and indexes .txt files
//	Create Lucene index	in this directory Index files in this directory
if (!dataDir.exists() || !dataDir.isDirectory()) {
throw new IOException(dataDir + " 不存在或不是目录。");
}
/*
* >=3.2.0版本的IndexWriter的使用
*/
WhitespaceAnalyzer analyzer = new WhitespaceAnalyzer(Version.LUCENE_CURRENT);

Directory directory = FSDirectory.open(indexDir);
IndexWriterConfig indexConfig = new IndexWriterConfig(
Version.LUCENE_34, analyzer);
indexConfig.setOpenMode(IndexWriterConfig.OpenMode.CREATE_OR_APPEND);
IndexWriter writer = new IndexWriter(directory, indexConfig);
indexDirectory(writer, dataDir);
int numIndexed = writer.numDocs();
System.out.println("优化中......................");
System.out.println("请耐心等待...................");
writer.optimize();
writer.close();
return numIndexed;
}
//recursive method that calls itself when it finds a directory.递归调用
private static void indexDirectory(IndexWriter writer, File dir)
throws IOException {
File[] files = dir.listFiles();//files number
for (int i = 0; i < files.length; i++) {
File f = files[i];
if (f.isDirectory()) {
indexDirectory(writer, f);
} else if (f.getName().endsWith(".txt")) {
indexFile(writer,f);
}
}
}
//		 method to actually index a file using Lucene
private static void indexFile(IndexWriter writer, File f)throws IOException {
if (f.isHidden() || !f.exists() || !f.canRead()) {
return;
}
System.out.println("索引... " + f.getCanonicalPath());

BufferedReader reader = new BufferedReader(new FileReader(f));
Document doc = new Document();
doc.add(new Field("FilePath", f.getCanonicalPath(), Field.Store.YES,
Field.Index.ANALYZED,TermVector.YES));
doc.add(new Field("FileName", f.getName(), Field.Store.YES,
Field.Index.ANALYZED,TermVector.YES));
//默认为索引，不储存，分词
doc.add(new Field("textField",reader.readLine(),Field.Store.YES,
Field.Index.ANALYZED,TermVector.YES));
//Add document to Lucene index
writer.addDocument(doc);
}
}

2.13.1 用IndexReader删除文档

1）IndexReader能够根据文档号删除文档

2）IndexReader可以通过Term对象删除文档，这与IndexWriter类似，但前者会返回被删除的文档号。

3）如果程序使用相同的reader进行搜索的话，IndexReader的删除操作会即时生效，而用IndexWriter删除必须等到程序打开一个新的Reader才能感知。

4）IndexWriter可以通过Query对象执行删除操作，但IndexWriter不行。

5）IndexReader提供了一个有时非常有用的方法undeleteAll，该方法能反向操作索引中所有挂起的删除。该方法只能对还未进行段合并的文档进行反删除操作，因为IndexWriter只是将被删除文档标记为删除状态，最终删除是在该文档所对应的段合并时进行的。

内容来自用户分享和网络整理，不保证内容的准确性，如有侵权内容，可联系管理员处理

标签：

相关文章推荐

新的分享

章节导航