您的位置：首页 > 编程语言 > Java开发

Lucene入门教程（二）- 理解索引过程的核心类

2014-06-22 15:56 323 查看

在上一篇博客，我们了解了索引，并使用代码创建了索引，下面，我们说一下，索引过程中用到的核心类。

回顾一下，上一篇博客中我们的IndexFiles，这个类可以创建一个索引，看一下和Lucene有关的代码，有多少，发现了什么，

是不是有很多关于IO操作的代码，而真正和Lucene相关的代码不是很多。

[java]
view plaincopy

Analyzer analyzer = new StandardAnalyzer(Version.LUCENE_CURRENT);

// Store the index in memory:
Directory directory = new RAMDirectory();
// To store an index on disk, use this instead:
//Directory directory = FSDirectory.open("/tmp/testindex");
IndexWriterConfig config = new IndexWriterConfig(Version.LUCENE_CURRENT, analyzer);
IndexWriter iwriter = new IndexWriter(directory, config);
Document doc = new Document();
String text = "This is the text to be indexed.";
doc.add(new Field("fieldname", text, TextField.TYPE_STORED));
iwriter.addDocument(doc);
iwriter.close();

上面这几行代码找不多就够了，下面，我们了解一下，这些核心类是干什么的。

1. IndexWriter

用于创建一个新的索引，并文档加到已有的索引当中去。可以把IndexWriter当做这样的对象：他可以为你提供对索引的写入操作，但不能用于读取或搜索。

2. Directory

描述了索引存放的位置，FSDirectory将索引存放在真实的文件系统中。

3. Analyzer

在文本被索引之前，需要经过分析器（Analyzer）的处理。在IndexWriter初始化是，我们指定了要使用的分析器，

它负责从将要被索引的的文本文件中提取词汇单元（tokens），并剔除剩下的无用信息。

4. Document

Document代表一些域（Field）的集合。可以把Document对象当成一个虚拟的文档-比如一个Web界面、文本文件等，可以从中取出大量的数据。

5. Field

索引中，每一个Document对象都包含一个或多个不同命名的域（Field），这些域包含于Field类中。每个域都对应一段数据，这些数据在搜索过程中，

可能会被查询或者在索引中被检索。

我们了解了核心的类，下面，我有重构了一下IndexFiles类，其中，主要是删除了一些参数校验什么的，让Lucene的代码更加的明显。

[java]
view plaincopy

package org.ygy.lucene;

import java.io.BufferedReader;
import java.io.File;
import java.io.FileInputStream;
import java.io.FileNotFoundException;
import java.io.IOException;
import java.io.InputStreamReader;
import java.util.Date;

import org.apache.lucene.analysis.Analyzer;
import org.apache.lucene.analysis.standard.StandardAnalyzer;
import org.apache.lucene.document.Document;
import org.apache.lucene.document.Field;
import org.apache.lucene.document.LongField;
import org.apache.lucene.document.StringField;
import org.apache.lucene.document.TextField;
import org.apache.lucene.index.IndexWriter;
import org.apache.lucene.index.IndexWriterConfig;
import org.apache.lucene.store.Directory;
import org.apache.lucene.store.FSDirectory;
import org.apache.lucene.util.Version;

/**
* 建立索引
*
* @author yuguiyang
*
*/
public class IndexFiles {

    /**
     * 创建索引
     *
     * @param indexPath
     *            存放索引的路径
     * @param docDir
     *            文档所在的文件
     * @throws IOException
     */
    private static void createIndex(String indexPath, String docsPath) throws IOException {

        // 验证资源文件地址
        File docDir = new File(docsPath);
        if ((!docDir.exists()) || (!docDir.canRead())) {
            System.out.println("Document directory '" + docDir.getAbsolutePath()
                    + "' does not exist or is not readable, please check the path");
            System.exit(1);
        }

        // 存放索引的目录
        Directory dir = FSDirectory.open(new File(indexPath));

        // 标准分析器
        Analyzer analyzer = new StandardAnalyzer(Version.LUCENE_45);

        // 索引配置
        IndexWriterConfig iwc = new IndexWriterConfig(Version.LUCENE_45, analyzer);

        iwc.setOpenMode(IndexWriterConfig.OpenMode.CREATE);

        // 初始化索引
        IndexWriter writer = new IndexWriter(dir, iwc);

        // 遍历文件，将文件加入到索引中
        indexDocs(writer, docDir);

        // 关闭索引
        writer.close();
    }

    // 递归的方式，遍历每一个文件
    private static void indexDocs(IndexWriter writer, File file) throws IOException {

        if (file.canRead()) {
            if (file.isDirectory()) {
                String[] files = file.list();

                if (files != null) {
                    for (int i = 0; i < files.length; i++)
                        indexDocs(writer, new File(file, files[i]));
                }
            } else {
                createDocument(writer , file);
            }
        }

    }

    //创建Document，并添加到索引
    private static void createDocument(IndexWriter writer , File file) throws IOException {
        FileInputStream fis;
        try {
            fis = new FileInputStream(file);
        } catch (FileNotFoundException fnfe) {
            return;
        }

        try {
            // 建立一个文档
            Document doc = new Document();

            // 保存字符串的域，保存了文件路径
            Field pathField = new StringField("path", file.getPath(), Field.Store.YES);
            doc.add(pathField);

            // Long型域，保存文件上次修改时间
            doc.add(new LongField("modified", file.lastModified(), Field.Store.NO));

            // Text型域，保存文件内容
            doc.add(new TextField("contents", new BufferedReader(new InputStreamReader(fis, "UTF-8"))));

            System.out.println("adding " + file);

            // 将文档加入到索引中
            writer.addDocument(doc);
        } finally {
            fis.close();
        }
    }

    // -index F:\Lucene_index -docs F:\Lucene_data
    public static void main(String[] args) {

        String indexPath = "F:\\Lucene_index"; // 索引存放的路径
        String docsPath = "F:\\Lucene_data"; // 需要给哪些文件建立索引（即资源库的地址）

        Date start = new Date();
        try {
            System.out.println("Indexing to directory '" + indexPath + "'...");

            // 创建索引
            createIndex(indexPath, docsPath);

            Date end = new Date();
            System.out.println(end.getTime() - start.getTime() + " total milliseconds");
        } catch (IOException e) {
            System.out.println(" caught a " + e.getClass() + "\n with message: " + e.getMessage());
        }
    }

}

内容来自用户分享和网络整理，不保证内容的准确性，如有侵权内容，可联系管理员处理

标签： java Lucene

相关文章推荐

新的分享

章节导航