您的位置:首页 > 其它

lucene in action第二章(1)(深…

2012-12-26 12:39 411 查看
lucene的索引

document是lucene的index和search的原子单位。每一个field包含若干个Field,每一个Field包含真正需要的内容

一、对一个Field我们可以对它有三种处理:

new
Field("city", "Den
Haag", Field.Store.YES, Field.Index.ANALYZED,TermVector.WITH_POSITIONS);


1、是否index,
用 Field.Index.ANALYZED Field.Index.NO等等来表示
ANALYZED


使用analyzer 分析,将分析得到的字段
用于索引,
ANALYZED_NO_NORMS


ANALYZED
的变体,区别是,
ANALYZED
存储了index time ,boost
information等norms,而
ANALYZED_NO_NORMS
不存储,这会在search的时候节约内存空间
NO


这个field不能被search
NOT_ANALYZED


不使用 analyzer分析,整体作为一个token,常用语精确匹配,例如
文件名,ID等就用这个。
NOT_ANALYZED_NO_NORMS


同理
NOT_ANALYZED
的变种

2、如果index是否存储term
vector。


term 就是analyzer分词后的词组。 每一个document 都含有一个term
vector
存储了这个document含有的term(unique,如果某个term出现多次也只存一个),以及这个term出现在field
中的position,以及offset。这些信息可以用来以后高亮一个选中的term等等。

3、field的value是否存在index中

Field.Store.YES,Field.Store.NO来表示

[/b]

二、Document灵活的模式:[/b]

与数据库的表不同,数据库表中的每一行(对于这里的一个document),都必须有相同的字段。

而Document可以不同,它不要求每一个doc都有相同的field,可以不同。比如doc1 可以有filed1
field2,而doc2可以有filed1 field2
fild3等,他们可以加入同一个index

[/b]

三、lucene的反向索引:[/b]

什么是反向索引?[/b]

lucene使用analyzed的词组作为查询的key,它不是回答“这个document包含哪些words?”的问题,而是回答“这个word出现在哪些document中?”

[/b]

四、lucene的索引片段( index segments)[/b]

每一个lucene的index包含了一个或者多个index segments。

每一个index segment
是一个单独的index,它只包含所有document的一个子集。

每当indexWriter改写index时候就有一个新的segment创建出来

在search阶段,search操作是在每一个segment上单独进行的,最后的每个单独的结果合并成一个总的结果给用户。

每一个segment包含多个文件,例如_0.fdt,_0.fdx 等等。

_X.<ext>的格式存在

例如下图中有2个segment
,segment0 和segment1



in action第二章(1)(深入index)" TITLE="lucene in action第二章(1)(深入index)" />


segments_<NUM>文件是非常重要的。它包含了到其他segment的引用(reference).lucene首先打开的就是这个文件,再打开它指向的其他segment(segment也是一个index)

例如图中的
segments_2,其中二代表“generation”(第几代),indexWriter每commit一次这个num就加一。

当segment太多时候,打开索引会消耗很多资源(例如文件描述符),indexWriter会使用一个MergeScheduler在适当的时候合并segement,以减小segment的个数

五、index的具体修改操作:[/b]

先要确定indexWriter的类型,下面是两个构造函数。

IndexWriter(Directory d,
Analyzer a, boolean create, IndexWriter.MaxFieldLength
mfl)

IndexWriter(Directory d,
Analyzer a, IndexWriter.MaxFieldLength
mfl)


如果使用第一个构造函数,create =
true的话,每一打开的indexWiter,都会新建一个index,使用addDocuemnt没有效果。[/b]

使用[/b]create
=
false或者使用第二个构造函数(第二个构造函数没有create,但是会先检查是否已经存在index,如果存在则打开它,否则新建一个)
就可以修改了。[/b]

往索引中添加一个document.

Document doc = new
Document();
doc.add(new Field("id",
ids[i],
Field.Store.YES,
Field.Index.NOT_ANALYZED));
doc.add(new Field("country",
unindexed[i],
Field.Store.YES,
Field.Index.NO));
doc.add(new Field("contents",
unstored[i],
Field.Store.NO,
Field.Index.ANALYZED));
doc.add(new Field("city",
text[i],
Field.Store.YES,
Field.Index.ANALYZED));
writer.addDocument(doc);

删除document

有下面的几个delete可以使用

deleteDocuments(Term) deletes
all documents containing the provided term.
deleteDocuments(Term[])deletes
all documents containing any of the terms in the provided
array.
deleteDocuments(Query) deletes
all documents matching the provided query.
deleteDocuments(Query[])deletes
all documents matching any of the queries in the provided
array.
deleteAll() deletes all
documents in the index. This is exactly the same as closing the
writer and opening a new writer with create=true, without having to
close your writer.

public void deleteDocuments(Term
[] terms) throws CorruptIndexException, IOException
{
// deletes all documents
containing any of the terms in the provided array.
this.writer.deleteDocuments(terms);
System.out.println("docs = " +
writer.numDocs());
}

update
document


记住,操作index的基本单位是一个document,update操作只能更新一个document而不能更新一个field。

其实update是delete和add操作组成的。

具体的update操作有

updateDocument(Term,
Document) first deletes all documents containing the provided term
and then adds the new document using the writer’s default
analyzer.

updateDocument(Term, Document, Analyzer) does the same but
uses the provided analyzer instead of the writer’s default
analyzer.

-------------------------------------------------------------------------

package charpter2;

import java.io.File;

import java.io.IOException;

import
org.apache.lucene.analysis.standard.StandardAnalyzer;

import
org.apache.lucene.document.Document;

import org.apache.lucene.document.Field;

import
org.apache.lucene.index.CorruptIndexException;

import
org.apache.lucene.index.IndexWriter;

import org.apache.lucene.index.Term;

import
org.apache.lucene.queryParser.ParseException;

import
org.apache.lucene.queryParser.QueryParser;

import
org.apache.lucene.search.IndexSearcher;

import org.apache.lucene.search.Query;

import
org.apache.lucene.search.ScoreDoc;

import
org.apache.lucene.search.TermQuery;

import org.apache.lucene.search.TopDocs;

import
org.apache.lucene.store.Directory;

import
org.apache.lucene.store.FSDirectory;

import org.apache.lucene.util.Version;

public class ChangeIndex {

private IndexWriter writer;

protected String[] ids = {"1",
"2"};

protected String[] unindexed = {"Netherlands",
"Italy"};

protected String[] unstored = {"Amsterdam has lots of
bridges","Venice has lots of canals"};

protected String[] text = {"Amsterdam",
"Venice"};

Directory dir = null;

public ChangeIndex(String indexDir) throws
IOException

{

dir = FSDirectory.open(new
File(indexDir));

//the "create" variable of indexWriter
constructor must be "false"

//IndexWriter(Directory d, Analyzer a, boolean
create, IndexWriter.MaxFieldLength
mfl)

this.writer = new IndexWriter(dir,new
StandardAnalyzer(Version.LUCENE_36),IndexWriter.MaxFieldLength.UNLIMITED);

}

public void addDocuments() throws CorruptIndexException,
IOException

{

for (int i = 0; i < ids.length;
i++)

{

Document doc = new Document();

doc.add(new Field("id", ids[i],

Field.Store.YES,

Field.Index.NOT_ANALYZED));

doc.add(new Field("country",
unindexed[i],

Field.Store.YES,

Field.Index.NO));

doc.add(new Field("contents",
unstored[i],

Field.Store.NO,

Field.Index.ANALYZED));

doc.add(new Field("city",
text[i],

Field.Store.YES,

Field.Index.ANALYZED));

writer.addDocument(doc);

}

System.out.println("docs = " +
writer.numDocs());

}

public void deleteDocuments(Term [] terms) throws
CorruptIndexException, IOException

{

// deletes all documents containing any of the terms in the
provided array.

this.writer.deleteDocuments(terms);

System.out.println("docs = " +
writer.numDocs());

}

public void updateDocuments(Term term) throws
CorruptIndexException, IOException

{

Document doc = new Document();

doc.add(new Field("id", "1",

Field.Store.YES,

Field.Index.NOT_ANALYZED));

doc.add(new Field("country",
"Netherlands",

Field.Store.YES,

Field.Index.NO));

doc.add(new Field("contents",

"Den Haag has a lot of museums",

Field.Store.YES,

Field.Index.ANALYZED));

doc.add(new Field("city", "Den
Haag",

Field.Store.YES,

Field.Index.ANALYZED));

writer.updateDocument(new Term("id",
"1"),

doc);

System.out.println("docs = " +
writer.numDocs());

}

public void search(String fieldName,String q) throws
CorruptIndexException, IOException,
ParseException

{

IndexSearcher searcher = new
IndexSearcher(dir);

QueryParser parser = new
QueryParser(Version.LUCENE_36,"contents",new
StandardAnalyzer(Version.LUCENE_36));

Query query = parser.parse(q);

TopDocs hits = searcher.search(query,
20);

System.out.println("search
result:");

for(ScoreDoc doc :
hits.scoreDocs)

{

// 取得命中的文档

Document d =
searcher.doc(doc.doc);

System.out.println(d.get("contents"));

}

}

public void commit() throws CorruptIndexException,
IOException

{

this.writer.commit();

}

public static void main(String[] args) throws IOException,
ParseException {

// TODO Auto-generated method
stub

ChangeIndex ci = new
ChangeIndex("charpter2-1");

//test add index

ci.addDocuments();

ci.commit();

//test delete index

// the term to delete

//Term [] terms = {new Term("id","1"),new
Term("id","10")};

//ci.deleteDocuments(terms);

//test update index

System.out.println("before
udpate");

ci.search("contents", "Haag");

ci.updateDocuments(new
Term("id","1"));

ci.commit();

System.out.println("after
udpate");

ci.search("contents", "Haag");

}

}
内容来自用户分享和网络整理,不保证内容的准确性,如有侵权内容,可联系管理员处理 点击这里给我发消息
标签: