您的位置:首页 > 其它

lucene in action 第二章(3)(索…

2012-12-26 12:40 183 查看
一、索引数值类型的数据。

在早期的lucene,数值类型的值,如“1900”是作为一个文本来对待的。他就是一个字符串,没有大小,没有范围。在实际使用当中,我们经常需要使用数字作为索引。例如,图书的价格,邮件的收发时间等等。

在lucene2.9以后提供了这种numeric
index的功能。

可以使用一个
NumericField的Field子类来实现。使用setXXValue的方法可以为Field设置numeric的值。

doc.add(new
NumericField("price").setDoubleValue(19.99));

这样这个price 的field就可以用于 搜索,sort
了。也可以像textual文本那样精确匹配。

通用
NumericField可以接受多个同名的field。

例如:doc.add(new
NumericField("price").setDoubleValue(19.99));

doc.add(new
NumericField("price").setINTValue(9.99));

在获取搜索结果的时候,NumericRangeQuery 和 NumericRangeFilter这个两个类会以
“or”
的形式来对待所有一个同名的field有多个value的情况。但是sort在这种情况下没有定义。

我们在index时间和日期的时候,就可以使用NumericField了。基本方法是将其转化为数值。

doc.add(new
NumericField("timestamp") .setLongValue(new
Date().getTime()));

Calendar cal =
Calendar.getInstance();

cal.setTime(date);

doc.add(new
NumericField("dayOfMonth") .setIntValue(cal.get(Calendar.DAY_OF_MONTH)));

二、域截断

在我们实例化一个IndexWriter的时候,会传入一个参数 IndexWriter.MaxFieldLength.UNLIMITED

new IndexWriter(dir,new
StandardAnalyzer(Version.LUCENE_36),IndexWriter.MaxFieldLength.UNLIMITED);

这个参数就是用户域截断的。当我们在索引一个Field的value的时候,我们事先不知道这个value有多长,也许很长,

或者这个value是非textual的
很长的 binary
content。这就会造成问题。我们可以使用这个参数控制index的数量。

也可使用indexWriter的setMaxFieldLength(maxFieldLength)

writer.setMaxFieldLength(maxFieldLength)

这个maxFieldLength的意思是index的最多多少个term:

The maximum
number of terms that will be indexed for a single field in a
document


如果刚开始的indexWriter的没有截断,在后面设置了设置截断长度,不影响前面的index。


慎用这个feature,因为可能造成索引不全,影响体验。


三、近似实时搜索

当我们频繁通过一个writer来修改一个index。现在我们想要search实时的内容时候就可以调用getReader方法。来获得一个只读的IndexReader。它会将这个writer的commit或者uncommit的change添加到index中,并返回一个read-only
indexReader。

IndexReader reader =
writer.getReader();

记住使用玩这个reader后记住关闭它。

这是lucene对它的解释。

"near real-time"
searching, in that changes made during an IndexWriter session can
be quickly made available for searching without closing the writer
nor calling
commit()
.Note
that this is functionally equivalent to calling {#flush} and then
using
IndexReader.open(org.apache.lucene.store.Directory)
to
open a new reader. But the turarnound time of this method should be
faster since it avoids the potentially
costly
commit()
.You
must close
the
IndexReader
returned
by this method once you are done using it.

四、优化index

优化原因:频繁更新indexWriter,会生成很多的segment(每个segment会有多个文件)。lucene在search的时候会search每一个segment然后合并结果,造成速度下降。而且太多的segment会消耗
更多的文件描述符。所以需要优化。

优化的目的就是将散落的多个segmet合并为N个,提高search的速度。

indexWriter提供了4个方法用于优化。

􀂃
optimize()
reduces the index to a
single segment, not returning until the
operation is finished.
􀂃 optimize(int
maxNumSegments),
also known as
partial optimize, reduces the
index to at most maxNumSegments segments. Because the final
merge down to
one segment is the most costly, optimizing to, say, five
segments should be quite
a bit faster than optimizing down to one segment, allowing you
to trade less
optimization time for slower search speed.
􀂃 optimize(boolean
doWait)
is just like optimize,
except if doWait is false then
the call returns immediately while the necessary merges take
place in the background.
Note that doWait=false
only works for a merge scheduler
that runs

merges in background
threads,
such as the default
ConcurrentMergeScheduler.
Section 2.13.6 describes merge schedulers in more
detail.
􀂃 optimize(int maxNumSegments, boolean
doWait)
is a partial optimize
that
runs in the background if doWait is false.

注意!优化会使用临时文件夹,因为在合并segment后的文件是放在tmp目录下的,当合并完成后才会删除原来的segment,这就必须保证有足够的磁盘空间

五、lucene的Directory

Directory dir =
FSDirectory.open(new File(indexDir));
writer = new IndexWriter(dir,new
StandardAnalyzer(Version.LUCENE_36),false,IndexWriter.MaxFieldLength.UNLIMITED);

todo。

六、lucene indexReader 和
IndexWriter在 多线程下工作


1、任意多个的indexReader 可以
在同一个index上打开。不管是在同一个jvm上还是多个jvm上,都没有问题。但是如果在同一个jvm上多个线程公用一个indexReader会更好滴利用资源。甚至在一个indexWriter在修改index的时候也可以创建一个indexReader。

2、在一个index上只能打开一个indexWriter。只是不能同时打开。每次只能打开一个。因为当一个indexWriter实例化后,就会创建一个write
锁。直到这个indexWriter关闭后,才释放这个锁。所以只能有一个indexWriter在一个index上。

3、但是可以有任意多个 thread 共享一个
indexReader或者indexwriter。因为这两个类是线程安全的。

4、writeIndexer的锁。Index每一个时刻只能有一个writerIndexer打开,控制这个机制就是使用write.clock来决定的。

例如 两个IndexWriter同时打开一个index,会抛出LockObtainFailedException异常。
{

IndexWriter writer
= new IndexWriter(dir,new
StandardAnalyzer(Version.LUCENE_36),IndexWriter.MaxFieldLength.UNLIMITED);

IndexWriter
new =
new IndexWriter(dir,new
StandardAnalyzer(Version.LUCENE_36),IndexWriter.MaxFieldLength.UNLIMITED);

}

Exception in thread "main"
org.apache.lucene.store.LockObtainFailedException: Lock
obtain timed out:
NativeFSLock@D:\work\eclipseworkpace\lucenelearn\charpter2-1\write.lock
at org.apache.lucene.store.Lock.obtain(Lock.java:84)
at
org.apache.lucene.index.IndexWriter.<init>(IndexWriter.java:1098)
at
org.apache.lucene.index.IndexWriter.<init>(IndexWriter.java:953)
at
charpter2.ChangeIndex.<init>(ChangeIndex.java:46)
at charpter2.ChangeIndex.main(ChangeIndex.java:177)

可以
自行设置锁的机制。使用Directory的setLockFactory方法设置锁的机制。
dir = FSDirectory.open(new
File(indexDir));

dir.setLockFactory(lockFactory)

this.writer = new
IndexWriter(dir,new
StandardAnalyzer(Version.LUCENE_36),IndexWriter.MaxFieldLength.UNLIMITED);
注意必须在调用IndexWriter构造函数之前,设置setLockFactory方法。

lucene有四种
LockFactory的实现

NativeFSLockFactory This
is the default locking for FSDirectory, using
java.nio native OS locking, which will never leave
leftover lock files when the JVM exits. But this
locking implementation may not work correctly
over certain shared file systems,
notably NFS.

SimpleFSLockFactory Uses
Java’s File.createNewFile API, which may be
more portable across different file systems
than NativeFSLockFactory. Be
aware that if the JVM crashes or IndexWriter isn’t
closed before the JVM exits, this may leave a
leftover write.lock file, which you must manually
remove.

SingleInstanceLockFactory Creates
a lock entirely in memory. This is the default
locking implementation for RAMDirectory. Use this
when you know all IndexWriters will be
instantiated in a single JVM.

NoLockFactory Disables
locking entirely. Be careful! Only use this when you
are
absolutely certain that Lucene’s normal locking
safeguard isn’t necessary—for example, when using
a private RAMDirectory with a single IndexWriter
instance.

可以使用 IndexWriter 的isLocked(Directory) 方法检查一个index是否已经上锁了。

可以使用 IndexWriter 的unlock(Directory)
给一个index解锁,但是最好不要用,可能会使一个index变得不可用。

七、 lucene的debug

this.writer = new
IndexWriter(dir,new
StandardAnalyzer(Version.LUCENE_36),IndexWriter.MaxFieldLength.UNLIMITED);
this.writer.setInfoStream(System.out);

可以使用setInfoStream(PrintStream)
lucene 会打印出所有 lucene的index过程的信息,如下
docs =
4
IW 0 [Sun Mar 18 17:54:58 CST
2012; main]: commit: start
IW 0 [Sun Mar 18 17:54:58 CST
2012; main]: commit: enter lock
IW 0 [Sun Mar 18 17:54:58 CST
2012; main]: commit: now prepare
IW 0 [Sun Mar 18 17:54:58 CST
2012; main]: prepareCommit: flush
IW 0 [Sun Mar 18 17:54:58 CST
2012; main]: now trigger flush reason=explicit flush
IW 0 [Sun Mar 18 17:54:58 CST
2012; main]: start flush:
applyAllDeletes=true
IW 0 [Sun Mar 18 17:54:58 CST
2012; main]: index before flush _0(3.6.1):C2/1
_2(3.6.1):C2/1 _4(3.6.1):C1
IW 0 [Sun Mar 18 17:54:58 CST
2012; main]: DW: flush postings as segment _5
numDocs=1
IW 0 [Sun Mar 18 17:54:58 CST
2012; main]: DW: new segment has no vectors
IW 0 [Sun Mar 18 17:54:58 CST
2012; main]: DW: flushedFiles=[_5.fdt, _5.prx, _5.fnm, _5.nrm,
_5.tis, _5.fdx, _5.frq, _5.tii]
IW 0 [Sun Mar 18 17:54:58 CST
2012; main]: DW: flush: segment=_5(3.6.1):C1
IW 0 [Sun Mar 18 17:54:58 CST
2012; main]: DW: ramUsed=0.095 MB
newFlushedSize=0 MB (0 MB w/o doc stores) docs/MB=3,898.052
new/old=0.191%
IW 0 [Sun Mar 18 17:54:58 CST
2012; main]: DW: flush: push buffered deletes startSize=98
frozenSize=1024
BD 0 [Sun Mar 18 17:54:58 CST
2012; main]: push deletes 1 deleted terms (unique
count=1) bytesUsed=1024 delGen=1 packetCount=1
IW 0 [Sun Mar 18 17:54:58 CST
2012; main]: DW: flush: delGen=1
IW 0 [Sun Mar 18 17:54:58 CST
2012; main]: DW: flush time 314 msec
IFD [Sun Mar 18 17:54:58 CST
2012; main]: now checkpoint "segments_5" [4 segments ; isCommit =
false]
IW 0 [Sun Mar 18 17:54:58 CST
2012; main]: apply all deletes during flush
BD 0 [Sun Mar 18 17:54:58 CST
2012; main]: applyDeletes: infos=[_0(3.6.1):C2/1, _2(3.6.1):C2/1,
_4(3.6.1):C1, _5(3.6.1):C1] packetCount=1
BD 0 [Sun Mar 18 17:54:58 CST
2012; main]: seg=_5(3.6.1):C1 segGen=1 segDeletes=[ 1 deleted terms
(unique count=1) bytesUsed=1024]; coalesced deletes=[null]
delCount=0
BD 0 [Sun Mar 18 17:54:58 CST
2012; main]: seg=_4(3.6.1):C1/1 segGen=0 coalesced
deletes=[CoalescedDeletes(termSets=1,queries=0)] delCount=1 100%
deleted
BD 0 [Sun Mar 18 17:54:58 CST
2012; main]: seg=_2(3.6.1):C2/1 segGen=0 coalesced
deletes=[CoalescedDeletes(termSets=1,queries=0)]
delCount=0
BD 0 [Sun Mar 18 17:54:58 CST
2012; main]: seg=_0(3.6.1):C2/1 segGen=0 coalesced
deletes=[CoalescedDeletes(termSets=1,queries=0)]
delCount=0
BD 0 [Sun Mar 18 17:54:58 CST
2012; main]: applyDeletes took 27 msec
IFD [Sun Mar 18 17:54:58 CST
2012; main]: now checkpoint "segments_5" [4 segments ; isCommit =
false]
IW 0 [Sun Mar 18 17:54:58 CST
2012; main]: drop 100% deleted segments:
[_4(3.6.1):C1/1]
IFD [Sun Mar 18 17:54:58 CST
2012; main]: now checkpoint "segments_5" [3 segments ; isCommit =
false]
IFD [Sun Mar 18 17:54:58 CST
2012; main]: delete "_4_1.del"
BD 0 [Sun Mar 18 17:54:58 CST
2012; main]: prune sis=org.apache.lucene.index.SegmentInfos@d08633
minGen=2 packetCount=1
BD 0 [Sun Mar 18 17:54:58 CST
2012; main]: pruneDeletes: prune 1 packets; 0 packets
remain
IW 0 [Sun Mar 18 17:54:58 CST
2012; main]: clearFlushPending
IW 0 [Sun Mar 18 17:54:58 CST
2012; main]: LMP: findMerges: 3 segments
IW 0 [Sun Mar 18 17:54:58 CST
2012; main]: LMP: seg=_0(3.6.1):C2/1 level=2.240549 size=0.000
MB
IW 0 [Sun Mar 18 17:54:58 CST
2012; main]: LMP: seg=_2(3.6.1):C2/1 level=2.240549 size=0.000
MB
IW 0 [Sun Mar 18 17:54:58 CST
2012; main]: LMP: seg=_5(3.6.1):C1 level=2.429752 size=0.000
MB
IW 0 [Sun Mar 18 17:54:58 CST
2012; main]: LMP: level -1.0 to 2.429752: 3
segments
IW 0 [Sun Mar 18 17:54:58 CST
2012; main]: CMS: now merge
IW 0 [Sun Mar 18 17:54:58 CST
2012; main]: CMS: index: _0(3.6.1):C2/1
_2(3.6.1):C2/1 _5(3.6.1):C1
IW 0 [Sun Mar 18 17:54:58 CST
2012; main]: CMS: no more merges pending; now
return
IW 0 [Sun Mar 18 17:54:58 CST
2012; main]: startCommit(): start
IW 0 [Sun Mar 18 17:54:58 CST
2012; main]: startCommit index=_0(3.6.1):C2/1 _2(3.6.1):C2/1
_5(3.6.1):C1 changeCount=4
IW 0 [Sun Mar 18 17:54:58 CST
2012; main]: done all syncs
IW 0 [Sun Mar 18 17:54:58 CST
2012; main]: commit: pendingCommit != null
IW 0 [Sun Mar 18 17:54:59 CST
2012; main]: commit: wrote segments file "segments_6"
IFD [Sun Mar 18 17:54:59 CST
2012; main]: now checkpoint "segments_6" [3 segments ; isCommit =
true]
IFD [Sun Mar 18 17:54:59 CST
2012; main]: deleteCommits: now decRef commit
"segments_5"
IFD [Sun Mar 18 17:54:59 CST
2012; main]: delete "_4.prx"
IFD [Sun Mar 18 17:54:59 CST
2012; main]: unable to remove file "_4.prx": java.io.IOException:
Cannot delete
D:\work\eclipseworkpace\lucenelearn\charpter2-1\_4.prx; Will re-try
later.
IFD [Sun Mar 18 17:54:59 CST
2012; main]: delete "_4.fnm"
IFD [Sun Mar 18 17:54:59 CST
2012; main]: delete "_4.fdx"
IFD [Sun Mar 18 17:54:59 CST
2012; main]: unable to remove file "_4.fdx": java.io.IOException:
Cannot delete
D:\work\eclipseworkpace\lucenelearn\charpter2-1\_4.fdx; Will re-try
later.
IFD [Sun Mar 18 17:54:59 CST
2012; main]: delete "_4.frq"
IFD [Sun Mar 18 17:54:59 CST
2012; main]: unable to remove file "_4.frq": java.io.IOException:
Cannot delete
D:\work\eclipseworkpace\lucenelearn\charpter2-1\_4.frq; Will re-try
later.
IFD [Sun Mar 18 17:54:59 CST
2012; main]: delete "_4.tis"
IFD [Sun Mar 18 17:54:59 CST
2012; main]: unable to remove file "_4.tis": java.io.IOException:
Cannot delete
D:\work\eclipseworkpace\lucenelearn\charpter2-1\_4.tis; Will re-try
later.
IFD [Sun Mar 18 17:54:59 CST
2012; main]: delete "_4.tii"
IFD [Sun Mar 18 17:54:59 CST
2012; main]: delete "_4.nrm"
IFD [Sun Mar 18 17:54:59 CST
2012; main]: unable to remove file "_4.nrm": java.io.IOException:
Cannot delete
D:\work\eclipseworkpace\lucenelearn\charpter2-1\_4.nrm; Will re-try
later.
IFD [Sun Mar 18 17:54:59 CST
2012; main]: delete "_4.fdt"
IFD [Sun Mar 18 17:54:59 CST
2012; main]: unable to remove file "_4.fdt": java.io.IOException:
Cannot delete
D:\work\eclipseworkpace\lucenelearn\charpter2-1\_4.fdt; Will re-try
later.
IFD [Sun Mar 18 17:54:59 CST
2012; main]: delete "segments_5"
IW 0 [Sun Mar 18 17:54:59 CST
2012; main]: commit: done
内容来自用户分享和网络整理,不保证内容的准确性,如有侵权内容,可联系管理员处理 点击这里给我发消息
标签: