您的位置：首页 > 其它

lucene的搜索过程（索引文件）

2015-08-28 17:56 267 查看

---恢复内容开始---

搜索的过程总的来说就是将词典及倒排表信息从索引中读出来，根据用户的查询语句合并倒排表，得到结果文档集并对文档进行打分的过程。

如图：

总共包含以下几个过程：

index打开索引文件，读取并打开指向索引文件的流。

用户输入查询语句。

将查询语句转为查询对象Query对象树。（从luke中可以看出来）

构造weight对象树，用于计算词的权重，也即计算打分公司中与搜索语句有关，与文档无关的部分（红色部分）。

构造Score对象树，用于计算打分。

在构造score对象树的过程中，其叶子节点的TermSocre会将词典和倒排表从索引中读取出来。

构造SumSocre对象树，其是为了方便合并倒排表对Socre对象树的从新组织，它的叶子节点仍为TermSocre,包含词典和倒排表。此步将倒排表合并后得到结果文档集，并对结果文档计算打分公式中的蓝色部分。打分公式中的求和符合，并非简单的相加，而是根据子查询倒排表的合并方式(与或非)来对子查询的打分求和，计算出父查询的打分。

将收集的结果集打分返回给用户。

lucene搜索详细过程：

为了解析Lucene对索引文件搜索的过程，预先写入索引了如下几个文件：

file01.txt: apple apples cat dog

file02.txt: apple boy cat category

file03.txt: apply dog eat etc

file04.txt: apply cat foods

打开IndexReader指向索引文件夹

Indexreader reader = IndexReader.open(FSDirectory.open(indexDir));

其实调用的是DirectoryReader.open(Directory,IndexDeletionPolicy,IndexCommit,boolean,int),其主要的作用是生成一个SegmentInfo.FindSegmentsFIle对象;并用它来找到该索引文件中的所有段。

源码跟踪：

IndexReader reader = IndexReader.open(indexpath);

|__open方法

　

　public static IndexReader open(final Directory directory) throws CorruptIndexException, IOException {
　　　　　　return open(directory, null, null, true, DEFAULT_TERMS_INDEX_DIVISOR);
　　　　}

　　

　　　　|__进入return 的open（），对一些没有传进的参数设null值

　　　　　

private static IndexReader open(final Directory directory, final IndexDeletionPolicy deletionPolicy, final IndexCommit commit, final boolean readOnly, 　　　　　　　　int termInfosIndexDivisor) throws CorruptIndexException, IOException {

　　　　　　　　return DirectoryReader.open(directory, deletionPolicy, commit, readOnly, termInfosIndexDivisor);
　　　　　　　　　　}

　　

到了这里就执行到了DirectoryReader.open(directory, deletionPolicy, commit, readOnly, termInfosIndexDivisor);

所以，在调用indexReader.open（...）的最终执行到的是DirectoryReader.open（），其主要作用是生成一个SegmentInfos.FindSegmentsFile对象，并用它来找到此索引文件中所有的段，并打开这些段。

具体的源代码如下：从directoryReader.open()到segmentInfos.FindSegmentsFile,directoryReader.open（）所调用的方法：

static IndexReader open(final Directory directory, final IndexDeletionPolicy deletionPolicy, final IndexCommit commit, final boolean readOnly,
final int termInfosIndexDivisor) throws CorruptIndexException, IOException {
return (IndexReader) new SegmentInfos.FindSegmentsFile(directory) {
@Override
protected Object doBody(String segmentFileName) throws CorruptIndexException, IOException {
SegmentInfos infos = new SegmentInfos();
infos.read(directory, segmentFileName);
if (readOnly)
return new ReadOnlyDirectoryReader(directory, infos, deletionPolicy, termInfosIndexDivisor);
else
return new DirectoryReader(directory, infos, deletionPolicy, false, termInfosIndexDivisor);
}
}.run(commit);
}

　　在segmentInfos.FindSegmentsFile(directory)中，调用了 public abstract static class FindSegmentsFile，它是segmentInfos类的内部抽象类。这是一个工具类，就是获取当前段的信息，这是在lock-less中是必要的。因为可能在你找到当前段文件的名称，打开它，读取内容，检查是否被修改过等期间，它可能已经被提交了删除的请求。（源码注释）

在抽象类中有个run方法：

public Object run(IndexCommit commit) throws CorruptIndexException, IOException {
if (commit != null) {
if (directory != commit.getDirectory())
throw new IOException("the specified commit does not match the specified Directory");
return doBody(commit.getSegmentsFileName());
}

String segmentFileName = null;
long lastGen = -1;
long gen = 0;
int genLookaheadCount = 0;
IOException exc = null;
boolean retry = false;

int method = 0;

// Loop until we succeed in calling doBody() without
// hitting an IOException.  An IOException most likely
// means a commit was in process and has finished, in
// the time it took us to load the now-old infos files
// (and segments files).  It's also possible it's a
// true error (corrupt index).  To distinguish these,
// on each retry we must see "forward progress" on
// which generation we are trying to load.  If we
// don't, then the original error is real and we throw
// it.

// We have three methods for determining the current
// generation.  We try the first two in parallel, and
// fall back to the third when necessary.

while(true) {

if (0 == method) {

// Method 1: list the directory and use the highest
// segments_N file.  This method works well as long
// as there is no stale caching on the directory
// contents (NOTE: NFS clients often have such stale
// caching):
String[] files = null;

long genA = -1;

files = directory.listAll();

if (files != null)
genA = getCurrentSegmentGeneration(files);

message("directory listing genA=" + genA);

// Method 2: open segments.gen and read its
// contents.  Then we take the larger of the two
// gen's.  This way, if either approach is hitting
// a stale cache (NFS) we have a better chance of
// getting the right generation.
long genB = -1;
for(int i=0;i<defaultGenFileRetryCount;i++) {
IndexInput genInput = null;
try {
genInput = directory.openInput(IndexFileNames.SEGMENTS_GEN);
} catch (FileNotFoundException e) {
message("segments.gen open: FileNotFoundException " + e);
break;
} catch (IOException e) {
message("segments.gen open: IOException " + e);
}

if (genInput != null) {
try {
int version = genInput.readInt();
if (version == FORMAT_LOCKLESS) {
long gen0 = genInput.readLong();
long gen1 = genInput.readLong();
message("fallback check: " + gen0 + "; " + gen1);
if (gen0 == gen1) {
// The file is consistent.
genB = gen0;
break;
}
}
} catch (IOException err2) {
// will retry
} finally {
genInput.close();
}
}
try {
Thread.sleep(defaultGenFileRetryPauseMsec);
} catch (InterruptedException ie) {
throw new ThreadInterruptedException(ie);
}
}

message(IndexFileNames.SEGMENTS_GEN + " check: genB=" + genB);

// Pick the larger of the two gen's:
if (genA > genB)
gen = genA;
else
gen = genB;

if (gen == -1) {
// Neither approach found a generation
throw new FileNotFoundException("no segments* file found in " + directory + ": files: " + Arrays.toString(files));
}
}

// Third method (fallback if first & second methods
// are not reliable): since both directory cache and
// file contents cache seem to be stale, just
// advance the generation.
if (1 == method || (0 == method && lastGen == gen && retry)) {

method = 1;

if (genLookaheadCount < defaultGenLookaheadCount) {
gen++;
genLookaheadCount++;
message("look ahead increment gen to " + gen);
}
}

if (lastGen == gen) {

// This means we're about to try the same
// segments_N last tried.  This is allowed,
// exactly once, because writer could have been in
// the process of writing segments_N last time.

if (retry) {
// OK, we've tried the same segments_N file
// twice in a row, so this must be a real
// error.  We throw the original exception we
// got.
throw exc;
} else {
retry = true;
}

} else if (0 == method) {
// Segment file has advanced since our last loop, so
// reset retry:
retry = false;
}

lastGen = gen;

segmentFileName = IndexFileNames.fileNameFromGeneration(IndexFileNames.SEGMENTS,
"",
gen);

try {
Object v = doBody(segmentFileName);
if (exc != null) {
message("success on " + segmentFileName);
}
return v;
} catch (IOException err) {

// Save the original root cause:
if (exc == null) {
exc = err;
}

message("primary Exception on '" + segmentFileName + "': " + err + "'; will retry: retry=" + retry + "; gen = " + gen);

if (!retry && gen > 1) {

// This is our first time trying this segments
// file (because retry is false), and, there is
// possibly a segments_(N-1) (because gen > 1).
// So, check if the segments_(N-1) exists and
// try it if so:
String prevSegmentFileName = IndexFileNames.fileNameFromGeneration(IndexFileNames.SEGMENTS,
"",
gen-1);

final boolean prevExists;
prevExists = directory.fileExists(prevSegmentFileName);

if (prevExists) {
message("fallback to prior segment file '" + prevSegmentFileName + "'");
try {
Object v = doBody(prevSegmentFileName);
if (exc != null) {
message("success on fallback " + prevSegmentFileName);
}
return v;
} catch (IOException err2) {
message("secondary Exception on '" + prevSegmentFileName + "': " + err2 + "'; will retry");
}
}
}
}
}
}

　　

　　就是判断是否被提交了，判断的参数：commit,从中可以取到directory,得到当前段所在的位置，并判断是否被修改过，如果没有，就从commit中获取segmentFileName就执行doBody(segmentFileName)。

解释一下indexCommit:

public abstract class IndexCommit {

public abstract String getSegmentsFileName();

public abstract Collection<String> getFileNames() throws IOException;

public abstract Directory getDirectory();

public void delete() {
throw new UnsupportedOperationException("This IndexCommit does not support this method.");
}

public boolean isDeleted() {
throw new UnsupportedOperationException("This IndexCommit does not support this method.");
}

public boolean isOptimized() {
throw new UnsupportedOperationException("This IndexCommit does not support this method.");
}

@Override
public boolean equals(Object other) {
if (other instanceof IndexCommit) {
IndexCommit otherCommit = (IndexCommit) other;
return otherCommit.getDirectory().equals(getDirectory()) && otherCommit.getVersion() == getVersion();
} else
return false;
}

@Override
public int hashCode() {
return getDirectory().hashCode() + getSegmentsFileName().hashCode();
}

public long getVersion() {
throw new UnsupportedOperationException("This IndexCommit does not support this method.");
}

public long getGeneration() {
throw new UnsupportedOperationException("This IndexCommit does not support this method.");
}

public long getTimestamp() throws IOException {
return getDirectory().fileModified(getSegmentsFileName());
}

public Map<String,String> getUserData() throws IOException {
throw new UnsupportedOperationException("This IndexCommit does not support this method.");
}
}

indexcommit是getSegmentsFileName,getDirectory的vo,从中可以得到段名称和目录。

回到DirectoryReader.open()中，里面调用了segmentInfos.FindSegmentsFile(dir){doBody(){ }},实现了doBody方法，将段名称传给dobody(),然后run(commit).

上面找到段信息的主要执行流程：

找到最新的segment_N

由于segment_N是整个索引过程中的元数据信息，因而正确的选择segment_N更加重要。

索引有可能部署在分布式系统中，在多台机器中都有，所以，需要保证索引的安全。

一方面取到segment_N，另一方面取到最大的N,设为genA

String[] files = directory.listAll();

long genA = getCurrentSegmentGeneration(files);

long getCurrentSegmentGeneration(String[] files) {

long max = -1;

for (int i = 0; i < files.length; i++) {

String file = files[i];

if (file.startsWith(IndexFileNames.SEGMENTS) //"segments_N"

&& !file.equals(IndexFileNames.SEGMENTS_GEN)) { //"segments.gen"

long gen = generationFromSegmentsFileName(file);

if (gen > max) {

max = gen;

}

}

}

return max;

}

　　另一方面，打开segment_gen,从中得到genB,在genA和genB中去较大者，为gen，并用此gen构造要打开的segments_N的文件名.

---恢复内容结束---

内容来自用户分享和网络整理，不保证内容的准确性，如有侵权内容，可联系管理员处理

标签：

相关文章推荐

新的分享

章节导航