2013-10-11 17:30 176 查看
在lucene中检索出来的文档用倒排列表来表示,每个query Term对应一个倒排列表。每个列表的长度则




OR操作来说则是求并集,原先的DisjunctionSumScorer中有成员变量List<Scorer> subScorers,是一个Scorer




(1) 假设minimumNrMatchers = 4,倒排表最初如下:

(2) 在DisjunctionSumScorer的构造函数中,将倒排表放入一个优先级队列scorerDocQueue中(scorerDocQueue的实现是一个最小堆),队列中的Scorer按照第一篇文档的大小排序。

private void initScorerDocQueue() throws IOException {

scorerDocQueue = new ScorerDocQueue(nrScorers);

for (Scorer se : subScorers) {

if (se.nextDoc() != NO_MORE_DOCS) { //此处的nextDoc使得每个Scorer得到第一篇文档号。





(3) 当BooleanScorer2.score(Collector)中第一次调用nextDoc()的时候,advanceAfterCurrent被调用。

public int nextDoc() throws IOException {

if (scorerDocQueue.size() < minimumNrMatchers || !advanceAfterCurrent()) {

currentDoc = NO_MORE_DOCS;


return currentDoc;


protected boolean advanceAfterCurrent() throws IOException {

do {

currentDoc = scorerDocQueue.topDoc(); //当前的文档号为最顶层

currentScore = scorerDocQueue.topScore(); //当前文档的打分

nrMatchers = 1; //当前文档满足的子条件的个数,也即包含当前文档号的Scorer的个数

do {


if (!scorerDocQueue.topNextAndAdjustElsePop()) {

if (scorerDocQueue.size() == 0) {

break; // nothing more to advance, check for last match.




if (scorerDocQueue.topDoc() != currentDoc) {

break; // All remaining subscorers are after currentDoc.



currentScore += scorerDocQueue.topScore();


} while (true);


if (nrMatchers >= minimumNrMatchers) {

return true;

} else if (scorerDocQueue.size() < minimumNrMatchers) {

return false;


} while (true);




最顶层的Scorer 0取得下一篇文档,为文档3,重新调整最小堆后如下图。此时currentDoc等于最顶层Scorer 1的第一篇文档号,都为2,文档2的nrMatchers为2。

最顶层的Scorer 1取得下一篇文档,为文档8,重新调整最小堆后如下图。此时currentDoc等于最顶层Scorer 3的第一篇文档号,都为2,文档2的nrMatchers为3。

最顶层的Scorer 3取得下一篇文档,为文档7,重新调整最小堆后如下图。此时currentDoc还为2,不等于最顶层Scorer 2的第一篇文档3,于是退出内循环。此时检查,发现文档2的nrMatchers为3,小于minimumNrMatchers,不满足条件。于是currentDoc设为最顶层Scorer 2的第一篇文档3,nrMatchers设为1,重新进入下一轮循环。

最顶层的Scorer 2取得下一篇文档,为文档5,重新调整最小堆后如下图。此时currentDoc等于最顶层Scorer 4的第一篇文档号,都为3,文档3的nrMatchers为2。

最顶层的Scorer 4取得下一篇文档,为文档7,重新调整最小堆后如下图。此时currentDoc等于最顶层Scorer 0的第一篇文档号,都为3,文档3的nrMatchers为3。

最顶层的Scorer 0取得下一篇文档,为文档5,重新调整最小堆后如下图。此时currentDoc还为3,不等于最顶层Scorer 0的第一篇文档5,于是退出内循环。此时检查,发现文档3的nrMatchers为3,小于minimumNrMatchers,不满足条件。于是currentDoc设为最顶层Scorer 0的第一篇文档5,nrMatchers设为1,重新进入下一轮循环。

最顶层的Scorer 0取得下一篇文档,为文档7,重新调整最小堆后如下图。此时currentDoc等于最顶层Scorer 2的第一篇文档号,都为5,文档5的nrMatchers为2。

最顶层的Scorer 2取得下一篇文档,为文档7,重新调整最小堆后如下图。此时currentDoc还为5,不等于最顶层Scorer 2的第一篇文档7,于是退出内循环。此时检查,发现文档5的nrMatchers为2,小于minimumNrMatchers,不满足条件。于是currentDoc设为最顶层Scorer 2的第一篇文档7,nrMatchers设为1,重新进入下一轮循环。

最顶层的Scorer 2取得下一篇文档,为文档8,重新调整最小堆后如下图。此时currentDoc等于最顶层Scorer 3的第一篇文档号,都为7,文档7的nrMatchers为2。

最顶层的Scorer 3取得下一篇文档,为文档9,重新调整最小堆后如下图。此时currentDoc等于最顶层Scorer 4的第一篇文档号,都为7,文档7的nrMatchers为3。

最顶层的Scorer 4取得下一篇文档,结果为空,Scorer 4所有的文档遍历完毕,弹出队列,重新调整最小堆后如下图。此时currentDoc等于最顶层Scorer 0的第一篇文档号,都为7,文档7的nrMatchers为4。

最顶层的Scorer 0取得下一篇文档,为文档9,重新调整最小堆后如下图。此时currentDoc还为7,不等于最顶层Scorer 1的第一篇文档8,于是退出内循环。此时检查,发现文档7的nrMatchers为4,大于等于minimumNrMatchers,满足条件,返回true,退出外循环。

(4) currentDoc设为7,在收集文档的过程中,DisjunctionSumScorer.docID()会被调用,返回currentDoc,也即当前的文档号为7。

(5) 当再次调用nextDoc()的时候,文档8, 9, 11都不满足要求,最后返回NO_MORE_DOCS,倒排表合并结束。



Stefan Pohl 认为现在的跳跃表已经使ConjunctionScorer(交集)很高效了,那么能不能在利用MinShouldMatch这个条件在DisjunctionSumScorer(并集)中最大化求交集呐?他的做法是先从subScorers
list中取出MinShouldMatch-1个scorer存入一个数组mmstack中,剩下的存入heap,首先从heap中获取候选的docid,count heap中当前匹配的docid个数,然后再拿这个docid和数组mmstack中的每个scorer的倒排列表进行求交,计算MinShouldMatch是否满足要求。


class MinShouldMatchSumScorer extends Scorer {

  /** The overall number of non-finalized scorers */
  private int numScorers;
  /** The minimum number of scorers that should match */
  private final int mm;

  /** A static array of all subscorers sorted by decreasing cost */
  private final Scorer sortedSubScorers[];
  /** A monotonically increasing index into the array pointing to the next subscorer that is to be excluded */
  private int sortedSubScorersIdx = 0;

  private final Scorer subScorers[]; // the first numScorers-(mm-1) entries are valid
  private int nrInHeap; // 0..(numScorers-(mm-1)-1)

  /** mmStack is supposed to contain the most costly subScorers that still did
   *  not run out of docs, sorted by increasing sparsity of docs returned by that subScorer.
   *  For now, the cost of subscorers is assumed to be inversely correlated with sparsity.
  private final Scorer mmStack[]; // of size mm-1: 0..mm-2, always full

  /** The document number of the current match. */
  private int doc = -1;
  /** The number of subscorers that provide the current match. */
  protected int nrMatchers = -1;
  private double score = Float.NaN;

   * Construct a <code>MinShouldMatchSumScorer</code>.
   * @param weight The weight to be used.
   * @param subScorers A collection of at least two subscorers.
   * @param minimumNrMatchers The positive minimum number of subscorers that should
   * match to match this query.
   * <br>When <code>minimumNrMatchers</code> is bigger than
   * the number of <code>subScorers</code>, no matches will be produced.
   * <br>When minimumNrMatchers equals the number of subScorers,
   * it is more efficient to use <code>ConjunctionScorer</code>.
  public MinShouldMatchSumScorer(List<Scorer> subScorersList, int minimumNrMatchers) throws IOException {
    this.nrInHeap = this.numScorers = subScorersList.size();
    if (minimumNrMatchers <= 0) {
      throw new IllegalArgumentException("Minimum nr of matchers must be positive");
    if (numScorers <= 1) {
      throw new IllegalArgumentException("There must be at least 2 subScorers");

    this.mm = minimumNrMatchers;
    this.sortedSubScorers = subScorersList.toArray(new Scorer[this.numScorers]);
    // sorting by decreasing subscorer cost should be inversely correlated with
    // next docid (assuming costs are due to generating many postings)
* 这个排序是其优化步骤之一,为什么要用cost()排序???
ArrayUtil.mergeSort(sortedSubScorers, new Comparator<Scorer>() {
      public int compare(Scorer o1, Scorer o2) {
        //return Long.signum(o2.docID() - o1.docID());
        return Long.signum(o2.cost() - o1.cost());
    // take mm-1 most costly subscorers aside
    this.mmStack = new Scorer[mm-1];
    for (int i = 0; i < mm-1; i++) {
      mmStack[i] = sortedSubScorers[i];
    nrInHeap -= mm-1;
    this.sortedSubScorersIdx = mm-1;
    // take remaining into heap, if any, and heapify
    this.subScorers = new Scorer[nrInHeap];
    for (int i = 0; i < nrInHeap; i++) {
      this.subScorers[i] = this.sortedSubScorers[mm-1+i];
    assert minheapCheck();
  public int nextDoc() throws IOException {
    assert doc != NO_MORE_DOCS;
    while (true) {
      // to remove current doc, call next() on all subScorers on current doc within heap
      while (subScorers[0].docID() == doc) {
        if (subScorers[0].nextDoc() != NO_MORE_DOCS) {
        } else {
          if (numScorers < mm) {
            return doc = NO_MORE_DOCS;
        //assert minheapCheck();


      if (nrMatchers >= mm) { // doc satisfies mm constraint
    return doc;
  private void evaluateSmallestDocInHeap() throws IOException {
    // within heap, subScorer[0] now contains the next candidate doc
    doc = subScorers[0].docID();
    if (doc == NO_MORE_DOCS) {
      nrMatchers = Integer.MAX_VALUE; // stop looping
    // 1. score and count number of matching subScorers within heap
    score = subScorers[0].score();
    nrMatchers = 1;
    // 2. score and count number of matching subScorers within stack,
    // short-circuit: stop when mm can't be reached for current doc, then perform on heap next()
    // TODO instead advance() might be possible, but complicates things
    for (int i = mm-2; i >= 0; i--) { // first advance sparsest subScorer
      if (mmStack[i].docID() >= doc || mmStack[i].advance(doc) != NO_MORE_DOCS) {
        if (mmStack[i].docID() == doc) { // either it was already on doc, or got there via advance()
          score += mmStack[i].score();
        } else { // scorer advanced to next after doc, check if enough scorers left for current doc
          if (nrMatchers + i < mm) { // too few subScorers left, abort advancing
            return; // continue looping TODO consider advance() here
      } else { // subScorer exhausted
        if (numScorers < mm) { // too few subScorers left
          doc = NO_MORE_DOCS;
          nrMatchers = Integer.MAX_VALUE; // stop looping
        if (mm-2-i > 0) {
          // shift RHS of array left
          System.arraycopy(mmStack, i+1, mmStack, i, mm-2-i);
        // find next most costly subScorer within heap TODO can this be done better?
        while (!minheapRemove(sortedSubScorers[sortedSubScorersIdx++])) {
          //assert minheapCheck();
        // add the subScorer removed from heap to stack
        mmStack[mm-2] = sortedSubScorers[sortedSubScorersIdx-1];
        if (nrMatchers + i < mm) { // too few subScorers left, abort advancing
          return; // continue looping TODO consider advance() here



1、首先对subScorers进行按cost()值降序排序,关于cost() 函数的意思,4.3.0API解释:

   * Returns the estimated cost of this {@link DocIdSetIterator}.
   * <p>
   * This is generally an upper bound of the number of documents this iterator
   * might match, but may be a rough heuristic, hardcoded value, or otherwise
   * completely inaccurate.
  public abstract long cost();

- 1.

3、再和mmStack求交时,对mmStack中Scorer是从短到长遍历,好处是,一旦剩下的Scorer数+当前minShoudlMatch 小于给定条件的话就跳出了,不用再比较剩下list较长的Scorer了。


