热门标签 | HotTags
当前位置:  开发笔记 > 编程语言 > 正文

lucene源码分析---11

lucene源码分析—BooleanQuery的评分过程前面的章节分析过BooleanQuery的查询过,评分的过程只是简单介绍了下,本章回头再看一下BooleanQuery的评分过程,从其

lucene源码分析—BooleanQuery的评分过程

前面的章节分析过BooleanQuery的查询过,评分的过程只是简单介绍了下,本章回头再看一下BooleanQuery的评分过程,从其score函数开始。

BooleanScorer::score

  public int score(LeafCollector collector, Bits acceptDocs, int min, int max) throws IOException {

...

BulkScorerAndDoc top = advance(min);
while (top.next top = scoreWindow(top, collector, singleClauseCollector, acceptDocs, min, max);
}

return top.next;
}

advance函数首先获得第一个匹配文档对应的BulkScorerAndDoc结构,其next成员变量就是文档号,然后通过scoreWindow函数循环处理匹配到的文档,scoreWindow函数默认一次处理最多2048个文档。

BooleanScorer::score->advance

  private BulkScorerAndDoc advance(int min) throws IOException {
final HeadPriorityQueue head = this.head;
final TailPriorityQueue tail = this.tail;
BulkScorerAndDoc headTop = head.top();
BulkScorerAndDoc tailTop = tail.top();
while (headTop.next
...

headTop.advance(min);
headTop = head.updateTop();

...

}
return headTop;
}

head为HeadPriorityQueue,对应的top函数返回BulkScorerAndDoc,updateTop函数将文档数量小的BulkScorerAndDoc排在前面并返回。

BooleanScorer::score->advance->BulkScorerAndDoc::advance

    void advance(int min) throws IOException {
score(orCollector, null, min, min);
}

void score(LeafCollector collector, Bits acceptDocs, int min, int max) throws IOException {
next = scorer.score(collector, acceptDocs, min, max);
}

public int score(LeafCollector collector, Bits acceptDocs, int min, int max) throws IOException {
collector.setScorer(scorer);
if (scorer.docID() == -1 && min == 0 && max == DocIdSetIterator.NO_MORE_DOCS) {
...
} else {
int doc = scorer.docID();
if (doc <min) {
doc = iterator.advance(min);
}
return scoreRange(collector, iterator, twoPhase, acceptDocs, doc, max);
}
}

如果是第一次获得文档ID,则docID函数返回-1,min为0,因此此时会调用iterator的advance函数获得文档ID,iterator的类型为BlockDocsEnum,其advance函数从对应的.doc文件中读取文档信息。

BooleanScorer::score->advance->BulkScorerAndDoc::advance->score->DefaultBulkScorer::score->BlockDocsEnum::advance

    public int advance(int target) throws IOException {

if (docFreq > BLOCK_SIZE && target > nextSkipDoc) {
...
}

if (docUpto == docFreq) {
return doc = NO_MORE_DOCS;
}

if (docBufferUpto == BLOCK_SIZE) {
refillDocs();
}

while (true) {
accum += docDeltaBuffer[docBufferUpto];
docUpto++;

if (accum >= target) {
break;
}
docBufferUpto++;
if (docUpto == docFreq) {
return doc = NO_MORE_DOCS;
}
}

freq = freqBuffer[docBufferUpto];
docBufferUpto++;
return doc = accum;
}

docUpto表示处理的文档指针,docBufferUpto是当前处理的文档指针,BLOCK_SIZE表示缓存大小,如果缓存已满,则调用refillDocs从.doc文件中读取数据到缓存。docDeltaBuffer和freqBuffer缓存分别存储了文档ID和词频,存储方式为差值存储,最后返回需要的文档ID。

获得第一个文档ID后,BooleanScorer的score函数接下来通过scoreWindow函数处理匹配到的文档。

BooleanScorer::score->scoreWindow

  private BulkScorerAndDoc scoreWindow(BulkScorerAndDoc top, LeafCollector collector,
LeafCollector singleClauseCollector, Bits acceptDocs, int min, int max) throws IOException {
final int windowBase = top.next & ~MASK;
final int windowMin = Math.max(min, windowBase);
final int windowMax = Math.min(max, windowBase + SIZE);

leads[0] = head.pop();
int maxFreq = 1;
while (head.size() > 0 && head.top().next leads[maxFreq++] = head.pop();
}

if (minShouldMatch == 1 && maxFreq == 1) {

...

} else {
scoreWindowMultipleScorers(collector, acceptDocs, windowBase, windowMin, windowMax, maxFreq);
return head.top();
}
}

scoreWindow函数一次只处理最多SIZE大小的文档,windowMin和windowMax分别表示当前处理的文档号的最小值和最大值。接下来获得对应的BulkScorerAndDoc保存在leads数组中,最后调用scoreWindowMultipleScorers函数继续处理。

BooleanScorer::score->scoreWindow->scoreWindowMultipleScorers

  private void scoreWindowMultipleScorers(LeafCollector collector, Bits acceptDocs, int windowBase, int windowMin, int windowMax, int maxFreq) throws IOException {

...

if (maxFreq >= minShouldMatch) {

...

scoreWindowIntoBitSetAndReplay(collector, acceptDocs, windowBase, windowMin, windowMax, leads, maxFreq);
}

...
}

scoreWindowMultipleScorers函数会继续调用scoreWindowIntoBitSetAndReplay进行处理。

BooleanScorer::score->scoreWindow->scoreWindowMultipleScorers->scoreWindowIntoBitSetAndReplay

  private void scoreWindowIntoBitSetAndReplay(LeafCollector collector, Bits acceptDocs,
int base, int min, int max, BulkScorerAndDoc[] scorers, int numScorers) throws IOException {
for (int i = 0; i final BulkScorerAndDoc scorer = scorers[i];
scorer.score(orCollector, acceptDocs, min, max);
}

scoreMatches(collector, base);
Arrays.fill(matching, 0L);
}

scoreWindowIntoBitSetAndReplay函数遍历当前的BulkScorerAndDoc数组,调用其score函数计算评分。BulkScorerAndDoc的score函数最终会调用到OrCollector的collect函数。scoreMatches对本次的处理结果进行最终处理。最终清空matching数组,以便后续2048个文档的分析。

BooleanScorer::score->scoreWindow->scoreWindowMultipleScorers->scoreWindowIntoBitSetAndReplay->BulkScorerAndDoc::score->DefaultBulkScorer::score->scoreRange->OrCollector::collect

    public void collect(int doc) throws IOException {
final int i = doc & MASK;
final int idx = i >>> 6;
matching[idx] |= 1L < final Bucket bucket = buckets[i];
bucket.freq++;
bucket.score += scorer.score();
}

collect函数一次最多处理2048个文档,成员变量matching用比特位记录匹配到了哪些文档,buckets存储当前处理的最多2048个文档的得分,分别调用score函数计算得到。其中,2048个文档被分成32个组,每组64个比特位记录哪些文档匹配。

BooleanScorer::score->scoreWindow->scoreWindowMultipleScorers->scoreWindowIntoBitSetAndReplay->scoreMatches

  private void scoreMatches(LeafCollector collector, int base) throws IOException {
long matching[] = this.matching;
for (int idx = 0; idx long bits = matching[idx];
while (bits != 0L) {
int ntz = Long.numberOfTrailingZeros(bits);
int doc = idx <<6 | ntz;
scoreDocument(collector, base, doc);
bits ^= 1L < }
}
}

scoreMatches根据比特位查看匹配到的文档号是多少,然后调用scoreDocument函数计算最终得分并排序。

BooleanScorer::score->scoreWindow->scoreWindowMultipleScorers->scoreWindowIntoBitSetAndReplay->scoreMatches->scoreDocument

  private void scoreDocument(LeafCollector collector, int base, int i) throws IOException {
final FakeScorer fakeScorer = this.fakeScorer;
final Bucket bucket = buckets[i];
if (bucket.freq >= minShouldMatch) {
fakeScorer.freq = bucket.freq;
fakeScorer.score = (float) bucket.score * coordFactors[bucket.freq];
final int doc = base | i;
fakeScorer.doc = doc;
collector.collect(doc);
}
bucket.freq = 0;
bucket.score = 0;
}

这里计算文档号,并从buckets数组中取出前面的计算结果,然后调用collect函数处理最终结果。这里的collector是最先创建的SimpleTopScoreDocCollector,其collect函数就是比较分数,对最终要返回的文档进行排序。


推荐阅读
author-avatar
洪可婷60134
这个家伙很懒,什么也没留下!
PHP1.CN | 中国最专业的PHP中文社区 | DevBox开发工具箱 | json解析格式化 |PHP资讯 | PHP教程 | 数据库技术 | 服务器技术 | 前端开发技术 | PHP框架 | 开发工具 | 在线工具
Copyright © 1998 - 2020 PHP1.CN. All Rights Reserved | 京公网安备 11010802041100号 | 京ICP备19059560号-4 | PHP1.CN 第一PHP社区 版权所有