lucene源码分析—BooleanQuery的评分过程
前面的章节分析过BooleanQuery的查询过,评分的过程只是简单介绍了下,本章回头再看一下BooleanQuery的评分过程,从其score函数开始。
BooleanScorer::score
public int score(LeafCollector collector, Bits acceptDocs, int min, int max) throws IOException {
...
BulkScorerAndDoc top = advance(min);
while (top.next top = scoreWindow(top, collector, singleClauseCollector, acceptDocs, min, max);
}
return top.next;
}
advance函数首先获得第一个匹配文档对应的BulkScorerAndDoc结构,其next成员变量就是文档号,然后通过scoreWindow函数循环处理匹配到的文档,scoreWindow函数默认一次处理最多2048个文档。
BooleanScorer::score->advance
private BulkScorerAndDoc advance(int min) throws IOException {
final HeadPriorityQueue head = this.head;
final TailPriorityQueue tail = this.tail;
BulkScorerAndDoc headTop = head.top();
BulkScorerAndDoc tailTop = tail.top();
while (headTop.next
...
headTop.advance(min);
headTop = head.updateTop();
...
}
return headTop;
}
head为HeadPriorityQueue,对应的top函数返回BulkScorerAndDoc,updateTop函数将文档数量小的BulkScorerAndDoc排在前面并返回。
BooleanScorer::score->advance->BulkScorerAndDoc::advance
void advance(int min) throws IOException {
score(orCollector, null, min, min);
}
void score(LeafCollector collector, Bits acceptDocs, int min, int max) throws IOException {
next = scorer.score(collector, acceptDocs, min, max);
}
public int score(LeafCollector collector, Bits acceptDocs, int min, int max) throws IOException {
collector.setScorer(scorer);
if (scorer.docID() == -1 && min == 0 && max == DocIdSetIterator.NO_MORE_DOCS) {
...
} else {
int doc = scorer.docID();
if (doc <min) {
doc = iterator.advance(min);
}
return scoreRange(collector, iterator, twoPhase, acceptDocs, doc, max);
}
}
如果是第一次获得文档ID,则docID函数返回-1,min为0,因此此时会调用iterator的advance函数获得文档ID,iterator的类型为BlockDocsEnum,其advance函数从对应的.doc文件中读取文档信息。
BooleanScorer::score->advance->BulkScorerAndDoc::advance->score->DefaultBulkScorer::score->BlockDocsEnum::advance
public int advance(int target) throws IOException {
if (docFreq > BLOCK_SIZE && target > nextSkipDoc) {
...
}
if (docUpto == docFreq) {
return doc = NO_MORE_DOCS;
}
if (docBufferUpto == BLOCK_SIZE) {
refillDocs();
}
while (true) {
accum += docDeltaBuffer[docBufferUpto];
docUpto++;
if (accum >= target) {
break;
}
docBufferUpto++;
if (docUpto == docFreq) {
return doc = NO_MORE_DOCS;
}
}
freq = freqBuffer[docBufferUpto];
docBufferUpto++;
return doc = accum;
}
docUpto表示处理的文档指针,docBufferUpto是当前处理的文档指针,BLOCK_SIZE表示缓存大小,如果缓存已满,则调用refillDocs从.doc文件中读取数据到缓存。docDeltaBuffer和freqBuffer缓存分别存储了文档ID和词频,存储方式为差值存储,最后返回需要的文档ID。
获得第一个文档ID后,BooleanScorer的score函数接下来通过scoreWindow函数处理匹配到的文档。
BooleanScorer::score->scoreWindow
private BulkScorerAndDoc scoreWindow(BulkScorerAndDoc top, LeafCollector collector,
LeafCollector singleClauseCollector, Bits acceptDocs, int min, int max) throws IOException {
final int windowBase = top.next & ~MASK;
final int windowMin = Math.max(min, windowBase);
final int windowMax = Math.min(max, windowBase + SIZE);
leads[0] = head.pop();
int maxFreq = 1;
while (head.size() > 0 && head.top().next leads[maxFreq++] = head.pop();
}
if (minShouldMatch == 1 && maxFreq == 1) {
...
} else {
scoreWindowMultipleScorers(collector, acceptDocs, windowBase, windowMin, windowMax, maxFreq);
return head.top();
}
}
scoreWindow函数一次只处理最多SIZE大小的文档,windowMin和windowMax分别表示当前处理的文档号的最小值和最大值。接下来获得对应的BulkScorerAndDoc保存在leads数组中,最后调用scoreWindowMultipleScorers函数继续处理。
BooleanScorer::score->scoreWindow->scoreWindowMultipleScorers
private void scoreWindowMultipleScorers(LeafCollector collector, Bits acceptDocs, int windowBase, int windowMin, int windowMax, int maxFreq) throws IOException {
...
if (maxFreq >= minShouldMatch) {
...
scoreWindowIntoBitSetAndReplay(collector, acceptDocs, windowBase, windowMin, windowMax, leads, maxFreq);
}
...
}
scoreWindowMultipleScorers函数会继续调用scoreWindowIntoBitSetAndReplay进行处理。
BooleanScorer::score->scoreWindow->scoreWindowMultipleScorers->scoreWindowIntoBitSetAndReplay
private void scoreWindowIntoBitSetAndReplay(LeafCollector collector, Bits acceptDocs,
int base, int min, int max, BulkScorerAndDoc[] scorers, int numScorers) throws IOException {
for (int i = 0; i final BulkScorerAndDoc scorer = scorers[i];
scorer.score(orCollector, acceptDocs, min, max);
}
scoreMatches(collector, base);
Arrays.fill(matching, 0L);
}
scoreWindowIntoBitSetAndReplay函数遍历当前的BulkScorerAndDoc数组,调用其score函数计算评分。BulkScorerAndDoc的score函数最终会调用到OrCollector的collect函数。scoreMatches对本次的处理结果进行最终处理。最终清空matching数组,以便后续2048个文档的分析。
BooleanScorer::score->scoreWindow->scoreWindowMultipleScorers->scoreWindowIntoBitSetAndReplay->BulkScorerAndDoc::score->DefaultBulkScorer::score->scoreRange->OrCollector::collect
public void collect(int doc) throws IOException {
final int i = doc & MASK;
final int idx = i >>> 6;
matching[idx] |= 1L < final Bucket bucket = buckets[i];
bucket.freq++;
bucket.score += scorer.score();
}
collect函数一次最多处理2048个文档,成员变量matching用比特位记录匹配到了哪些文档,buckets存储当前处理的最多2048个文档的得分,分别调用score函数计算得到。其中,2048个文档被分成32个组,每组64个比特位记录哪些文档匹配。
BooleanScorer::score->scoreWindow->scoreWindowMultipleScorers->scoreWindowIntoBitSetAndReplay->scoreMatches
private void scoreMatches(LeafCollector collector, int base) throws IOException {
long matching[] = this.matching;
for (int idx = 0; idx long bits = matching[idx];
while (bits != 0L) {
int ntz = Long.numberOfTrailingZeros(bits);
int doc = idx <<6 | ntz;
scoreDocument(collector, base, doc);
bits ^= 1L < }
}
}
scoreMatches根据比特位查看匹配到的文档号是多少,然后调用scoreDocument函数计算最终得分并排序。
BooleanScorer::score->scoreWindow->scoreWindowMultipleScorers->scoreWindowIntoBitSetAndReplay->scoreMatches->scoreDocument
private void scoreDocument(LeafCollector collector, int base, int i) throws IOException {
final FakeScorer fakeScorer = this.fakeScorer;
final Bucket bucket = buckets[i];
if (bucket.freq >= minShouldMatch) {
fakeScorer.freq = bucket.freq;
fakeScorer.score = (float) bucket.score * coordFactors[bucket.freq];
final int doc = base | i;
fakeScorer.doc = doc;
collector.collect(doc);
}
bucket.freq = 0;
bucket.score = 0;
}
这里计算文档号,并从buckets数组中取出前面的计算结果,然后调用collect函数处理最终结果。这里的collector是最先创建的SimpleTopScoreDocCollector,其collect函数就是比较分数,对最终要返回的文档进行排序。