LuceneHack之通过缩小搜索结果集来提升性能(2)

作者：job2672488 | 来源：互联网 | 2023-06-01 14:15

作者：caocao（网络隐士），[url]http:www.caocao.name[url]，[url]htt

作者&＃xff1a;caocao&＃xff08;网络隐士&＃xff09;&＃xff0c;[url]http://www.caocao.name[/url]&＃xff0c;[url]http://www.caocao.mobi[/url]

转载请注明来源&＃xff1a;[url]http://www.iteye.com/topic/80073[/url]

书接前文([url]http://www.iteye.com/topic/78884[/url])&＃xff0c;上回说了个大致的原理&＃xff0c;这回开始上代码。

五、原则

1、不改动lucene-core的代码

肆意改动lucene-core的代码实在是很不道德的事情&＃xff0c;而且会导致后期维护升级的大量问题。如果真的有这等迫切需求&＃xff0c;还不如加入lucene开发组&＃xff0c;尽一份绵薄之力。看官说了&＃xff0c;隐士你怎么不去啊&＃xff0c;唉&＃xff0c;代码比较丑陋&＃xff0c;没脸去人家那里&＃xff0c;后文详述。

2、不改动lucene索引文件格式

道理同上。

3、替换常规搜索的接口尽量少

这样可以方便来回切换标准搜索和这个搜索&＃xff0c;减小代码修改、维护的成本。

4、命名规范

所有增加的类名均以Inaccurate开头&＃xff0c;其余遵循lucene命名规范。

六、限制

1、隐士只做了BooleanWeight2的替代品&＃xff0c;如果Weight不是BooleanWeight2&＃xff0c;则等同于常规搜索。

2、如果搜索结果集小于等于最大允许的结果集&＃xff0c;则等同于常规搜索。

七、文件

[code]

org.apache.lucene.search

InaccurateBooleanScorer2.java // BooleanScorer2的替代品

InaccurateBooleanWeight2.java // BooleanWeight2的替代品

InaccurateHit.java // Hit的替代品

InaccurateHitIterator.java // HitIterator的替代品

InaccurateHits.java // Hits的替代品

InaccurateIndexSearcher.java // IndexSearcher的替代品

org.apache.lucene.util

InaccurateResultAggregation.java // 放搜索统计信息的value object

[/code]

八、实战

1、InaccurateIndexSearcher

InaccurateIndexSearcher extends IndexSearcher&＃xff0c;结构很简单&＃xff0c;增加了两个成员变量&＃xff1a;maxNumberOfDocs和inaccurateResultAggregation&＃xff0c;以及几个methods。

丑陋的部分来了&＃xff1a;

[code]

public void search(Weight weight, Filter filter, final HitCollector results, boolean ascending) throws IOException {

...

if (weight.getClass().getSimpleName().equals("BooleanWeight2")) { // hook BooleanWeight2

InaccurateBooleanWeight2 inaccurateBooleanWeight2 &＃61; new InaccurateBooleanWeight2(

this, weight.getQuery());

float sum &＃61; inaccurateBooleanWeight2.sumOfSquaredWeights();

float norm &＃61; this.getSimilarity().queryNorm(sum);

inaccurateBooleanWeight2.normalize(norm); // bad smell

InaccurateBooleanScorer2 inaccurateBooleanScorer2 &＃61; inaccurateBooleanWeight2

.getInaccurateBooleanScorer2(reader, maxNumberOfDocs);

if (inaccurateBooleanScorer2 !&＃61; null) {

inaccurateResultAggregation &＃61; inaccurateBooleanScorer2

.getInaccurateTopAggregation(collector, ascending);

}

} else {

Scorer scorer &＃61; weight.scorer(reader);

if (scorer !&＃61; null) {

scorer.score(collector);

}

}

...

}

[/code]

由于BooleanWeight2被lucene-core给藏起来了&＃xff0c;instanceof都不能用&＃xff0c;只好丑陋一把用weight.getClass().getSimpleName().equals("BooleanWeight2")。

把BooleanWeight2替换为InaccurateBooleanWeight2后代码老是搜不到任何结果&＃xff0c;经过千辛万苦地调试才发现BooleanWeight2初始化后并不算完&＃xff0c;需要拿到sum、norm&＃xff0c;然后normalize一把&＃xff0c;有点bad smell。

接着从InaccurateBooleanWeight2里拿到InaccurateBooleanScorer2&＃xff0c;调用getInaccurateTopAggregation搜一把&＃xff0c;这里ascending并没有发挥作用&＃xff0c;原因相当复杂&＃xff0c;隐士引入ascending的本意是调整lucene扫描索引的方式&＃xff0c;docID小->大或docID大->小&＃xff0c;后来调整了建索引的方式就不需要这个了&＃xff0c;所以隐士只是留这个接口以后用&＃xff0c;万一以后lucene-core支持双向扫描索引即可启用。

2、InaccurateHits

InaccurateIndexSearcher里面调用search其实是调用new InaccurateHits(this, query, null, sort, ascending)。getMoreDocs会反向调用新写的search方法。

上代码&＃xff1a;

[code]

...

TopDocs topDocs &＃61; (sort &＃61;&＃61; null) ? searcher.search(weight, filter, n,

ascending) : searcher

.search(weight, filter, n, sort, ascending);

length &＃61; topDocs.totalHits;

InaccurateResultAggregation inaccurateResultAggregation &＃61; searcher

.getInaccurateResultAggregation();

if (inaccurateResultAggregation &＃61;&＃61; null) {

totalLength &＃61; length;

} else {

accurate &＃61; inaccurateResultAggregation.isAccurate();

if (inaccurateResultAggregation.isAccurate()) {

totalLength &＃61; inaccurateResultAggregation

.getNumberOfRecordsFound();

} else {

int maxDocID &＃61; searcher.maxDoc();

totalLength &＃61; 1000 * ((int) Math

.ceil((0.001

* maxDocID

/ (inaccurateResultAggregation.getLastDocID() &＃43; 1) * inaccurateResultAggregation

.getNumberOfRecordsFetched()))); // guessing how many records there are

}

}

...

[/code]

代码没什么特别的&＃xff0c;除了一个猜测记录总数的算法。lucene从docID小向大的扫&＃xff0c;由于上回说了扫到一半会跳出来&＃xff0c;那么由最后扫到的lastDocID和maxDocID的比例可以猜测总共有多少条记录&＃xff0c;虽然不是很准&＃xff0c;但是数量级的精度是可以保证的&＃xff0c;反正一般用户只能看到前1000条记录&＃xff0c;具体有多少对用户来说不过是过眼云烟。

3、InaccurateBooleanWeight2

InaccurateBooleanWeight2没什么好说的&＃xff0c;就是个拿到InaccurateBooleanScorer2的跳板。

4、InaccurateBooleanScorer2

InaccurateBooleanScorer2的代码均来自BooleanScorer2&＃xff0c;由于BooleanScorer2从设计上来说并不准备被继承&＃xff0c;隐士只好另起炉灶&＃xff0c;bad smell啊。隐士没有修改任何从BooleanScorer2过来的代码&＃xff0c;只加了getMaxNumberOfDocs、getInaccurateTopAggregation、getAccurateBottomAggregation。getInaccurateTopAggregation是扫描到maxNumberOfDocs后立即跳出来&＃xff0c;所以结果会有所不准&＃xff0c;getAccurateBottomAggregation总是保留最后maxNumberOfDocs个结果&＃xff0c;结果也会有所不准&＃xff0c;但是统计值是准的&＃xff0c;因为每次都走完了所有索引。由两者差异可知getAccurateBottomAggregation性能会差一点&＃xff0c;准确性和性能不可兼得啊。

[code]

public InaccurateResultAggregation getInaccurateTopAggregation(

HitCollector hc, boolean ascending) throws IOException {

// DeltaTime dt &＃61; new DeltaTime();

if (countingSumScorer &＃61;&＃61; null) {

initCountingSumScorer();

}

int lastDocID &＃61; 0;

boolean reachedTheEnd &＃61; true;

int numberOfRecordsFetched &＃61; 0;

while (countingSumScorer.next()) {

lastDocID &＃61; countingSumScorer.doc();

float score &＃61; score();

hc.collect(lastDocID, score);

numberOfRecordsFetched&＃43;&＃43;;

if (numberOfRecordsFetched >&＃61; maxNumberOfDocs) {

reachedTheEnd &＃61; !countingSumScorer.next();

break;

}

}

// System.out.println(dt.getTimeElasped());

/*

* This method might cast the rest away. So it might be inaccurate.

*/

return new InaccurateResultAggregation(lastDocID, ascending,

reachedTheEnd, numberOfRecordsFetched, numberOfRecordsFetched);

}

public InaccurateResultAggregation getAccurateBottomAggregation(

HitCollector hc, boolean ascending) throws IOException {

// DeltaTime dt &＃61; new DeltaTime();

if (countingSumScorer &＃61;&＃61; null) {

initCountingSumScorer();

}

LinkedList resultNodes &＃61; new LinkedList();

boolean isFull &＃61; false;

int lastDocID &＃61; 0;

int index &＃61; 0;

int numberOfRecordsFound &＃61; 0;

while (countingSumScorer.next()) {

lastDocID &＃61; countingSumScorer.doc();

float score &＃61; score();

resultNodes.add(new ResultNode(lastDocID, score));

if (isFull) {

resultNodes.removeFirst();

}

index&＃43;&＃43;;

numberOfRecordsFound&＃43;&＃43;;

if (index >&＃61; maxNumberOfDocs) {

isFull &＃61; true;

index &＃61; 0;

// break;

}

}

for (ResultNode resultNode : resultNodes) {

hc.collect(resultNode.getDoc(), resultNode.getScore());

}

// System.out.println(dt.getTimeElasped());

/*

* Since this method runs full scan against all matched docs, it&＃39;s

* accurate at all.

*/

return new InaccurateResultAggregation(lastDocID, ascending, true,

resultNodes.size(), numberOfRecordsFound);

}

[/code]

九、总结

代码已经打包上传了&＃xff0c;有隐士写的简略注释&＃xff0c;调用方式写在readme.txt里面&＃xff0c;只需要替换几行代码即可。

总的来说只要

1、将Searcher searcher &＃61; new IndexSearcher(reader);替换为InaccurateIndexSearcher searcher &＃61; new InaccurateIndexSearcher(reader, 5000);

2、将Hits hits &＃61; searcher.search(query);替换为InaccurateHits hits &＃61; searcher.search(query, sort, ascending);

就行了。欢迎大家试用&＃xff0c;如果有什么改进&＃xff0c;请务必把改进后的代码也开源给大家&＃xff0c;互相学习&＃xff0c;互相促进。

由于代码里面有几处有bad smell&＃xff0c;隐士实在没脸去lucene开发组那里喊一嗓子。

推荐阅读

php
解决Mac上无法使用localhost连接mysql的问题

本文介绍了在Mac上搭建php环境后无法使用localhost连接mysql的问题，并通过将localhost替换为127.0.0.1或本机IP解决了该问题。文章解释了localhost和127.0.0.1的区别，指出了使用socket方式连接导致连接失败的原因。此外，还提供了相关链接供读者深入了解。 ... [详细]

蜡笔小新 2023-12-13 17:48:58
schema
Activiti7流程定义开发笔记

本文介绍了Activiti7流程定义的开发笔记，包括流程定义的概念、使用activiti-explorer和activiti-eclipse-designer进行建模的方式，以及生成流程图的方法。还介绍了流程定义部署的概念和步骤，包括将bpmn和png文件添加部署到activiti数据库中的方法，以及使用ZIP包进行部署的方式。同时还提到了activiti.cfg.xml文件的作用。 ... [详细]

蜡笔小新 2023-12-10 19:22:56
php
Metasploit攻击渗透实践

本文介绍了Metasploit攻击渗透实践的内容和要求，包括主动攻击、针对浏览器和客户端的攻击，以及成功应用辅助模块的实践过程。其中涉及使用Hydra在不知道密码的情况下攻击metsploit2靶机获取密码，以及攻击浏览器中的tomcat服务的具体步骤。同时还讲解了爆破密码的方法和设置攻击目标主机的相关参数。 ... [详细]

蜡笔小新 2023-12-14 12:14:09
join
Python自动提取文本中的时间（包含中文日期）及特殊时间识别方法

本文介绍了在处理不规则数据时如何使用Python自动提取文本中的时间日期，包括使用dateutil.parser模块统一日期字符串格式和使用datefinder模块提取日期。同时，还介绍了一段使用正则表达式的代码，可以支持中文日期和一些特殊的时间识别，例如'2012年12月12日'、'3小时前'、'在2012/12/13哈哈'等。 ... [详细]

蜡笔小新 2023-12-12 12:09:33
join
CentOS 6.4更新源地址的方法

本文介绍了在CentOS 6.4系统中更新源地址的方法，包括备份现有源文件、下载163源、修改文件名、更新列表和系统，并提供了相应的命令。 ... [详细]

蜡笔小新 2023-12-11 16:09:40
schema
如何使用PHP代码将表格导出为UTF8格式的Excel文件

本文介绍了如何使用PHP代码将表格导出为UTF8格式的Excel文件。首先，需要连接到数据库并获取表格的列名。然后，设置文件名和文件指针，并将内容写入文件。最后，设置响应头部，将文件作为附件下载。 ... [详细]

蜡笔小新 2023-12-11 00:29:33
php
Python基础篇：315道题目及答案整理，帮助你检验学习成果

本文整理了315道Python基础题目及答案，帮助读者检验学习成果。文章介绍了学习Python的途径、Python与其他编程语言的对比、解释型和编译型编程语言的简述、Python解释器的种类和特点、位和字节的关系、以及至少5个PEP8规范。对于想要检验自己学习成果的读者，这些题目将是一个不错的选择。请注意，答案在视频中，本文不提供答案。 ... [详细]

蜡笔小新 2023-12-10 14:33:46
callback
python3 nmap函数简介及使用方法

本文介绍了python3 nmap函数的简介及使用方法，python-nmap是一个使用nmap进行端口扫描的python库，它可以生成nmap扫描报告，并帮助系统管理员进行自动化扫描任务和生成报告。同时，它也支持nmap脚本输出。文章详细介绍了python-nmap的几个py文件的功能和用途，包括__init__.py、nmap.py和test.py。__init__.py主要导入基本信息，nmap.py用于调用nmap的功能进行扫描，test.py用于测试是否可以利用nmap的扫描功能。 ... [详细]

蜡笔小新 2023-12-10 12:15:27
go
大数据Hadoop生态(20)MapReduce框架原理OutputFormat的开发笔记

本文介绍了大数据Hadoop生态(20)MapReduce框架原理OutputFormat的开发笔记，包括outputFormat接口实现类、自定义outputFormat步骤和案例。案例中将包含nty的日志输出到nty.log文件，其他日志输出到other.log文件。同时提供了一些相关网址供参考。 ... [详细]

蜡笔小新 2023-12-10 11:44:06
php
目录浏览漏洞与目录遍历漏洞的危害及修复方法

本文讨论了目录浏览漏洞与目录遍历漏洞的危害，包括网站结构暴露、隐秘文件访问等。同时介绍了检测方法，如使用漏洞扫描器和搜索关键词。最后提供了针对常见中间件的修复方式，包括关闭目录浏览功能。对于保护网站安全具有一定的参考价值。 ... [详细]

蜡笔小新 2023-12-09 23:30:30
int
Java如何导入和导出Excel文件的方法和步骤详解

本文详细介绍了在SpringBoot中使用Java导入和导出Excel文件的方法和步骤，包括添加操作Excel的依赖、自定义注解等。文章还提供了示例代码，并将代码上传至GitHub供访问。 ... [详细]

蜡笔小新 2023-12-09 20:27:00
int
禅道测试管理工具的介绍及搭建方法

本文介绍了禅道作为一款国产开源免费的测试管理工具的特点和功能，并提供了禅道的搭建和调试方法。禅道是一款B/S结构的项目管理工具，可以实现组织管理、后台管理、产品管理、项目管理和测试管理等功能。同时，本文还介绍了其他软件测试相关工具，如功能自动化工具和性能自动化工具，以及白盒测试工具的使用。通过本文的阅读，读者可以了解禅道的基本使用方法和优势，从而更好地进行测试管理工作。 ... [详细]

蜡笔小新 2023-12-09 19:03:20
int
RHEL 7中的系统日志管理和网络管理

本文介绍了在RHEL 7中的系统日志管理和网络管理。系统日志管理包括rsyslog和systemd-journal两种日志服务，分别介绍了它们的特点、配置文件和日志查询方式。网络管理主要介绍了使用nmcli命令查看和配置网络接口的方法，包括查看网卡信息、添加、修改和删除配置文件等操作。 ... [详细]

蜡笔小新 2023-12-09 10:39:58
php
codeigniter技巧——防止model-controller名字冲突

使用这个技巧要达到的目标：一般来说，模型和控制器你都不会有相同的类名字。让我先创建一个取名为post的model。classPostextendsModel{}现在 ... [详细]

蜡笔小新 2023-10-17 19:12:02
php
评估连接速度的最佳方法 - Best way to evaluate connection speed

Imdevelopinganappwhichneedstogetmusicfilebystreamingforplayinglive.我正在开发一个应用程序，需要通过流 ... [详细]

蜡笔小新 2023-10-17 14:49:20

job2672488

这个家伙很懒，什么也没留下！

Tags | 热门标签

RankList | 热门文章