I have few millions of records and I need them to be indexed in Solr. Once they're indexed, they're not going to be changed and the collections are used only for "read". I am following the pattern by posting the xml docs to the REST api and it works fine ... even though it takes some time (configs are optimized for read and cache);

我有几百万条记录,我需要它们在Solr中编入索引。一旦它们被索引,它们就不会被改变,并且集合仅用于“读取”。我通过将xml文档发布到REST API来遵循该模式,并且它工作正常......即使它需要一些时间(配置针对读取和缓存进行了优化);

But I was wondering ... is there a better/faster approach - maybe avoiding the HTTP/network layer? Something like working locally to build the collection, copy it to solr server and then add/swap the collection?

但我想知道......是否有更好/更快的方法 - 可能避免HTTP /网络层?在本地工作以构建集合,将其复制到solr服务器然后添加/交换集合?

One choice could be a custom DIH for a second/backup core and swap when done - but this would mean I would have to "eat" the memory used on solr for caching slowing down searches.

一个选择可能是第二个/备份核心的自定义DIH和完成时交换 - 但这意味着我必须“吃掉”solr上用于缓存的内存减慢搜索速度。

I am searching/hoping for a disconnected solution - like a command line tool, running on a different machine with the configuration optimized for writing, then copy the core on production swapping the old with the new one.

我正在寻找/希望找到一个断开连接的解决方案 - 比如一个命令行工具,在不同的机器上运行,并且配置已针对写入进行了优化,然后将生产中的核心复制到新的生产中。

Any ideas?


1 个解决方案



Few million records should not be an issue.


Check how often you do commit and maybe disable soft commit or make it much higher.


You can also send documents to one Solr instance from multiple clients and get some multi-threading benefits.


And you can certainly write a small SolrJ client to index into a local/embedded core and then swap that core into production.


