Nutch1.11(1.x)和Solr5.3.1(5.x)的集成-IntegrationbetweenNutch1.11(1.x)andSolr5.3.1(5.x)

作者：oko123 | 来源：互联网 | 2023-05-19 17:18

IjuststartedusingNutch1.11andSolr5.3.1.我刚开始使用Nutch1.11和Solr5.3.1。Iwanttocrawldataw

I just started using Nutch 1.11 and Solr 5.3.1.

我刚开始使用Nutch 1.11和Solr 5.3.1。

I want to crawl data with Nutch, then index and prepare for searching with Solr.

我想用Nutch抓取数据，然后索引并准备用Solr搜索。

I know how to crawl data from web using Nutch's bin/crawl command, and successfully got much data from a website in my local.

我知道如何使用Nutch的bin/抓取命令从web抓取数据，并且成功地从本地的一个网站获得了很多数据。

I also started a new Solr server in local with below command under Solr root folder,

我还在Solr root文件夹下，在本地启动了一个新的Solr服务器，

bin/solr start

And started the example files core under the example folder with below command:

并在以下命令的示例文件夹下启动示例文件core:

bin/solr create -c files -d example/files/conf

And I can login below admin url and manage the files core,

我可以登录管理url，管理文件核心，

http://localhost:8983/solr/#/files

So I believe I started the Solr correctly, and started to post the Nutch data into Solr with Nutch's bin/nutch index command:

所以我相信我是正确的启动了Solr，并开始将Nutch的数据用Nutch的bin/ Nutch index命令将Nutch数据发布到Solr:

bin/nutch index crawl/crawldb \
-linkdb crawl/linkdb \
-params solr.server.url=127.0.0.1:8983/solr/files \
-dir crawl/segments

Hoping with Solr5's new Auto Schema feature, I can put myself restful, however, I got below error(copy from log file):

希望使用Solr5的新的自动模式特性，我可以使自己保持restful，但是，我得到了以下错误(从日志文件中复制):

WARN  util.NativeCodeLoader - Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
INFO  segment.SegmentChecker - Segment dir is complete: file:/user/nutch/apache-nutch-1.11/crawl/segments/s1.
INFO  segment.SegmentChecker - Segment dir is complete: file:/user/nutch/apache-nutch-1.11/crawl/segments/s2.
INFO  segment.SegmentChecker - Segment dir is complete: file:/user/nutch/apache-nutch-1.11/crawl/segments/s3.
INFO  indexer.IndexingJob - Indexer: starting at 2015-12-14 15:21:39
INFO  indexer.IndexingJob - Indexer: deleting gone documents: false
INFO  indexer.IndexingJob - Indexer: URL filtering: false
INFO  indexer.IndexingJob - Indexer: URL normalizing: false
INFO  indexer.IndexWriters - Adding org.apache.nutch.indexwriter.solr.SolrIndexWriter
INFO  indexer.IndexingJob - Active IndexWriters :
SolrIndexWriter
    solr.server.type : Type of SolrServer to communicate with (default 'http' however options include 'cloud', 'lb' and 'concurrent')
    solr.server.url : URL of the Solr instance (mandatory)
    solr.zookeeper.url : URL of the Zookeeper URL (mandatory if 'cloud' value for solr.server.type)
    solr.loadbalance.urls : Comma-separated string of Solr server strings to be used (madatory if 'lb' value for solr.server.type)
    solr.mapping.file : name of the mapping file for fields (default solrindex-mapping.xml)
    solr.commit.size : buffer size when sending to Solr (default 1000)
    solr.auth : use authentication (default false)
    solr.auth.username : username for authentication
    solr.auth.password : password for authentication


INFO  indexer.IndexerMapReduce - IndexerMapReduce: crawldb: crawl/crawldb
INFO  indexer.IndexerMapReduce - IndexerMapReduce: linkdb: crawl/linkdb
INFO  indexer.IndexerMapReduce - IndexerMapReduces: adding segment: file:/user/nutch/apache-nutch-1.11/crawl/segments/s1
INFO  indexer.IndexerMapReduce - IndexerMapReduces: adding segment: file:/user/nutch/apache-nutch-1.11/crawl/segments/s2
INFO  indexer.IndexerMapReduce - IndexerMapReduces: adding segment: file:/user/nutch/apache-nutch-1.11/crawl/segments/s3
WARN  conf.Configuration - file:/tmp/hadoop-user/mapred/staging/user117437667/.staging/job_local117437667_0001/job.xml:an attempt to override final parameter: mapreduce.job.end-notification.max.retry.interval;  Ignoring.
WARN  conf.Configuration - file:/tmp/hadoop-user/mapred/staging/user117437667/.staging/job_local117437667_0001/job.xml:an attempt to override final parameter: mapreduce.job.end-notification.max.attempts;  Ignoring.
WARN  conf.Configuration - file:/tmp/hadoop-user/mapred/local/localRunner/user/job_local117437667_0001/job_local117437667_0001.xml:an attempt to override final parameter: mapreduce.job.end-notification.max.retry.interval;  Ignoring.
WARN  conf.Configuration - file:/tmp/hadoop-user/mapred/local/localRunner/user/job_local117437667_0001/job_local117437667_0001.xml:an attempt to override final parameter: mapreduce.job.end-notification.max.attempts;  Ignoring.
INFO  anchor.AnchorIndexingFilter - Anchor deduplication is: off
INFO  indexer.IndexWriters - Adding org.apache.nutch.indexwriter.solr.SolrIndexWriter
INFO  solr.SolrMappingReader - source: content dest: content
INFO  solr.SolrMappingReader - source: title dest: title
INFO  solr.SolrMappingReader - source: host dest: host
INFO  solr.SolrMappingReader - source: segment dest: segment
INFO  solr.SolrMappingReader - source: boost dest: boost
INFO  solr.SolrMappingReader - source: digest dest: digest
INFO  solr.SolrMappingReader - source: tstamp dest: tstamp
INFO  solr.SolrIndexWriter - Indexing 250 documents
INFO  solr.SolrIndexWriter - Deleting 0 documents
INFO  solr.SolrIndexWriter - Indexing 250 documents
WARN  mapred.LocalJobRunner - job_local117437667_0001
java.lang.Exception: org.apache.solr.client.solrj.impl.HttpSolrServer$RemoteSolrException: Expected mime type application/octet-stream but got text/html. 




HTTP ERROR 404
Problem accessing /solr/update. Reason:
    Not Found
Powered by Jetty://




    at org.apache.hadoop.mapred.LocalJobRunner$Job.runTasks(LocalJobRunner.java:462)
    at org.apache.hadoop.mapred.LocalJobRunner$Job.run(LocalJobRunner.java:529)
Caused by: org.apache.solr.client.solrj.impl.HttpSolrServer$RemoteSolrException: Expected mime type application/octet-stream but got text/html. 




HTTP ERROR 404
Problem accessing /solr/update. Reason:
    Not Found
Powered by Jetty://




    at org.apache.solr.client.solrj.impl.HttpSolrServer.executeMethod(HttpSolrServer.java:512)
    at org.apache.solr.client.solrj.impl.HttpSolrServer.request(HttpSolrServer.java:210)
    at org.apache.solr.client.solrj.impl.HttpSolrServer.request(HttpSolrServer.java:206)
    at org.apache.solr.client.solrj.request.AbstractUpdateRequest.process(AbstractUpdateRequest.java:124)
    at org.apache.nutch.indexwriter.solr.SolrIndexWriter.write(SolrIndexWriter.java:134)
    at org.apache.nutch.indexer.IndexWriters.write(IndexWriters.java:85)
    at org.apache.nutch.indexer.IndexerOutputFormat$1.write(IndexerOutputFormat.java:50)
    at org.apache.nutch.indexer.IndexerOutputFormat$1.write(IndexerOutputFormat.java:41)
    at org.apache.hadoop.mapred.ReduceTask$OldTrackingRecordWriter.write(ReduceTask.java:493)
    at org.apache.hadoop.mapred.ReduceTask$3.collect(ReduceTask.java:422)
    at org.apache.nutch.indexer.IndexerMapReduce.reduce(IndexerMapReduce.java:356)
    at org.apache.nutch.indexer.IndexerMapReduce.reduce(IndexerMapReduce.java:56)
    at org.apache.hadoop.mapred.ReduceTask.runOldReducer(ReduceTask.java:444)
    at org.apache.hadoop.mapred.ReduceTask.run(ReduceTask.java:392)
    at org.apache.hadoop.mapred.LocalJobRunner$Job$ReduceTaskRunnable.run(LocalJobRunner.java:319)
    at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511)
    at java.util.concurrent.FutureTask.run(FutureTask.java:266)
    at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
    at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
    at java.lang.Thread.run(Thread.java:745)
ERROR indexer.IndexingJob - Indexer: java.io.IOException: Job failed!
    at org.apache.hadoop.mapred.JobClient.runJob(JobClient.java:836)
    at org.apache.nutch.indexer.IndexingJob.index(IndexingJob.java:145)
    at org.apache.nutch.indexer.IndexingJob.run(IndexingJob.java:222)
    at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:70)
    at org.apache.nutch.indexer.IndexingJob.main(IndexingJob.java:231)

I remember this

我记得这个

org.apache.solr.client.solrj.impl.HttpSolrServer$RemoteSolrException: Expected mime type application/octet-stream but got text/html.

Is something related to the Solr url, but I double check with the url I used 127.0.0.1:8983/solr/files, I think it is correct.

是与Solr url相关的东西，但是我使用了127.0.0.1:8983/ Solr /文件的url，我认为它是正确的。

Does anyone know what the problem is? I search on the web and in here, got nothing useful.

有人知道问题是什么吗?我在网上搜索，在这里，没有任何有用的东西。

Note: I also tried the way which disabled Solr5's Auto Schema feature in examples/files/conf/solrconfig.xml and replaced examples/files/conf/managed-schema.xml with Nutch's conf/schema.xml, still hit the same error.

注意:我还尝试了在示例/文件/conf/solrconfig中禁用Solr5的自动模式特性的方法。xml和例子/文件/ conf / managed-schema所取代。xml与Nutch conf /模式。xml仍然有同样的错误。

Update: After trying the DEPRECATED command bin/nutch solrindex(Thanks to Thangaperumal), the previous error is gone but hit another error:

更新:在尝试了已弃用的命令bin/nutch solrindex(感谢Thangaperumal)之后，先前的错误已经消失，但又犯了另一个错误:

bin/nutch solrindex http://127.0.0.1:8983/solr/files crawl/crawldb -linkdb crawl/linkdb crawl/segments/s1

Error message:

错误信息:

INFO  solr.SolrIndexWriter - Indexing 250 documents
INFO  solr.SolrIndexWriter - Deleting 0 documents
INFO  solr.SolrIndexWriter - Indexing 250 documents
INFO  solr.SolrIndexWriter - Deleting 0 documents
INFO  solr.SolrIndexWriter - Indexing 250 documents
WARN  mapred.LocalJobRunner - job_local1306504137_0001
java.lang.Exception: org.apache.solr.client.solrj.impl.HttpSolrServer$RemoteSolrException: Unable to invoke function processAdd in script: update-script.js: Can't unambiguously select between fixed arity signatures [(java.lang.String, java.io.Reader), (java.lang.String, java.lang.String)] of the method org.apache.solr.analysis.TokenizerChain.tokenStream for argument types [java.lang.String, null]
    at org.apache.hadoop.mapred.LocalJobRunner$Job.runTasks(LocalJobRunner.java:462)
    at org.apache.hadoop.mapred.LocalJobRunner$Job.run(LocalJobRunner.java:529)
Caused by: org.apache.solr.client.solrj.impl.HttpSolrServer$RemoteSolrException: Unable to invoke function processAdd in script: update-script.js: Can't unambiguously select between fixed arity signatures [(java.lang.String, java.io.Reader), (java.lang.String, java.lang.String)] of the method org.apache.solr.analysis.TokenizerChain.tokenStream for argument types [java.lang.String, null]
    at org.apache.solr.client.solrj.impl.HttpSolrServer.executeMethod(HttpSolrServer.java:552)
    at org.apache.solr.client.solrj.impl.HttpSolrServer.request(HttpSolrServer.java:210)
    at org.apache.solr.client.solrj.impl.HttpSolrServer.request(HttpSolrServer.java:206)
    at org.apache.solr.client.solrj.request.AbstractUpdateRequest.process(AbstractUpdateRequest.java:124)
    at org.apache.nutch.indexwriter.solr.SolrIndexWriter.write(SolrIndexWriter.java:134)
    at org.apache.nutch.indexer.IndexWriters.write(IndexWriters.java:85)
    at org.apache.nutch.indexer.IndexerOutputFormat$1.write(IndexerOutputFormat.java:50)
    at org.apache.nutch.indexer.IndexerOutputFormat$1.write(IndexerOutputFormat.java:41)
    at org.apache.hadoop.mapred.ReduceTask$OldTrackingRecordWriter.write(ReduceTask.java:493)
    at org.apache.hadoop.mapred.ReduceTask$3.collect(ReduceTask.java:422)
    at org.apache.nutch.indexer.IndexerMapReduce.reduce(IndexerMapReduce.java:356)
    at org.apache.nutch.indexer.IndexerMapReduce.reduce(IndexerMapReduce.java:56)
    at org.apache.hadoop.mapred.ReduceTask.runOldReducer(ReduceTask.java:444)
    at org.apache.hadoop.mapred.ReduceTask.run(ReduceTask.java:392)
    at org.apache.hadoop.mapred.LocalJobRunner$Job$ReduceTaskRunnable.run(LocalJobRunner.java:319)
    at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511)
    at java.util.concurrent.FutureTask.run(FutureTask.java:266)
    at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
    at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
    at java.lang.Thread.run(Thread.java:745)
ERROR indexer.IndexingJob - Indexer: java.io.IOException: Job failed!
    at org.apache.hadoop.mapred.JobClient.runJob(JobClient.java:836)
    at org.apache.nutch.indexer.IndexingJob.index(IndexingJob.java:145)
    at org.apache.nutch.indexer.IndexingJob.run(IndexingJob.java:222)
    at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:70)
    at org.apache.nutch.indexer.IndexingJob.main(IndexingJob.java:231)

2 个解决方案

#1

Instead, Try this statement to integrate solr and nutch

相反，试着将solr和nutch集成在一起。

bin/nutch solrindex http://127.0.0.1:8983/solr/ crawl/crawldb -linkdb crawl/linkdb crawl/segments/

#2

Have you tried specifying the Solr URL using:

您是否尝试过使用以下方法指定Solr URL:

-D solr.server.url=http://localhost:8983/solr/files

instead of the -params approach? At least this is the right syntax for the crawl script. And since both invoke an underline java class to do the work should work.

而不是-params方法?至少，这是爬行脚本的正确语法。而且，由于这两个调用都是java类的下划线，所以应该工作。

bin/nutch index crawl/crawldb \
-linkdb crawl/linkdb \
-D solr.server.url=http://127.0.0.1:8983/solr/files \
-dir crawl/segments

推荐阅读

io
为什么多数程序员难以成为架构师？

探讨80%的程序员为何难以晋升为架构师，涉及技术深度、经验积累和综合能力等方面。本文将详细解析Tomcat的配置和服务组件，帮助读者理解其内部机制。 ... [详细]

蜡笔小新 2024-11-14 03:39:46
io
InfluxDB、collectd与Grafana的详细安装与配置指南

本文详细介绍了 InfluxDB、collectd 和 Grafana 的安装与配置流程。首先，按照启动顺序依次安装并配置 InfluxDB、collectd 和 Grafana。InfluxDB 作为时序数据库，用于存储时间序列数据；collectd 负责数据的采集与传输；Grafana 则用于数据的可视化展示。文中提供了 collectd 的官方文档链接，便于用户参考和进一步了解其配置选项。通过本指南，读者可以轻松搭建一个高效的数据监控系统。 ... [详细]

蜡笔小新 2024-11-11 19:54:24
io
在CentOS 7环境中安装配置Redis及使用Redis Desktop Manager连接时的注意事项与技巧

在 CentOS 7 环境中安装和配置 Redis 时，需要注意一些关键步骤和最佳实践。本文详细介绍了从安装 Redis 到配置其基本参数的全过程，并提供了使用 Redis Desktop Manager 连接 Redis 服务器的技巧和注意事项。此外，还探讨了如何优化性能和确保数据安全，帮助用户在生产环境中高效地管理和使用 Redis。 ... [详细]

蜡笔小新 2024-11-11 18:27:44
io
Kubernetes Metric Server Pod 运行异常：缺少 IP SANs

检查 Kubernetes 系统命名空间中的 Pod 状态时，发现 Metric Server Pod 虽然处于运行状态，但存在异常：日志显示 'it doesn’t contain any IP SANs'。 ... [详细]

蜡笔小新 2024-11-14 07:58:56
object
Python基础：使用NLTK和Python构建机器学习应用

本文节选自《NLTK基础教程——用NLTK和Python库构建机器学习应用》一书的第1章第1.2节，作者Nitin Hardeniya。本文将带领读者快速了解Python的基础知识，为后续的机器学习应用打下坚实的基础。 ... [详细]

蜡笔小新 2024-11-13 21:23:34
io
Linux 环境下 Java 及相关软件的安装指南

本文详细介绍了如何在 Linux 系统上安装 JDK 1.8、MySQL 和 Redis，并提供了相应的环境配置和验证步骤。 ... [详细]

蜡笔小新 2024-11-13 18:10:16
object
Delphi 7下最小化到系统托盘（主要是WM_TRAYMSG和WM_SYSCOMMAND消息）

在Delphi7下要制作系统托盘，只能制作一个比较简单的系统托盘，因为ShellAPI文件定义的TNotifyIconData结构体是比较早的版本。定义如下：1234 ... [详细]

蜡笔小新 2024-11-12 12:32:15
object
在 QQmlPropertyMap 的派生类中无法调用槽函数或 Q_INVOKABLE 方法？

在尝试对 QQmlPropertyMap 类进行测试驱动开发时，发现其派生类中无法正常调用槽函数或 Q_INVOKABLE 方法。这可能是由于 QQmlPropertyMap 的内部实现机制导致的，需要进一步研究以找到解决方案。 ... [详细]

蜡笔小新 2024-11-11 15:34:22
io
如何使用 `org.apache.tomcat.websocket.server.WsServerContainer.findMapping()` 方法及其代码示例解析

如何使用 `org.apache.tomcat.websocket.server.WsServerContainer.findMapping()` 方法及其代码示例解析 ... [详细]

蜡笔小新 2024-11-11 10:08:55
merge
如何在MySQL中有效运用EXPLAIN命令进行查询优化

本文详细介绍了在MySQL中如何高效利用EXPLAIN命令进行查询优化。通过实例解析和步骤说明，文章旨在帮助读者深入理解EXPLAIN命令的工作原理及其在性能调优中的应用，内容通俗易懂且结构清晰，适合各水平的数据库管理员和技术人员参考学习。 ... [详细]

蜡笔小新 2024-11-10 15:18:39
io
Android 源代码解析系列（一）：init.c 文件详解

本文详细解析了 Android 系统启动过程中的核心文件 `init.c`，探讨了其在系统初始化阶段的关键作用。通过对 `init.c` 的源代码进行深入分析，揭示了其如何管理进程、解析配置文件以及执行系统启动脚本。此外，文章还介绍了 `init` 进程的生命周期及其与内核的交互方式，为开发者提供了深入了解 Android 启动机制的宝贵资料。 ... [详细]

蜡笔小新 2024-11-10 00:35:48
io
MATLAB字典学习工具箱SPAMS：稀疏与字典学习的详细介绍、配置及应用实例

SPAMS（Sparse Modeling Software）是一个强大的开源优化工具箱，专为解决多种稀疏估计问题而设计。该工具箱基于MATLAB，提供了丰富的算法和函数，适用于字典学习、信号处理和机器学习等领域。本文将详细介绍SPAMS的配置方法、核心功能及其在实际应用中的典型案例，帮助用户更好地理解和使用这一工具箱。 ... [详细]

蜡笔小新 2024-11-09 16:17:27
io
Maven进阶指南：高效管理项目外部依赖库

本文深入探讨了如何利用Maven高效管理项目中的外部依赖库。通过介绍Maven的官方依赖搜索地址（），详细讲解了依赖库的添加、版本管理和冲突解决等关键操作。此外，还提供了实用的配置示例和最佳实践，帮助开发者优化项目构建流程，提高开发效率。 ... [详细]

蜡笔小新 2024-11-09 11:17:43
io
【源自百度知识】批处理技术详解与应用

本文详细介绍了批处理技术的基本概念及其在实际应用中的重要性。首先，对简单的批处理内部命令进行了概述，重点讲解了Echo命令的功能，包括如何打开或关闭回显功能以及显示消息。如果没有指定任何参数，Echo命令会显示当前的回显设置。此外，文章还探讨了批处理技术在自动化任务执行、系统管理等领域的广泛应用，为读者提供了丰富的实践案例和技术指导。 ... [详细]

蜡笔小新 2024-11-09 10:19:25
io
精选Linux经典著作在数字图书馆展出

数字图书馆近期展出了一批精选的Linux经典著作，这些书籍虽然部分较为陈旧，但依然具有重要的参考价值。如需转载相关内容，请务必注明来源：小文论坛（http://www.xiaowenbbs.com）。 ... [详细]

蜡笔小新 2024-11-08 10:55:29

oko123

这个家伙很懒，什么也没留下！

Tags | 热门标签

RankList | 热门文章