I just started using Nutch 1.11 and Solr 5.3.1.
我刚开始使用Nutch 1.11和Solr 5.3.1。
I want to crawl data with Nutch, then index and prepare for searching with Solr.
我想用Nutch抓取数据,然后索引并准备用Solr搜索。
I know how to crawl data from web using Nutch
's bin/crawl
command, and successfully got much data from a website in my local.
我知道如何使用Nutch的bin/抓取命令从web抓取数据,并且成功地从本地的一个网站获得了很多数据。
I also started a new Solr
server in local with below command under Solr
root folder,
我还在Solr root文件夹下,在本地启动了一个新的Solr服务器,
bin/solr start
And started the example files
core under the example folder with below command:
并在以下命令的示例文件夹下启动示例文件core:
bin/solr create -c files -d example/files/conf
And I can login below admin url and manage the files
core,
我可以登录管理url,管理文件核心,
http://localhost:8983/solr/#/files
So I believe I started the Solr
correctly, and started to post the Nutch
data into Solr
with Nutch
's bin/nutch index
command:
所以我相信我是正确的启动了Solr,并开始将Nutch的数据用Nutch的bin/ Nutch index命令将Nutch数据发布到Solr:
bin/nutch index crawl/crawldb \
-linkdb crawl/linkdb \
-params solr.server.url=127.0.0.1:8983/solr/files \
-dir crawl/segments
Hoping with Solr5
's new Auto Schema feature, I can put myself restful, however, I got below error(copy from log file):
希望使用Solr5的新的自动模式特性,我可以使自己保持restful,但是,我得到了以下错误(从日志文件中复制):
WARN util.NativeCodeLoader - Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
INFO segment.SegmentChecker - Segment dir is complete: file:/user/nutch/apache-nutch-1.11/crawl/segments/s1.
INFO segment.SegmentChecker - Segment dir is complete: file:/user/nutch/apache-nutch-1.11/crawl/segments/s2.
INFO segment.SegmentChecker - Segment dir is complete: file:/user/nutch/apache-nutch-1.11/crawl/segments/s3.
INFO indexer.IndexingJob - Indexer: starting at 2015-12-14 15:21:39
INFO indexer.IndexingJob - Indexer: deleting gone documents: false
INFO indexer.IndexingJob - Indexer: URL filtering: false
INFO indexer.IndexingJob - Indexer: URL normalizing: false
INFO indexer.IndexWriters - Adding org.apache.nutch.indexwriter.solr.SolrIndexWriter
INFO indexer.IndexingJob - Active IndexWriters :
SolrIndexWriter
solr.server.type : Type of SolrServer to communicate with (default 'http' however options include 'cloud', 'lb' and 'concurrent')
solr.server.url : URL of the Solr instance (mandatory)
solr.zookeeper.url : URL of the Zookeeper URL (mandatory if 'cloud' value for solr.server.type)
solr.loadbalance.urls : Comma-separated string of Solr server strings to be used (madatory if 'lb' value for solr.server.type)
solr.mapping.file : name of the mapping file for fields (default solrindex-mapping.xml)
solr.commit.size : buffer size when sending to Solr (default 1000)
solr.auth : use authentication (default false)
solr.auth.username : username for authentication
solr.auth.password : password for authentication
INFO indexer.IndexerMapReduce - IndexerMapReduce: crawldb: crawl/crawldb
INFO indexer.IndexerMapReduce - IndexerMapReduce: linkdb: crawl/linkdb
INFO indexer.IndexerMapReduce - IndexerMapReduces: adding segment: file:/user/nutch/apache-nutch-1.11/crawl/segments/s1
INFO indexer.IndexerMapReduce - IndexerMapReduces: adding segment: file:/user/nutch/apache-nutch-1.11/crawl/segments/s2
INFO indexer.IndexerMapReduce - IndexerMapReduces: adding segment: file:/user/nutch/apache-nutch-1.11/crawl/segments/s3
WARN conf.Configuration - file:/tmp/hadoop-user/mapred/staging/user117437667/.staging/job_local117437667_0001/job.xml:an attempt to override final parameter: mapreduce.job.end-notification.max.retry.interval; Ignoring.
WARN conf.Configuration - file:/tmp/hadoop-user/mapred/staging/user117437667/.staging/job_local117437667_0001/job.xml:an attempt to override final parameter: mapreduce.job.end-notification.max.attempts; Ignoring.
WARN conf.Configuration - file:/tmp/hadoop-user/mapred/local/localRunner/user/job_local117437667_0001/job_local117437667_0001.xml:an attempt to override final parameter: mapreduce.job.end-notification.max.retry.interval; Ignoring.
WARN conf.Configuration - file:/tmp/hadoop-user/mapred/local/localRunner/user/job_local117437667_0001/job_local117437667_0001.xml:an attempt to override final parameter: mapreduce.job.end-notification.max.attempts; Ignoring.
INFO anchor.AnchorIndexingFilter - Anchor deduplication is: off
INFO indexer.IndexWriters - Adding org.apache.nutch.indexwriter.solr.SolrIndexWriter
INFO solr.SolrMappingReader - source: content dest: content
INFO solr.SolrMappingReader - source: title dest: title
INFO solr.SolrMappingReader - source: host dest: host
INFO solr.SolrMappingReader - source: segment dest: segment
INFO solr.SolrMappingReader - source: boost dest: boost
INFO solr.SolrMappingReader - source: digest dest: digest
INFO solr.SolrMappingReader - source: tstamp dest: tstamp
INFO solr.SolrIndexWriter - Indexing 250 documents
INFO solr.SolrIndexWriter - Deleting 0 documents
INFO solr.SolrIndexWriter - Indexing 250 documents
WARN mapred.LocalJobRunner - job_local117437667_0001
java.lang.Exception: org.apache.solr.client.solrj.impl.HttpSolrServer$RemoteSolrException: Expected mime type application/octet-stream but got text/html.
HTTP ERROR 404
Problem accessing /solr/update. Reason:
Not Found
Powered by Jetty://
at org.apache.hadoop.mapred.LocalJobRunner$Job.runTasks(LocalJobRunner.java:462)
at org.apache.hadoop.mapred.LocalJobRunner$Job.run(LocalJobRunner.java:529)
Caused by: org.apache.solr.client.solrj.impl.HttpSolrServer$RemoteSolrException: Expected mime type application/octet-stream but got text/html.
HTTP ERROR 404
Problem accessing /solr/update. Reason:
Not Found
Powered by Jetty://
at org.apache.solr.client.solrj.impl.HttpSolrServer.executeMethod(HttpSolrServer.java:512)
at org.apache.solr.client.solrj.impl.HttpSolrServer.request(HttpSolrServer.java:210)
at org.apache.solr.client.solrj.impl.HttpSolrServer.request(HttpSolrServer.java:206)
at org.apache.solr.client.solrj.request.AbstractUpdateRequest.process(AbstractUpdateRequest.java:124)
at org.apache.nutch.indexwriter.solr.SolrIndexWriter.write(SolrIndexWriter.java:134)
at org.apache.nutch.indexer.IndexWriters.write(IndexWriters.java:85)
at org.apache.nutch.indexer.IndexerOutputFormat$1.write(IndexerOutputFormat.java:50)
at org.apache.nutch.indexer.IndexerOutputFormat$1.write(IndexerOutputFormat.java:41)
at org.apache.hadoop.mapred.ReduceTask$OldTrackingRecordWriter.write(ReduceTask.java:493)
at org.apache.hadoop.mapred.ReduceTask$3.collect(ReduceTask.java:422)
at org.apache.nutch.indexer.IndexerMapReduce.reduce(IndexerMapReduce.java:356)
at org.apache.nutch.indexer.IndexerMapReduce.reduce(IndexerMapReduce.java:56)
at org.apache.hadoop.mapred.ReduceTask.runOldReducer(ReduceTask.java:444)
at org.apache.hadoop.mapred.ReduceTask.run(ReduceTask.java:392)
at org.apache.hadoop.mapred.LocalJobRunner$Job$ReduceTaskRunnable.run(LocalJobRunner.java:319)
at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511)
at java.util.concurrent.FutureTask.run(FutureTask.java:266)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
at java.lang.Thread.run(Thread.java:745)
ERROR indexer.IndexingJob - Indexer: java.io.IOException: Job failed!
at org.apache.hadoop.mapred.JobClient.runJob(JobClient.java:836)
at org.apache.nutch.indexer.IndexingJob.index(IndexingJob.java:145)
at org.apache.nutch.indexer.IndexingJob.run(IndexingJob.java:222)
at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:70)
at org.apache.nutch.indexer.IndexingJob.main(IndexingJob.java:231)
I remember this
我记得这个
org.apache.solr.client.solrj.impl.HttpSolrServer$RemoteSolrException: Expected mime type application/octet-stream but got text/html.
Is something related to the Solr
url, but I double check with the url I used 127.0.0.1:8983/solr/files
, I think it is correct.
是与Solr url相关的东西,但是我使用了127.0.0.1:8983/ Solr /文件的url,我认为它是正确的。
Does anyone know what the problem is? I search on the web and in here, got nothing useful.
有人知道问题是什么吗?我在网上搜索,在这里,没有任何有用的东西。
Note: I also tried the way which disabled Solr5
's Auto Schema feature in examples/files/conf/solrconfig.xml
and replaced examples/files/conf/managed-schema.xml
with Nutch
's conf/schema.xml
, still hit the same error.
注意:我还尝试了在示例/文件/conf/solrconfig中禁用Solr5的自动模式特性的方法。xml和例子/文件/ conf / managed-schema所取代。xml与Nutch conf /模式。xml仍然有同样的错误。
Update: After trying the DEPRECATED command bin/nutch solrindex
(Thanks to Thangaperumal
), the previous error is gone but hit another error:
更新:在尝试了已弃用的命令bin/nutch solrindex(感谢Thangaperumal)之后,先前的错误已经消失,但又犯了另一个错误:
bin/nutch solrindex http://127.0.0.1:8983/solr/files crawl/crawldb -linkdb crawl/linkdb crawl/segments/s1
Error message:
错误信息:
INFO solr.SolrIndexWriter - Indexing 250 documents
INFO solr.SolrIndexWriter - Deleting 0 documents
INFO solr.SolrIndexWriter - Indexing 250 documents
INFO solr.SolrIndexWriter - Deleting 0 documents
INFO solr.SolrIndexWriter - Indexing 250 documents
WARN mapred.LocalJobRunner - job_local1306504137_0001
java.lang.Exception: org.apache.solr.client.solrj.impl.HttpSolrServer$RemoteSolrException: Unable to invoke function processAdd in script: update-script.js: Can't unambiguously select between fixed arity signatures [(java.lang.String, java.io.Reader), (java.lang.String, java.lang.String)] of the method org.apache.solr.analysis.TokenizerChain.tokenStream for argument types [java.lang.String, null]
at org.apache.hadoop.mapred.LocalJobRunner$Job.runTasks(LocalJobRunner.java:462)
at org.apache.hadoop.mapred.LocalJobRunner$Job.run(LocalJobRunner.java:529)
Caused by: org.apache.solr.client.solrj.impl.HttpSolrServer$RemoteSolrException: Unable to invoke function processAdd in script: update-script.js: Can't unambiguously select between fixed arity signatures [(java.lang.String, java.io.Reader), (java.lang.String, java.lang.String)] of the method org.apache.solr.analysis.TokenizerChain.tokenStream for argument types [java.lang.String, null]
at org.apache.solr.client.solrj.impl.HttpSolrServer.executeMethod(HttpSolrServer.java:552)
at org.apache.solr.client.solrj.impl.HttpSolrServer.request(HttpSolrServer.java:210)
at org.apache.solr.client.solrj.impl.HttpSolrServer.request(HttpSolrServer.java:206)
at org.apache.solr.client.solrj.request.AbstractUpdateRequest.process(AbstractUpdateRequest.java:124)
at org.apache.nutch.indexwriter.solr.SolrIndexWriter.write(SolrIndexWriter.java:134)
at org.apache.nutch.indexer.IndexWriters.write(IndexWriters.java:85)
at org.apache.nutch.indexer.IndexerOutputFormat$1.write(IndexerOutputFormat.java:50)
at org.apache.nutch.indexer.IndexerOutputFormat$1.write(IndexerOutputFormat.java:41)
at org.apache.hadoop.mapred.ReduceTask$OldTrackingRecordWriter.write(ReduceTask.java:493)
at org.apache.hadoop.mapred.ReduceTask$3.collect(ReduceTask.java:422)
at org.apache.nutch.indexer.IndexerMapReduce.reduce(IndexerMapReduce.java:356)
at org.apache.nutch.indexer.IndexerMapReduce.reduce(IndexerMapReduce.java:56)
at org.apache.hadoop.mapred.ReduceTask.runOldReducer(ReduceTask.java:444)
at org.apache.hadoop.mapred.ReduceTask.run(ReduceTask.java:392)
at org.apache.hadoop.mapred.LocalJobRunner$Job$ReduceTaskRunnable.run(LocalJobRunner.java:319)
at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511)
at java.util.concurrent.FutureTask.run(FutureTask.java:266)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
at java.lang.Thread.run(Thread.java:745)
ERROR indexer.IndexingJob - Indexer: java.io.IOException: Job failed!
at org.apache.hadoop.mapred.JobClient.runJob(JobClient.java:836)
at org.apache.nutch.indexer.IndexingJob.index(IndexingJob.java:145)
at org.apache.nutch.indexer.IndexingJob.run(IndexingJob.java:222)
at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:70)
at org.apache.nutch.indexer.IndexingJob.main(IndexingJob.java:231)
0
Instead, Try this statement to integrate solr and nutch
相反,试着将solr和nutch集成在一起。
bin/nutch solrindex http://127.0.0.1:8983/solr/ crawl/crawldb -linkdb crawl/linkdb crawl/segments/
0
Have you tried specifying the Solr URL using:
您是否尝试过使用以下方法指定Solr URL:
-D solr.server.url=http://localhost:8983/solr/files
instead of the -params
approach? At least this is the right syntax for the crawl
script. And since both invoke an underline java class to do the work should work.
而不是-params方法?至少,这是爬行脚本的正确语法。而且,由于这两个调用都是java类的下划线,所以应该工作。
bin/nutch index crawl/crawldb \
-linkdb crawl/linkdb \
-D solr.server.url=http://127.0.0.1:8983/solr/files \
-dir crawl/segments