热门标签 | HotTags
当前位置:  开发笔记 > 编程语言 > 正文

windows平台下在eclipse中配置Nutch1.2并调试

本文由守望者MS转载并整理注:全文分两部分,第一部分为英文配置方案,第二部分为中文配置方案。推荐按照英文的步骤去做,中文的少了cygwin的步骤,在以后的操作中会出现一点问题,解决方案会在另一

本文由守望者MS转载并整理

注:全文分两部分,第一部分为英文配置方案,第二部分为中文配置方案。推荐按照英文的步骤去做,中文的少了cygwin的步骤,在以后的操作中会出现

一点问题,解决方案会在另一篇文章中贴出来。

  • 第一部分

This is a work in progress. If you find errors or would like to improve this page, just create an account [UserPreferences ] and start editing this page :-)

 

Tested with

  • Nutch release 1.0
  • Eclipse 3.3 (Europa) and 3.4 (Ganymede)
  • Java 1.6
  • Ubuntu (should work on most platforms though)
  • Windows XP and Vista

 

Before you start

Setting up Nutch to run into Eclipse can be tricky, and most of the time it is much faster if you edit Nutch in Eclipse but run the scripts from the command line (my 2 cents). However, it's very useful to be able to debug Nutch in Eclipse. Sometimes examining the logs (logs/hadoop.log) is quicker to debug a problem.

 

Steps

 

For Windows Users

If you are running Windows (tested on Windows XP) you must first install cygwin. Download it fromhttp://www.cygwin.com/setup.exe

Install cygwin and set the PATH environment variable for it. You can set it from the Control Panel, System, Advanced Tab, Environment Variables and edit/add PATH.

Example PATH:

 

C:/Sun/SDK/bin;C:/cygwin/bin

If you run "bash" from the Windows command line (Start > Run... > cmd.exe) it should successfully run cygwin.

If you are running Eclipse on Vista, you will need to either give cygwin administrative privileges or turn off Vista's User Access Control (UAC) . Otherwise Hadoop will likely complain that it cannot change a directory permission when you later run the crawler:

 

org.apache.hadoop.util.Shell$ExitCodeException: chmod: changing permissions of ... Permission denied

See this for more information about the UAC issue.

 

Install Nutch

  • Grab a fresh release of Nutch 1.0 or download and untar the official 1.0 release .

  • Do not build Nutch yet. Make sure you have no .project and .classpath files in the Nutch directory

 

Create a new Java Project in Eclipse

  • File > New > Project > Java project > click Next

  • Name the project (Nutch_Trunk for instance)
  • Select "Create project from existing source" and use the location where you downloaded Nutch
  • Click on Next, and wait while Eclipse is scanning the folders
  • Add the folder "conf" to the classpath (Right-click on the project, select "properties" then "Java Build Path" tab (left menu) and then the "Libraries" tab. Click "Add Class Folder..." button, and select "conf" from the list)
  • Go to "Order and Export" tab, find the entry for added "conf" folder and move it to the top (by checking it and clicking the "Top" button). This is required so Eclipse will take config (nutch-default.xml, nutch-final.xml, etc.) resources from our "conf" folder and not from somewhere else.
  • Eclipse should have guessed all the Java files that must be added to your classpath. If that's not the case, add "src/java", "src/test" and all plugin "src/java" and "src/test" folders to your source folders. Also add all jars in "lib" and in the plugin lib folders to your libraries
  • Click the "Source" tab and set the default output folder to "Nutch_Trunk/bin/tmp_build". (You may need to create the tmp_build folder.)
  • Click the "Finish" button
  • DO NOT add "build" to classpath

 

Configure Nutch

  • See the Tutorial

  • Change the property "plugin.folders" to "./src/plugin" on $NUTCH_HOME/conf/nutch-defaul.xml
  • Make sure Nutch is configured correctly before testing it into Eclipse ;-)

 

Missing org.farng and com.etranslate

Eclipse will complain about some import statements in parse-mp3 and parse-rtf plugins (30 errors in my case). Because of incompatibility with the Apache license, the .jar files that define the necessary classes were not included with the source code.

Download them here:

http://nutch.cvs.sourceforge.net/nutch/nutch/src/plugin/parse-mp3/lib/

http://nutch.cvs.sourceforge.net/nutch/nutch/src/plugin/parse-rtf/lib/

Copy the jar files into src/plugin/parse-mp3/lib and src/plugin/parse-rtf/lib/ respectively. Then add the jar files to the build path (First refresh the workspace by pressing F5. Then right-click the project folder > Build Path > Configure Build Path... Then select the Libraries tab, click "Add Jars..." and then add each .jar file individually. If that does not work, you may try clicking "Add External JARs" and the point to the two the directories above).

 

Two Errors with RTFParseFactory

If you are trying to build the official 1.0 release, Eclipse will complain about 2 errors regarding the RTFParseFactory (this is after adding the RTF jar file from the previous step). This problem was fixed (see NUTCH-644 and NUTCH-705 ) but was not included in the 1.0 official release because of licensing issues. So you will need to manually alter the code to remove these 2 build errors.

In RTFParseFactory.java:

  1. Add the following import statement: import org.apache.nutch.parse.ParseResult;

  2. Change

 

public Parse getParse(Content content) {

to

 

public ParseResult getParse(Content content) {
  1. In the getParse function, replace

 

return new ParseStatus(ParseStatus.FAILED,
ParseStatus.FAILED_EXCEPTION,
e.toString()).getEmptyParse(conf);

with

 

return new ParseStatus(ParseStatus.FAILED,
ParseStatus.FAILED_EXCEPTION,
e.toString()).getEmptyParseResult(content.getUrl(), getConf());
  1. In the getParse function, replace

 

return new ParseImpl(text,
new ParseData(ParseStatus.STATUS_SUCCESS,
title,
OutlinkExtractor.getOutlinks(text, this.conf),
content.getMetadata(),
metadata));

with

 

return ParseResult.createParseResult(content.getUrl(),
new ParseImpl(text,
new ParseData(ParseStatus.STATUS_SUCCESS,
title,
OutlinkExtractor.getOutlinks(text, this.conf),
content.getMetadata(),
metadata)));

In TestRTFParser.java, replace

 

parse = new ParseUtil(conf).parseByExtensionId("parse-rtf", content);

with

 

parse = new ParseUtil(conf).parseByExtensionId("parse-rtf", content).get(urlString);

Once you have made these changes and saved the files, Eclipse should build with no errors.

 

Build Nutch

If you setup the project correctly, Eclipse will build Nutch for you into "tmp_build". See below for problems you could run into.

 

Create Eclipse launcher

  • Menu Run > "Run..."

  • create "New" for "Java Application"
  • set in Main class

 

org.apache.nutch.crawl.Crawl
  • on tab Arguments, Program Arguments

 

urls -dir crawl -depth 3 -topN 50
  • in VM arguments

 

-Dhadoop.log.dir=logs -Dhadoop.log.file=hadoop.log
  • click on "Run"
  • if all works, you should see Nutch getting busy at crawling :-)

 

Debug Nutch in Eclipse (not yet tested for 0.9)

  • Set breakpoints and debug a crawl
  • It can be tricky to find out where to set the breakpoint, because of the Hadoop jobs. Here are a few good places to set breakpoints:

 

Fetcher [line: 371] - run
Fetcher [line: 438] - fetch
Fetcher$FetcherThread [line: 149] - run()
Generator [line: 281] - generate
Generator$Selector [line: 119] - map
OutlinkExtractor [line: 111] - getOutlinks

 

If things do not work...

Yes, Nutch and Eclipse can be a difficult companionship sometimes ;-)

 

Java Heap Size problem

If the crawler throws an IOException exception early in the crawl (Exception in thread "main" java.io.IOException: Job failed!), check the logs/hadoop.log file for further information. If you find in hadoop.log lines similar to this:

 

2009-04-13 13:41:06,105 WARN mapred.LocalJobRunner - job_local_0001
java.lang.OutOfMemoryError: Java heap space

then you should increase amount of RAM for running applications from Eclipse.

Just set it in:

Eclipse -> Window -> Preferences -> Java -> Installed JREs -> edit -> Default VM arguments

I've set mine to

 

-Xms5m -Xmx150m

because I have like 200MB RAM left after running all apps

-Xms (minimum ammount of RAM memory for running applications) -Xmx (maximum)

 

Eclipse: Cannot create project content in workspace

The nutch source code must be out of the workspace folder. My first attempt was download the code with eclipse (svn) under my workspace. When I try to create the project using existing code, eclipse don't let me do it from source code into the workspace. I use the source code out of my workspace and it work fine.

 

plugin dir not found

Make sure you set your plugin.folders property correct, instead of using a relative path you can use a absolute one as well in nutch-defaults.xml or may be better in nutch-site.xml

 


plugin.folders
/home/....../nutch-0.9/src/plugin

 

No plugins loaded during unit tests in Eclipse

During unit testing, Eclipse ignored conf/nutch-site.xml in favor of src/test/nutch-site.xml, so you might need to add the plugin directory configuration to that file as well.

 

NOTE: Additional note for people who want to run eclipse with latest nutch code

If you are getting following exception - org.apache.nutch.plugin.PluginRuntimeException : java.lang.ClassNotFoundException : org.apache.nutch.net .urlnormalizer.basic.BasicURLNormalizer

  1. Execute 'ant job' (which is the default) after downloading nutch through SVN
  2. Update "plugin.folders" (under nutch-default.xml) to build/plugins (where ant builds plugins)
  3. If it still fails increase your memory allocation or find a simpler website to crawl.

 

Unit tests work in eclipse but fail when running ant in the command line

Suppose your unit tests work perfectly in eclipse, but each and everyone fail when running ant test in the command line - including the ones you haven't modified. Check if you defined theplugin.folders property in hadoop-site.xml. In that case, try removing it from that file and adding it directly to nutch-site.xml

Run ant test again. That should have solved the problem.

If that didn't solve the problem, are you testing a plugin? If so, did you add the plugin to the list of packages in plugin/build.xml, on the test target?

 

classNotFound

  • open the class itself, rightclick
  • refresh the build dir

 

debugging hadoop classes

  • Sometime it makes sense to also have the hadoop classes available during debugging. So, you can check out the Hadoop sources on your machine and add the sources to the hadoop-xxx.jar. Alternatively, you can:
    • Remove the hadoopXXX.jar from your classpath libraries
    • Checkout the hadoop brunch that is used within nutch
    • configure a hadoop project similar to the nutch project within your eclipse
    • add the hadoop project as a dependent project of nutch project
    • you can now also set break points within hadoop classes lik inputformat implementations etc.

 

Failed to get the current user's information

On Windows, if the crawler throws an exception complaining it "Failed to get the current user's information" or 'Login failed: Cannot run program "bash"', it is likely you forgot to set the PATH to point to cygwin. Open a new command line window (All Programs > Accessories > Command Prompt) and type "bash". This should start cygwin. If it doesn't, type "path" to see your path. You should see within the path the cygwin bin directory (e.g., C:/cygwin/bin). See the steps to adding this to your PATH at the top of the article under "For Windows Users". After setting the PATH, you will likely need to restart Eclipse so it will use the new PATH.

Original credits: RenaudRichardet

Updated by: Zeeshan

  • 第二部分

二、安装
Nutch直接解压即可(假设解压目录为D:\nutch-0.9),其他两个安装也很简单,略。

三、配置:
打开Eclipse,选择新建->Java项目->输入项目名称nutch-0.9(注意与解压目录名一致), 选择从现有资源创建项目,选择目录D:\nutch-0.9,单击下一步,按下图设置:

注意,这里一定要严格按照上图设置。先选择 conf文件夹,然后在下方中单击将文件夹“conf"添加至构建路径,然后选择conf作为缺省输出文件夹。最后完成,对出现的提示是否删除bin下面的东西,可以选择否。

完成上面步骤后,项目已经新建完成,只是还存在缺少 org.farng 和com.etranslate的提示,需要
下载jid3lib-0.5.1.jar和rtf-parser.jar两个jar包。可尝试到下面地址下载:

[WWW] http://nutch.cvs.sourceforge.net/nutch/nutch/src/plugin/parse-mp3/lib/

[WWW] http://nutch.cvs.sourceforge.net/nutch/nutch/src/plugin/parse-rtf/lib/

下载后把两个包放入项目\nutch-0.9\lib目录下。并在项目中添加,添加方法如下:
选择项目名称,右键-》刷新一下,然后选择项目-》属性-》Java构建路径-》库-》添加Jar包。。。,在弹出的对话框中,选择本nutch-0.9下面的lib,选中jid3lib-0.5.1.jar和rtf-parser.jar,确定完成。

此致项目配置部分已经完成。

四、运行测试
对上面的配置环境进行验证。参考下面资料进行Nutch配置:

  • 为处理方便,直接在nutch-0.9工程下创建一个名为url.txt文件,然后在文件里添加要搜索的网址,例如:http://www.sina.com.cn/,注意网址最后的"/"一定要有。前面的"http://"也是必不可少的。

    2.配置crawl-urlfilter.txt

    打开工程conf/crawl-urlfilter.txt文件,找到这两行

    # accept hosts in MY.DOMAIN.NAME

    +^http://([a-z0-9]*\.)*MY.DOMAIN.NAME/

    红色部分是一个正则,改写为如下形式

        +^http://([a-z0-9]*\.)*com.cn/
       +^http://([a-z0-9]*\.)*cn/ 
       +^http://([a-z0-9]*\.)*com/

  • 注意:“+”号前面不要有空格。

  • 3.修改conf\nutch-site.xml为如下内容,否则不会抓取。

         http.agent.name

         *

  • 在conf/nutch-defaul.xml下,将属性"plugin.folders"的值由“plugins”更改为 "./src/plugin"

完成上面添加修改之后,就可运行了。运行方法参考:

  • Menu Run > "Run..."

  • create "New" for "Java Application"

  • set in Main class

org.apache.nutch.crawl.Crawl

  • on tab Arguments, Program Arguments

url.txt -dir sinaweb -depth 3 -topN 50 -threads 3

  • in VM arguments (注:指定日志文件及其路径)

-Dhadoop.log.dir=logs -Dhadoop.log.file=hadoop.log

  • click on "Run"

  • if all works, you should see Nutch getting busy at crawling



推荐阅读
  • Metasploit攻击渗透实践
    本文介绍了Metasploit攻击渗透实践的内容和要求,包括主动攻击、针对浏览器和客户端的攻击,以及成功应用辅助模块的实践过程。其中涉及使用Hydra在不知道密码的情况下攻击metsploit2靶机获取密码,以及攻击浏览器中的tomcat服务的具体步骤。同时还讲解了爆破密码的方法和设置攻击目标主机的相关参数。 ... [详细]
  • 本文分享了一个关于在C#中使用异步代码的问题,作者在控制台中运行时代码正常工作,但在Windows窗体中却无法正常工作。作者尝试搜索局域网上的主机,但在窗体中计数器没有减少。文章提供了相关的代码和解决思路。 ... [详细]
  • Webmin远程命令执行漏洞复现及防护方法
    本文介绍了Webmin远程命令执行漏洞CVE-2019-15107的漏洞详情和复现方法,同时提供了防护方法。漏洞存在于Webmin的找回密码页面中,攻击者无需权限即可注入命令并执行任意系统命令。文章还提供了相关参考链接和搭建靶场的步骤。此外,还指出了参考链接中的数据包不准确的问题,并解释了漏洞触发的条件。最后,给出了防护方法以避免受到该漏洞的攻击。 ... [详细]
  • Skywalking系列博客1安装单机版 Skywalking的快速安装方法
    本文介绍了如何快速安装单机版的Skywalking,包括下载、环境需求和端口检查等步骤。同时提供了百度盘下载地址和查询端口是否被占用的命令。 ... [详细]
  • 本文介绍了一个在线急等问题解决方法,即如何统计数据库中某个字段下的所有数据,并将结果显示在文本框里。作者提到了自己是一个菜鸟,希望能够得到帮助。作者使用的是ACCESS数据库,并且给出了一个例子,希望得到的结果是560。作者还提到自己已经尝试了使用"select sum(字段2) from 表名"的语句,得到的结果是650,但不知道如何得到560。希望能够得到解决方案。 ... [详细]
  • 本文介绍了三种方法来实现在Win7系统中显示桌面的快捷方式,包括使用任务栏快速启动栏、运行命令和自己创建快捷方式的方法。具体操作步骤详细说明,并提供了保存图标的路径,方便以后使用。 ... [详细]
  • 本文详细介绍了MySQL表分区的创建、增加和删除方法,包括查看分区数据量和全库数据量的方法。欢迎大家阅读并给予点评。 ... [详细]
  • Go Cobra命令行工具入门教程
    本文介绍了Go语言实现的命令行工具Cobra的基本概念、安装方法和入门实践。Cobra被广泛应用于各种项目中,如Kubernetes、Hugo和Github CLI等。通过使用Cobra,我们可以快速创建命令行工具,适用于写测试脚本和各种服务的Admin CLI。文章还通过一个简单的demo演示了Cobra的使用方法。 ... [详细]
  • 【shell】网络处理:判断IP是否在网段、两个ip是否同网段、IP地址范围、网段包含关系
    本文介绍了使用shell脚本判断IP是否在同一网段、判断IP地址是否在某个范围内、计算IP地址范围、判断网段之间的包含关系的方法和原理。通过对IP和掩码进行与计算,可以判断两个IP是否在同一网段。同时,还提供了一段用于验证IP地址的正则表达式和判断特殊IP地址的方法。 ... [详细]
  • REVERT权限切换的操作步骤和注意事项
    本文介绍了在SQL Server中进行REVERT权限切换的操作步骤和注意事项。首先登录到SQL Server,其中包括一个具有很小权限的普通用户和一个系统管理员角色中的成员。然后通过添加Windows登录到SQL Server,并将其添加到AdventureWorks数据库中的用户列表中。最后通过REVERT命令切换权限。在操作过程中需要注意的是,确保登录名和数据库名的正确性,并遵循安全措施,以防止权限泄露和数据损坏。 ... [详细]
  • 本文介绍了Java工具类库Hutool,该工具包封装了对文件、流、加密解密、转码、正则、线程、XML等JDK方法的封装,并提供了各种Util工具类。同时,还介绍了Hutool的组件,包括动态代理、布隆过滤、缓存、定时任务等功能。该工具包可以简化Java代码,提高开发效率。 ... [详细]
  • android listview OnItemClickListener失效原因
    最近在做listview时发现OnItemClickListener失效的问题,经过查找发现是因为button的原因。不仅listitem中存在button会影响OnItemClickListener事件的失效,还会导致单击后listview每个item的背景改变,使得item中的所有有关焦点的事件都失效。本文给出了一个范例来说明这种情况,并提供了解决方法。 ... [详细]
  • 利用Visual Basic开发SAP接口程序初探的方法与原理
    本文介绍了利用Visual Basic开发SAP接口程序的方法与原理,以及SAP R/3系统的特点和二次开发平台ABAP的使用。通过程序接口自动读取SAP R/3的数据表或视图,在外部进行处理和利用水晶报表等工具生成符合中国人习惯的报表样式。具体介绍了RFC调用的原理和模型,并强调本文主要不讨论SAP R/3函数的开发,而是针对使用SAP的公司的非ABAP开发人员提供了初步的接口程序开发指导。 ... [详细]
  • 本文介绍了Linux Shell中括号和整数扩展的使用方法,包括命令组、命令替换、初始化数组以及算术表达式和逻辑判断的相关内容。括号中的命令将会在新开的子shell中顺序执行,括号中的变量不能被脚本余下的部分使用。命令替换可以用于将命令的标准输出作为另一个命令的输入。括号中的运算符和表达式符合C语言运算规则,可以用在整数扩展中进行算术计算和逻辑判断。 ... [详细]
  • 本文记录了在vue cli 3.x中移除console的一些采坑经验,通过使用uglifyjs-webpack-plugin插件,在vue.config.js中进行相关配置,包括设置minimizer、UglifyJsPlugin和compress等参数,最终成功移除了console。同时,还包括了一些可能出现的报错情况和解决方法。 ... [详细]
author-avatar
夏乐迎1
这个家伙很懒,什么也没留下!
Tags | 热门标签
RankList | 热门文章
PHP1.CN | 中国最专业的PHP中文社区 | DevBox开发工具箱 | json解析格式化 |PHP资讯 | PHP教程 | 数据库技术 | 服务器技术 | 前端开发技术 | PHP框架 | 开发工具 | 在线工具
Copyright © 1998 - 2020 PHP1.CN. All Rights Reserved | 京公网安备 11010802041100号 | 京ICP备19059560号-4 | PHP1.CN 第一PHP社区 版权所有