`
xtfncel
  • 浏览: 74534 次
  • 性别: Icon_minigender_1
  • 来自: 北京
社区版块
存档分类
最新评论

nutch1.1导入eclipse中运行

阅读更多

Nutch导入eclipse

    最近在开始研究nutch刚把它在eclipse中跑起来,方便研究源码吧。本文针对的是nutch1.1版本。如有不对的地方欢迎指证。 

直接导入eclipse.

1.首先下载nutch的最新版本, http://apache.etoak.com/nutch/ 本文写作时nutch的最新版本为1.1。所以本文以下所述均针对nutch1.1(注意笔者在下载时发现该版本的src包有问题。换成bin包后正常)

2.eclipse中新建立一个Java Project. 名字自己定义(Nutch). 选择"Create project from existing source",指向自己nutch-1.0的目录.

3.单点finish完成。这时就把nutch的整个工程全部导入到了 eclipse中了。

 

4.此时还需将conf文件下的所有配置文件加入到classPath中。

右键conf――>Build PathàUse as Source Folder

5.修改配置文件

conf/nutch-site.xml    <configuration>中加入以下内容。

<property>

           <name>http.agent.name</name>

           <value>test</value>

           <description>

                 HTTP 'User-Agent' request header. MUST NOT be empty - please

                 set this to a single word uniquely related to your

                 organization. NOTE: You should also check other related

                 properties: http.robots.agents http.agent.description

                 http.agent.url http.agent.email http.agent.version and set

                 their values appropriately.

           </description>

      </property>

      <property>

           <name>http.agent.description</name>

           <value>test</value>

           <description>

                 Further description of our bot- this text is used in

                 the User-Agent header. It appears in parenthesis after the

                 agent name.

           </description>

      </property>

      <property>

           <name>http.agent.url</name>

           <value>www.163.com</value>

           <description>

                 A URL to advertise in the User-Agent header. This will

                 appear in parenthesis after the agent name. Custom dictates

                 that this

                 should be a URL of a page explaining the purpose and

                 behavior of this

                 crawler.

           </description>

      </property>

      <property>

           <name>http.agent.email</name>

           <value>xxx@126.com</value>

           <description>

                 An email address to advertise in the HTTP 'From' request

                 header and User-Agent header. A good practice is to mangle     this

                 address (e.g. 'info at example dot com') to avoid spamming.

           </description>

      </property>

conf/nutch-default.xml

<property>

            <name>plugin.folders</name>

  <value>./src/plugin</value><!—修改部分-->

conf/crawl-urlfilter.txt

    #accept hosts in MY.DOMAIN.NAME

+^http://([a-z0-9]*\.)*163.com/      ---入想取的站正

6.在工程的根目录下建立urls文件夹,其中新建一个url.txt文件,该文件中写入想抓取的网站URL,如:http://www.163.com/

7.执行nutch crawl命令。

至此nutch导入eclipse中运行成功。

 

手动复制导入nutch代码到eclipse.(目录结构更清楚)

通过以上方式虽然可以成功的将nutch导入eclipse中,但工程目录结构极为不适合,所以笔者又以手动的方式将nutch的源代码导入eclipse中,使目录更清晰。具体步骤如下:

1.eclipse中新建立一个Java Project. 名字自己定义(Nutch). 选择"Create New project in WorkSpace".点击完成。

2.将解压后的nutch目录下的\src\java\中的代码全部复制到新建工程中的src下。

将解压后的nutch目录下的libpluginsconf三个文件夹复制到新建工程的根目录下(与src同级)

3.右键工程properties, 切换到"Libraries"选择"Add Class Folder..." 按钮,从列表中选择"conf".  conf加入到classpath中。

4.修改配置文件

(1)conf/nutch-site.xml 同上.

(2)conf/nutch-default.xml

<property>

         <name>plugin.folders</name>

           <value> ./plugins</value><!—小心这里路径变了,笔者曾在此费了大半天时间-->

(3) conf/crawl-urlfilter.txt 同上。

(4)新建urls。同上

 

 

布署nutch搜索到tomcat.

1.安装WAR文件
     WAR文件$nutch$/nutch-*.war拷贝到目录$tomcat$/webapps/.这样就可以通过URL: http://127.0.0.1:8080/nutch 来打开搜索主页面。

注意:如果你的tomcat中的默认JDK不是1.6。启动tomcat时就会报“错误的版本”的异常。这时就需要修改tomcatJDK版本。

 

配置tomcat使用特定的java sdk版本的方法非常简单:

1、修改tomcat/bin/catalina.bat,增加 set JAVA_HOME=XXXXXX,其中XXXXXXjdk 的路径,如c:\j2sdk1_6

2、修改tomcat/bin/setclasspath.bat,同样增加 set JAVA_HOME=XXXXXX

其实这个问题是很初级的,但是一般人常常忘了步骤2,导致的结果就是特定版本的java启动tomcat,但是jdk用的还是系统默认的(在系统的JAVA_HOME中的设置)。

 

2. 指定搜索数据目录

需要为搜索服务程序指定数据文件的位置。

假设WAR文件保存为nutch.war,重启动Tomcat,解压缩成目录$tomcat$/webapps/nutch/

打开文件$tomcat$/webapps/nutch/WEB-INF/classes/nutch-site.xml,添加searcher.dir属性,例如数据文件保存在/local/nutch/crawl目录中,则添加:

   <property>
      <name>searcher.dir</name>
      <value>/local/nutch/crawl</value>
   </property>

   这样search.jsp就知道数据文件的在哪里了.

分享到:
评论
4 楼 攻城小狮 2015-07-13  
leeking888 写道
org.apache.hadoop.mapred.JobClient
这个东西是不是要配置一下 hadoop 才能运行呢??还是要安装cygwin

这个应该是有对应jar包就可以的 前辈 你研究的好早啊
3 楼 淘宝王挺 2012-07-19  
我用的最新版本,使用 maven 管理的,这个错误是什么呢?
12/07/19 16:12:43 WARN mapred.LocalJobRunner: job_local_0001
java.lang.Exception: java.lang.RuntimeException: Error in configuring object
	at org.apache.hadoop.mapred.LocalJobRunner$Job.run(LocalJobRunner.java:372)
Caused by: java.lang.RuntimeException: Error in configuring object
	at org.apache.hadoop.util.ReflectionUtils.setJobConf(ReflectionUtils.java:101)
	at org.apache.hadoop.util.ReflectionUtils.setConf(ReflectionUtils.java:72)
	at org.apache.hadoop.util.ReflectionUtils.newInstance(ReflectionUtils.java:125)
	at org.apache.hadoop.mapred.MapTask.runOldMapper(MapTask.java:385)
	at org.apache.hadoop.mapred.MapTask.run(MapTask.java:322)
	at org.apache.hadoop.mapred.LocalJobRunner$Job$MapTaskRunnable.run(LocalJobRunner.java:239)
	at java.util.concurrent.Executors$RunnableAdapter.call(Unknown Source)
	at java.util.concurrent.FutureTask$Sync.innerRun(Unknown Source)
	at java.util.concurrent.FutureTask.run(Unknown Source)
	at java.util.concurrent.ThreadPoolExecutor$Worker.runTask(Unknown Source)
	at java.util.concurrent.ThreadPoolExecutor$Worker.run(Unknown Source)
	at java.lang.Thread.run(Unknown Source)
Caused by: java.lang.reflect.InvocationTargetException
	at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
	at sun.reflect.NativeMethodAccessorImpl.invoke(Unknown Source)
	at sun.reflect.DelegatingMethodAccessorImpl.invoke(Unknown Source)
	at java.lang.reflect.Method.invoke(Unknown Source)
	at org.apache.hadoop.util.ReflectionUtils.setJobConf(ReflectionUtils.java:96)
	... 11 more
Caused by: java.lang.RuntimeException: Error in configuring object
	at org.apache.hadoop.util.ReflectionUtils.setJobConf(ReflectionUtils.java:101)
	at org.apache.hadoop.util.ReflectionUtils.setConf(ReflectionUtils.java:72)
	at org.apache.hadoop.util.ReflectionUtils.newInstance(ReflectionUtils.java:125)
	at org.apache.hadoop.mapred.MapRunner.configure(MapRunner.java:38)
	... 16 more
Caused by: java.lang.reflect.InvocationTargetException
	at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
	at sun.reflect.NativeMethodAccessorImpl.invoke(Unknown Source)
	at sun.reflect.DelegatingMethodAccessorImpl.invoke(Unknown Source)
	at java.lang.reflect.Method.invoke(Unknown Source)
	at org.apache.hadoop.util.ReflectionUtils.setJobConf(ReflectionUtils.java:96)
	... 19 more
Caused by: java.lang.IllegalArgumentException: plugin.folders is not defined
	at org.apache.nutch.plugin.PluginManifestParser.parsePluginFolder(PluginManifestParser.java:79)
	at org.apache.nutch.plugin.PluginRepository.<init>(PluginRepository.java:72)
	at org.apache.nutch.plugin.PluginRepository.get(PluginRepository.java:99)
	at org.apache.nutch.net.URLNormalizers.<init>(URLNormalizers.java:118)
	at org.apache.nutch.crawl.Injector$InjectMapper.configure(Injector.java:70)
	... 24 more
12/07/19 16:12:44 INFO mapreduce.Job:  map 0% reduce 0%
12/07/19 16:12:44 INFO mapreduce.Job: Job complete: job_local_0001
12/07/19 16:12:44 INFO mapreduce.Job: Counters: 0
Exception in thread "main" java.io.IOException: Job failed!
	at org.apache.hadoop.mapred.JobClient.runJob(JobClient.java:784)
	at org.apache.nutch.crawl.Injector.inject(Injector.java:257)
	at org.apache.nutch.crawl.Crawl.run(Crawl.java:127)
	at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:69)
	at org.apache.nutch.crawl.Crawl.main(Crawl.java:55)
2 楼 leeking888 2010-10-20  
org.apache.hadoop.mapred.JobClient
这个东西是不是要配置一下 hadoop 才能运行呢??还是要安装cygwin
1 楼 leeking888 2010-10-20  
按照LZ的 总是运行不了 出现以下错误
是不是因为我没配置hadoop呢??
Exception in thread "main" java.io.IOException: Cannot run program "chmod": CreateProcess error=2, ϵͳÕҲ»µ½ָ¶
	at java.lang.ProcessBuilder.start(ProcessBuilder.java:459)
	at org.apache.hadoop.util.Shell.runCommand(Shell.java:149)
	at org.apache.hadoop.util.Shell.run(Shell.java:134)
	at org.apache.hadoop.util.Shell$ShellCommandExecutor.execute(Shell.java:286)
	at org.apache.hadoop.util.Shell.execCommand(Shell.java:354)
	at org.apache.hadoop.util.Shell.execCommand(Shell.java:337)
	at org.apache.hadoop.fs.RawLocalFileSystem.execCommand(RawLocalFileSystem.java:481)
	at org.apache.hadoop.fs.RawLocalFileSystem.setPermission(RawLocalFileSystem.java:473)
	at org.apache.hadoop.fs.FilterFileSystem.setPermission(FilterFileSystem.java:280)
	at org.apache.hadoop.fs.FileSystem.mkdirs(FileSystem.java:266)
	at org.apache.hadoop.mapred.JobClient.configureCommandLineOptions(JobClient.java:573)
	at org.apache.hadoop.mapred.JobClient.submitJobInternal(JobClient.java:761)
	at org.apache.hadoop.mapred.JobClient.submitJob(JobClient.java:730)
	at org.apache.hadoop.mapred.JobClient.runJob(JobClient.java:1249)
	at org.apache.nutch.crawl.Injector.inject(Injector.java:211)
	at org.apache.nutch.crawl.Crawl.main(Crawl.java:124)
Caused by: java.io.IOException: CreateProcess error=2, ϵͳÕҲ»µ½ָ¶
	at java.lang.ProcessImpl.create(Native Method)
	at java.lang.ProcessImpl.<init>(ProcessImpl.java:81)
	at java.lang.ProcessImpl.start(ProcessImpl.java:30)
	at java.lang.ProcessBuilder.start(ProcessBuilder.java:452)
	... 15 more

相关推荐

Global site tag (gtag.js) - Google Analytics