`

nutch 分布式索引(爬虫)

 
阅读更多

其实,全网抓取比intranet区别再于,

  前者提供了较为多的urls入口,

  没有使用crawl-urlfilter.txt 中并没有限制哪些url ,(如果没有使用crawl命令)

  并通过逐步处理的方式得以可按的局面;

 

在1.3,还有此区别,

  如默认的fetcher.parse是false,使得每次fetch后必须有一个parse step,刚开始老是看不懂为什么tutorial中这样做。。

  其次是,此版本不再有crawl-urlfiter.txt,而是用regex-urlfilter.txt替换。

 

在recrawl时的区别见nutch 数据增量更新

 

其实这个过程可以说是nutch对hadoop的利用的最深体会,我是这样认为的。想想看,当初,hadoop是内嵌在Nutch中,作为其中的一个功能模块。现在版本的nutch将hadop分离出来,但在分布式抓取时又得将它(配置文件,jar等)放回Nutch下。刚开始时老想nutch怎样结合hadoop进行分布式抓取;但分布式搜索还是有些不一样的,因为后者即使也是分布式,但它利用的hdfs对nutch是透明的。

 

install processes:

a.configure hadoop to run on cluster mode;

b.put all the config files belong hadoop(master and slaves) to conf dir of nutch(s) respectively;

c.execute the crawl command (SHOULD use individual commands to do INSTEAD OF 'craw',as 'crawl' is used for intranet usually)

 

here are the jobs belong this step:

Available Jobs
Job tracker Host Name Job tracker Start time Job Id Name User
master Mon Nov 07 20:50:54 CST 2011 job_201111072050_0001 inject crawl-url hadoop
master Mon Nov 07 20:50:54 CST 2011 job_201111072050_0002 crawldb crawl/dist/crawldb hadoop
master Mon Nov 07 20:50:54 CST 2011 job_201111072050_0003 generate: select from crawl/dist/crawldb hadoop
master Mon Nov 07 20:50:54 CST 2011 job_201111072050_0004 generate: partition crawl/dist/segments/2011110720 hadoop
master Mon Nov 07 20:50:54 CST 2011 job_201111072050_0005 fetch crawl/dist/segments/20111107205746 hadoop
master Mon Nov 07 20:50:54 CST 2011 job_201111072050_0006 crawldb crawl/dist/crawldb(update db actually )
hadoop
master Mon Nov 07 20:50:54 CST 2011 job_201111072050_0007 linkdb crawl/dist/linkdb hadoop
master Mon Nov 07 20:50:54 CST 2011 job_201111072050_0008 index-lucene crawl/dist/indexes hadoop
master Mon Nov 07 20:50:54 CST 2011 job_201111072050_0009 dedup 1: urls by time hadoop
master Mon Nov 07 20:50:54 CST 2011 job_201111072050_0010 dedup 2: content by hash hadoop
master Mon Nov 07 20:50:54 CST 2011 job_201111072050_0011

dedup 3: delete from index(es)

hadoop

 

 

* the jobs above with same color is ONE step beong the crawl command;

* job 2 :将sortjob結果作为输入(与已有的current数据合并),生成新的crawldb;所以可以有重复的urls,在reduce中会去重 ?

* job 4:由于存在多台crawlers,所以需要利用partition来划分urls(by host by default),来避免重复让一台机来抓取 ;

 

 

here is the output of resulst:

hadoop@leibnitz-laptop:/xxxxxxxxx$ hadoop fs -lsr crawl/dist/
drwxr-xr-x   - hadoop supergroup          0 2011-11-07 21:00 /user/hadoop/crawl/dist/crawldb
drwxr-xr-x   - hadoop supergroup          0 2011-11-07 21:00 /user/hadoop/crawl/dist/crawldb/current
drwxr-xr-x   - hadoop supergroup          0 2011-11-07 21:00 /user/hadoop/crawl/dist/crawldb/current/part-00000
-rw-r--r--   2 hadoop supergroup       6240 2011-11-07 21:00 /user/hadoop/crawl/dist/crawldb/current/part-00000/data
-rw-r--r--   2 hadoop supergroup        215 2011-11-07 21:00 /user/hadoop/crawl/dist/crawldb/current/part-00000/index
drwxr-xr-x   - hadoop supergroup          0 2011-11-07 21:00 /user/hadoop/crawl/dist/crawldb/current/part-00001
-rw-r--r--   2 hadoop supergroup       7779 2011-11-07 21:00 /user/hadoop/crawl/dist/crawldb/current/part-00001/data
-rw-r--r--   2 hadoop supergroup        218 2011-11-07 21:00 /user/hadoop/crawl/dist/crawldb/current/part-00001/index

drwxr-xr-x   - hadoop supergroup          0 2011-11-07 21:07 /user/hadoop/crawl/dist/index
-rw-r--r--   2 hadoop supergroup        369 2011-11-07 21:07 /user/hadoop/crawl/dist/index/_2.fdt
-rw-r--r--   2 hadoop supergroup         20 2011-11-07 21:07 /user/hadoop/crawl/dist/index/_2.fdx
-rw-r--r--   2 hadoop supergroup         71 2011-11-07 21:07 /user/hadoop/crawl/dist/index/_2.fnm
-rw-r--r--   2 hadoop supergroup       1836 2011-11-07 21:07 /user/hadoop/crawl/dist/index/_2.frq
-rw-r--r--   2 hadoop supergroup         14 2011-11-07 21:07 /user/hadoop/crawl/dist/index/_2.nrm
-rw-r--r--   2 hadoop supergroup       4922 2011-11-07 21:07 /user/hadoop/crawl/dist/index/_2.prx
-rw-r--r--   2 hadoop supergroup        171 2011-11-07 21:07 /user/hadoop/crawl/dist/index/_2.tii
-rw-r--r--   2 hadoop supergroup      11234 2011-11-07 21:07 /user/hadoop/crawl/dist/index/_2.tis
-rw-r--r--   2 hadoop supergroup         20 2011-11-07 21:07 /user/hadoop/crawl/dist/index/segments.gen
-rw-r--r--   2 hadoop supergroup        284 2011-11-07 21:07 /user/hadoop/crawl/dist/index/segments_2

drwxr-xr-x   - hadoop supergroup          0 2011-11-07 21:04 /user/hadoop/crawl/dist/indexes
drwxr-xr-x   - hadoop supergroup          0 2011-11-07 21:04 /user/hadoop/crawl/dist/indexes/part-00000
-rw-r--r--   2 hadoop supergroup        223 2011-11-07 21:04 /user/hadoop/crawl/dist/indexes/part-00000/_0.fdt
-rw-r--r--   2 hadoop supergroup         12 2011-11-07 21:04 /user/hadoop/crawl/dist/indexes/part-00000/_0.fdx
-rw-r--r--   2 hadoop supergroup         71 2011-11-07 21:04 /user/hadoop/crawl/dist/indexes/part-00000/_0.fnm
-rw-r--r--   2 hadoop supergroup        991 2011-11-07 21:04 /user/hadoop/crawl/dist/indexes/part-00000/_0.frq
-rw-r--r--   2 hadoop supergroup          9 2011-11-07 21:04 /user/hadoop/crawl/dist/indexes/part-00000/_0.nrm
-rw-r--r--   2 hadoop supergroup       2813 2011-11-07 21:04 /user/hadoop/crawl/dist/indexes/part-00000/_0.prx
-rw-r--r--   2 hadoop supergroup        100 2011-11-07 21:04 /user/hadoop/crawl/dist/indexes/part-00000/_0.tii
-rw-r--r--   2 hadoop supergroup       5169 2011-11-07 21:04 /user/hadoop/crawl/dist/indexes/part-00000/_0.tis
-rw-r--r--   2 hadoop supergroup          0 2011-11-07 21:04 /user/hadoop/crawl/dist/indexes/part-00000/index.done
-rw-r--r--   2 hadoop supergroup         20 2011-11-07 21:04 /user/hadoop/crawl/dist/indexes/part-00000/segments.gen
-rw-r--r--   2 hadoop supergroup        240 2011-11-07 21:04 /user/hadoop/crawl/dist/indexes/part-00000/segments_2
drwxr-xr-x   - hadoop supergroup          0 2011-11-07 21:04 /user/hadoop/crawl/dist/indexes/part-00001
-rw-r--r--   2 hadoop supergroup        150 2011-11-07 21:04 /user/hadoop/crawl/dist/indexes/part-00001/_0.fdt
-rw-r--r--   2 hadoop supergroup         12 2011-11-07 21:04 /user/hadoop/crawl/dist/indexes/part-00001/_0.fdx
-rw-r--r--   2 hadoop supergroup         71 2011-11-07 21:04 /user/hadoop/crawl/dist/indexes/part-00001/_0.fnm
-rw-r--r--   2 hadoop supergroup        845 2011-11-07 21:04 /user/hadoop/crawl/dist/indexes/part-00001/_0.frq
-rw-r--r--   2 hadoop supergroup          9 2011-11-07 21:04 /user/hadoop/crawl/dist/indexes/part-00001/_0.nrm
-rw-r--r--   2 hadoop supergroup       2109 2011-11-07 21:04 /user/hadoop/crawl/dist/indexes/part-00001/_0.prx
-rw-r--r--   2 hadoop supergroup        106 2011-11-07 21:04 /user/hadoop/crawl/dist/indexes/part-00001/_0.tii
-rw-r--r--   2 hadoop supergroup       6226 2011-11-07 21:04 /user/hadoop/crawl/dist/indexes/part-00001/_0.tis
-rw-r--r--   2 hadoop supergroup          0 2011-11-07 21:04 /user/hadoop/crawl/dist/indexes/part-00001/index.done
-rw-r--r--   2 hadoop supergroup         20 2011-11-07 21:04 /user/hadoop/crawl/dist/indexes/part-00001/segments.gen
-rw-r--r--   2 hadoop supergroup        240 2011-11-07 21:04 /user/hadoop/crawl/dist/indexes/part-00001/segments_2

drwxr-xr-x   - hadoop supergroup          0 2011-11-07 21:01 /user/hadoop/crawl/dist/linkdb
drwxr-xr-x   - hadoop supergroup          0 2011-11-07 21:01 /user/hadoop/crawl/dist/linkdb/current
drwxr-xr-x   - hadoop supergroup          0 2011-11-07 21:01 /user/hadoop/crawl/dist/linkdb/current/part-00000
-rw-r--r--   2 hadoop supergroup       8131 2011-11-07 21:01 /user/hadoop/crawl/dist/linkdb/current/part-00000/data
-rw-r--r--   2 hadoop supergroup        215 2011-11-07 21:01 /user/hadoop/crawl/dist/linkdb/current/part-00000/index
drwxr-xr-x   - hadoop supergroup          0 2011-11-07 21:01 /user/hadoop/crawl/dist/linkdb/current/part-00001
-rw-r--r--   2 hadoop supergroup      11240 2011-11-07 21:01 /user/hadoop/crawl/dist/linkdb/current/part-00001/data
-rw-r--r--   2 hadoop supergroup        218 2011-11-07 21:01 /user/hadoop/crawl/dist/linkdb/current/part-00001/index

drwxr-xr-x   - hadoop supergroup          0 2011-11-07 20:59 /user/hadoop/crawl/dist/segments
drwxr-xr-x   - hadoop supergroup          0 2011-11-07 20:59 /user/hadoop/crawl/dist/segments/20111107205746
drwxr-xr-x   - hadoop supergroup          0 2011-11-07 20:59 /user/hadoop/crawl/dist/segments/20111107205746/content
drwxr-xr-x   - hadoop supergroup          0 2011-11-07 20:59 /user/hadoop/crawl/dist/segments/20111107205746/content/part-00000
-rw-r--r--   2 hadoop supergroup      13958 2011-11-07 20:59 /user/hadoop/crawl/dist/segments/20111107205746/content/part-00000/data
-rw-r--r--   2 hadoop supergroup        213 2011-11-07 20:59 /user/hadoop/crawl/dist/segments/20111107205746/content/part-00000/index
drwxr-xr-x   - hadoop supergroup          0 2011-11-07 20:59 /user/hadoop/crawl/dist/segments/20111107205746/content/part-00001
-rw-r--r--   2 hadoop supergroup       6908 2011-11-07 20:59 /user/hadoop/crawl/dist/segments/20111107205746/content/part-00001/data
-rw-r--r--   2 hadoop supergroup        224 2011-11-07 20:59 /user/hadoop/crawl/dist/segments/20111107205746/content/part-00001/index

drwxr-xr-x   - hadoop supergroup          0 2011-11-07 20:59 /user/hadoop/crawl/dist/segments/20111107205746/crawl_fetch
drwxr-xr-x   - hadoop supergroup          0 2011-11-07 20:59 /user/hadoop/crawl/dist/segments/20111107205746/crawl_fetch/part-00000
-rw-r--r--   2 hadoop supergroup        255 2011-11-07 20:59 /user/hadoop/crawl/dist/segments/20111107205746/crawl_fetch/part-00000/data
-rw-r--r--   2 hadoop supergroup        213 2011-11-07 20:59 /user/hadoop/crawl/dist/segments/20111107205746/crawl_fetch/part-00000/index
drwxr-xr-x   - hadoop supergroup          0 2011-11-07 20:59 /user/hadoop/crawl/dist/segments/20111107205746/crawl_fetch/part-00001
-rw-r--r--   2 hadoop supergroup        266 2011-11-07 20:59 /user/hadoop/crawl/dist/segments/20111107205746/crawl_fetch/part-00001/data
-rw-r--r--   2 hadoop supergroup        224 2011-11-07 20:59 /user/hadoop/crawl/dist/segments/20111107205746/crawl_fetch/part-00001/index

drwxr-xr-x   - hadoop supergroup          0 2011-11-07 20:58 /user/hadoop/crawl/dist/segments/20111107205746/crawl_generate
-rw-r--r--   2 hadoop supergroup        255 2011-11-07 20:58 /user/hadoop/crawl/dist/segments/20111107205746/crawl_generate/part-00000
-rw-r--r--   2 hadoop supergroup         86 2011-11-07 20:58 /user/hadoop/crawl/dist/segments/20111107205746/crawl_generate/part-00001

drwxr-xr-x   - hadoop supergroup          0 2011-11-07 20:59 /user/hadoop/crawl/dist/segments/20111107205746/crawl_parse
-rw-r--r--   2 hadoop supergroup       6819 2011-11-07 20:59 /user/hadoop/crawl/dist/segments/20111107205746/crawl_parse/part-00000
-rw-r--r--   2 hadoop supergroup       8302 2011-11-07 20:59 /user/hadoop/crawl/dist/segments/20111107205746/crawl_parse/part-00001

drwxr-xr-x   - hadoop supergroup          0 2011-11-07 20:59 /user/hadoop/crawl/dist/segments/20111107205746/parse_data
drwxr-xr-x   - hadoop supergroup          0 2011-11-07 20:59 /user/hadoop/crawl/dist/segments/20111107205746/parse_data/part-00000
-rw-r--r--   2 hadoop supergroup       2995 2011-11-07 20:59 /user/hadoop/crawl/dist/segments/20111107205746/parse_data/part-00000/data
-rw-r--r--   2 hadoop supergroup        213 2011-11-07 20:59 /user/hadoop/crawl/dist/segments/20111107205746/parse_data/part-00000/index
drwxr-xr-x   - hadoop supergroup          0 2011-11-07 20:59 /user/hadoop/crawl/dist/segments/20111107205746/parse_data/part-00001
-rw-r--r--   2 hadoop supergroup       1917 2011-11-07 20:59 /user/hadoop/crawl/dist/segments/20111107205746/parse_data/part-00001/data
-rw-r--r--   2 hadoop supergroup        224 2011-11-07 20:59 /user/hadoop/crawl/dist/segments/20111107205746/parse_data/part-00001/index

drwxr-xr-x   - hadoop supergroup          0 2011-11-07 20:59 /user/hadoop/crawl/dist/segments/20111107205746/parse_text
drwxr-xr-x   - hadoop supergroup          0 2011-11-07 20:59 /user/hadoop/crawl/dist/segments/20111107205746/parse_text/part-00000
-rw-r--r--   2 hadoop supergroup       3669 2011-11-07 20:59 /user/hadoop/crawl/dist/segments/20111107205746/parse_text/part-00000/data
-rw-r--r--   2 hadoop supergroup        213 2011-11-07 20:59 /user/hadoop/crawl/dist/segments/20111107205746/parse_text/part-00000/index
drwxr-xr-x   - hadoop supergroup          0 2011-11-07 20:59 /user/hadoop/crawl/dist/segments/20111107205746/parse_text/part-00001
-rw-r--r--   2 hadoop supergroup       2770 2011-11-07 20:59 /user/hadoop/crawl/dist/segments/20111107205746/parse_text/part-00001/data
-rw-r--r--   2 hadoop supergroup        224 2011-11-07 20:59 /user/hadoop/crawl/dist/segments/20111107205746/parse_text/part-00001/index

 

从以上分析可得,除了merged index外,其它目录都存在两份-对应两台crawlers.

利用这两份索引 ,就可以实现分布式搜索了。


剩下问题:为什么网上介绍的分步方式都没有使用dedup命令?

  nutch 数据增量更新    上可知, 分布式抓取也应该使用这个dedup命令。

 

see also

http://wiki.apache.org/nutch/NutchTutorial

http://mr-lonely-hp.iteye.com/blog/1075395

分享到:
评论

相关推荐

    Nutch分布式网络爬虫研究与优化.pdf

    Nutch分布式网络爬虫研究与优化.pdfNutch分布式网络爬虫研究与优化.pdfNutch分布式网络爬虫研究与优化.pdf

    nutch分布式搜索索引热替换程序

    nutch分布式搜索索引热替换程序,当使用nutch分布式搜索的时候,通过修改nutch来实现重建索引和分布式搜索分隔开,相互不影响

    Linux下Nutch分布式配置和使用

    Linux下Nutch分布式配置 使用:分布式爬虫、索引、Nutch搜索本地数据、Nutch搜索HDFS数据。

    nutch 爬虫数据nutch 爬虫数据nutch 爬虫数据nutch 爬虫数据

    nutch 爬虫数据nutch 爬虫数据nutch 爬虫数据nutch 爬虫数据nutch 爬虫数据nutch 爬虫数据nutch 爬虫数据nutch 爬虫数据nutch 爬虫数据

    论文研究-基于Nutch的分布式主题爬虫的研究与实现 .pdf

    基于Nutch的分布式主题爬虫的研究与实现,王光,辛阳,随着互联网的日新月异的发展,网络中存储的信息量急剧增加,传统的通用搜索引擎在广泛应用的同时,面临无法满足个性化、专业化检

    Linux下Nutch分布式配置和使用.rar

    6 Nutch分布式爬虫 .................................................. 9 6.1配置Nutch配置文件 ............................................ 9 6.2 执行Nutch分布式爬虫 ......................................

    apache-nutch-1.13-src.zip_nutch_网络爬虫

    网络编程一个非常不错的开源网络爬虫学习代码!

    Apache Nutch 网络爬虫.rar

    Nutch是一个开源的网络爬虫框架,由Apache基金会开发和维护。它能够高效地抓取并处理海量数据,并提供了丰富的插件来支持各种数据源和处理方式。由于其高度可定制化和易于扩展的特性,Nutch被广泛应用于搜索引擎、...

    Nutch-1.0分布式安装手册.rar

    Nutch-1.0分布式安装手册.rar,完整的

    Apache Nutch Java网络爬虫系统 v1.14

    Nutch的创始人是Doug Cutting,他同时也是Lucene、Hadoop和Avro开源项目的创始人。 Nutch诞生于2002年8月,是Apache旗下的一个用Java实现的开源搜索引擎项目,自Nutch1.2版本之后,Nutch已经从搜索引

    基于lucene和nutch的开源搜索引擎资料集合

    Linux下Nutch分布式配置和使用.pdf Lucene+Nutch源码.rar Lucene学习笔记.doc nutch_tutorial.pdf nutch二次开发总结.txt nutch入门.pdf nutch入门学习.pdf Nutch全文搜索学习笔记.doc Yahoo的Hadoop教程.doc [硕士...

    nutch1.7 爬虫

    一个已经部署好的 nutch1.7爬虫。 导入到 eclipse里面就能用了。假如不能用的话。 还是装个cygwin 吧 找到org.apache.nutch.crawl.Crawl 这个类。 run configuration 在 Programa argument 里面 输入 crawl urls -...

    分布式搜索引擎nutch开发

    非常实用的分布式搜索引擎开发工具nutch,有兴趣的赶紧下吧!

    开发基于 Nutch 的集群式搜索引擎

    介绍 Nutch 的背景知识,包括 Nutch 架构,爬虫和...然后示例说明 Nutch 爬虫如何抓取目标网站内容,产生片断和索引,并将结果存放在集群的2个节点上。最后使用 Nutch 检索器提供的 API 开发应用,为用户提供搜索接口。

    nutch爬虫资料

    包括nutch的参考书,和NUTCH源代码分析

    Apache Nutch网络爬虫-其他

    </p><p>Nutch诞生于2002年8月,是Apache旗下的一个用Java实现的开源搜索引擎项目,自Nutch1.2版本之后,Nutch已经从搜索引擎演化为网络爬虫,接着Nutch进一步演化为两大分支版本:1.X和2.X,这两大分支最大的区别...

    Apache Nutch网络爬虫 v1.19.zip

    Apache Nutch网络爬虫 v1.19.zip

    nutch 初学文档教材

    7. nutch分布式文件系统........41 2007-8-26 北京邮电大学-李阳 Nutch入门学习 7.1 概述...41 7.2 MapReduce.......41 7.3 文件系统语法......42 7.4 文件系统设计......42 7.5 系统的可用性......43 7.6 Nutch...

    nutch爬虫说明文档

     Nutch 的爬虫有两种方式  爬行企业内部网(Intranet crawling:针对少数网站进行,用 crawl 命令。  爬行整个互联网:使用低层的 inject, generate, fetch 和 updatedb 命令,具有更强的可控制性。 有研究或...

Global site tag (gtag.js) - Google Analytics