当我们为nutch的架构发愁的时候,nutch的开发人员送来了nutchbase。我一些简单的测试表明,在hadoop0.20.1和hbase0.20.2上,稍加修改可以运行起来。
它的优点很明显:架构合理.
开发者是这样说的,引用自jira
http://issues.apache.org/jira/browse/NUTCH-650
A) Why integrate with hbase?
All your data in a central location
No more segment/crawldb/linkdb merges.
No more "missing" data in a job. There are a lot of places where we copy data from one structure to another just so that it is available in a later job. For example, during parsing we don't have access to a URL's fetch status. So we copy fetch status into content metadata. This will no longer be necessary with hbase integration.
A much simpler data model. If you want to update a small part in a single record, now you have to write a MR job that reads the relevant directory, change the single record, remove old directory and rename new directory. With hbase, you can just update that record. Also, hbase gives us access to Yahoo! Pig, which I think, with its SQL-ish language may be easier for people to understand and use.
B) Design
Design is actually rather straightforward.
We store everything (fetch time, status, content, parsed text, outlinks, inlinks, etc.) in hbase. I have written a small utility class that creates "webtable" with necessary columns.
So now most jobs just take the name of the table as input.
There are two main classes for interfacing with hbase. ImmutableRowPart wraps around a RowResult and has helper getters (getStatus(), getContent(), etc.). RowPart is similar to ImmutableRowPart but also has setters. The idea is that RowPart also wraps RowResult but also keeps a list of updates done to that row. So when getSomething is called, it first checks if Something is already updated (if so then returns the updated version) or returns from RowResult. RowPart can also create a BatchUpdate from its list of updates.
URLs are stores in reversed host order. For example, http://bar.foo.com:8983/to/index.html?a=b becomes com.foo.bar:http:8983/to/index.html?a=b. This way, URLs from the same tld/host/domain are stored closer to each other. TableUtil has methods for reversing and unreversing URLs.
CrawlDatum Status-es are simplifed. Since everything is in central location now, no point in having a DB and FETCH status.
Jobs:
Each job marks rows so that the next job knows which rows to read. For example, if GeneratorHbase decides that a URL should be generated it marks the URL with a TMP_FETCH_MARK (Marking a url is simply creating a special metadata field.) When FetcherHbase runs, it skips over anything without this special mark.
InjectorHbase: First, a job runs where injected urls are marked. Then in the next job, if a row has the mark but nothing else (here, I assumed that if a row has "status:" column, that it already exists), InjectorHbase initializes the row.
GeneratorHbase: Supports max-per-host configuration and topN. Marks generated urls with a marker.
FetcherHbase: Very similar to original Fetcher. Marks urls successfully fetched. Skips over URLs not marked by GeneratorHbase
ParseTable: Similar to original Parser. Outlinks are stored "outlinks:<fromUrl>" -> "anchor".
UpdateTable: Does updatedb's and invertlink's job. Also clears any markers.
IndexerHbase: Indexes the entire table. Skips over URLs not parsed successfully.
分享到:
相关推荐
nutch2.3+hbase0.94.14+hadoop1.2.1安装文档.txt )
Nutch+solr + hadoop相关框架搭建教程
使用github中最新的nutch-2.x源码,奋战10天拿下的Hadoop-2.4.0+Hbase-0.94.18+Nutch-2.3配置攻略,在ubuntu14.04上成功运行本地和分布式爬虫。文档详细描述了三者版本不兼容问题的解决方案以及各个配置文件的详细...
nutch+lucene开发自己的搜索引擎 第三章开源搜索引擎入门
基于Nutch+ElasticSearch+MySQL+SSM的简易搜索引擎
MySQL 是一款广受欢迎的开源关系型数据库管理系统(RDBMS),由瑞典MySQL AB公司开发,现隶属于美国甲骨文公司(Oracle)。自1998年首次发布以来,MySQL以其卓越的性能、可靠性和可扩展性,成为全球范围内Web应用...
<项目介绍> 该资源内项目源码是个人的毕设,代码都测试ok,都是运行成功后才上传资源,答辩评审平均分达到96分,放心下载使用! 1、该资源内项目代码都经过测试运行成功,功能ok的情况下才上传的,请放心下载使用!...
所有源码均经过严格测试,可以直接运行,可以放心下载使用。有任何使用问题欢迎随时与博主沟通,第一时间进行解答!该资源内项目代码都经过测试运行成功,功能ok的情况下才上传的,请放心下载使用!...
Windows下cygwin+MyEclipse 8.5+Nutch1.2+Tomcat 6.0 Windows下cygwin+MyEclipse 8.5+Nutch1.2+Tomcat 6.0 Windows下cygwin+MyEclipse 8.5+Nutch1.2+Tomcat 6.0
Linux下Nutch分布式配置和使用.pdf Lucene+Nutch源码.rar Lucene学习笔记.doc nutch_tutorial.pdf nutch二次开发总结.txt nutch入门.pdf nutch入门学习.pdf Nutch全文搜索学习笔记.doc Yahoo的Hadoop教程.doc [硕士...
nutch爬虫,java也能做爬虫,不一定非得用python呦
Lucene+Nutch本书源码+详细说明,研究搜索引擎具体抓取与解析等技术问题,有利于搜索引擎开发新手的熟悉与了解,难得的电子版,值得珍藏
完整的《Lucene+nutch搜索引擎开发》PDF版一共83.6M,无奈我上传的最高限是80M,所以切成两个。这一个是主文件,还需要下载一个副文件Lucene+nutch搜索引擎开发.z01。解压时直接放到一起,解压这个主文件就行了。
Nutch抓取指定网址数据,存储在HBase数据库中,存储过程由zookeeper管理。脚本调用索引器部件将数据索引化,经过索引化的数据被前端检索查询,最后前端展示查询结果,用户点击结果列表查看目标资料。
licene 实例代码 nutch实例代码 lucene+nutch搜索引擎开发实例代码(王学松版)
nutch 爬虫数据nutch 爬虫数据nutch 爬虫数据nutch 爬虫数据nutch 爬虫数据nutch 爬虫数据nutch 爬虫数据nutch 爬虫数据nutch 爬虫数据
Lucene+Nutch搜索光盘资料Lucene+Nutch搜索光盘资料Lucene+Nutch搜索光盘资料Lucene+Nutch搜索光盘资料
Lucene+Nutch搜索引擎开发
Lucene nutch 搜索引擎开发 Part1
大数据技术 Hadoop开发者第二期 Nutch MapReduce HDFS Hive Mahout HBase 共64页.r