- 浏览: 158239 次
- 性别:
- 来自: 北京
最新评论
-
w156445045:
我最近想写这方面的论文,学位论文啊,哎。希望博主能给点思路,谢 ...
《Lucene+Nutch搜索引擎》看过以后。。。 -
inprice:
这也要研究啊!!!!! 失望ing.........
完成了对于heritrix的初步研究 -
dt_fb:
您好,我想问问您,你有跳过recrawl.sh这个脚本文件么? ...
Nutch开源搜索引擎的crawl日志分析及工作目录说明 -
lovepoem:
能增量吗?是不是还是把所有的url遍历出来。和以前的对比。算是 ...
Nutch开源搜索引擎增量索引recrawl的终极解决办法 -
itang:
见到牛人照片了, MS下巴动过刀(开玩笑)
搜索引擎名人堂之Jeff Dean
本文重点是介绍Nutch开源搜索引擎如何在Hadoop分布式计算架构上进行recrawl,也就是在解决nutch增量索引的问题。google过来的章中没有一个详细解释整个过程的,经过一番痛苦的研究,最后找到了最终解决办法。
先按照自己部署好的Nutch架构写出recrawl的shell脚本,注意:如果本地索引,就需要调用bash的 rm、cp等命令,如果HDFS上的索引,就需要调用hadoop dfs -rmr 或者hadoop dfs -cp命令来处理,当然在用这个命令的同时,还需要处理一下命令的返回结果。写好脚本后,执行就可以了,或者放到crontab里面定时执行。
网上有一篇wiki,提供了一个shell脚本
http://wiki.apache.org/nutch/IntranetRecrawl#head-93eea6620f57b24dbe3591c293aead539a017ec7
下载下来后,满心欢喜的加到nutch/bin下面,然后执行命令
/nutch/search/bin/recrawl /nutch/tomcat/webapps/cse /user/nutch/crawl10 10 31
每个参数的意思是 tomcat_servlet_home ,nutch的HDFS上的crawl目录,10是深度,31是adddays
程序在执行过程中有报错,大致意思是没有找到mergesegs_dir目录等等,但是MapReduce的过程还在进行,我也没有太在意,先让它执行完毕再说吧。当执行完毕后,发现索引根本没有增加,而且在nutch目录下还多了一个mergesegs_dir。这个时候我开始检查recrawl.sh,发现在wiki上的shell脚本是针对本地索引来写的。于是,我开始修改 recrawl.sh文件,将其它的rm、cp命令修改成hadoop的命令。
然后再执行之前的命令,发现在generate这一步hadoop就报错了,无法执行下去。还好hadoop的log非常详细,在Job Failed里面发现报出一大堆Too many open files异常。又经过一番google后,发现在datanode这一端,需要将/etc/security/limits.conf中的文件打开参数调整一下,加入
nutch soft nofile 4096
nutch hard nofile 63536
nutch soft nproc 2047
nutch hard nproc 16384
调整完毕后,需要将hadoop重启一下,这一步很重要,否则会报同样的错误。
做完这些后,再去执行之前的命令,一切OK了。
最后,给大家分享下,我修改好的recrawl.sh,本人shell基础不好,凑合能用吧,哈哈。
#!/bin/bash
# Nutch recrawl script.
# Based on 0.7.2 script at http://today.java.net/pub/a/today/2006/02/16/introduction-to-nutch-2.html
#
# The script merges the new segments all into one segment to prevent redundant
# data. However, if your crawl/segments directory is becoming very large, I
# would suggest you delete it completely and generate a new crawl. This probaly
# needs to be done every 6 months.
#
# Modified by Matthew Holt
# mholt at elon dot edu
if [ -n "$1" ]
then
tomcat_dir=$1
else
echo "Usage: recrawl servlet_path crawl_dir depth adddays [topN]"
echo "servlet_path - Path of the nutch servlet (full path, ie: /usr/local/tomc
at/webapps/ROOT)"
echo "crawl_dir - Path of the directory the crawl is located in. (full path, i
e: /home/user/nutch/crawl)"
echo "depth - The link depth from the root page that should be crawled."
echo "adddays - Advance the clock # of days for fetchlist generation. [0 for n
one]"
echo "[topN] - Optional: Selects the top # ranking URLS to be crawled."
exit 1
fi
if [ -n "$2" ]
then
crawl_dir=$2
else
echo "Usage: recrawl servlet_path crawl_dir depth adddays [topN]"
echo "servlet_path - Path of the nutch servlet (full path, ie: /usr/local/tomc
at/webapps/ROOT)"
echo "crawl_dir - Path of the directory the crawl is located in. (full path, i
e: /home/user/nutch/crawl)"
echo "depth - The link depth from the root page that should be crawled."
echo "adddays - Advance the clock # of days for fetchlist generation. [0 for n
one]"
echo "[topN] - Optional: Selects the top # ranking URLS to be crawled."
exit 1
fi
if [ -n "$3" ]
then
depth=$3
else
echo "Usage: recrawl servlet_path crawl_dir depth adddays [topN]"
echo "servlet_path - Path of the nutch servlet (full path, ie: /usr/local/tomc
at/webapps/ROOT)"
echo "crawl_dir - Path of the directory the crawl is located in. (full path, i
e: /home/user/nutch/crawl)"
echo "depth - The link depth from the root page that should be crawled."
echo "adddays - Advance the clock # of days for fetchlist generation. [0 for n
one]"
echo "[topN] - Optional: Selects the top # ranking URLS to be crawled."
exit 1
fi
if [ -n "$4" ]
then
adddays=$4
else
echo "Usage: recrawl servlet_path crawl_dir depth adddays [topN]"
echo "servlet_path - Path of the nutch servlet (full path, ie: /usr/local/tomcat/webapps/ROOT)"
echo "crawl_dir - Path of the directory the crawl is located in. (full path, ie: /home/user/nutch/crawl)"
echo "depth - The link depth from the root page that should be crawled."
echo "adddays - Advance the clock # of days for fetchlist generation. [0 for n
one]"
echo "[topN] - Optional: Selects the top # ranking URLS to be crawled."
exit 1
fi
if [ -n "$5" ]
then
topn="-topN $5"
else
topn=""
fi
#Sets the path to bin
nutch_dir=`dirname $0`
echo "nutch directory :$nutch_dir"
# Only change if your crawl subdirectories are named something different
webdb_dir=$crawl_dir/crawldb
segments_dir=$crawl_dir/segments
linkdb_dir=$crawl_dir/linkdb
index_dir=$crawl_dir/index
hadoop="/nutch/search/bin/hadoop" # hadoop command
# The generate/fetch/update cycle
for ((i=1; i <= depth ; i++))
do
$nutch_dir/nutch generate $webdb_dir $segments_dir $topn -adddays $adddays
#segment=`ls -d $segments_dir/* | tail -1`
segment_tmp=`$hadoop dfs -ls $segments_dir | tail -1`
segment_tmp_len=`expr length "$segment_tmp"`
segment_tmp_end=`expr $segment_tmp_len - 6`
segment=`expr substr "$segment_tmp" 1 $segment_tmp_end`
echo "fetch update segment :$segment"
echo "fetch update segment_tmp :$segment_tmp"
$nutch_dir/nutch fetch $segment
$nutch_dir/nutch updatedb $webdb_dir $segment
done
# Merge segments and cleanup unused segments
mergesegs_dir=$crawl_dir/mergesegs_dir
$nutch_dir/nutch mergesegs $mergesegs_dir -dir $segments_dir
#for segment in `ls -d $segments_dir/* | tail -$depth`
for segment_tmp in `$hadoop dfs -ls $segments_dir | tail -$depth`
do
segment_tmp_len=`expr length "$segment_tmp"`
segment_tmp_end=`expr $segment_tmp_len - 6`
segment=`expr substr "$segment_tmp" 1 $segment_tmp_end`
echo "Removing Temporary Segment: $segment"
#rm -rf $segment
$hadoop dfs -rmr $segment
done
#cp -R $mergesegs_dir/* $segments_dir
#rm -rf $mergesegs_dir
$hadoop dfs -cp $mergesegs_dir/* $segments_dir
$hadoop dfs -rmr $mergesegs_dir
# Update segments
$nutch_dir/nutch invertlinks $linkdb_dir -dir $segments_dir
# Index segments
new_indexes=$crawl_dir/newindexes
#segment=`ls -d $segments_dir/* | tail -1`
segment_tmp=`$hadoop dfs -ls $segments_dir | tail -1`
segment_tmp_len=`expr length "$segment_tmp"`
segment_tmp_end=`expr $segment_tmp_len - 6`
segment=`expr substr "$segment_tmp" 1 $segment_tmp_end`
echo "Index segment :$segment"
$nutch_dir/nutch index $new_indexes $webdb_dir $linkdb_dir $segment
# De-duplicate indexes
$nutch_dir/nutch dedup $new_indexes
# Merge indexes
$nutch_dir/nutch merge $index_dir $new_indexes
# Tell Tomcat to reload index
touch $tomcat_dir/WEB-INF/web.xml
# Clean up
#rm -rf $new_indexes
$hadoop dfs -rmr $new_indexes
echo "FINISHED: Recrawl completed. To conserve disk space, I would suggest"
echo " that the crawl directory be deleted once every 6 months (or more"
echo " frequent depending on disk constraints) and a new crawl generated."
adddays, which is useful for forcing pages to be retrieved even if they are not yet due to be re-fetched.
The page re-fetch interval in Nutch is controlled by the configuration property db.default.fetch.interval,
and defaults to 30 days. The adddays arguments can be used to advance the clock for fetchlist generation
(but not for calculating the next fetch time), thereby fetching pages early.
简单的说"adddays"被用来增大当前时间来生成抓取列表,但是不用来计算下一次抓取时间。因为nutch默认是在30天以后对同一个页面进行重新抓取,这个值配置在nutch-default.xml
<property><name>db.default.fetch.interval</name><value>30</value><description>(DEPRECATED) The default number of days between re-fetches of a page.
</description></property>
先按照自己部署好的Nutch架构写出recrawl的shell脚本,注意:如果本地索引,就需要调用bash的 rm、cp等命令,如果HDFS上的索引,就需要调用hadoop dfs -rmr 或者hadoop dfs -cp命令来处理,当然在用这个命令的同时,还需要处理一下命令的返回结果。写好脚本后,执行就可以了,或者放到crontab里面定时执行。
网上有一篇wiki,提供了一个shell脚本
http://wiki.apache.org/nutch/IntranetRecrawl#head-93eea6620f57b24dbe3591c293aead539a017ec7
下载下来后,满心欢喜的加到nutch/bin下面,然后执行命令
/nutch/search/bin/recrawl /nutch/tomcat/webapps/cse /user/nutch/crawl10 10 31
每个参数的意思是 tomcat_servlet_home ,nutch的HDFS上的crawl目录,10是深度,31是adddays
程序在执行过程中有报错,大致意思是没有找到mergesegs_dir目录等等,但是MapReduce的过程还在进行,我也没有太在意,先让它执行完毕再说吧。当执行完毕后,发现索引根本没有增加,而且在nutch目录下还多了一个mergesegs_dir。这个时候我开始检查recrawl.sh,发现在wiki上的shell脚本是针对本地索引来写的。于是,我开始修改 recrawl.sh文件,将其它的rm、cp命令修改成hadoop的命令。
然后再执行之前的命令,发现在generate这一步hadoop就报错了,无法执行下去。还好hadoop的log非常详细,在Job Failed里面发现报出一大堆Too many open files异常。又经过一番google后,发现在datanode这一端,需要将/etc/security/limits.conf中的文件打开参数调整一下,加入
nutch soft nofile 4096
nutch hard nofile 63536
nutch soft nproc 2047
nutch hard nproc 16384
调整完毕后,需要将hadoop重启一下,这一步很重要,否则会报同样的错误。
做完这些后,再去执行之前的命令,一切OK了。
最后,给大家分享下,我修改好的recrawl.sh,本人shell基础不好,凑合能用吧,哈哈。
#!/bin/bash
# Nutch recrawl script.
# Based on 0.7.2 script at http://today.java.net/pub/a/today/2006/02/16/introduction-to-nutch-2.html
#
# The script merges the new segments all into one segment to prevent redundant
# data. However, if your crawl/segments directory is becoming very large, I
# would suggest you delete it completely and generate a new crawl. This probaly
# needs to be done every 6 months.
#
# Modified by Matthew Holt
# mholt at elon dot edu
if [ -n "$1" ]
then
tomcat_dir=$1
else
echo "Usage: recrawl servlet_path crawl_dir depth adddays [topN]"
echo "servlet_path - Path of the nutch servlet (full path, ie: /usr/local/tomc
at/webapps/ROOT)"
echo "crawl_dir - Path of the directory the crawl is located in. (full path, i
e: /home/user/nutch/crawl)"
echo "depth - The link depth from the root page that should be crawled."
echo "adddays - Advance the clock # of days for fetchlist generation. [0 for n
one]"
echo "[topN] - Optional: Selects the top # ranking URLS to be crawled."
exit 1
fi
if [ -n "$2" ]
then
crawl_dir=$2
else
echo "Usage: recrawl servlet_path crawl_dir depth adddays [topN]"
echo "servlet_path - Path of the nutch servlet (full path, ie: /usr/local/tomc
at/webapps/ROOT)"
echo "crawl_dir - Path of the directory the crawl is located in. (full path, i
e: /home/user/nutch/crawl)"
echo "depth - The link depth from the root page that should be crawled."
echo "adddays - Advance the clock # of days for fetchlist generation. [0 for n
one]"
echo "[topN] - Optional: Selects the top # ranking URLS to be crawled."
exit 1
fi
if [ -n "$3" ]
then
depth=$3
else
echo "Usage: recrawl servlet_path crawl_dir depth adddays [topN]"
echo "servlet_path - Path of the nutch servlet (full path, ie: /usr/local/tomc
at/webapps/ROOT)"
echo "crawl_dir - Path of the directory the crawl is located in. (full path, i
e: /home/user/nutch/crawl)"
echo "depth - The link depth from the root page that should be crawled."
echo "adddays - Advance the clock # of days for fetchlist generation. [0 for n
one]"
echo "[topN] - Optional: Selects the top # ranking URLS to be crawled."
exit 1
fi
if [ -n "$4" ]
then
adddays=$4
else
echo "Usage: recrawl servlet_path crawl_dir depth adddays [topN]"
echo "servlet_path - Path of the nutch servlet (full path, ie: /usr/local/tomcat/webapps/ROOT)"
echo "crawl_dir - Path of the directory the crawl is located in. (full path, ie: /home/user/nutch/crawl)"
echo "depth - The link depth from the root page that should be crawled."
echo "adddays - Advance the clock # of days for fetchlist generation. [0 for n
one]"
echo "[topN] - Optional: Selects the top # ranking URLS to be crawled."
exit 1
fi
if [ -n "$5" ]
then
topn="-topN $5"
else
topn=""
fi
#Sets the path to bin
nutch_dir=`dirname $0`
echo "nutch directory :$nutch_dir"
# Only change if your crawl subdirectories are named something different
webdb_dir=$crawl_dir/crawldb
segments_dir=$crawl_dir/segments
linkdb_dir=$crawl_dir/linkdb
index_dir=$crawl_dir/index
hadoop="/nutch/search/bin/hadoop" # hadoop command
# The generate/fetch/update cycle
for ((i=1; i <= depth ; i++))
do
$nutch_dir/nutch generate $webdb_dir $segments_dir $topn -adddays $adddays
#segment=`ls -d $segments_dir/* | tail -1`
segment_tmp=`$hadoop dfs -ls $segments_dir | tail -1`
segment_tmp_len=`expr length "$segment_tmp"`
segment_tmp_end=`expr $segment_tmp_len - 6`
segment=`expr substr "$segment_tmp" 1 $segment_tmp_end`
echo "fetch update segment :$segment"
echo "fetch update segment_tmp :$segment_tmp"
$nutch_dir/nutch fetch $segment
$nutch_dir/nutch updatedb $webdb_dir $segment
done
# Merge segments and cleanup unused segments
mergesegs_dir=$crawl_dir/mergesegs_dir
$nutch_dir/nutch mergesegs $mergesegs_dir -dir $segments_dir
#for segment in `ls -d $segments_dir/* | tail -$depth`
for segment_tmp in `$hadoop dfs -ls $segments_dir | tail -$depth`
do
segment_tmp_len=`expr length "$segment_tmp"`
segment_tmp_end=`expr $segment_tmp_len - 6`
segment=`expr substr "$segment_tmp" 1 $segment_tmp_end`
echo "Removing Temporary Segment: $segment"
#rm -rf $segment
$hadoop dfs -rmr $segment
done
#cp -R $mergesegs_dir/* $segments_dir
#rm -rf $mergesegs_dir
$hadoop dfs -cp $mergesegs_dir/* $segments_dir
$hadoop dfs -rmr $mergesegs_dir
# Update segments
$nutch_dir/nutch invertlinks $linkdb_dir -dir $segments_dir
# Index segments
new_indexes=$crawl_dir/newindexes
#segment=`ls -d $segments_dir/* | tail -1`
segment_tmp=`$hadoop dfs -ls $segments_dir | tail -1`
segment_tmp_len=`expr length "$segment_tmp"`
segment_tmp_end=`expr $segment_tmp_len - 6`
segment=`expr substr "$segment_tmp" 1 $segment_tmp_end`
echo "Index segment :$segment"
$nutch_dir/nutch index $new_indexes $webdb_dir $linkdb_dir $segment
# De-duplicate indexes
$nutch_dir/nutch dedup $new_indexes
# Merge indexes
$nutch_dir/nutch merge $index_dir $new_indexes
# Tell Tomcat to reload index
touch $tomcat_dir/WEB-INF/web.xml
# Clean up
#rm -rf $new_indexes
$hadoop dfs -rmr $new_indexes
echo "FINISHED: Recrawl completed. To conserve disk space, I would suggest"
echo " that the crawl directory be deleted once every 6 months (or more"
echo " frequent depending on disk constraints) and a new crawl generated."
评论
6 楼
lovepoem
2011-06-15
能增量吗?是不是还是把所有的url遍历出来。和以前的对比。算是增量索引了吗?
5 楼
freespace
2010-08-25
目前之用的本地索引,分布式的未用上。
4 楼
SeanHe
2009-11-09
libinwalan 写道
另外官网说
Setting adddays at 31 causes all pages will to be recrawled.
这个就让我更不解了
本来认为是多少天以内。但是为啥31就是全部了。
不解啊,哎。
Setting adddays at 31 causes all pages will to be recrawled.
这个就让我更不解了
本来认为是多少天以内。但是为啥31就是全部了。
不解啊,哎。
adddays, which is useful for forcing pages to be retrieved even if they are not yet due to be re-fetched.
The page re-fetch interval in Nutch is controlled by the configuration property db.default.fetch.interval,
and defaults to 30 days. The adddays arguments can be used to advance the clock for fetchlist generation
(but not for calculating the next fetch time), thereby fetching pages early.
简单的说"adddays"被用来增大当前时间来生成抓取列表,但是不用来计算下一次抓取时间。因为nutch默认是在30天以后对同一个页面进行重新抓取,这个值配置在nutch-default.xml
<property><name>db.default.fetch.interval</name><value>30</value><description>(DEPRECATED) The default number of days between re-fetches of a page.
</description></property>
3 楼
libinwalan
2009-02-27
另外官网说
Setting adddays at 31 causes all pages will to be recrawled.
这个就让我更不解了
本来认为是多少天以内。但是为啥31就是全部了。
不解啊,哎。
Setting adddays at 31 causes all pages will to be recrawled.
这个就让我更不解了
本来认为是多少天以内。但是为啥31就是全部了。
不解啊,哎。
2 楼
libinwalan
2009-02-27
你好 参数adddays 翻译为 距当前时间的日期增量天数
但是它具体是什么意思呢 一直理解不到
可否解释下 谢谢
但是它具体是什么意思呢 一直理解不到
可否解释下 谢谢
1 楼
a496649849
2009-02-27
请问一下我为什么我执行以上脚本怎是报
Fetcher: java.io.IOException: Segment already fetched!at org.apache.nutch.fetcher.FetcherOutputFormat.checkOutputSpecs(Fetcher
谢谢!!
Fetcher: java.io.IOException: Segment already fetched!at org.apache.nutch.fetcher.FetcherOutputFormat.checkOutputSpecs(Fetcher
谢谢!!
发表评论
-
Nutch1.0开源搜索引擎与Paoding在eclipse中用plugin方式集成(终极篇)
2009-09-14 13:15 4290本文主要描述的是如何将paoding分词用plugi ... -
Nutch1.0的那些事
2009-09-10 12:37 2157很久没有更新博客了,应该快一年了。现在呢,我把去年 ... -
配置linux服务器之间ssh不用密码访问
2008-11-05 13:55 3877在配置nutch的时候,我 ... -
搜索引擎术语
2008-10-15 15:30 2506最近monner兄共享了一篇 ... -
搜索引擎机器人研究报告
2008-10-13 15:35 1902从该文对googlebot的分析看,googlebot似乎是想 ... -
搜索引擎算法研究
2008-10-13 15:11 20961.引言 万维网WWW(World Wide Web ... -
谁说搜索引擎只关注结果-看我viewzi的72变
2008-10-04 20:15 1801搜索引擎给大家的感觉,就是用起来简单,以google为首,一个 ... -
《Lucene+Nutch搜索引擎》看过以后。。。
2008-10-03 23:42 7605研究搜索引擎技术快一 ... -
微软有趣的人物关系搜索引擎——人立方
2008-10-03 20:00 3904最近,微软亚洲研究院 ... -
Nutch开源搜索引擎增量索引recrawl的终极解决办法(续)
2008-09-28 19:30 3448十一要放假了,先祝广大同学们节日快乐! 在之前的一篇文章中, ... -
Nutch:一个灵活可扩展的开源web搜索引擎
2008-09-28 11:46 2228在网上找到一篇于2004年11月由CommerceNet La ... -
Google公司都是些什么牛人?
2008-09-27 17:31 2024Google公司都是些什么牛人? 1 Vi ... -
搜索引擎名人堂之Doug Cutting
2008-09-27 11:41 2602Doug Cutting是一个开源搜索技术的提倡者和创造者。他 ... -
Nutch开源搜索引擎与Paoding中文分词用plugin方式集成
2008-09-26 15:31 4572本文是我在集成中文分词paoding时积累的经验,单独成一篇文 ... -
关于Hadoop的MapReduce纯技术点文章
2008-09-24 18:10 3476本文重点讲述Hadoop的整 ... -
MapReduce-免费午餐还没有结束?
2008-09-24 09:57 1460Map Reduce - the Free Lunch is ... -
搜索引擎名人堂之Jeff Dean
2008-09-22 15:09 14914最近一直在研究Nutch,所以关注到了搜索引擎界的一些名人,以 ... -
Lucene于搜索引擎技术(Analysis包详解)
2008-09-22 14:55 2187Analysis 包分析 ... -
Lucene与搜索引擎技术(Document包详解)
2008-09-22 14:54 1692Document 包分析 理解 ... -
Lucene的查询语法
2008-09-22 14:53 1382原文来自:http://liyu2000.nease.net/ ...
相关推荐
Nutch开源搜索引擎增量索引recrawl的终极解决办法续
其中内容均为前段时间研究开源搜索引擎时搜集参考的资料,非常齐全包含的内容有: Computing PageRank Using Hadoop.ppt Google的秘密PageRank彻底解说中文版.doc JAVA_Lucene_in_Action教程完整版.doc Java开源搜索...
nutch一款开源搜索引擎,recrawl是实现索引更新的脚本 mergecrawl是合并多个网站查询的bash脚本。
为您提供Apache Nutch 开源搜索引擎下载,Nutch的创始人是Doug Cutting,他同时也是Lucene、Hadoop和Avro开源项目的创始人。Nutch诞生于2002年8月,是Apache旗下的一个用Java实现的开源搜索引擎项目,自Nutch1.2版本...
为您提供Apache Nutch 开源搜索引擎下载,Nutch的创始人是Doug Cutting,他同时也是Lucene、Hadoop和Avro开源项目的创始人。Nutch诞生于2002年8月,是Apache旗下的一个用Java实现的开源搜索引擎项目,自Nutch1.2版本...
Nutch 是一个开源的、Java 实现的搜索引擎。它提供了我们运行自己的搜索引擎所需的全部工具。 nutch 1.0
基于Nutch的搜索引擎系统的研究与实现
Nutch搜索引擎·Nutch简介及安装(第1期) Nutch搜索引擎·Solr简介及安装(第2期) Nutch搜索引擎·Nutch简单应用(第3期) Nutch搜索引擎·Eclipse开发配置(第4期) Nutch搜索引擎·Nutch浅入分析(第5期)
开源搜索引擎的比较,Lucene Nutch Heritrix Weblech等
Lucene+Nutch搜索引擎开发
Nutch是一个优秀的开放源代码的Web...分析开源搜索引擎Nutch代码,研究了Nutch的页面排序方法。在Nutch原有的结构基础上提出了3种修改Nutch 排序的方法,对每种方法的实现进行了阐述,最后对这些方法的特点进行了比较
nutch框架详细介绍,基本概念,功能模块,搭建方法
学习Lucene和Nutch的入门书籍,学习搜索引擎开发值得一看的书籍
Lucene nutch 搜索引擎开发 Part1
完整的《Lucene+nutch搜索引擎开发》PDF版一共83.6M,无奈我上传的最高限是80M,所以切成两个。这一个是主文件,还需要下载一个副文件Lucene+nutch搜索引擎开发.z01。解压时直接放到一起,解压这个主文件就行了。
licene 实例代码 nutch实例代码 lucene+nutch搜索引擎开发实例代码(王学松版)
nutch分布式搜索索引热替换程序,当使用nutch分布式搜索的时候,通过修改nutch来实现重建索引和分布式搜索分隔开,相互不影响