- 浏览: 167315 次
- 性别:
- 来自: 北京
文章分类
最新评论
-
小桔子:
u 棒棒哒!按照你的搞定了,之前搞了好久!u 棒棒哒!!! ...
Ubuntu为Tomcat启用80端口 -
u011938035:
我用的是nutch1.7,org.apache.nutch.n ...
nutch1.4 URLNormalizers 详解 -
peigang:
试试跟踪一下脚本,应该是环境变量的问题。
nutch1.4:爬虫定时抓取设置 -
zhangmj10:
你好,看这帖子是好久以前的,不知道你能不能看到。不知道能不能帮 ...
nutch1.4:爬虫定时抓取设置 -
shinide1989:
楼主你好,我正需要修改html的解析,并想把结果存为其他格 ...
nutch1.4插件开发
org.apache.nutch.crawl.InjectorURL注入器对象,nutch抓取入口。
代码如下:
/** * Licensed to the Apache Software Foundation (ASF) under one or more * contributor license agreements. See the NOTICE file distributed with * this work for additional information regarding copyright ownership. * The ASF licenses this file to You under the Apache License, Version 2.0 * (the "License"); you may not use this file except in compliance with * the License. You may obtain a copy of the License at * * http://www.apache.org/licenses/LICENSE-2.0 * * Unless required by applicable law or agreed to in writing, software * distributed under the License is distributed on an "AS IS" BASIS, * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. * See the License for the specific language governing permissions and * limitations under the License. */ package org.apache.nutch.crawl; import java.io.*; import java.text.SimpleDateFormat; import java.util.*; // Commons Logging imports import org.slf4j.Logger; import org.slf4j.LoggerFactory; import org.apache.hadoop.io.*; import org.apache.hadoop.fs.*; import org.apache.hadoop.conf.*; import org.apache.hadoop.mapred.*; import org.apache.hadoop.util.*; import org.apache.nutch.net.*; import org.apache.nutch.scoring.ScoringFilterException; import org.apache.nutch.scoring.ScoringFilters; import org.apache.nutch.util.NutchConfiguration; import org.apache.nutch.util.NutchJob; import org.apache.nutch.util.TimingUtil; /** This class takes a flat file of URLs and adds them to the of pages to be * crawled. Useful for bootstrapping the system. * The URL files contain one URL per line, optionally followed by custom metadata * separated by tabs with the metadata key separated from the corresponding value by '='. <br> * Note that some metadata keys are reserved : <br> * - <i>nutch.score</i> : allows to set a custom score for a specific URL <br> * - <i>nutch.fetchInterval</i> : allows to set a custom fetch interval for a specific URL <br> * e.g. http://www.nutch.org/ \t nutch.score=10 \t nutch.fetchInterval=2592000 \t userType=open_source **/ public class Injector extends Configured implements Tool { public static final Logger LOG = LoggerFactory.getLogger(Injector.class); /** metadata key reserved for setting a custom score for a specific URL */ public static String nutchScoreMDName = "nutch.score"; /** metadata key reserved for setting a custom fetchInterval for a specific URL */ public static String nutchFetchIntervalMDName = "nutch.fetchInterval"; /** Normalize and filter injected urls. */ public static class InjectMapper implements Mapper<WritableComparable, Text, Text, CrawlDatum> { private URLNormalizers urlNormalizers; private int interval; private float scoreInjected; private JobConf jobConf; private URLFilters filters; private ScoringFilters scfilters; private long curTime; public void configure(JobConf job) { this.jobConf = job; urlNormalizers = new URLNormalizers(job, URLNormalizers.SCOPE_INJECT); interval = jobConf.getInt("db.fetch.interval.default", 2592000); filters = new URLFilters(jobConf); scfilters = new ScoringFilters(jobConf); scoreInjected = jobConf.getFloat("db.score.injected", 1.0f); curTime = job.getLong("injector.current.time", System.currentTimeMillis()); } public void close() {} public void map(WritableComparable key, Text value, OutputCollector<Text, CrawlDatum> output, Reporter reporter) throws IOException { String url = value.toString(); // value is line of text /** * 忽略"#"字符开头注释的行 */ if (url != null && url.trim().startsWith("#")) { /* Ignore line that start with # */ return; } // if tabs : metadata that could be stored // must be name=value and separated by \t float customScore = -1f; int customInterval = interval; Map<String,String> metadata = new TreeMap<String,String>(); //格式化URL if (url.indexOf("\t")!=-1){ String[] splits = url.split("\t"); url = splits[0]; for (int s=1;s<splits.length;s++){ // find separation between name and value int indexEquals = splits[s].indexOf("="); if (indexEquals==-1) { // skip anything without a = continue; } String metaname = splits[s].substring(0, indexEquals); String metavalue = splits[s].substring(indexEquals+1); if (metaname.equals(nutchScoreMDName)) { try { customScore = Float.parseFloat(metavalue);} catch (NumberFormatException nfe){} } else if (metaname.equals(nutchFetchIntervalMDName)) { try { customInterval = Integer.parseInt(metavalue);} catch (NumberFormatException nfe){} } else metadata.put(metaname,metavalue); } } try { //过滤URL行为包括爬取规则过滤,网址格式格式化。详细参考:http://peigang.iteye.com/blog/1468984 url = urlNormalizers.normalize(url, URLNormalizers.SCOPE_INJECT); //过滤URL。 详细参考 http://peigang.iteye.com/blog/1469108 url = filters.filter(url); // filter the url } catch (Exception e) { if (LOG.isWarnEnabled()) { LOG.warn("Skipping " +url+":"+e); } url = null; } if (url != null) { // if it passes value.set(url); // collect it CrawlDatum datum = new CrawlDatum(CrawlDatum.STATUS_INJECTED, customInterval); datum.setFetchTime(curTime); // now add the metadata Iterator<String> keysIter = metadata.keySet().iterator(); while (keysIter.hasNext()){ String keymd = keysIter.next(); String valuemd = metadata.get(keymd); datum.getMetaData().put(new Text(keymd), new Text(valuemd)); } if (customScore != -1) datum.setScore(customScore); else datum.setScore(scoreInjected); try { //过滤结果集,详细参考:http://peigang.iteye.com/blog/1469143 scfilters.injectedScore(value, datum); } catch (ScoringFilterException e) { if (LOG.isWarnEnabled()) { LOG.warn("Cannot filter injected score for url " + url + ", using default (" + e.getMessage() + ")"); } } //输出 output.collect(value, datum); } } } /** Combine multiple new entries for a url. */ public static class InjectReducer implements Reducer<Text, CrawlDatum, Text, CrawlDatum> { public void configure(JobConf job) {} public void close() {} private CrawlDatum old = new CrawlDatum(); private CrawlDatum injected = new CrawlDatum(); public void reduce(Text key, Iterator<CrawlDatum> values, OutputCollector<Text, CrawlDatum> output, Reporter reporter) throws IOException { boolean oldSet = false; //遍历集合,设置CrawlDatum状态。 while (values.hasNext()) { CrawlDatum val = values.next(); if (val.getStatus() == CrawlDatum.STATUS_INJECTED) { injected.set(val); injected.setStatus(CrawlDatum.STATUS_DB_UNFETCHED); } else { old.set(val); oldSet = true; } } CrawlDatum res = null; if (oldSet) res = old; // don't overwrite existing value else res = injected; output.collect(key, res); } } public Injector() {} public Injector(Configuration conf) { setConf(conf); } /** * 初始化抓取数据库 * @param crawlDb 爬取目录 * @param urlDir 文件地址,该文件存储爬取URL列表 * @throws IOException */ public void inject(Path crawlDb, Path urlDir) throws IOException { SimpleDateFormat sdf = new SimpleDateFormat("yyyy-MM-dd HH:mm:ss"); long start = System.currentTimeMillis(); if (LOG.isInfoEnabled()) { LOG.info("Injector: starting at " + sdf.format(start)); LOG.info("Injector: crawlDb: " + crawlDb); LOG.info("Injector: urlDir: " + urlDir); } //临时文件目录;首先从变量mapred.temp.dir中读取地址,如果不存在则临时目录根为“.” 当前目录 Path tempDir = new Path(getConf().get("mapred.temp.dir", ".") + "/inject-temp-"+ Integer.toString(new Random().nextInt(Integer.MAX_VALUE))); // map text input file to a <url,CrawlDatum> file if (LOG.isInfoEnabled()) { LOG.info("Injector: Converting injected urls to crawl db entries."); } JobConf sortJob = new NutchJob(getConf()); //创建job对象 sortJob.setJobName("inject " + urlDir); //设置job名称 FileInputFormat.addInputPath(sortJob, urlDir); //设置读取地址 sortJob.setMapperClass(InjectMapper.class); //设置map类 FileOutputFormat.setOutputPath(sortJob, tempDir); //设置map输出路径 sortJob.setOutputFormat(SequenceFileOutputFormat.class); //设置排序类 sortJob.setOutputKeyClass(Text.class); //设置map输出的KEY类型 sortJob.setOutputValueClass(CrawlDatum.class); //设置map输出的VALUE类型 sortJob.setLong("injector.current.time", System.currentTimeMillis()); JobClient.runJob(sortJob); /** * sortJob读取url进行合并后将结果输出到tempDir中。 * mergeJob读取tempDir中的数据进行合并。 */ // merge with existing crawl db if (LOG.isInfoEnabled()) { LOG.info("Injector: Merging injected urls into crawl db."); } JobConf mergeJob = CrawlDb.createJob(getConf(), crawlDb); FileInputFormat.addInputPath(mergeJob, tempDir); //设置读取目录 mergeJob.setReducerClass(InjectReducer.class); //设置reducer类型 JobClient.runJob(mergeJob); CrawlDb.install(mergeJob, crawlDb); /** * 合并完毕后删除临时目录 */ // clean up FileSystem fs = FileSystem.get(getConf()); fs.delete(tempDir, true); long end = System.currentTimeMillis(); LOG.info("Injector: finished at " + sdf.format(end) + ", elapsed: " + TimingUtil.elapsedTime(start, end)); } public static void main(String[] args) throws Exception { int res = ToolRunner.run(NutchConfiguration.create(), new Injector(), args); System.exit(res); } public int run(String[] args) throws Exception { if (args.length < 2) { System.err.println("Usage: Injector <crawldb> <url_dir>"); return -1; } try { inject(new Path(args[0]), new Path(args[1])); return 0; } catch (Exception e) { LOG.error("Injector: " + StringUtils.stringifyException(e)); return -1; } } }
在crawl对象中引用了public void inject(Path crawlDb, Path urlDir)方法,
发表评论
-
Nutch1.7二次开发培训讲义
2015-09-16 15:23 1239做Nutch二次开发,开发阶段用什么操作系统都可以,只要有J ... -
nutch-default.xml 配置范例
2014-07-22 20:20 2159nutch的配置文件属性很多,需要根据实际需要详细配置。下面 ... -
nutch本地模式调试环境配置
2014-07-22 17:33 721nutch本地模式调试可以跟踪详细的爬取过程,便于调 ... -
nutch分布式调试环境配置
2014-07-17 14:35 593准备:hadoop单机模式设置,参考:http:/ ... -
nutch 正文提取流程解析
2013-05-03 17:59 1088nutch正文提取在Fatcher的run方法中进行,本文 ... -
用Eclipse开发nutch准备工作
2012-09-20 11:34 1285本文来源于:http://zettadata.blogs ... -
nutch1.4 CrawlDatum详解
2012-09-17 14:04 0nutch中CrawlDatum对象封装了爬取数据,包括爬取地 ... -
nutch1.4 分布式爬取
2012-06-19 12:02 5387从nutch1.3开始本地抓取(单机),分布式抓取(集群)所使 ... -
nutch1.4:爬虫定时抓取设置
2012-06-13 15:03 4624nutch1.4定时爬取数据配合linux定时任务可以实现nu ... -
nutch1.4 开发:增加外部jar包
2012-06-11 14:48 1539ntuch1.4开发中可能会涉及到引入外部jar包的情况,比如 ... -
nutch1.4 爬虫父页面参数传递到子页面注意事项
2012-06-02 11:51 17771、inject中以读取文件的方式传入自定义参数: d ... -
nutch1.4 Fetcher详解
2012-05-24 17:37 0org.apache.nutch.fetcher.Fetche ... -
nutch1.4 Protocol接口解析
2012-04-24 17:44 0实现Protocol接口的过滤器插件,所有插件都extends ... -
nutch1.4自定义字段开发实例
2012-04-13 19:41 0本文介绍nutch1.4中插件方式实现自定义字段,并在solr ... -
nutch1.4插件开发
2012-04-13 17:02 2836参考了不少nutch插件开 ... -
nutch1.4 解析器 ParseSegment详解
2012-04-11 15:17 1211org.apache.nutch.parse.ParseSeg ... -
nutch1.4 Generator详解
2012-03-31 15:14 0org.apache.nutch.crawl.Generato ... -
nutch1.4 ScoringFilter详解
2012-03-29 17:39 1150org.apache.nutch.scoring.Scorin ... -
nutch1.4 URLFilter详解
2012-03-29 17:16 1622org.apache.nutch.net.URLFilter接 ... -
nutch1.4 URLNormalizers 详解
2012-03-29 15:56 1667org.apache.nutch.net.URLNorm ...
相关推荐
nutch1.4帮助文档,学习nutch1.4必备,最新nutch1.4核心类解读!
nutch1.4 在windows下的安装配置环境搭建
nutch 1.4 在windows下安装配置
Nutch是一个由Java实现的,刚刚诞生开放源代码(open-source)的web搜索引擎。Nutch目前最新的版本为version1.4。这个为nutch的最新版 1.4。
apache-nutch-1.4-bin.tar.gz.part2
Nutch1[1].4_windows下eclipse配置图文详解
apache-nutch-1.4-bin.part2
apache-nutch-1.4-bin.part1
apache-nutch-1.4-bin.tar.gz.part1
Windows下使用Eclipse配置Nutch2图文详解
Nutch 是一个开源Java 实现的搜索引擎。这里是它的安装包。
对于nutch源码的解读,让你了解nutch工作流程各个功能模块的作用
很好的一个开源搜索引擎,可以自己设计添加代码。
nutcher 是 Apache Nutch 的中文教程,在... Nutch流程控制源码详解(bin/crawl中文注释版) Nutch教程——URLNormalizer源码详解 Nutch参数配置——http.content.limit 文档截图:
1.1 Nutch 基本原理 1.1.1 Nutch 基本组成 1.1.2 Nutch 工作流程 1.2 Nutch 流程详解 1.2.1 Nutch 数据流程 1.2.2 Nutch 流程分析
Nutch各个配置项的详细说明,非常详细的说明了每一项
Nutch搜索引擎·Nutch简单应用(第3期) 1.1 Nutch 命令详解 1.2 Nutch 简单应用
nutch平台的详细搭建过程 配置环境 抓取 建立索引 查看结果
nutch 爬虫数据nutch 爬虫数据nutch 爬虫数据nutch 爬虫数据nutch 爬虫数据nutch 爬虫数据nutch 爬虫数据nutch 爬虫数据nutch 爬虫数据
1.4 nutch VS lucene.....2 2. nutch的安装与配置.....3 2.1 JDK的安装与配置.3 2.2 nutch的安装与配置........5 2.3 tomcat的安装与配置......5 3. nutch初体验7 3.1 爬行企业内部网....7 3.1.1 配置nutch....7 ...