从上篇的Crawl可以看到,抓取过程是按一个一个阶段,逐步进行。所以先看Injector( org.apache.nutch.crawl.Injector)
// initialize crawlDb injector.inject(crawlDb, rootUrlDir);
,从代码可以很明显看出,nutch是建立于hadoop之上,只不过使用的是旧的api。
Injector主要功能:
1.对url文件进行规范化和过滤,将结果存入临时文件夹
2.将上述结果与老的crawldb/current合并,产生一个新的,来替换原有的。
public void inject(Path crawlDb, Path urlDir) throws IOException { SimpleDateFormat sdf = new SimpleDateFormat("yyyy-MM-dd HH:mm:ss"); long start = System.currentTimeMillis(); if (LOG.isInfoEnabled()) { LOG.info("Injector: starting at " + sdf.format(start)); LOG.info("Injector: crawlDb: " + crawlDb); LOG.info("Injector: urlDir: " + urlDir); } //建立临时目录,用于mapreduce的临时输出 Path tempDir = new Path(getConf().get("mapred.temp.dir", ".") + "/inject-temp-"+ Integer.toString(new Random().nextInt(Integer.MAX_VALUE))); // map text input file to a <url,CrawlDatum> file if (LOG.isInfoEnabled()) { LOG.info("Injector: Converting injected urls to crawl db entries."); } JobConf sortJob = new NutchJob(getConf()); sortJob.setJobName("inject " + urlDir); FileInputFormat.addInputPath(sortJob, urlDir); sortJob.setMapperClass(InjectMapper.class); FileOutputFormat.setOutputPath(sortJob, tempDir); sortJob.setOutputFormat(SequenceFileOutputFormat.class); sortJob.setOutputKeyClass(Text.class); //输出数据类型为CrawlDatum.class sortJob.setOutputValueClass(CrawlDatum.class); sortJob.setLong("injector.current.time", System.currentTimeMillis()); //提交job RunningJob mapJob = JobClient.runJob(sortJob); long urlsInjected = mapJob.getCounters().findCounter("injector", "urls_injected").getValue(); long urlsFiltered = mapJob.getCounters().findCounter("injector", "urls_filtered").getValue(); LOG.info("Injector: total number of urls rejected by filters: " + urlsFiltered); LOG.info("Injector: total number of urls injected after normalization and filtering: " + urlsInjected); // merge with existing crawl db 合并已存在crawlDb if (LOG.isInfoEnabled()) { LOG.info("Injector: Merging injected urls into crawl db."); } JobConf mergeJob = CrawlDb.createJob(getConf(), crawlDb); FileInputFormat.addInputPath(mergeJob, tempDir); mergeJob.setReducerClass(InjectReducer.class); JobClient.runJob(mergeJob); CrawlDb.install(mergeJob, crawlDb); // clean up 删除临时文件夹 FileSystem fs = FileSystem.get(getConf()); fs.delete(tempDir, true); long end = System.currentTimeMillis(); LOG.info("Injector: finished at " + sdf.format(end) + ", elapsed: " + TimingUtil.elapsedTime(start, end)); }
下面先看下对url文件操作的InjectMapper
public void configure(JobConf job) { this.jobConf = job; //初始化URLNormalizers URL规范器, urlNormalizers = new URLNormalizers(job, URLNormalizers.SCOPE_INJECT); interval = jobConf.getInt("db.fetch.interval.default", 2592000); //初始化过滤器 filters = new URLFilters(jobConf); //初始化分数过滤器 scfilters = new ScoringFilters(jobConf); //初始化,新加入url的得分 scoreInjected = jobConf.getFloat("db.score.injected", 1.0f); curTime = job.getLong("injector.current.time", System.currentTimeMillis()); }
可以看到findExtensions方法用来加载urlnormalizer的策略,不同的scope可以配置不同策略
/** * searches a list of suitable url normalizer plugins for the given scope. * * @param scope * Scope for which we seek a url normalizer plugin. * @return List - List of extensions to be used for this scope. If none, * returns null. * @throws PluginRuntimeException */ private List<Extension> findExtensions(String scope) { String[] orders = null; String orderlist = conf.get("urlnormalizer.order." + scope); if (orderlist == null) orderlist = conf.get("urlnormalizer.order"); if (orderlist != null && !orderlist.trim().equals("")) { orders = orderlist.trim().split("\\s+"); } String scopelist = conf.get("urlnormalizer.scope." + scope); Set<String> impls = null; if (scopelist != null && !scopelist.trim().equals("")) { String[] names = scopelist.split("\\s+"); impls = new HashSet<String>(Arrays.asList(names)); } Extension[] extensions = this.extensionPoint.getExtensions(); HashMap<String, Extension> normalizerExtensions = new HashMap<String, Extension>(); for (int i = 0; i < extensions.length; i++) { Extension extension = extensions[i]; if (impls != null && !impls.contains(extension.getClazz())) continue; normalizerExtensions.put(extension.getClazz(), extension); } List<Extension> res = new ArrayList<Extension>(); if (orders == null) { res.addAll(normalizerExtensions.values()); } else { // first add those explicitly named in correct order for (int i = 0; i < orders.length; i++) { Extension e = normalizerExtensions.get(orders[i]); if (e != null) { res.add(e); normalizerExtensions.remove(orders[i]); } } // then add all others in random order res.addAll(normalizerExtensions.values()); } return res; }
urlnormalizer相关配置文件
<!-- URL normalizer properties --> <property> <name>urlnormalizer.order</name> <value>org.apache.nutch.net.urlnormalizer.basic.BasicURLNormalizer org.apache.nutch.net.urlnormalizer.regex.RegexURLNormalizer</value> <description>Order in which normalizers will run. If any of these isn't activated it will be silently skipped. If other normalizers not on the list are activated, they will run in random order after the ones specified here are run. </description> </property> <property> <name>urlnormalizer.regex.file</name> <value>regex-normalize.xml</value> <description>Name of the config file used by the RegexUrlNormalizer class. </description> </property> <property> <name>urlnormalizer.loop.count</name> <value>1</value> <description>Optionally loop through normalizers several times, to make sure that all transformations have been performed. </description> </property>
Urlfilter的初始化,可以看到是由插件仓库
ExtensionPoint point = PluginRepository.get(conf).getExtensionPoint( URLFilter.X_POINT_ID); if (point == null) throw new RuntimeException(URLFilter.X_POINT_ID + " not found.");
/** * @return a cached instance of the plugin repository */ public static synchronized PluginRepository get(Configuration conf) { String uuid = NutchConfiguration.getUUID(conf); if (uuid == null) { uuid = "nonNutchConf@" + conf.hashCode(); // fallback } PluginRepository result = CACHE.get(uuid); //如果为空,初始化 if (result == null) { result = new PluginRepository(conf); CACHE.put(uuid, result); } return result; }
public PluginRepository(Configuration conf) throws RuntimeException { //初始化活动插件的集合 fActivatedPlugins = new HashMap<String, Plugin>(); //初始化扩展点的集合 fExtensionPoints = new HashMap<String, ExtensionPoint>(); this.conf = conf; //读取配置,是否自动激活 this.auto = conf.getBoolean("plugin.auto-activation", true); //读取配置,插件存放目录 String[] pluginFolders = conf.getStrings("plugin.folders"); //工具类,作用就是遍历插件存放目录,找到plugin.xml,每一个插件对应一个plugin.xml。 //根据plugin生成PluginDescriptor的集合 PluginManifestParser manifestParser = new PluginManifestParser(conf, this); Map<String, PluginDescriptor> allPlugins = manifestParser .parsePluginFolder(pluginFolders); //要排除的插件,正则表达式 Pattern excludes = Pattern.compile(conf.get("plugin.excludes", "")); //要包含的插件,正则表达式 Pattern includes = Pattern.compile(conf.get("plugin.includes", "")); //对不适用的插件进行过滤 Map<String, PluginDescriptor> filteredPlugins = filter(excludes, includes, allPlugins); //对插件的依赖关系检查 fRegisteredPlugins = getDependencyCheckedPlugins(filteredPlugins, this.auto ? allPlugins : filteredPlugins); //安装扩展点 installExtensionPoints(fRegisteredPlugins); try { installExtensions(fRegisteredPlugins); } catch (PluginRuntimeException e) { LOG.error(e.toString()); throw new RuntimeException(e.getMessage()); } displayStatus(); }
/** * Returns a list of all found plugin descriptors. * * @param pluginFolders * folders to search plugins from * @return A {@link Map} of all found {@link PluginDescriptor}s. */ public Map<String, PluginDescriptor> parsePluginFolder(String[] pluginFolders) { Map<String, PluginDescriptor> map = new HashMap<String, PluginDescriptor>(); if (pluginFolders == null) { throw new IllegalArgumentException("plugin.folders is not defined"); } for (String name : pluginFolders) { File directory = getPluginFolder(name); if (directory == null) { continue; } LOG.info("Plugins: looking in: " + directory.getAbsolutePath()); for (File oneSubFolder : directory.listFiles()) { if (oneSubFolder.isDirectory()) { String manifestPath = oneSubFolder.getAbsolutePath() + File.separator + "plugin.xml"; try { LOG.debug("parsing: " + manifestPath); PluginDescriptor p = parseManifestFile(manifestPath); map.put(p.getPluginId(), p); } catch (MalformedURLException e) { LOG.warn(e.toString()); } catch (SAXException e) { LOG.warn(e.toString()); } catch (IOException e) { LOG.warn(e.toString()); } catch (ParserConfigurationException e) { LOG.warn(e.toString()); } } } } return map; }
相关推荐
apache-nutch-2.3.1-src.tar.gz
nutch配置nutch-default.xml
外网不能访问,故上传,一方面自己备份,一方面也方便大家不能下载的痛苦,只有nutch的源码,没有依赖包,如果需要依赖包,请自行下载
apache-nutch-1.3 的源码包,需要的可以看下
nutch-param-setnutch-param-setnutch-param-setnutch-param-set
apache-nutch-2.3.1-src.tar ,网络爬虫的源码, 用ivy2管理, ant runtime 编译 apache-nutch-2.3.1-src.tar ,网络爬虫的源码, 用ivy2管理, ant runtime 编译
lucene+nutch搜索引擎光盘源码(1-8章),一次上传不了那么多所以分卷了。
apache-nutch-1.4-bin.tar.gz.part2
Nutch是一款刚刚诞生的完整的开源搜索引擎系统,可以结合数据库进行索引,能快速构建所需系统。Nutch 是基于Lucene的,Lucene为 Nutch 提供了文本索引和搜索的API,所以它使用Lucene作为索引和检索的模块。Nutch的...
Nutch 是一个开源Java 实现的搜索引擎。这里是它的安装包。
apache-nutch-1.6-src.tar.gz 来自APACHE官网,本人亲自测试可以使用。
nutch不用安装,是个应用程序,下载后为nutch-1.6.tar.gz,双击桌面上的cygwin快捷方式;执行以下命令: $ cd D:/Downloads/Soft $ tar zxvf nutch-1.0.tar.gz 在e盘下面出现nutch-0.9文件夹说明解压成功了.然后环境...
apache-nutch-1.4-bin.part2
Nutch 1.2 学习笔记,讲的比较清楚的文档
Nutch 是一个开源Java 实现的搜索引擎。它提供了我们运行自己的搜索引擎所需的全部工具。包括全文搜索和Web爬虫。包含nutch-1.5.1的源码
学习nutch 源码解读 轻松入门 搭建自己的nutch搜索引擎
nutch1.6源码,直接从官网也可以下
nutch-1.0-dev.jar nutch devlope
apache-nutch-2.3-src.zip来自APACHE官方网站,亲自测试可以使用。
好用的爬虫工具,刚发布不久的新版本 nutch是网络搜索及信息提取中使用得最广泛的网络爬虫工具 仅仅使用简单的配置就可以实现强大的爬取信息功能