解析Nutch插件系统

iammonster

浏览: 1797997 次
性别:
来自: 北京

最近访客更多访客>>

amwfngt

yuanyuan7891

sagadan

JianCaesar

博主相关

博客

微博

相册

留言

关于我

文章分类

社区版块

存档分类

博客分类：

HBase

正则表达式 lucene Apache UML XML

nutch系统架构的一个亮点就是插件，借鉴这个架构我们可以设计出自己的灵活的系统架构，下面就来解析Nutch
的插件系统是怎么回事。

关于nutch，在这里了解：http://lucene.apache.org/nutch/，目前最新版本是1.0：

    23 March 2009 - Apache Nutch 1.0 Released

    Nutch is open source web-search software. It builds on Lucene Java, adding web-specifics, such as a
crawler, a link-graph database, parsers for HTML and other document formats, etc.
    For more information about Nutch, please see the Nutch wiki：http://wiki.apache.org/nutch/

一、在Nutch的插件体系架构下，有些术语需要解释：

   1、扩展点(ExtensionPoint )

      扩展点是系统中可以被再次扩展的类或者接口，通过扩展点的定义，可以使得系统的执行过程变得可插入，可
任意变化。

   2、扩展 ( Extension )

      扩展式插件内部的一个属性，一个扩展是针对某个扩展点的一个实现，每个扩展都可以有自己的额外属性，用
于在同一个扩展点实现之间进行区分。扩展必须在插件内部进行定义。

   3、插件 ( Plugin )

      插件实际就是一个虚拟的容器，包含了多个扩展 Extension、依赖插件 RequirePlugins 和自身发布的库Runtime，插件可以被启动或者停止。

    Nutch 为了扩展，预留了很多扩展点 ExtenstionPoint，同时提供了这些扩展点的基本实现 Extension，Plugin
用来组织这些扩展，这些都通过配置文件进行控制，主要的配置文件包括了多个定义扩展点和插件（扩展）的配置文
件，一个控制加载哪些插件的配置文件。体系结构图如下：

二、插件的内部结构 ，如下图：

   1. runtime 属性描述了其需要的 Jar 包，和发布的 Jar 包
   2. requires 属性描述了依赖的插件
   3. extension-point 描述了本插件宣布可扩展的扩展点
   4. extension 属性则描述了扩展点的实现

三、插件定义方法 如下：

<plugin
   id="urlfilter-suffix"  插件ID
   name="Suffix URL Filter" 插件名称
   version="1.0.0" 插件版本
   provider-name="nutch.org"> 插件提供者的ID

   <runtime>
      <library name="urlfilter-suffix.jar"> 依赖的JAR包
         <export name="*"/> 发布的JAR包
      </library>
   </runtime>

   <requires>
      <import plugin="nutch-extensionpoints"/> 依赖的插件
   </requires>

   <extension id="org.apache.nutch.net.urlfilter.suffix" 扩展的插件ID
              name="Nutch Suffix URL Filter" 扩展的插件名
              point="org.apache.nutch.net.URLFilter"> 插件的扩展点ID
      <implementation id="SuffixURLFilter" 插件实现ID
                      class="org.apache.nutch.urlfilter.suffix.SuffixURLFilter"/> 实现类
      <!-- by default, attribute "file" is undefined, to keep classic behavior.
      <implementation id="SuffixURLFilter"
                      class="org.apache.nutch.net.SuffixURLFilter">
        <parameter name="file" value="urlfilter-suffix.txt"/>
      </implementation>
      -->
   </extension>

</plugin>

四、插件主要配置，在nutch-default.xml里面有：

<!-- plugin properties -->

// 插件所在的目录，缺省位置在 plugins 目录下。
<property>
  <name>plugin.folders</name>
  <value>plugins</value>
  <description>Directories where nutch plugins are located.  Each
  element may be a relative or absolute path.  If absolute, it is used
  as is.  If relative, it is searched for on the classpath.</description>
</property>

// 当被配置为过滤（即不加载），但是又被其他插件依赖的时候，是否自动启动，缺省为 true。
<property>
  <name>plugin.auto-activation</name>
  <value>true</value>
  <description>Defines if some plugins that are not activated regarding
  the plugin.includes and plugin.excludes properties must be automaticaly
  activated if they are needed by some actived plugins.
  </description>
</property>

// 要包含的插件名称列表，支持正则表达式方式定义。
<property>
  <name>plugin.includes</name>
  <value>protocol-http|urlfilter-regex|parse-(text|html|js)|index-(basic|anchor)|query-

(basic|site|url)|response-(json|xml)|summary-basic|scoring-opic|urlnormalizer-(pass|regex|basic)

</value>
  <description>Regular expression naming plugin directory names to
  include.  Any plugin not matching this expression is excluded.
  In any case you need at least include the nutch-extensionpoints plugin. By
  default Nutch includes crawling just HTML and plain text via HTTP,
  and basic indexing and search plugins. In order to use HTTPS please enable
  protocol-httpclient, but be aware of possible intermittent problems with the
  underlying commons-httpclient library.
  </description>
</property>

// 要排除的插件名称列表，支持正则表达式方式定义。
<property>
  <name>plugin.excludes</name>
  <value></value>
  <description>Regular expression naming plugin directory names to exclude. 
  </description>
</property>

五、插件主要类 UML 图 ：

类包括：

   1. PluginRepository 是一个通过加载 Iconfiguration 配置信息初始化的插件库，里面维护了系统中所有的扩展

点 ExtensionPoint 和所有的插件 Plugin 实例
   2. ExtensionPoint 是一个扩展点，通过扩展点的定义，插件 Plugin 才能定义实际的扩展 Extension，从而实现

扩展，每个 ExtensionPoint 类实例都维护了宣布实现了此扩展点的扩展 Extension.
   3. Plugin 是一个虚拟的组织，提供了一个启动 start 和一个 shutdown 方法，从而实现了插件的启动和停止，

他还有一个描述对象 PluginDescriptor，负责保存此插件相关的配置信息，另外还有一个 PluginClassLoader 负责

此插件相关类和库的加载。

六、插件加载过程时序图：

    通过序列图可以发现，Nutch 加载插件的过程需要 actor 全程直接调用每个关联对象，最终得到的是插件的实现
对象。详细过程如下：

   1. 首先通过 PluginRepository.getConf() 方法加载配置信息，配置的内容包括插件的目录，插件的配置文件信息 plugin.properties 等，此时 pluginrepository 将根据配置信息加载各个插件的 plugin.xml，同时根据 Plugin.xml 加载插件的依赖类。
   2. 当 actor 需要加载某个扩展点的插件的时候，他可以：
         1. 首先根据扩展点的名称，通过 PluginRepository 得到扩展点的实例，即 ExtensionPoint 类的实例；
         2. 然后调用 ExtensionPoint 对象的 getExtensions 方法，返回的是实现此扩展点的实例列表
（Extension[]）；
         3. 对每个实现的扩展实例 Extension，调用它的 getExtensionInstance() 方法，以得到实际的实现类实
例，此处为 Object；

4. 根据实际情况，将 Object 转型为实际的类对象类型，然后调用它们的实现方法，例如 helloworld 方法。

七、插件的典型调用方式 ：

private Extension findExtension(String name) throws PluginRuntimeException {

    Extension[] extensions = this.extensionPoint.getExtensions();

    for (int i = 0; i < extensions.length; i++) {
      Extension extension = extensions[i];

      if (contains(name, extension.getAttribute("protocolName")))
        return extension;
    }
    return null;
  }
 
  boolean contains(String what, String where){
    String parts[]=where.split("[, ]");
    for(int i=0;i<parts.length;i++) {
      if(parts[i].equals(what)) return true;
    }
    return false;
  }

八、插件类加载机制 ：

实际整个系统如果使用了插件架构，则插件类的加载是由 PluginClassLoader 类完成的，每个 Plugin 都有自己的 classLoader，此 classloader 继承自 URLClassLoader，并没有做任何事情：

public class PluginClassLoader extends URLClassLoader {
    /**
    * Construtor
    *
    * @param urls
    *          Array of urls with own libraries and all exported libraries of
    *          plugins that are required to this plugin
    * @param parent
    */
    public PluginClassLoader(URL[] urls, ClassLoader parent) {
        super(urls, parent);
    }
}

这个 classloader 是属于这个插件的，它只负责加载本插件相关的类、本地库和依赖插件的发布 (exported) 库，也

包括一些基本的配置文件例如 .properties 文件。

此类的实例化过程：

/**
   * Returns a cached classloader for a plugin. Until classloader creation all
   * needed libraries are collected. A classloader use as first the plugins own
   * libraries and add then all exported libraries of dependend plugins.
   *
   * @return PluginClassLoader the classloader for the plugin
   */
  public PluginClassLoader getClassLoader() {
    if (fClassLoader != null)
      return fClassLoader;
    ArrayList<URL> arrayList = new ArrayList<URL>();
    arrayList.addAll(fExportedLibs);
    arrayList.addAll(fNotExportedLibs);
    arrayList.addAll(getDependencyLibs());
    File file = new File(getPluginPath());
    try {
      for (File file2 : file.listFiles()) {
        if (file2.getAbsolutePath().endsWith("properties"))
          arrayList.add(file2.getParentFile().toURL());
      }
    } catch (MalformedURLException e) {
      LOG.debug(getPluginId() + " " + e.toString());
    }
    URL[] urls = arrayList.toArray(new URL[arrayList.size()]);
    fClassLoader = new PluginClassLoader(urls, PluginDescriptor.class
        .getClassLoader());
    return fClassLoader;
  }

    * 首先判断缓存是否存在
    * 加载需要的 Jar 包、自身需要的 Jar 包，依赖插件发布的 Jar 包
    * 加载本地的 properties 文件
    * 构造此 classloader，父 classloader 为 PluginDescriptor 的加载者，通常是 contextClassLoader

九、总结 ：
    Nutch 是一个非常出色的开源搜索框架，它的插件架构更加是它的一个技术亮点，通过此架构，可以保证 Nutch
方便的被灵活的扩展而不用修改原来的代码，通过配置文件可以简单方便的控制加载或者不加载哪些插件，而且这些
都不需要额外的容器支持。这些都是我们在系统架构设计的时候可以学习和参考的有益经验。