hadoop几种排序简介

leibnitz

浏览: 274561 次
性别:
来自: 广州

最近访客更多访客>>

eternal1025

bneliao

adapterofcoms

caipeijun666

博主相关

博客

微博

相册

留言

关于我

文章分类

社区版块

存档分类

博客分类：

hadoop sources reading
hadoop

在map reduce框架中，除了常用的分布式计算外，排序也算是比较重要的一环了。这形如sql查询中的排序数据一样重要。

一。无排序

当书写code 时，如果指定了mapred.reduce.tasks=0(same effect as setNumReduceTasks)。这样便达到目的。

产生的效果当然是只有一个part file，而且其中的entries是unorder.

二。默认排序（sort only in partition)

其实这也称”局部排序“。这种情况是产生若干个part files，并且各file内部是排序好的，但file之间没有内容排序之分。

三。全局排序

当你使用TotalOrderPartitioner来作partitioner时，便可以了(注意在mapreduce lib中已经删除了）。当然要更新一下它的setPartitionFile(xx)，以便它利用样本估算得出边界的几个参数（数量是reduces num - 1)。但通常会使用InputSampler.RandomSampler实现来取样。

具体的算法如下：

/**
     * Randomize the split order, then take the specified number of keys from
     * each split sampled, where each key is selected with the specified
     * probability and possibly replaced by a subsequently selected key when
     * the quota of keys from that split is satisfied.
     */
public K[] getSample(InputFormat<K,V> inf, JobConf job) throws IOException {
      InputSplit[] splits = inf.getSplits(job, job.getNumMapTasks());
      ArrayList<K> samples = new ArrayList<K>(numSamples);
      int splitsToSample = Math.min(maxSplitsSampled, splits.length);  //取多少样本(splits)

      Random r = new Random();
      long seed = r.nextLong();
      r.setSeed(seed);
      LOG.debug("seed: " + seed);
      // shuffle splits；其实就 是随机交換splits达到混乱的效果显得更加均匀。
      for (int i = 0; i < splits.length; ++i) {
        InputSplit tmp = splits[i];
        int j = r.nextInt(splits.length);
        splits[i] = splits[j];
        splits[j] = tmp;
      }
      // our target rate is in terms of the maximum number of sample splits,
      // but we accept the possibility of sampling additional splits to hit
      // the target sample keyset
      for (int i = 0; i < splitsToSample ||
                     (i < splits.length && samples.size() < numSamples); ++i) {
        RecordReader<K,V> reader = inf.getRecordReader(splits[i], job,
            Reporter.NULL);
        K key = reader.createKey();
        V value = reader.createValue();
        while (reader.next(key, value)) {
          if (r.nextDouble() <= freq) {    // 概率要小于初始概率 
            if (samples.size() < numSamples) {  //未达到上限时直接添加样本
              samples.add(key);
            } else {
              // When exceeding the maximum number of samples, replace a
              // random element with this one, then adjust the frequency
              // to reflect the possibility of existing elements being
              // pushed out
              int ind = r.nextInt(numSamples); /// 否则更新某个样本元素
              if (ind != numSamples) {
                samples.set(ind, key);
              }
              freq *= (numSamples - 1) / (double) numSamples; //更新了之后降低后续更新概率，否则太频繁了。
            }
            key = reader.createKey();
          }
        }
        reader.close();
      }
      return (K[])samples.toArray();
    }

利用上述返回值，hadoop便会得出此样本的比例情况。具体的算法我没有找到在哪里实现，但大概我认为是这样的：

1.利用当前100 ／ reduce num ／ 100来得出平均概率分布；

2.对样本进行排序

3.由低到高（相反也可以）逐个区间进行各种key占比例统计，当达到平均概率值（当然允许有偏差）时停止此区间的添加，并得到最大key作为第一个边界值；

4.同样道理处理其它keys

5.这样处理可能最后出现很多组边界值，所以得有一个优化算法再进一步筛选。

不过我尝试实现过，发现这种计算也是挺复杂的，因为你不知道该什么时候结束；而且要记住不同情况下的边界值。

我认为hadoop也会设置一个offset值，并且限制优化次数。TODO 有空我会继续找源码看看。

四。分组（二次排序）

这个功用就类似于sql中的group by clause，就是对已经排序的数据再进一步key去重。

实现也是很简单的，过程大概是这样：

1.生成复合键；

之所以生成复合键，因为hadoop最終排序的是key而不是value.这个我有统计日志时就过同样需求了。

即在通常的key后加入其它要grouping values。

2.利用复合键来排序；

这过程基本上是利用复合键中的所有参数进行，因为毕竟你最終目标是同key的只要一个（最大或最小，这是group特性）

当然你也要写个partitioner否则，原本相同key的去到不同的reduce中。

3.对复合键分组；

在这个Comparator中，要将过滤的条件写要里面即可。比如按照原key只要第一条数据：

return compsedKeyObject.key.compare(that.key)

这样在进入reduce前有相同的key便被过滤了。

另外，也有可能是三四组合，这样达到各part files之间有序，同时也达到了grouping的效果。

其实二次排序关键的是明白group来对key进行逻辑分组功能。

分享到：

nutch 几种搜索布署 | nutch结合hadoop解説 RPC机制

2011-12-16 21:52
浏览 1574
评论(0)
分类:开源软件
查看更多

发表评论

您还没有登录,请您登录后再发表评论

最近访客更多访客>>

博主相关

文章分类

社区版块

存档分类

最新评论