[Hadoop] TopK的一个简单实现

RangerWolf

浏览: 232890 次
性别:
来自: 南京

最近访客更多访客>>

dazhou

xubukang

minxiaomin

qihongce

博主相关

博客

微博

相册

留言

关于我

文章分类

社区版块

存档分类

博客分类：

一步一步Hadoop
Java

题外话：

《Hadoop in Action》是一本非常不错的交Hadoop的入门书，而且建议看英文版。此书作者的英文表达非常简单易懂。相信有一定英文阅读能力的同学直接用英文版就能非常容易的上手~

进入正题。这个题目是《Hadoop in Action》上面的一道题目，求出Top K的值。

我自己随便弄了一个输入文件：

讲讲我的思路：

对于Top K的问题，首先要在每个block/分片之中找到这部分的Top K。并且由于只能输出一次，所以输出的工作需要在cleanup方法之中进行。为了简单，使用的是java之中的TreeMap，因为这个数据结构天生就带有排序功能。而Reducer的工作流程跟Map其实是完全一致的，只是光Map一步还不够，所以只能再加一个Reduce步骤。

最终输出的格式为如下：(K=2)

1117    a
456    g

所以需要使用map。如果只需要输出大小的话，直接使用TreeSet会更高效一点。

下面是实现的代码：

package hadoop_in_action_exersice;

import java.io.IOException;
import java.util.TreeMap;

import org.apache.hadoop.conf.Configuration;
import org.apache.hadoop.fs.Path;
import org.apache.hadoop.io.IntWritable;
import org.apache.hadoop.io.LongWritable;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapreduce.Job;
import org.apache.hadoop.mapreduce.Mapper;
import org.apache.hadoop.mapreduce.Reducer;
import org.apache.hadoop.mapreduce.lib.input.FileInputFormat;
import org.apache.hadoop.mapreduce.lib.output.FileOutputFormat;

public class TopK {

	public static final int K = 2;
	
	public static class KMap extends Mapper<LongWritable, Text, IntWritable, Text> {
		
		TreeMap<Integer, String> map = new TreeMap<Integer, String>(); 
		
		public void map(LongWritable key, Text value, Context context) throws IOException, InterruptedException {
			
			String line = value.toString();
			if(line.trim().length() > 0 && line.indexOf("\t") != -1) {
				
				String[] arr = line.split("\t", 2);
				String name = arr[0];
				Integer num = Integer.parseInt(arr[1]);
				map.put(num, name);
				
				if(map.size() > K) {
					map.remove(map.firstKey());
				}
			}
		}

		@Override
		protected void cleanup(
				Mapper<LongWritable, Text, IntWritable, Text>.Context context)
				throws IOException, InterruptedException {
			
			for(Integer num : map.keySet()) {
				context.write(new IntWritable(num), new Text(map.get(num)));
			}
			
		}
		
	}
	
	
	public static class KReduce extends Reducer<IntWritable, Text, IntWritable, Text> {
		
		TreeMap<Integer, String> map = new TreeMap<Integer, String>();
		
		public void reduce(IntWritable key, Iterable<Text> values, Context context) throws IOException, InterruptedException {
				
			map.put(key.get(), values.iterator().next().toString());
			if(map.size() > K) {
				map.remove(map.firstKey());
			}
		}

		@Override
		protected void cleanup(
				Reducer<IntWritable, Text, IntWritable, Text>.Context context)
				throws IOException, InterruptedException {
			for(Integer num : map.keySet()) {
				context.write(new IntWritable(num), new Text(map.get(num)));
			}
		}
	}

	public static void main(String[] args) {
		// TODO Auto-generated method stub
		
		Configuration conf = new Configuration();
		try {
			Job job = new Job(conf, "my own word count");
			job.setJarByClass(TopK.class);
			job.setMapperClass(KMap.class);
			job.setCombinerClass(KReduce.class);
			job.setReducerClass(KReduce.class);
			job.setOutputKeyClass(IntWritable.class);
			job.setOutputValueClass(Text.class);
			FileInputFormat.setInputPaths(job, new Path("/home/hadoop/DataSet/Hadoop/WordCount-Result"));
			FileOutputFormat.setOutputPath(job, new Path("/home/hadoop/DataSet/Hadoop/TopK-output1"));
			System.out.println(job.waitForCompletion(true));
		} catch (IOException e) {
			// TODO Auto-generated catch block
			e.printStackTrace();
		} catch (ClassNotFoundException e) {
			// TODO Auto-generated catch block
			e.printStackTrace();
		} catch (InterruptedException e) {
			// TODO Auto-generated catch block
			e.printStackTrace();
		} 
	}
}

分享到：

[Hadoop] 练习：使用Hadoop计算两个向量的 ... | [Hadoop] 常用的web界面汇总 (持续更新中 ...

2014-09-22 11:54
浏览 6896
评论(0)
分类:开源软件
查看更多

发表评论

您还没有登录,请您登录后再发表评论

最近访客更多访客>>

博主相关

文章分类

社区版块

存档分类

最新评论

[Hadoop] TopK的一个简单实现

评论

发表评论

相关推荐

最近访客 更多访客>>

博主相关

文章分类

社区版块

存档分类

最新评论

[Hadoop] TopK的一个简单实现

评论

发表评论

相关推荐

[Lucene] Lucene 4.10 显示分词结果

[Hadoop] 分布式Join : Replicated Join

[Hadoop]使用Hadoop进行ReduceSideJoin

[Hadoop] Hadoop 链式任务 : ChainMapper and ChainReducer的使用

[Hadoop] 练习：使用Hadoop计算两个向量的内积

[Hadoop] 新API容易遇到的一个问题： expected LongWritable recieved Text

[Hadoop] 从WordCount 入门

[Mahout] 使用Mahout 对Kddcup 1999的数据进行分析 -- Naive Bayes

[Mahout] 为什么mahout需要sequencefile ?

[Mahout] mahout 0.9 的 seqdirectory 有bug

[Mahout] 使用Mahout对iris数据进行分析 - Logistic Regression

[Mahout] Windows + Eclipse 构建mahout运行环境

[Mahout] 第一个小实验：使用GroupLens进行推荐模型的检验

[Mahout] Windows下Mahout单机安装

[Kaggle实战] Titanic 逃生预测 (5) - 使用Dot语言绘制决策树

[Kaggle实战] Titanic 逃生预测 (4) - 决策树建模

[Kaggle实战] Titanic 逃生预测 (3) - Age离散化

[Kaggle实战] Titanic 逃生预测 (2) - 数据预处理

[Kaggle实战] Titanic 逃生预测 (1) - 项目起步

Java实现的朴素贝叶斯分类器

最近访客更多访客>>