同样是<Hadoop in Action> 上面的练习~
练习:
计算两个向量的内积,比如:
v1 = [1 2 3]
v2 = [2 3 4]
内积 = 2 + 5 + 12 = 19
我的输入文件:
1.0 2.0 3.0 4.0 1 1
即:
v1 = [1 3 1]
v2 = [2 4 1]
结果: 15
思路:
每行读取两个向量的两个元素并计算乘积,然后在Reduce之中进行求和。
注意:
如果在main函数之中,设定了setCombiner(Reduce.class) 最后结果会出错,因为和被计算了两次!
即算出来的结果会是30!
代码如下:
package hadoop_in_action_exersice; import java.io.IOException; import java.util.TreeMap; import org.apache.hadoop.conf.Configuration; import org.apache.hadoop.fs.Path; import org.apache.hadoop.io.DoubleWritable; import org.apache.hadoop.io.IntWritable; import org.apache.hadoop.io.LongWritable; import org.apache.hadoop.io.NullWritable; import org.apache.hadoop.io.Text; import org.apache.hadoop.mapreduce.Job; import org.apache.hadoop.mapreduce.Mapper; import org.apache.hadoop.mapreduce.Reducer; import org.apache.hadoop.mapreduce.lib.input.FileInputFormat; import org.apache.hadoop.mapreduce.lib.output.FileOutputFormat; public class InnerProduct { private static final Text SUM = new Text("sum"); public static class Map extends Mapper<LongWritable, Text, Text, DoubleWritable> { TreeMap<Integer, String> map = new TreeMap<Integer, String>(); private static double map_sum = 0.0; public void map(LongWritable key, Text value, Context context) throws IOException, InterruptedException { String line = value.toString(); String[] arr = line.split(" "); try { double v1 = Double.parseDouble(arr[0]); double v2 = Double.parseDouble(arr[1]); map_sum += v1 * v2; } catch(Exception e) { e.printStackTrace(); } } @Override protected void cleanup( Mapper<LongWritable, Text, Text, DoubleWritable>.Context context) throws IOException, InterruptedException { System.out.println("!!!" + map_sum); context.write(SUM, new DoubleWritable(map_sum)); } } public static class Reduce extends Reducer<Text, DoubleWritable, Text, DoubleWritable> { private static double sum = 0; public void reduce(Text key, Iterable<DoubleWritable> values, Context context) throws IOException, InterruptedException { if(key.toString() .equals(SUM.toString()) ) { for(DoubleWritable v : values) { sum += v.get(); } context.write(key, new DoubleWritable(sum)); } } } public static void main(String[] args) { // TODO Auto-generated method stub Configuration conf = new Configuration(); try { Job job = new Job(conf, "my own word count"); job.setJarByClass(InnerProduct.class); job.setMapperClass(Map.class); job.setCombinerClass(Reduce.class); // job.setReducerClass(Reduce.class); // 这里不能调用,否则会多进行一次求和的操作造成结果错误 job.setOutputKeyClass(Text.class); job.setOutputValueClass(DoubleWritable.class); FileInputFormat.setInputPaths(job, new Path("/home/hadoop/DataSet/Hadoop/Exercise/InnerProduct")); FileOutputFormat.setOutputPath(job, new Path("/home/hadoop/DataSet/Hadoop/Exercise/InnerProduct-output")); System.out.println(job.waitForCompletion(true)); } catch (IOException e) { // TODO Auto-generated catch block e.printStackTrace(); } catch (ClassNotFoundException e) { // TODO Auto-generated catch block e.printStackTrace(); } catch (InterruptedException e) { // TODO Auto-generated catch block e.printStackTrace(); } } }
相关推荐
实战Hadoop 2.0:从云计算到大数据(第二版)
Hadoop安全:大数据平台隐私保护 Hadoop安全:大数据平台隐私保护 Hadoop安全:大数据平台隐私保护
大数据安全-kerberos技术-hadoop安装包,hadoop版本:hadoop-3.3.4.tar.gz
Hadoop实战:Hadoop in Action
hadoop权威指南代码 (Hadoop: The Definitive Guide code) http://www.hadoopbook.com
Apache Hadoop YARN:Moving beyond MapReduce and Batch Processing with Apach 2 【yarn权威指南】
Hadoop大数据:处理技术基础与实践
资源名称:云计算Hadoop:快速部署Hadoop集群内容简介: 近来云计算越来越热门了,云计算已经被看作IT业的新趋势。云计算可以粗略地定义为使用自己环境之外的某一服务提供的可伸缩计算资源,并按使用量付费。可以...
Hadoop的权威指南 Hadoop: The Definitive Guide 。里面有两个pdf的压缩包,中英两版本都有,欢迎查阅
计算Hadoop:快速部署Hadoop集群 详细的Hadoop集群部署文档,对您绝对有用~
hadoop2.7汇总:新增功能最新编译64位安装、源码包、API、eclipse插件下载
Hadoop安全:大数据平台隐私保护,让集群数据可以更加安全,按照文档中内容操作,可以保障集群中的数据
Hadoop: The Definitive Guide is a comprehensive resource for using Hadoop to build reliable, scalable, distributed systems. Programmers will find details for analyzing large datasets with Hadoop, and...
"Data Analytics with Hadoop: An Introduction for Data Scientists" ISBN: 1491913703 | 2016 | PDF | 288 pages | 7 MB Ready to use statistical and machine-learning techniques across large data sets? ...
Hadoop实例:二度人脉与好友推荐,供大家一起共同分享学习。
Hadoop databases: Hive, Impala, Spark, Presto For ORACLE DBAs
Apache Hadoop软件库是一个框架,它允许使用简单的编程模型跨计算机群集分布式处理大型数据集。它旨在从单个服务器扩展到数千台机器,每台机器提供本地计算和存储。该库本身不是依靠硬件来提供高可用性,而是设计...
文档较详尽的讲述了MR的简介,MR初学分析示例(有代码)、MR特性,MR的执行过程(有代码),MR单元测试介绍(有代码)、HA的架构和配置、同时也向大众推荐了两本书。其中部分有较为详尽的链接以供参考。
NULL 博文链接:https://mr-lili-1986-163-com.iteye.com/blog/1095200