引用
Reducer reduces a set of intermediate values which share a key to a smaller set of values.
Reducer的数量
可通过以下方法设置
JobConf.setNumReduceTasks(int);
可以修改mapred.reduce.tasks参数,默认值为1。
官网推荐计算方法
- 0.95 * NUMBER_OF_NODES * mapred.tasktracker.reduce.tasks.maximum
- 1.75 * NUMBER_OF_NODES * mapred.tasktracker.reduce.tasks.maximum
其中选择0.95时,所有的reducer task都会在map task结束时立即启动;选择1.75时有部分reducer task需要等到第二轮执行(从先结束的节点开始执行),这样可以更好的做到负载均衡。
引用
It is legal to set the number of reduce-tasks to zero if no reduction is desired.
In this case the outputs of the map-tasks go directly to the FileSystem, into the output path set by setOutputPath(Path). The framework does not sort the map-outputs before writing them out to the FileSystem.
Reducer阶段的工作
引用
Reducer has 3 primary phases: shuffle, sort and reduce.
Shuffle
Input to the Reducer is the sorted output of the mappers. In this phase the framework fetches the relevant partition of the output of all the mappers, via HTTP.
Sort
The framework groups Reducer inputs by keys (since different mappers may have output the same key) in this stage.
The shuffle and sort phases occur simultaneously; while map-outputs are being fetched they are merged.
Secondary Sort
If equivalence rules for grouping the intermediate keys are required to be different from those for grouping keys before reduction, then one may specify a Comparator via JobConf.setOutputValueGroupingComparator(Class). Since JobConf.setOutputKeyComparatorClass(Class) can be used to control how intermediate keys are grouped, these can be used in conjunction to simulate secondary sort on values.
辅助排序可用于values需要排序的场合,如果将key对应的所有values直接在内存中排序可能会造成OOM问题,通过辅助排序将key和value封装成复合key,JobConf.setOutputKeyComparatorClass(Class)控制排序;JobConf.setOutputValueGroupingComparator(Class)控制分组;同时需要自定义Partitioner,将每个group下的数据分到同一个reducer中,而每个group会形成一个part文件,在key很多的场景下可能会需要文件的拼接。(我的理解一般key对应的value比较少的场景直接在reducer中排序,如果key对应的value非常多,需要使用辅助排序。)
Reduce
In this phase the reduce(WritableComparable, Iterator, OutputCollector, Reporter) method is called for each <key, (list of values)> pair in the grouped inputs.
分享到:
相关推荐
hadoop-annotations-3.1.1.jar hadoop-common-3.1.1.jar hadoop-mapreduce-client-core-3.1.1.jar hadoop-yarn-api-3.1.1.jar hadoop-auth-3.1.1.jar hadoop-hdfs-3.1.1.jar hadoop-mapreduce-client-hs-3.1.1.jar ...
赠送jar包:hbase-hadoop2-compat-1.2.12.jar; 赠送原API文档:hbase-hadoop2-compat-1.2.12-javadoc.jar; 赠送源代码:hbase-hadoop2-compat-1.2.12-sources.jar; 赠送Maven依赖信息文件:hbase-hadoop2-compat-...
赠送jar包:hbase-hadoop2-compat-1.1.3.jar; 赠送原API文档:hbase-hadoop2-compat-1.1.3-javadoc.jar; 赠送源代码:hbase-hadoop2-compat-1.1.3-sources.jar; 赠送Maven依赖信息文件:hbase-hadoop2-compat-...
hadoop2.6-common-bin 解决在Windows上操作hadoop出现 Could not locate executable问题
flink-1.0.3-bin-hadoop27-scala_2flink-1.0.3-bin-hadoop27-scala_2
hadoop3.3.0-winutils所有bin文件,亲测有效
赠送jar包:hadoop-mapreduce-client-jobclient-2.6.5.jar; 赠送原API文档:hadoop-mapreduce-client-jobclient-2.6.5-javadoc.jar; 赠送源代码:hadoop-mapreduce-client-jobclient-2.6.5-sources.jar; 赠送...
hadoop-eclipse-plugin-2.7.3和2.7.7的jar包 hadoop-eclipse-plugin-2.7.3和2.7.7的jar包 hadoop-eclipse-plugin-2.7.3和2.7.7的jar包 hadoop-eclipse-plugin-2.7.3和2.7.7的jar包
Eclipse集成Hadoop2.10.0的插件,使用`ant`对hadoop的jar包进行打包并适应Eclipse加载,所以参数里有hadoop和eclipse的目录. 必须注意对于不同的hadoop版本,` HADDOP_INSTALL_PATH/share/hadoop/common/lib`下的jar包...
赠送jar包:hadoop-yarn-client-2.6.5.jar; 赠送原API文档:hadoop-yarn-client-2.6.5-javadoc.jar; 赠送源代码:hadoop-yarn-client-2.6.5-sources.jar; 赠送Maven依赖信息文件:hadoop-yarn-client-2.6.5.pom;...
flink-1.7.2-bin-hadoop27-scala_2.11.tgz
spark-3.2.4-bin-hadoop3.2-scala2.13 安装包
spark-1.6.3-bin-hadoop2.4-without-hive.tgz 经测试,hadoop 2.8.2下可用。hive2.1.1 可用
Hadoop Real-World Solutions Cookbook source code 源代码
hadoop-eclipse-plugin-2.7.4.jar和hadoop-eclipse-plugin-2.7.3.jar还有hadoop-eclipse-plugin-2.6.0.jar的插件都在这打包了,都可以用。
赠送jar包:hadoop-yarn-common-2.6.5.jar 赠送原API文档:hadoop-yarn-common-2.6.5-javadoc.jar 赠送源代码:hadoop-yarn-common-2.6.5-sources.jar 包含翻译后的API文档:hadoop-yarn-common-2.6.5-javadoc-...
hadoop-eclipse-plugin-3.1.1, hadoop eclipse 插件 3.1.1
hadoop-eclipse-plugin-1.2.1hadoop-eclipse-plugin-1.2.1hadoop-eclipse-plugin-1.2.1hadoop-eclipse-plugin-1.2.1
hadoop-eclipse-plugin-3.1.3,eclipse版本为eclipse-jee-2020-03
赠送jar包:hadoop-mapreduce-client-common-2.6.5.jar; 赠送原API文档:hadoop-mapreduce-client-common-2.6.5-javadoc.jar; 赠送源代码:hadoop-mapreduce-client-common-2.6.5-sources.jar; 赠送Maven依赖信息...