`

InputFormat牛逼(3)org.apache.hadoop.mapreduce.InputFormat<K, V>

 
阅读更多


@Public
@Stable

InputFormat describes the input-specification for a Map-Reduce job.

The Map-Reduce framework relies on the InputFormat of the job to:


Validate the input-specification of the job.
Split-up the input file(s) into logical InputSplits, each of which is then assigned to an individual Mapper.
Provide the RecordReader implementation to be used to glean input records from the logical InputSplit for processing by the Mapper.
The default behavior of file-based InputFormats, typically sub-classes of FileInputFormat, is to split the input into logical InputSplits based on the total size, in bytes, of the input files. However, the FileSystem blocksize of the input files is treated as an upper bound for input splits. A lower bound on the split size can be set via mapreduce.input.fileinputformat.split.minsize.

Clearly, logical splits based on input-size is insufficient for many applications since record boundaries are to respected. In such cases, the application has to also implement a RecordReader on whom lies the responsibility to respect record-boundaries and present a record-oriented view of the logical InputSplit to the individual task.

@InterfaceAudience.Public
@InterfaceStability.Stable
public abstract class InputFormat<K, V> {

  /** 
   * Logically split the set of input files for the job.  
   * 
   * <p>Each {@link InputSplit} is then assigned to an individual {@link Mapper}
   * for processing.</p>
   *
   * <p><i>Note</i>: The split is a <i>logical</i> split of the inputs and the
   * input files are not physically split into chunks. For e.g. a split could
   * be <i>&lt;input-file-path, start, offset&gt;</i> tuple. The InputFormat
   * also creates the {@link RecordReader} to read the {@link InputSplit}.
   * 
   * @param context job configuration.
   * @return an array of {@link InputSplit}s for the job.
   */
  public abstract 
    List<InputSplit> getSplits(JobContext context
                               ) throws IOException, InterruptedException;
  
  /**
   * Create a record reader for a given split. The framework will call
   * {@link RecordReader#initialize(InputSplit, TaskAttemptContext)} before
   * the split is used.
   * @param split the split to be read
   * @param context the information about the task
   * @return a new record reader
   * @throws IOException
   * @throws InterruptedException
   */
  public abstract 
    RecordReader<K,V> createRecordReader(InputSplit split,
                                         TaskAttemptContext context
                                        ) throws IOException, 
                                                 InterruptedException;

}




  • 大小: 69.8 KB
分享到:
评论

相关推荐

    CustomInputFormatCollection:Hadoop Mapreduce InputFormat 集合

    Hadoop 代码使用方式 job.setInputFormatClass(SmallFileCombineTextInputFormat.class);...org.apache.hadoop.mapreduce.sample.SmallFileWordCount -Dmapreduce.input.fileinputformat.split.maxsize=10

    wonderdog:批量加载以进行弹性搜索

    您可以在自己的Hadoop MapReduce作业中使用的 ,可从轻松使用这些InputFormat和OutputFormat类 从 LOAD和STORE到ElasticSearch的 一些用于与ElasticSearch进行交互的 &lt; project&gt; ... &lt; dependencies&gt; &lt; ...

    自定义MapReduce的InputFormat

    自定义MapReduce的InputFormat,实现提取指定开始与结束限定符的内容。

    Hadoop源码解析---MapReduce之InputFormat

    结合Hadoop源码,详细讲解了MapReduce开发中的InputFormat,很详细。

    Hadoop技术内幕:深入解析MapReduce架构设计与实现原理

    MapReduce设计理念与基本架构2.1 Hadoop发展史2.1.1 Hadoop产生背景2.1.2 Apache Hadoop新版本的特性2.1.3 Hadoop版本变迁2.2 Hadoop MapReduce设计目标2.3 MapReduce编程模型概述2.3.1 MapReduce编程模型...

    Hadoop实战中文版.PDF

    1649.2.2 获得命令行工具 1669.2.3 准备SSH密钥对 1689.3 在EC2上安装Hadoop 1699.3.1 配置安全参数 1699.3.2 配置集群类型 1699.4 在EC2上运行MapReduce程序 1719.4.1 将代码转移到Hadoop集群上 1719...

    Hadoop实战

    433.3 读和写 433.3.1 InputFormat 443.3.2 OutputFormat 493.4 小结 50第二部分 实战第4章 编写MapReduce基础程序 524.1 获得专利数据集 524.1.1 专利引用数据 534.1.2 专利描述数据 544.2 构建MapReduce程序的基础...

    Hadoop实战中文版

    《Hadoop实战》分为3个部分,深入浅出地介绍了Hadoop框架、编写和运行Hadoop数据处理程序所需的实践技能及Hadoop之外更大的生态系统。《Hadoop实战》适合需要处理大量离线数据的云计算程序员、架构师和项目经理阅读...

    ExcelRecordReaderMapReduce:可以读取Excel文件的MapReduce InputFormat

    ExcelRecordReaderMapReducehadoop mapreduce的MapReduce输入格式以读取Microsoft Excel电子表格执照Apache许可。用法1.下载并运行ant。 2.在您的环境中包括ExcelRecordReaderMapReduce-0.0.1-SNAPSHOT.jar 3.使用...

    Hadoop实战(陆嘉恒)译

    Hadoop组件3.1 HDFS 文件操作3.1.1 基本文件命令3.1.2 编程读写HDFS3.2 剖析MapReduce 程序3.2.1 Hadoop数据类型3.2.2 Mapper3.2.3 Reducer3.2.4 Partitioner:重定向Mapper输出3.2.5 Combiner:本地reduce3.2.6 ...

    大数据学习(九):mapreduce编程模型及具体框架实现

    map reduce编程模型把数据运算流程分成2个阶段  阶段1:读取原始数据,形成key-value数据(map方法)  阶段2:将阶段1的key-value数据按照相同key分组聚合(reduce方法) ... 读数据:InputFormat–&gt;TextInputFormat

    mapreduce_training:用于教学目的的MapReduce应用程序集

    MapReduce自定义InputFormat和RecordReader实现 MapReduce自定义OutputFormat和RecordWriter实现 Pig自定义LoadFunc加载和解析Apache HTTP日志事件 Pig的自定义EvalFunc使用MaxMind GEO API将IP地址转换为位置 另外...

    mapfileinputformat:MapFiles 的 Hadoop InputFormat,它在将任何内容传递给映射器之前过滤不相关的 FileSplits

    映射文件输入格式MapFiles 的 Hadoop InputFormat,它在将任何内容传递给映射器之前过滤不相关的 FileSplits。目的假设您的文件系统中有一些带有排序键的非常大的文件,并且键已排序。 在编写 MapReduce 作业时,您...

    FocusBigData:【大数据成神之路学习路径+面经+简历】

    FocusBigData :elephant:Hadoop分布存储框架 Hadoop篇 HDFS篇 ...MapReduce之InputFormat数据输入 MapReduce之OutputFormat数据输出 MapReduce之Shuffle机制 MapReduce之MapJoin和ReduceJoin MapReduce之

    Spark RDD详解

    Spark与Apache Hadoop有何关系? Spark是与Hadoop数据兼容的快速通用处理引擎。它可以通过YARN或Spark的Standalone在Hadoop集群中运行,并且可以处理HDFS、Hbase、Cassandra、Hive和任何Hadoop InputFormat中的数据...

    ecosystem:TensorFlow与其他开源框架的集成

    TensorFlow生态系统 该存储库包含将TensorFlow与其他开源框架集成的示例。... -Hadoop MapReduce和Spark的TFRecord文件InputFormat / OutputFormat。 -Spark TensorFlow连接器 -Python软件包,可帮助用户使用TensorF

Global site tag (gtag.js) - Google Analytics