Mapreduce多目录/多文件输出

wangjin161

浏览: 167168 次
性别:
来自: 北京

最近访客更多访客>>

gaojingsong

wolfwood

ldwnt

a755292832

博主相关

博客

微博

相册

留言

关于我

文章分类

社区版块

存档分类

博客分类：

hadoop

Mapreduce多目录/多文件输出

一，介绍

1，旧API中有 org.apache.hadoop.mapred.lib.MultipleOutputFormat和org.apache.hadoop.mapred.lib.MultipleOutputs

MultipleOutputFormat allowing to write the output data to different output files.

MultipleOutputs creates multiple OutputCollectors. Each OutputCollector can have its own OutputFormat and types for the key/value pair. Your MapReduce program will decide what to output to each OutputCollector.

2，新API中 org.apache.hadoop.mapreduce.lib.output.MultipleOutputs

整合了上面旧API两个的功能，没有了MultipleOutputFormat。

　　The MultipleOutputs class simplifies writing output data to multiple outputs

　　Case one: writing to additional outputs other than the job default output. Each additional output, or named output, may be configured with its own 　　　　　　　　　　　　OutputFormat, with its own key class and with its own value class.

　　Case two: to write data to different files provided by user

下面这段话来自Hadoop：The.Definitive.Guide(3rd,Early.Release)P₂₅₁

　　“In the old MapReduce API there are two classes for producing multiple outputs: MultipleOutputFormat and MultipleOutputs. In a nutshell, MultipleOutputs is more fully featured, but MultipleOutputFormat has more control over the output directory structure and file naming. MultipleOutputs in the new API combines the best features of the two multiple output classes in the old API.”

二，应用

1，输出到多个文件或多个文件夹：

如果只是输出到多个目录，驱动中不需要额外改变，只需要在MapClass或Reduce类中加入如下代码

　　private MultipleOutputs<Text,IntWritable> mos;
　　public void setup(Context context) throws IOException,InterruptedException {
　　　　mos = new MultipleOutputs(context);
　　}
　　public void cleanup(Context context) throws IOException,InterruptedException {
　　　　mos.close();
　　}

baseOutputPath=context.getConfiguration().get("path")// 父目录
　　然后就可以用mos.write(Key key,Value value,String baseOutputPath)代替context.write(key, value);

　　在MapClass或Reduce中使用，输出时也会有默认的文件part-m-00*或part-r-00*，不过这些文件是无内容的，大小为0. 而且只有part-m-00*会传给Reduce。

2，以多种格式输出：这种输出需要加上文件名，根据业务需求

public class TestwithMultipleOutputs extends Configured implements Tool {

　　public static class MapClass extends Mapper<LongWritable,Text,Text,IntWritable> {

　　　　private MultipleOutputs<Text,IntWritable> mos;

　　　　protected void setup(Context context) throws IOException,InterruptedException {
　　　　　　mos = new MultipleOutputs<Text,IntWritable>(context);
　　　　}

　　　　public void map(LongWritable key, Text value, Context context) throws IOException, InterruptedException{
　　　　　　String line = value.toString();
　　　　　　String[] tokens = line.split("-");

　　　　　　mos.write("MOSInt",new Text(tokens[0]), new IntWritable(Integer.parseInt(tokens[1]))); //（第一处）
　　　　　　mos.write("MOSText", new Text(tokens[0]),tokens[2]);　　　　 //（第二处）
　　　　　　mos.write("MOSText", new Text(tokens[0]),line,tokens[0]+"/");　　//（第三处）同时也可写到指定的文件或文件夹中
　　　　}

　　　　protected void cleanup(Context context) throws IOException,InterruptedException {
　　　　　　mos.close();
　　　　}

　　}
　　public int run(String[] args) throws Exception {

　　　　Configuration conf = getConf();

　　　　Job job = new Job(conf,"word count with MultipleOutputs");

　　　　job.setJarByClass(TestwithMultipleOutputs.class);

　　　　Path in = new Path(args[0]);
　　　　Path out = new Path(args[1]);

　　　　FileInputFormat.setInputPaths(job, in);
　　　　FileOutputFormat.setOutputPath(job, out);

　　　　job.setMapperClass(MapClass.class);
　　　　job.setNumReduceTasks(0);　　

　　　　MultipleOutputs.addNamedOutput(job,"MOSInt",TextOutputFormat.class,Text.class,IntWritable.class);
　　　　MultipleOutputs.addNamedOutput(job,"MOSText",TextOutputFormat.class,Text.class,Text.class);

　　　　System.exit(job.waitForCompletion(true)?0:1);
　　　　return 0;
　　}

　　public static void main(String[] args) throws Exception {

　　　　int res = ToolRunner.run(new Configuration(), new TestwithMultipleOutputs(), args);
　　　　System.exit(res);
　　}

}

3.复杂应用可以实现MultipleTextOutputFormat 不同的目录都可以context.getconfiguration中获得

public class MultiOutputFormatByFileName extends MultipleTextOutputFormat<Text, Text> {
    
    
    
    @Override
    protected String generateLeafFileName(String name) {
        // TODO Auto-generated method stub
        System.out.println(name);
        String[] names = name.split("-");
        
        return names[0]+File.separator+name;
    }
    
    @Override
    protected String generateFileNameForKeyValue(Text key, Text value,
            String name) {
        // TODO Auto-generated method stub
        return super.generateFileNameForKeyValue(key, value, name);
    }
    
}

main

MultipleOutputs.addNamedOutput(init,"q", MultiOutputFormatByFileName.class , Text.class, Text.class);
MultipleOutputs.addNamedOutput(init,"x", MultiOutputFormatByFileName.class, Text.class, Text.class);
MultipleOutputs.addNamedOutput(init,"bi", MultiOutputFormatByFileName.class, Text.class, Text.class);
MultipleOutputs.addNamedOutput(init,"bu", MultiOutputFormatByFileName.class, Text.class, Text.class);

分享到：

shell判断和比较 | 探索Hadoop OutputFormat

2014-08-28 10:34
浏览 1694
评论(0)
分类:编程语言
查看更多

发表评论

您还没有登录,请您登录后再发表评论

最近访客更多访客>>

博主相关

文章分类

社区版块

存档分类

最新评论