`

oozie-工作流Map-Reduce行为

 
阅读更多
Map-Reduce行为


A map-reduce action can be configured to perform file system cleanup and directory creation before starting the map reduce job. This capability enables Oozie to retry a Hadoop job in the situation of a transient failure (Hadoop checks the non-existence of the job output directory and then creates it when the Hadoop job is starting, thus a retry without cleanup of the job output directory would fail).

The workflow job will wait until the Hadoop map/reduce job completes before continuing to the next action in the workflow execution path.

The counters of the Hadoop job and job exit status (=FAILED=, KILLED or SUCCEEDED ) must be available to the workflow job after the Hadoop jobs ends. This information can be used from within decision nodes and other actions configurations.

The map-reduce action has to be configured with all the necessary Hadoop JobConf properties to run the Hadoop map/reduce job.

Hadoop JobConf properties can be specified in a JobConf XML file bundled with the workflow application or they can be indicated inline in the map-reduce action configuration.

The configuration properties are loaded in the following order, streaming , job-xml and configuration , and later values override earlier values.

Streaming and inline property values can be parameterized (templatized) using EL expressions.

The Hadoop mapred.job.tracker and fs.default.name properties must not be present in the job-xml and inline configuration.


3.2.2.1 Adding Files and Archives for the Job

The file , archive elements make available, to map-reduce jobs, files and archives. If the specified path is relative, it is assumed the file or archiver are within the application directory, in the corresponding sub-path. If the path is absolute, the file or archive it is expected in the given absolute path.

Files specified with the file element, will be symbolic links in the home directory of the task.

If a file is a native library (an '.so' or a '.so.#' file), it will be symlinked as and '.so' file in the task running directory, thus available to the task JVM.

To force a symlink for a file on the task running directory, use a '#' followed by the symlink name. For example 'mycat.sh#cat'.

Refer to Hadoop distributed cache documentation for details more details on files and archives.


3.2.2.2 Streaming

Streaming information can be specified in the streaming element.

The mapper and reducer elements are used to specify the executable/script to be used as mapper and reducer.

User defined scripts must be bundled with the workflow application and they must be declared in the files element of the streaming configuration. If the are not declared in the files element of the configuration it is assumed they will be available (and in the command PATH) of the Hadoop slave machines.

Some streaming jobs require Files found on HDFS to be available to the mapper/reducer scripts. This is done using the file and archive elements described in the previous section.

The Mapper/Reducer can be overridden by a mapred.mapper.class or mapred.reducer.class properties in the job-xml file or configuration elements.


3.2.2.3 Pipes

Pipes information can be specified in the pipes element.

A subset of the command line options which can be used while using the Hadoop Pipes Submitter can be specified via elements - map , reduce , inputformat , partitioner , writer , program .

The program element is used to specify the executable/script to be used.

User defined program must be bundled with the workflow application.

Some pipe jobs require Files found on HDFS to be available to the mapper/reducer scripts. This is done using the file and archive elements described in the previous section.

Pipe properties can be overridden by specifying them in the job-xml file or configuration element.

3.2.2.4 Syntax

<workflow-app name="[WF-DEF-NAME]" xmlns="uri:oozie:workflow:0.1">
    ...
    <action name="[NODE-NAME]">
        <map-reduce>
            <job-tracker>[JOB-TRACKER]</job-tracker>
            <name-node>[NAME-NODE]</name-node>
            <prepare>
                <delete path="[PATH]"/>
                ...
                <mkdir path="[PATH]"/>
                ...
            </prepare>
            <streaming>
                <mapper>[MAPPER-PROCESS]</mapper>
                <reducer>[REDUCER-PROCESS]</reducer>
                <record-reader>[RECORD-READER-CLASS]</record-reader>
                <record-reader-mapping>[NAME=VALUE]</record-reader-mapping>
                ...
                <env>[NAME=VALUE]</env>
                ...
            </streaming>
<!-- Either streaming or pipes can be specified for an action, not both -->
            <pipes>
                <map>[MAPPER]</map>
                <reduce>[REDUCER]</reducer>
                <inputformat>[INPUTFORMAT]</inputformat>
                <partitioner>[PARTITIONER]</partitioner>
                <writer>[OUTPUTFORMAT]</writer>
                <program>[EXECUTABLE]</program>
            </pipes>
            <job-xml>[JOB-XML-FILE]</job-xml>
            <configuration>
                <property>
                    <name>[PROPERTY-NAME]</name>
                    <value>[PROPERTY-VALUE]</value>
                </property>
                ...
            </configuration>
            <file>[FILE-PATH]</file>
            ...
            <archive>[FILE-PATH]</archive>
            ...
        </map-reduce>        <ok to="[NODE-NAME]"/>
        <error to="[NODE-NAME]"/>
    </action>
    ...
</workflow-app>
The prepare element, if present, indicates a list of path do delete before starting the job. This should be used exclusively for directory cleanup for the job to be executed. The delete operation will be performed in the fs.default.name filesystem.

The job-xml element, if present, must refer to a Hadoop JobConf job.xml file bundled in the workflow application. The job-xml element is optional and if present it can be only one.

The configuration element, if present, contains JobConf properties for the Hadoop job.

Properties specified in the configuration element override properties specified in the file specified in the job-xml element.

The file element, if present, must specify the target sybolic link for binaries by separating the original file and target with a # (file#target-sym-link). This is not required for libraries.

The mapper and reducer process for streaming jobs, should specify the executable command with URL encoding. e.g. '%' should be replaced by '%25'.

Example:

<workflow-app name="foo-wf" xmlns="uri:oozie:workflow:0.1">
    ...
    <action name="myfirstHadoopJob">
        <map-reduce>
            <job-tracker>foo:9001</job-tracker>
            <name-node>bar:9000</name-node>
            <prepare>
                <delete path="hdfs://foo:9000/usr/tucu/output-data"/>
            </prepare>
            <job-xml>/myfirstjob.xml</job-xml>
            <configuration>
                <property>
                    <name>mapred.input.dir</name>
                    <value>/usr/tucu/input-data</value>
                </property>
                <property>
                    <name>mapred.output.dir</name>
                    <value>/usr/tucu/input-data</value>
                </property>
                <property>
                    <name>mapred.reduce.tasks</name>
                    <value>${firstJobReducers}</value>
                </property>
            </configuration>
        </map-reduce>
        <ok to="myNextAction"/>
        <error to="errorCleanup"/>
    </action>
    ...
</workflow-app>
In the above example, the number of Reducers to be used by the Map/Reduce job has to be specified as a parameter of the workflow job configuration when creating the workflow job.

Streaming Example:

<workflow-app name="sample-wf" xmlns="uri:oozie:workflow:0.1">
    ...
    <action name="firstjob">
        <map-reduce>
            <job-tracker>foo:9001</job-tracker>
            <name-node>bar:9000</name-node>
            <prepare>
                <delete path="${output}"/>
            </prepare>
            <streaming>
                <mapper>/bin/bash testarchive/bin/mapper.sh testfile</mapper>
                <reducer>/bin/bash testarchive/bin/reducer.sh</reducer>
            </streaming>
            <configuration>
                <property>
                    <name>mapred.input.dir</name>
                    <value>${input}</value>
                </property>
                <property>
                    <name>mapred.output.dir</name>
                    <value>${output}</value>
                </property>
                <property>
                    <name>stream.num.map.output.key.fields</name>
                    <value>3</value>
                </property>
            </configuration>
            <file>/users/blabla/testfile.sh#testfile</file>
            <archive>/users/blabla/testarchive.jar#testarchive</archive>
        </map-reduce>
        <ok to="end"/>
        <error to="kill"/>
    </action>
  ...
</workflow-app>
Pipes Example:

<workflow-app name="sample-wf" xmlns="uri:oozie:workflow:0.1">
    ...
    <action name="firstjob">
        <map-reduce>
            <job-tracker>foo:9001</job-tracker>
            <name-node>bar:9000</name-node>
            <prepare>
                <delete path="${output}"/>
            </prepare>
            <pipes>
                <program>bin/wordcount-simple#wordcount-simple</program>
            </pipes>
            <configuration>
                <property>
                    <name>mapred.input.dir</name>
                    <value>${input}</value>
                </property>
                <property>
                    <name>mapred.output.dir</name>
                    <value>${output}</value>
                </property>
            </configuration>
            <archive>/users/blabla/testarchive.jar#testarchive</archive>
        </map-reduce>
        <ok to="end"/>
        <error to="kill"/>
    </action>
  ...
</workflow-app>
分享到:
评论

相关推荐

    oozie-branch-4.1.zip

    oozie-4.1源码。github下载的。 oozie-4.1源码。github下载的。

    oozie-5.2.1-distro.tar.gz

    Apache Oozie-5.2.1源码编译包

    oozie-5.1.0.tar.gz(3)

    oozie-5.1.0.tar.gz 编译结果,受限上传大小,文件分3部分

    oozie-4.3.1.tar.gz

    oozie-4.3.1.tar.gz 源码,可以利用该tar包进行编译安装oozie

    oozie-5.1.0.tar.gz(2)

    oozie-5.1.0.tar.gz 编译结果,受限上传大小,文件分3部分

    oozie-4.1.0.tar.gz

    oozie 4.1.0 linux安装包

    oozie -4.3.0 .tar for linux.64

    linux.64 下的 oozie-4.3.0.tar.gz 源码包 解压后编译即可 注意对应的版本 cd bin ./mkdistro.sh -Phadoop-2 -Dhadoop.auth.version=2.8.2 -Ddistcp.version=2.8.2 -Dhadoop.version=2.8.2 -Dsqoop.version=1.4.6 -...

    oozie-core-4.3.0.jar

    oozie-core

    oozie-5.0.0.tar.gz

    2019-06-25 最新oozie5.0.0.tar.gz基于工作流调度hadoop作业web工具

    oozie-4.2.0-distro.tar.gz

    之前公司需要结合hadoop-2.7.2搭建oozie-4.2.0的时候,一直不知从何下手,官网下的包需要结合hadoop版本进行二次编译,手动编译很多次都一直中断, 这个包结合hadoop-2.7.2进行编译的,希望对需要搭建oozie-4.2.0的...

    oozie-5.1.0.tar.gz(1)

    oozie-5.1.0.tar.gz 编译结果,受限上传大小,文件分3部分

    oozie-4.2.0

    oozie-4.2.0

    oozie-client:nodejs oozie客户端

    oozie-client安装npm安装oozie-client得到帮助节点app.js-帮助选项: -s,--save保存参数-undefined,--cluster hdinsight群集名称(期望值)-undefined,--user用户(期望值)-undefined,--pass password(期望值...

    快速学习-Oozie的使用

    [atguigu@hadoop102 oozie-4.0.0-cdh5.3.6]$ tar -zxvf oozie-examples.tar.gz 2)创建工作目录 [atguigu@hadoop102 oozie-4.0.0-cdh5.3.6]$ mkdir oozie-apps/ 3)拷贝任务模板到oozie-apps/目录 [atguigu@hadoop...

    安装Oozie4.1.0-CDH版本

    http://archive.cloudera.com/cdh5/cdh/5/oozie-4.1.0-cdh5.5.2.tar.gz 文件较大,1.6G 还需要下载ExjJS,这是扩展的JavaScript的UI桌面框架。 必须是2.2版本的,这是官网指定的,已经写死在oozied.sh中。下载地址:...

    Oozie-JavaAction

    oozie 提交任务参数传递到下一个任务节点 oozie 提交任务参数传递到下一个任务节点

    Oozie - The Workflow Scheduler for Hadoop

    oozie 权威图书。pdf原版 2015-05-08: First Release Mohammad Kamrul Islam & Aravind Srinivasan

    hadoop-oozie:具有Oozie的映像,该映像是为Hadoop 2.x构建的(带有2.7.0库)

    andlaz/hadoop-oozie su oozie -c 'oozie-setup.sh sharelib create -fs hdfs://namenode:8020' 启动Ooozie docker run -d --name oozie -p 0.0.0.0:11000 -p 0.0.0.0:11001:11001 \ andlaz/hadoop-oozie su oozie ...

    oozie-4.3.0

    Oozie是一种框架,它让我们可以把多个Map/Reduce作业组合到一个逻辑工作单元中。

    oozie-graphite:使用石墨监控您的 oozie 服务器和 oozie 包

    oozie-graphite包含一些有用的粘合剂,用于将操作数据从 oozie 包/协调器和/或 oozie-internal 仪器推送到石墨中。 兼容性 版本 1.0 + 版本 1.1.0 + 如何构建 使用 ,只需使用捆绑和预配置的 gradlew 包装器。 ...

Global site tag (gtag.js) - Google Analytics