Cascading Terminology and Concepts

sunwinner

浏览: 197973 次
性别:
来自: 上海

最近访客更多访客>>

luojianbing

yanghuangsanguo

jahentao

baichoufei90sina

博主相关

博客

微博

相册

留言

关于我

文章分类

社区版块

存档分类

博客分类：

Hadoop
Cascading

Cascading is a data processing API and processing query planner used for defining, sharing, and executing data-processing workflows on a single computing node or distributed computing cluster. On a single node, Cascading's "local mode" can be used to efficiently test code and process local files before being deployed on a cluster. On a distributed computing cluster using Apache Hadoop platform, Cascading adds an abstraction layer over the Hadoop API, greatly simplifying Hadoop application development, job creation, and job scheduling. Java developers can leverage Cascading to develop robust Data Analytics and Data Management applications on Apache Hadoop. You can find a kick start example in this blog post.

Terminology of Cascading

The Cascading processing model is based on a metaphor of pipes (data streams) and filters (data operations). Thus the Cascading API allows the developer to assemble pipe assemblies that split, merge, group, or join streams of data while applying operations to each data record or groups of records.

In Cascading, we call a data record a tuple, a simple chain of pipes without forks or merges a branch, an interconnected set of pipe branches a pipe assembly, and a series of tuples passing through a pipe branch or assembly a tuple stream.

Pipe assemblies are specified independently of the data source they are to process. So before a pipe assembly can be executed, it must be bound to taps, i.e., data sources and sinks. The result of binding one or more pipe assemblies to taps is a flow, which is executed on a computer or cluster using the Hadoop framework.

Multiple flows can be grouped together and executed as a single process. In this context, if one flow depends on the output of another, it is not executed until all of its data dependencies are satisfied. Such a collection of flows is called a cascade.

Concepts of Cascading

Pipe Assemblies: Pipe assemblies define what work should be done against tuple streams, which are read from tap sources and written to tap sinks. The work performed on the data stream may include actions such as filtering, transforming, organizing, and calculating. Pipe assemblies may use multiple sources and multiple sinks, and may define splits, merges, and joins to manipulate the tuple streams.
Pipes: The base class cascading.pipe.Pipe and its subclasses are shown in the diagram below.

The following table summarizes the different types of pipes.

We will talk more about pipes in another blog post.
Connector: Cascading supports pluggable planners that allow it to execute on differing platforms. Planners are invoked by an associated FlowConnector subclass. Currently, only two planners are provided: LocalFlowConnector and HadoopFlowConnector.

LocalFlowConnector provides a local mode planner for running Cascading completely in memory on the current computer while HadoopFlowConnector provides a planner for running Cascading on an Apache Hadoop cluster.
Tap: All input data comes in from, and all output data goes out to, some instance of cascading.tap.Tap. A tap can be read from, which makes it a source, or written to, which makes it a sink. Or, more commonly, taps act as both sinks and sources when shared between flows. Below is the class diagram of Taps:

We're not going to talk about all of taps here, please refer to Cascading javadoc for details of these taps classes.
Scheme: If the Tap is about where the data is and how to access it, the Scheme is about what the data is and how to read it. Every Tap must have a Scheme that describes the data. Cascading provides four Scheme classes: TextLine, TextDelimited, SequenceFile, WritableSequenceFile, below is the class diagram of Scheme:

Field Set: Cascading applications can perform complex manipulation or "field algebra" on the fields stored in tuples, using Fields sets, a feature of the Fields class that provides a sort of wildcard tool for referencing sets of field values. These predefined Fields sets are constant values on the Fields class. They can be used in many places where the Fields class is expected.

 /** Field UNKNOWN */
  public static final Fields UNKNOWN = new Fields( Kind.UNKNOWN );
  /** Field NONE represents a wildcard for no fields */
  public static final Fields NONE = new Fields( Kind.NONE );
  /** Field ALL represents a wildcard for all fields */
  public static final Fields ALL = new Fields( Kind.ALL );
  /** Field KEYS represents all fields used as they key for the last grouping */
  public static final Fields GROUP = new Fields( Kind.GROUP );
  /** Field VALUES represents all fields used as values for the last grouping */
  public static final Fields VALUES = new Fields( Kind.VALUES );
  /** Field ARGS represents all fields used as the arguments for the current operation */
  public static final Fields ARGS = new Fields( Kind.ARGS );
  /** Field RESULTS represents all fields returned by the current operation */
  public static final Fields RESULTS = new Fields( Kind.RESULTS );
  /** Field REPLACE represents all incoming fields, and allows their values to be replaced by the current operation results. */
  public static final Fields REPLACE = new Fields( Kind.REPLACE );
  /** Field SWAP represents all fields not used as arguments for the current operation and the operations results. */
  public static final Fields SWAP = new Fields( Kind.SWAP );
  /** Field FIRST represents the first field position, 0 */
  public static final Fields FIRST = new Fields( 0 );
  /** Field LAST represents the last field position, -1 */
  public static final Fields LAST = new Fields( -1 );

The chart below shows common ways to merge input and result fields for the desired output fields. A few minutes with this chart may help clarify the discussion of fields, tuples, and pipes. Also see Each and Every Pipes for details on the different columns and their relationships to the Each and Every pipes and Functions, Aggregators, and Buffers.

Flow: When pipe assemblies are bound to source and sink taps, a Flow is created. Flows are executable in the sense that, once they are created, they can be started and will execute on the specified platform. If the Hadoop platform is specified, the Flow will execute on a Hadoop cluster. A Flow is essentially a data processing pipeline that reads data from sources, processes the data as defined by the pipe assembly, and writes data to the sinks.
Cascade: A Cascade allows multiple Flow instances to be executed as a single logical unit. If there are dependencies between the Flows, they are executed in the correct order. Further, Cascades act like Ant builds or Unix make files - that is, a Cascade only executes Flows that have stale sinks (i.e., output data that is older than the input data).
```
CascadeConnector connector = new CascadeConnector();
Cascade cascade = connector.connect( flowFirst, flowSecond, flowThird );
```