`
tmrp
  • 浏览: 43781 次
  • 性别: Icon_minigender_1
  • 来自: 深圳
社区版块
存档分类
最新评论

RPC and Serialization with Hadoop, Thrift, and Protocol Buffers

阅读更多
http://www.cloudera.com/resources/rpc-and-serialization-in-hadoop

RPC and Serialization with Hadoop, Thrift, and Protocol Buffers
This post was originally written by Tom White, published here.

Hadoop and related projects like Thrift provide a choice of protocols and formats for doing RPC and serialization. In this post I'll briefly run through them and explain where they came from, how they relate to each other and how Google's newly released Protocol Buffers might fit in.

RPC and Writables
Hadoop has its own RPC mechanism that dates back to when Hadoop was a part of Nutch. It's used throughout Hadoop as the mechanism by which daemons talk to each other. For example, a DataNode communicates with the NameNode using the RPC interface DatanodeProtocol.

Protocols are defined using Java interfaces whose arguments and return types are primitives, Strings, Writables, or arrays. These types can all be serialized using Hadoop's specialized serialization format, based on Writable. Combined with the magic of Java dynamic proxies, we get a simple RPC mechanism which for the caller appears to be a Java interface.

MapReduce and Writables
Hadoop uses Writables for another, quite different, purpose: as a serialization format for MapReduce programs. If you've ever written a Hadoop MapReduce program you will have used Writables for the key and value types. For example:

public class MapClassimplements Mapper<LongWritable, Text, Text, IntWritable> {// ...}(Text is just a Writable version of Java String.)

The primary benefit of using Writables is in their efficiency. Compared to Java serialization, which would have been an obvious alternative choice, they have a more compact representation. Writables don't store their type in the serialized representation, since at the point of deserialization it is known which type is expected. For the MapReduce code above, the input key is a LongWritable, so an empty LongWritable instance is asked to populate itself from the input data stream.

More flexible MapReduce
There are downsides of having to use Writables for MapReduce types, however. For a newcomer to Hadoop it's another hurdle: something else to learn ("why can't I just use a String?"). More seriously, perhaps, is that it's hard to use different binary storage formats for MapReduce input and output. For example, Apache Thrift (see below) is an increasingly popular way of storing binary data. It's possible, but cumbersome and inefficient, to read or write Thrift data from MapReduce.

From Hadoop 0.17.0 onwards you no longer have to use Writables for key and value types in MapReduce programs. You can use any serialization framework. (Note that this is change is completely independent of Hadoop's RPC mechanism, which still uses Writables - and can only use Writables - as its on-wire format.) So it's easier to use Thrift types, say, throughout your MapReduce program. Or you can even use Java serialization (with some limitations which will be fixed). What's more, you can add your own serialization framework if you like.

Record I/O, Thrift and Protocol Buffers

Another problem with Writables, at least for the MapReduce programmer, is that creating new types is a burden. You have to implement the Writable interface, which means designing the on-wire format, and writing two methods: one to write the data in that format and one to read it back.

Hadoop's Record I/O was created to solve this problem. You write a definition of your types using a record definition language, then run a record compiler to generate Java source code representations of your types. All Record I/O types are Writable, so they plug into Hadoop very easily. As a bonus, you can generate bindings for other languages, so it's easy to read your data files from other programs.

For whatever reason, Record I/O never really took off. It's used in ZooKeeper, but that's about it (and ZooKeeper will move away from it someday). Momentum has switched to Thrift (from Facebook, now in the Apache Incubator), which offers a very similar proposition, but in more languages. Thrift also makes it easy to build a (cross-language) RPC mechanism.

Yesterday, Google open sourced Protocol Buffers, its "language-neutral, platform-neutral, extensible mechanism for serializing structured data". Record I/O, Thrift and Protocol Buffers are really solving the same problem, so it will be interesting to see how this develops. Of course, since we're talking about persistent data formats, nothing's going to go away in the short or medium term while people have significant amounts of data locked up in these formats.

That's why it makes sense to add support in Hadoop for MapReduce using Thrift and Protocol Buffers: so people can process data in the format they have it in. This will be a relatively simple addition.

What Next?
For RPC, where a message is short-lived, changing the mechanism is more viable in the short term. Going back to Hadoop's RPC mechanism, now that both Thrift and Protocol Buffers offer an alternative, it may well be time to evaluate them to see if either can offer a performance boost. It would be a big job to retrofit RPC in Hadoop with another implementation, but if there are significant performance gains to be had, then it would be worth doing.
分享到:
评论

相关推荐

    the programmer's guide to apache thrift

    Apache Thrift is an open source cross language serialization and RPC framework. With support for over 15 programming languages, Apache Thrift can play an important role in a range of distributed ...

    Hadoop.in.Practice.2nd.Edition

    Chapter 3: Data serialization— working with text and beyond Chapter 4: Organizing and optimizing data in HDFS Chapter 5: Moving data into and out of Hadoop Part 3: Big data patterns Chapter 6: ...

    Hadoop: The Definitive Guide, 4th Edition

    With the fourth edition of this comprehensive guide, you’ll learn how to build and maintain reliable, scalable, distributed systems with Apache Hadoop. This book is ideal for programmers looking to ...

    Hadoop权威掼.pdf

    * Become familiar with Hadoop's data and IO building blocks for compression, data integrity, serialization, and persistence * Learn the common pitfalls and advanced features for writing real-world ...

    深入浅析Java Object Serialization与 Hadoop 序列化

    序列化是指将结构化对象转化为字节流以便在网络上传输或者写到磁盘永久存储的过程。下面通过本文给大家分享Java Object Serialization与 Hadoop 序列化,需要的朋友可以参考下

    Hadoop: The Definitive Guide

    * Become familiar with Hadoop's data and I/O building blocks for compression, data integrity, serialization, and persistence * Discover common pitfalls and advanced features for writing real-world ...

    Hadoop The Definitive Guide PDF

    serialization, and file-based data structures. The next four chapters cover MapReduce in depth. Chapter 5 goes through the practical steps needed to develop a MapReduce application. Chapter 6 looks at...

    hadoop 权威指南(中文版)

    Become familiar with Hadoop’s data and I/O building blocks for compression, data integrity, serialization, and persistence Discover common pitfalls and advanced features for writing real-world ...

    Hadoop.The.Definitive.Guide.4th.Edition.1491901632

    With the fourth edition of this comprehensive guide, you’ll learn how to build and maintain reliable, scalable, distributed systems with Apache Hadoop. This book is ideal for programmers looking to ...

    Hadoop The Definitive Guide 3rd Edition

    Use Hadoop’s data and I/O building blocks for compression, data integrity, serialization (including Avro), and persistence Discover common pitfalls and advanced features for writing real-world ...

    Hadoop权威指南第三版修订版

    Use Hadoop’s data and I/O building blocks for compression, data integrity, serialization (including Avro), and persistence Discover common pitfalls and advanced features for writing real-world ...

    Hadoop: The Definitive Guide [Paperback]

    Become familiar with Hadoop’s data and I/O building blocks for compression, data integrity, serialization, and persistence Discover common pitfalls and advanced features for writing real-world ...

    Network Programming with Go: Essential Skills for Using and Securing Networks

    Jan Newmarch, "Network Programming with Go: ...Work with RPC, web sockets, and more Who This Book Is For Experienced Go programmers and other programmers with some experience with the Go language.

    Hadoop- The Definitive Guide, 3rd Edition.pdf

    serialization, and file-based data structures. The next four chapters cover MapReduce in depth. Chapter 5 goes through the practical steps needed to develop a MapReduce application. Chapter 6 looks ...

    Hadoop in Practice(2012)

    It balances conceptual foundations with practical recipes for key problem areas like data ingress and egress, serialization, and LZO compression. You'll explore each technique step by step, learning ...

    java Streams and Serialization 详解

    详细讲解了java各种输入输出流(stream),以及serialization和externalization的用法与关系

    hadoop 权威指南(第三版)英文版

    Apache Hadoop and the Hadoop Ecosystem Hadoop Releases What’s Covered in this Book Compatibility 2. MapReduce . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . ...

    hadoop_the_definitive_guide_3nd_edition

    Avro, REST, and Thrift 468 Example 469 Schemas 470 Loading Data 471 Web Queries 474 HBase Versus RDBMS 477 Successful Service 478 HBase 479 Use Case: HBase at Streamy.com 479 Table of Contents | ix ...

    Pro C# 7: With .NET and .NET Core

    Chapter 20: File I/O and Object Serialization Chapter 21: Data Access with ADO.NET Chapter 22: Introducing Entity Framework 6 Chapter 23: Introducing Windows Communication Foundation Part VII: ...

Global site tag (gtag.js) - Google Analytics