一、背景
1、当进程在进行远程通信时,彼此可以发送各种类型的数据,无论是什么类型的数据都会以二进制序列的形式在网络上传送。发送方需要把对象转化为字节序列才可在网络上传输,称为对象序列化;接收方则需要把字节序列恢复为对象,称为对象的反序列化。
2、Hive的反序列化是对key/value反序列化成hive table的每个列的值。
3、Hive可以方便的将数据加载到表中而不需要对数据进行转换,这样在处理海量数据时可以节省大量的时间。
SerDe
SerDe 是 Serialize/Deserilize 的简称,目的是用于序列化和反序列化。序列化的格式包括:
- 分隔符(tab、逗号、CTRL-A)
- Thrift 协议
反序列化(内存内):
- Java Integer/String/ArrayList/HashMap
- Hadoop Writable 类
- 用户自定义类
目前存在的 Serde 见下图:
其中,LazyObject 只有在访问到列的时候才进行反序列化。 BinarySortable:保留了排序的二进制格式。
Input processing
-
Hive's
execution engine (referred to as just engine henceforth) first uses the
configured InputFormat to read in a record of data (the value object
returned by the RecordReader of the InputFormat).
-
The
engine then invokes Serde.deserialize() to perform deserialization of
the record. There is no real binding that the deserialized object
returned by this method indeed be a fully deserialized one.
For
instance, in Hive there is a LazyStruct object which is used by the
LazySimpleSerde to represent the deserialized object.
This object does
not have the bytes deserialized
up front but does at the point of access
of a field.
-
The
engine also gets hold of the ObjectInspector to use by invoking
Serde.getObjectInspector(). This has to be a subclass of
structObjectInspector since a record representing a row of input data is
essentially a struct type.
-
The
engine passes the deserialized object(
eg. LazyStruct
) and the object inspector to all
operators for their use in order to get the needed data from the record.
The object inspector
knows how to construct individual fields out of a
deserialized
record. For example, StructObjectInspector has a method
called getStructFieldData() which returns a certain field in the record.
This is the mechanism to access individual fields.
For instance
ExprNodeColumnEvaluator
class which can extract a column from the input
row uses this mechanism to get the real column object from the
serialized row object. This real column object in turn can be a complex
type (like a struct). To access sub fields in such complex typed
objects, an operator would use the object inspector associated with that
field (The top level StructObjectInspector for the row maintains a list
of field level object inspectors which can be used to interpret
individual fields).
ps:ExprNodeColumnEvaluator是真正将数据deserialize的结构,而
ObjectInspector使用的是deserialize以后的结果。
For
UDFs the new GenericUDF abstract class provides the ObjectInspector
associated with the UDF arguments in the initialize() method. So the
engine first initializes the UDF by calling this method. The UDF can
then use these ObjectInspectors to interpret complex arguments (for
simple arguments, the object handed to the udf is already the right
primitive object like LongWritable/IntWritable
etc).
Output processing
Output
is analogous to input. The engine passes the deserialized Object
representing a record and the corresponding ObjectInspector to
Serde.serialize(). In this context serialization means converting the
record object to an object of the type expected by the OutputFormat
which will be used to perform the write. To perform this conversion, the
serialize() method can make use of the passed ObjectInspector to get
the individual fields in the record in order to convert the record to
the appropriate type.
分享到:
相关推荐
Hive-JSON-Serde-1.3.8.zip
row format serde 'com.bizo.hive.serde.csv.CSVSerde' stored as textfile ; 自定义格式 opencsv库中的默认分隔符,引号和转义符是: DEFAULT_ESCAPE_CHARACTER \ DEFAULT_QUOTE_CHARACTER " DEFAULT_SEPARATOR ...
适用于Apache Hadoop Hive的序列化/反序列化模块 此模块允许配置单元以JSON格式进行读写(有关更多信息,请参见 )。 特征: 读取以JSON格式存储的数据 INSERT INTO表时将数据转换为JSON格式 支持数组和映射 还...
hive中json序列化发序列化工具,在hive建表语句中使用row format serde "org.openx.data.jsonserde.JsonSerDe"语句。
hive-json-serde-0.2.jar
)是LinkedIn开发的Hive Serde,用于在Hive中处理Avro编码的数据。 Haivvreo的要点: 从Avro模式推断Hive表的模式。 利用Avro的向后兼容功能,根据指定的架构读取表中的所有Avro文件支持任意嵌套的架构。 将所有Avro...
适用于Apache Hive XML SerDe的VTD处理器 使用基于VTD-XML的处理器代替默认的JDK DOM XPath处理器可以显着提高性能。 参见 使用以下DDL创建Apache Hive表,注意VTD-XML处理器的“ xml.processor.class”, 创建...
Hive XML SerDe是一个基于Hive SerDe(序列化/反序列化)框架的XML处理库。它依赖于Apache Mahout项目中的XmlInputFormat,根据特定的开始和结束标记将输入文件分解成XML片段。 XML SerDe的本质其实是使用XPath处理器...
hive-anttasks.jar hive-cli.jar hive-common.jar hive-contrib.jar hive-hbaseec.jar hive-hbase-handler.jar hive-hwi.jar ...hive-serde.jar hive-service.jar hive-shims.jar hadoop-core-1.0.4.jar
hive-serde-1.2.1.jar hive-service-1.2.1.jar httpclient-4.4.jar httpcore-4.4.jar libfb303-0.9.3.jar libthrift-0.9.3.jar log4j-1.2.16.jar slf4j-api-1.7.10.jar slf4j-log4j12-1.7.10.jar
hive java开发驱动包列表hive-common-2.3.4.jarhive-exec-2.3.4.jarhive-jdbc-2.3.4.jarhive-llap-client-2.3.4.jarhive-llap-common-2.3.4.jarhive-llap-server-2.3.4.jarhive-llap-tez-2.3.4.jarhive-metastore-...
hive-json-serde hive的数组解析json中的数组,Map解析json中的对象:{“ pluginList”:[{“ name”:“ 1”,“ browser”:“ 1”,“ on”:“ 2”},{“ name“:” 1“,” browser“:” 3“,” on“:” 2...
JsonSerde-JSON数据的读/写SerDe 建立状态: 掌握 : 开发: 该库使Apache Hive能够以JSON格式进行读写。 它包括对序列化和反序列化(SerDe)以及JSON转换UDF的支持。 特征 读取以JSON格式存储的数据 在INSERT ...
hive-serde-1.1.0,mysql-connector-java-5.1.31.jar,hive-jdbc-standalone,atlas-plugin-classloader-1.2.0,hive-bridge-shim-1.2.0
hive解析json时所需jar包。具体使用: add jar ../../../target/json-serde-1.3-jar-with-dependencies.jar; CREATE TABLE json_nested_test ( ...原下载地址:https://github.com/rcongiu/Hive-JSON-Serde
hive工具,适用于hive的序列化与反序列化,自动解析json
hive-serde.zip,用于配置单元的json serde用于配置单元的json serde
hive json数据格式存储,支持数组和嵌套复杂数据结构解析
hive-serde.zip,用于构建自定义配置单元serdesa的库用于构建自定义配置单元serdes的库