`
samuschen
  • 浏览: 398542 次
  • 性别: Icon_minigender_2
  • 来自: 北京
社区版块
存档分类
最新评论

hive serde

    博客分类:
  • hive
阅读更多

一、背景

1、当进程在进行远程通信时,彼此可以发送各种类型的数据,无论是什么类型的数据都会以二进制序列的形式在网络上传送。发送方需要把对象转化为字节序列才可在网络上传输,称为对象序列化;接收方则需要把字节序列恢复为对象,称为对象的反序列化。

2、Hive的反序列化是对key/value反序列化成hive table的每个列的值。

3、Hive可以方便的将数据加载到表中而不需要对数据进行转换,这样在处理海量数据时可以节省大量的时间。

SerDe

SerDe 是 Serialize/Deserilize 的简称,目的是用于序列化和反序列化。序列化的格式包括:

  • 分隔符(tab、逗号、CTRL-A)
  • Thrift 协议

反序列化(内存内):

  • Java Integer/String/ArrayList/HashMap
  • Hadoop Writable 类
  • 用户自定义类

目前存在的 Serde 见下图:

其中,LazyObject 只有在访问到列的时候才进行反序列化。 BinarySortable:保留了排序的二进制格式。


Input processing

 

  • Hive's execution engine (referred to as just engine henceforth) first uses the configured InputFormat to read in a record of data (the value object returned by the RecordReader of the InputFormat).

  • The engine then invokes Serde.deserialize() to perform deserialization of the record. There is no real binding that the deserialized object returned by this method indeed be a fully deserialized one. For instance, in Hive there is a LazyStruct object which is used by the LazySimpleSerde to represent the deserialized object. This object does not have the bytes deserialized up front but does at the point of access of a field.

  • The engine also gets hold of the ObjectInspector to use by invoking Serde.getObjectInspector(). This has to be a subclass of structObjectInspector since a record representing a row of input data is essentially a struct type.

  • The engine passes the deserialized object( eg. LazyStruct ) and the object inspector to all operators for their use in order to get the needed data from the record. The object inspector knows how to construct individual fields out of a deserialized record. For example, StructObjectInspector has a method called getStructFieldData() which returns a certain field in the record. This is the mechanism to access individual fields. For instance ExprNodeColumnEvaluator class which can extract a column from the input row uses this mechanism to get the real column object from the serialized row object. This real column object in turn can be a complex type (like a struct). To access sub fields in such complex typed objects, an operator would use the object inspector associated with that field (The top level StructObjectInspector for the row maintains a list of field level object inspectors which can be used to interpret individual fields).

ps:ExprNodeColumnEvaluator是真正将数据deserialize的结构,而 ObjectInspector使用的是deserialize以后的结果。

For UDFs the new GenericUDF abstract class provides the ObjectInspector associated with the UDF arguments in the initialize() method. So the engine first initializes the UDF by calling this method. The UDF can then use these ObjectInspectors to interpret complex arguments (for simple arguments, the object handed to the udf is already the right primitive object like LongWritable/IntWritable etc).

 

Output processing

Output is analogous to input. The engine passes the deserialized Object representing a record and the corresponding ObjectInspector to Serde.serialize(). In this context serialization means converting the record object to an object of the type expected by the OutputFormat which will be used to perform the write. To perform this conversion, the serialize() method can make use of the passed ObjectInspector to get the individual fields in the record in order to convert the record to the appropriate type.

分享到:
评论

相关推荐

    Hive-JSON-Serde-1.3.8.zip

    Hive-JSON-Serde-1.3.8.zip

    csv-serde:Hive SerDe for CSV

    row format serde 'com.bizo.hive.serde.csv.CSVSerde' stored as textfile ; 自定义格式 opencsv库中的默认分隔符,引号和转义符是: DEFAULT_ESCAPE_CHARACTER \ DEFAULT_QUOTE_CHARACTER " DEFAULT_SEPARATOR ...

    HiveSerde:Hive SerDe为复杂的儿子

    适用于Apache Hadoop Hive的序列化/反序列化模块 此模块允许配置单元以JSON格式进行读写(有关更多信息,请参见 )。 特征: 读取以JSON格式存储的数据 INSERT INTO表时将数据转换为JSON格式 支持数组和映射 还...

    Hive-JSON-Serde-develop

    hive中json序列化发序列化工具,在hive建表语句中使用row format serde "org.openx.data.jsonserde.JsonSerDe"语句。

    hive-json-serde-0.2.jar

    hive-json-serde-0.2.jar

    haivvreo:蜂巢+ Avro。 Serde与Hive中的Avro合作

    )是LinkedIn开发的Hive Serde,用于在Hive中处理Avro编码的数据。 Haivvreo的要点: 从Avro模式推断Hive表的模式。 利用Avro的向后兼容功能,根据指定的架构读取表中的所有Avro文件支持任意嵌套的架构。 将所有Avro...

    Hive-XML-SerDe-VTD:适用于Apache Hive XML SerDe的VTD处理器

    适用于Apache Hive XML SerDe的VTD处理器 使用基于VTD-XML的处理器代替默认的JDK DOM XPath处理器可以显着提高性能。 参见 使用以下DDL创建Apache Hive表,注意VTD-XML处理器的“ xml.processor.class”, 创建...

    hivexmlserde jar包与配套数据.rar

    Hive XML SerDe是一个基于Hive SerDe(序列化/反序列化)框架的XML处理库。它依赖于Apache Mahout项目中的XmlInputFormat,根据特定的开始和结束标记将输入文件分解成XML片段。 XML SerDe的本质其实是使用XPath处理器...

    hive jar 包

    hive-anttasks.jar hive-cli.jar hive-common.jar hive-contrib.jar hive-hbaseec.jar hive-hbase-handler.jar hive-hwi.jar ...hive-serde.jar hive-service.jar hive-shims.jar hadoop-core-1.0.4.jar

    hive连接jdbc所需jar包.zip

    hive-serde-1.2.1.jar hive-service-1.2.1.jar httpclient-4.4.jar httpcore-4.4.jar libfb303-0.9.3.jar libthrift-0.9.3.jar log4j-1.2.16.jar slf4j-api-1.7.10.jar slf4j-log4j12-1.7.10.jar

    hive-java开发驱动包

    hive java开发驱动包列表hive-common-2.3.4.jarhive-exec-2.3.4.jarhive-jdbc-2.3.4.jarhive-llap-client-2.3.4.jarhive-llap-common-2.3.4.jarhive-llap-server-2.3.4.jarhive-llap-tez-2.3.4.jarhive-metastore-...

    hive-json-serde

    hive-json-serde hive的数组解析json中的数组,Map解析json中的对象:{“ pluginList”:[{“ name”:“ 1”,“ browser”:“ 1”,“ on”:“ 2”},{“ name“:” 1“,” browser“:” 3“,” on“:” 2...

    Hive-JSON-Serde:读取-为Apache Hive编写JSON SerDe

    JsonSerde-JSON数据的读/写SerDe 建立状态: 掌握 : 开发: 该库使Apache Hive能够以JSON格式进行读写。 它包括对序列化和反序列化(SerDe)以及JSON转换UDF的支持。 特征 读取以JSON格式存储的数据 在INSERT ...

    hive-jdbc-1.2.1.spark2.jar

    hive-serde-1.1.0,mysql-connector-java-5.1.31.jar,hive-jdbc-standalone,atlas-plugin-classloader-1.2.0,hive-bridge-shim-1.2.0

    hive解析json格式数据所需jar包

    hive解析json时所需jar包。具体使用: add jar ../../../target/json-serde-1.3-jar-with-dependencies.jar; CREATE TABLE json_nested_test ( ...原下载地址:https://github.com/rcongiu/Hive-JSON-Serde

    json-serde-1.3.9-SNAPSHOT-jar-with-dependencies.jar

    hive工具,适用于hive的序列化与反序列化,自动解析json

    xwiki-commons-tool-xar-plugin-3.2.zip

    hive-serde.zip,用于配置单元的json serde用于配置单元的json serde

    json-serde-1.3.8-jar-with-dependencies.jar

    hive json数据格式存储,支持数组和嵌套复杂数据结构解析

    panc-maven-plugin-9.3-RC1.zip

    hive-serde.zip,用于构建自定义配置单元serdesa的库用于构建自定义配置单元serdes的库

Global site tag (gtag.js) - Google Analytics