hive创建外部表映射hbase中已存在表问题 -

hai19850514

浏览: 25270 次
性别:
来自: 杭州

最近访客更多访客>>

xt_yangjie

koberichard

chen.zhu

rockyforme

博主相关

博客

微博

相册

留言

关于我

文章分类

社区版块

存档分类

hive创建外部表映射hbase中已存在表问题

博客分类：

大数据量处理

hbase中的建表脚本:
create 'HisDiagnose',{ NAME => 'diagnoseFamily'}

通过往hive中创建外部表来映射hbase中已经存在的表结构，从而可以通过Hive QL查询hbase表中的数据，从而使得hbase这种NOSQL数据库具备SQL的能力，脚本脚本为:
CREATE EXTERNAL TABLE HisDiagnose(key string, doctorId int, patientId int, description String, rtime int)
STORED BY 'org.apache.hadoop.hive.hbase.HBaseStorageHandler'
WITH SERDEPROPERTIES ("hbase.columns.mapping" = ":key,diagnoseFamily:doctorId,diagnoseFamily:patientId,diagnoseFamily:description,diagnoseFamily:rtime")
TBLPROPERTIES("hbase.table.name" = "HisDiagnose");
问题描述:
通过hbase client api往hbase的HisDiagnose插入数据，字段doctorId、patientId、rtime类型int，在hive中通过select * from HisDiagnose查询得到doctorId、patientId、rtime三个字段的值为null，代码如下:
/**
* 插入数据
* @param tablename
*/
public static void insertData(String tablename) {
  System.out.println("开始插数据 ....");
  HTablePool pool = new HTablePool(conf, 1000);
  HTableInterface table = pool.getTable(tablename);
  try {
   for(int i=1; i<=1; i++){
    Put put = new Put(("2013-03-0" + i).getBytes());//一个PUT代表一行数据，再NEW一个PUT表示第二行数据，每行一个唯一的ROWKEY,此处ROWKEY为put构造方法中传入的值
    put.add("diagnoseFamily".getBytes(), "doctorId".getBytes(), new Date().getTime(), Bytes.toBytes(i));
    put.add("diagnoseFamily".getBytes(), "patientId".getBytes(), new Date().getTime(), Bytes.toBytes(i));
    put.add("diagnoseFamily".getBytes(), "description".getBytes(), new Date().getTime(), "描述".getBytes());
    put.add("diagnoseFamily".getBytes(), "rtime".getBytes(), new Date().getTime(), Bytes.toBytes(new Date().getTime()));
    table.put(put);
   }

  } catch (IOException e) {
   // TODO Auto-generated catch block
   e.printStackTrace();
  }
  System.out.println("插数据结束 ....");
}
问题解决:
根据官网Wiki文档，https://cwiki.apache.org/confluence/display/Hive/HBaseIntegration关于Column Mapping的说明如下:
There are two SERDEPROPERTIES that control the mapping of HBase columns to Hive:
(1)、hbase.columns.mapping
(2)、hbase.table.default.storage.type: Can have a value of either string (the default) or binary, this option is only available as of Hive 0.9 and the string behavior is the only one available in earlier versions
The column mapping support currently available is somewhat cumbersome and restrictive:

(1)、for each Hive column, the table creator must specify a corresponding entry in the comma-delimited hbase.columns.mapping string (so for a Hive table with n columns, the string should have n entries); whitespace should not be used in between entries since these will be interperted as part of the column name, which is almost certainly not what you want
(2)、a mapping entry must be either :key or of the form column-family-name:[column-name][#(binary|string) (the type specification that delimited by # was added in Hive 0.9.0, earlier versions interpreted everything as strings)
(3)、If no type specification is given the value from hbase.table.default.storage.type will be used
(4)、Any prefixes of the valid values are valid too (i.e. #b instead of #binary)
(5)、If you specify a column as binary the bytes in the corresponding HBase cells are expected to be of the form that HBase's Bytes class yields.
(6)、there must be exactly one :key mapping (we don't support compound keys yet)
(7)、(note that before HIVE-1228 in Hive 0.6, :key was not supported, and the first Hive column implicitly mapped to the key; as of Hive 0.6, it is now strongly recommended that you always specify the key explictly; we will drop support for implicit key mapping in the future)
(8)、if no column-name is given, then the Hive column will map to all columns in the corresponding HBase column family, and the Hive MAP datatype must be used to allow access to these (possibly sparse) columns
(9)、there is currently no way to access the HBase timestamp attribute, and queries always access data with the latest timestamp.
(10)、Since HBase does not associate datatype information with columns, the serde converts everything to string representation before storing it in HBase; there is currently no way to plug in a custom serde per column
(11)、it is not necessary to reference every HBase column family, but those that are not mapped will be inaccessible via the Hive table; it's possible to map multiple Hive tables to the same HBase table
The next few sections provide detailed examples of the kinds of column mappings currently possible.

根据以上得知:当在hive中创建hbase已经存在的外部表时，默认的hbase.table.default.storage.type类型为string,而hbase中的doctorId、patientId、rtime三个字段值为int类型的，难怪映射过来的值为null，将hive中的外部表删除，
hbase.table.default.storage.type的值设置为binary即可，重建脚本如下:
CREATE EXTERNAL TABLE HisDiagnose(key string, doctorId int, patientId int, description String, rtime int)
STORED BY 'org.apache.hadoop.hive.hbase.HBaseStorageHandler'
WITH SERDEPROPERTIES ("hbase.columns.mapping" = ":key,diagnoseFamily:doctorId,diagnoseFamily:patientId,diagnoseFamily:description,diagnoseFamily:rtime","hbase.table.default.storage.type"="binary")
TBLPROPERTIES("hbase.table.name" = "HisDiagnose");

分享到：

Spring MVC国际化的支持 | 设置JVM内存溢出时快照转存HeapDump到文件

2013-08-02 16:04
浏览 5588
评论(0)
分类:开源软件
查看更多

发表评论

您还没有登录,请您登录后再发表评论

最近访客更多访客>>

博主相关

文章分类

社区版块

存档分类

最新评论

hive创建外部表映射hbase中已存在表问题

评论

发表评论

相关推荐

最近访客 更多访客>>

博主相关

文章分类

社区版块

存档分类

最新评论

hive创建外部表映射hbase中已存在表问题

评论

发表评论

相关推荐

hbase删除数据的问题

HBase性能优化方法总结

hbase+zookeeper配置优化

HIVE体系架构

hadoop+zookeeper+hbase集群配置整理

hadoop+hbase+hive+zookeeper集群版本升级配置整理

hadoop版本与支持的hbase版本对照表 .

hadoop+zookeeper+hbase完全分布式集群配置整理

最近访客更多访客>>