`

hbase-hfile format

 
阅读更多

what is

  every db holds its storage level,either memory or fs.this is similar to,hbase,it has HFile as underlying data structure which will be stored in fs or dfs.

  exquisite data structure also must be matched appropriate algoriths.of course ,different requirements will lead to heavily varied db design styles.

  like some other index tools,HFile is a indexed storage structure mainly for fast access.the idea of HFle is from TFile ,sure,the former has a bit differences improved, and its internal structures will like below figures:



 *note:this file format(not Data block) is for HFile v1,the part File Info is placed after Meta Index for v2.



 

 

whye

  as described above,hbase uses special file instead of existances in hadoop (eg. SequenceFile) is to speedup the readings by rowkeys,as you can see some index blocks in previous figures:

a.construct a reversed index query,speed up for reading(mainly)

b.supplys a hadoop-independent style to read/write file format without considering dfs's compability

 

differences among certain files 
file structure compression embed in
SequenceFile

3 file types:

1)uncompressed 

2)record-comprssed(only values be compressed)

3)block-compressed(both keys and vlaues are compressed,similar to HFile v2)

yes hadoop
HFile v1 similar to v2,but some block indexes are different wth that. yes hbase
HFile v2

1)indexed reversed style

2)most data(except trailer) in file are compressed if configure compressor

yes

hbase new file format

for (94+)

 

 

how to 

there is  a process model in the flush:

 

1.iterate all keyvaues in snapshot to write to memory buffer
2.generate a new data block if over a 'block-size' which in set when creating table
3.repeat 1 & 2 until no more data from snapshot
4.flush to memory compressed stream 
5.flush to stream to outputStream (hfile stream) and clear the tmp buffer to avoid huage memory usage

6.similar to data block flush,flush meta block,data index,meta index,file-info and last trailer

 

 

why places extra parts of index and trailer to the last of a hfile?

i think some points are abvious:

a. as the index or trailer have some stats about the block and index,flushing memory buffer to hfile per 'block'(part of hfile) will min the memory overhead by region.

b. this can avoid going far away from the top of hfile offset to locate a special data block offset

 

others

using the or.apache.hadoop.hbase.io.hfile.HFile tool ,u wil look at the the details of it like below:

hbase hfile -f path-to-file

   for a compressed file occupied size '18644309' will results like this:

 

Stats:
Key length: count: 496573	min: 48	max: 53	mean: 49.44025551127427
Val length: count: 496573	min: 1	max: 915	mean: 34.053575204451306
Row size (bytes): count: 28930	min: 814	max: 3190	mean: 1570.45855513308
Row size (columns): count: 28930	min: 12	max: 21	mean: 17.16463878326996
Key of biggest row: 94678d0589778ade561378ac26dfd791
    Key length count--number of keys(composite keys,ie. rowkey+fml+col+ts+type)
    val length --same as above
    row size (bytes) count--number of rows belong to this hfile;and the 'mean' is the average bytes of total rows size
    row size (columns)  --same as above,but for 'mean',it is average qualifies(columns) of total rows
   so we know,a actual uncompressed row size is:
18644309 / compact-ratio(.42) / 28930  = 1534 ~ 1570(mean)
and 
key length count = val length count = 496573 ~ (28930 * 17.16 = 496438.8)
  

  and the key length count is just same the entries when flush log shows:

 

2014-11-04 00:24:08,299 INFO [regionserver60020.cacheFlusher] Store.java:817 Added hdfs://xxx:54310/hbase/gggg/f95f2a0fa04b5c13747bafa22bd610d9/f1/0a5ac38c99da4244ac6925beb51c96b8
, entries=496275, sequenceid=96412594, filesize=18.0m

  

 

 TODO something to check

-cache data on write:set in blockcache also when writing

-data block encoding:use some encodes like fast diff

-data block /meta block index structure

 

ref:

hbase-memstore flush -1 overview

hbase guide

TFile

  • 大小: 87.2 KB
  • 大小: 55.5 KB
  • 大小: 115.8 KB
分享到:
评论

相关推荐

    hbase-meta-repair-hbase-2.0.2.jar

    HBase 元数据修复工具包。 ①修改 jar 包中的application.properties,重点是 zookeeper.address、zookeeper.nodeParent、hdfs....③开始修复 `java -jar -Drepair.tableName=表名 hbase-meta-repair-hbase-2.0.2.jar`

    hbase-sdk是基于hbase-client和hbase-thrift的原生API封装的一款轻量级的HBase ORM框架

    hbase-sdk是基于hbase-client和hbase-thrift的原生API封装的一款轻量级的HBase ORM框架。 针对HBase各版本API(1.x~2.x)间的差异,在其上剥离出了一层统一的抽象。并提供了以类SQL的方式来读写HBase表中的数据。对...

    HBase(hbase-2.4.9-bin.tar.gz)

    HBase(hbase-2.4.9-bin.tar.gz)是一个分布式的、面向列的开源数据库,该技术来源于 Fay Chang 所撰写的Google论文“Bigtable:一个结构化数据的分布式存储系统”。就像Bigtable利用了Google文件系统(File System...

    hbase的hbase-1.2.0-cdh5.14.2.tar.gz资源包

    hbase的hbase-1.2.0-cdh5.14.2.tar.gz资源包

    hbase-client-2.1.0-cdh6.3.0.jar

    hbase-client-2.1.0-cdh6.3.0.jar

    phoenix-hbase-2.2-5.1.2-bin.tar.gz

    phoenix-hbase-2.2-5.1.2-bin.tar.gz

    hive-hbase-handler-1.2.1.jar

    被编译的hive-hbase-handler-1.2.1.jar,用于在Hive中创建关联HBase表的jar,解决创建Hive关联HBase时报FAILED: Execution Error, return code 1 from org.apache.hadoop.hive.ql.exec.DDLTask. org.apache.hadoop....

    hbase-1.2.6.1-bin.tar.gz

    hbase-1.2.6.1-bin.tar.gz,hbase-1.2.6.1-bin.tar.gz,hbase-1.2.6.1-bin.tar.gz,hbase-1.2.6.1-bin.tar.gz,hbase-1.2.6.1-bin.tar.gz,hbase-1.2.6.1-bin.tar.gz,hbase-1.2.6.1-bin.tar.gz,hbase-1.2.6.1-bin.tar.gz

    hbase-1.2.1-bin.tar.gz.zip

    hbase-1.2.1-bin.tar.gz.zip 提示:先解压再使用,最外层是zip压缩文件

    phoenix-client-hbase-2.2-5.1.2.jar

    phoenix-client-hbase-2.2-5.1.2.jar

    hbase-hadoop-compat-1.1.3-API文档-中文版.zip

    赠送jar包:hbase-hadoop-compat-1.1.3.jar; 赠送原API文档:hbase-hadoop-compat-1.1.3-javadoc.jar; 赠送源代码:hbase-hadoop-compat-1.1.3-sources.jar; 赠送Maven依赖信息文件:hbase-hadoop-compat-1.1.3....

    hbase-prefix-tree-1.1.3-API文档-中文版.zip

    赠送jar包:hbase-prefix-tree-1.1.3.jar; 赠送原API文档:hbase-prefix-tree-1.1.3-javadoc.jar; 赠送源代码:hbase-prefix-tree-1.1.3-sources.jar; 赠送Maven依赖信息文件:hbase-prefix-tree-1.1.3.pom; ...

    hbase-metrics-api-1.4.3-API文档-中文版.zip

    赠送jar包:hbase-metrics-api-1.4.3.jar; 赠送原API文档:hbase-metrics-api-1.4.3-javadoc.jar; 赠送源代码:hbase-metrics-api-1.4.3-sources.jar; 赠送Maven依赖信息文件:hbase-metrics-api-1.4.3.pom; ...

    hbase-annotations-1.1.2-API文档-中文版.zip

    赠送jar包:hbase-annotations-1.1.2.jar; 赠送原API文档:hbase-annotations-1.1.2-javadoc.jar; 赠送源代码:hbase-annotations-1.1.2-sources.jar; 赠送Maven依赖信息文件:hbase-annotations-1.1.2.pom; ...

    hbase-1.2.4-bin.tar.gz

    Hbase-1.2.4-bin.tar.gz,HBASE的Linux版安装包。Hadoop学习必备

    hbase-client-1.4.3-API文档-中文版.zip

    赠送jar包:hbase-client-1.4.3.jar; 赠送原API文档:hbase-client-1.4.3-javadoc.jar; 赠送源代码:hbase-client-1.4.3-sources.jar; 赠送Maven依赖信息文件:hbase-client-1.4.3.pom; 包含翻译后的API文档:...

    hbase-2.2.6-bin.tar.gz

    hbase-2.2.6-bin.tar.gz HBase是一个分布式的、面向列的开源数据库,该技术来源于 Fay Chang 所撰写的Google论文“Bigtable:一个结构化数据的分布式存储系统”。就像Bigtable利用了Google文件系统(File System)所...

    hbase-server-1.4.3-API文档-中文版.zip

    赠送jar包:hbase-server-1.4.3.jar; 赠送原API文档:hbase-server-1.4.3-javadoc.jar; 赠送源代码:hbase-server-1.4.3-sources.jar; 赠送Maven依赖信息文件:hbase-server-1.4.3.pom; 包含翻译后的API文档:...

    hbase-common-1.4.3-API文档-中文版.zip

    赠送jar包:hbase-common-1.4.3.jar; 赠送原API文档:hbase-common-1.4.3-javadoc.jar; 赠送源代码:hbase-common-1.4.3-sources.jar; 赠送Maven依赖信息文件:hbase-common-1.4.3.pom; 包含翻译后的API文档:...

Global site tag (gtag.js) - Google Analytics