SequenceFile的压缩和分片 -

samuschen

浏览: 412174 次
性别:
来自: 北京

最近访客更多访客>>

dy.f

u012363178

谁谁谁

wangyy

博主相关

博客

微博

相册

留言

关于我

文章分类

社区版块

存档分类

SequenceFile的压缩和分片

博客分类：

hive

Hadoop performance Access

Compressed Data Storage

Keeping data compressed in Hive tables has, in some cases, known to give better performance that uncompressed storage; both, in terms of disk usage and query performance.

You can import text files compressed with Gzip or Bzip2 directly into a table stored as TextFile . The compression will be detected automatically and the file will be decompressed on-the-fly during query execution. For example:

CREATE TABLE raw (line STRING)
   ROW FORMAT DELIMITED FIELDS TERMINATED BY '\t' LINES TERMINATED BY '\n';

LOAD DATA LOCAL INPATH '/tmp/weblogs/20090603-access.log.gz' INTO TABLE raw;

The table 'raw' is stored as a TextFile , which is the default storage. However, in this case Hadoop will not be able to split your file into chunks/blocks and run multiple maps in parallel. This can cause under-utilization of your cluster's 'mapping' power.

The recommended practice is to insert data into another table, which is stored as a SequenceFile . A SequenceFile can be split by Hadoop and distributed across map jobs (is this statement correct?) . For example:

CREATE TABLE raw (line STRING)
   ROW FORMAT DELIMITED FIELDS TERMINATED BY '\t' LINES TERMINATED BY '\n';

CREATE TABLE raw_sequence (line STRING)
   ROW FORMAT DELIMITED FIELDS TERMINATED BY '\t' LINES TERMINATED BY '\n'
   STORED AS SEQUENCEFILE;

LOAD DATA LOCAL INPATH '/tmp/weblogs/20090603-access.log.gz' INTO TABLE raw;

SET hive.exec.compress.output=true; 
SET io.seqfile.compression.type=BLOCK; -- NONE/RECORD/BLOCK (see below)
INSERT OVERWRITE TABLE raw_sequence SELECT LINE FROM raw;

INSERT OVERWRITE TABLE raw_sequence SELECT * FROM raw; -- The previous line did not work for me, but this does.

The value for io.seqfile.compression.type determines how the compression is performed. If you set it to RECORD you will get as many output files as the number of map/reduce jobs. If you set it to BLOCK, you will get as many output files as there were input files. There is a tradeoff involved here -- large number of output files => more parellel map jobs => lower compression ratio.

分享到：

hive数据模型 | hive的一些资料整理

2010-12-06 19:43
浏览 2476
评论(0)
分类:企业架构
查看更多

发表评论

您还没有登录,请您登录后再发表评论

最近访客更多访客>>

博主相关

文章分类

社区版块

存档分类

最新评论

SequenceFile的压缩和分片

Compressed Data Storage

评论

发表评论

相关推荐

最近访客 更多访客>>

博主相关

文章分类

社区版块

存档分类

最新评论

SequenceFile的压缩和分片

Compressed Data Storage

评论

发表评论

相关推荐

hive serde

hive compile-1

hive 用mysql存储元信息

hive编译部分的源码结构

hive执行作业时reduce任务个数设置为多少合适？

hive 源码结构分析（编译器）

hive中关于partition的操作

hive mapjoin

Hive QL

hive数据模型

hive的一些资料整理

hive的存储格式

TPC-H on Hive

hive show table显示不出表的问题

hive运行实例

源码编译hive

hive报Invalid maximum heap size: -Xmx4096m错误解决方法

Hive Installation and Configuration

最近访客更多访客>>