Compressed Data Storage
Keeping data
compressed in Hive tables has, in some cases, known to give better
performance that uncompressed storage; both, in terms of disk usage and
query performance.
You can import text files compressed with Gzip or Bzip2 directly into a table stored as TextFile
.
The compression will be detected automatically and the file will be
decompressed on-the-fly during query execution. For example:
CREATE TABLE raw (line STRING)
ROW FORMAT DELIMITED FIELDS TERMINATED BY '\t' LINES TERMINATED BY '\n';
LOAD DATA LOCAL INPATH '/tmp/weblogs/20090603-access.log.gz' INTO TABLE raw;
The table 'raw' is stored as a TextFile
,
which is the default storage. However, in this case Hadoop will not be
able to split your file into chunks/blocks and run multiple maps in
parallel. This can cause under-utilization of your cluster's 'mapping'
power.
The recommended practice is to insert data into another table, which is stored as a SequenceFile
. A SequenceFile
can be split by Hadoop and distributed across map jobs (is this statement correct?)
. For example:
CREATE TABLE raw (line STRING)
ROW FORMAT DELIMITED FIELDS TERMINATED BY '\t' LINES TERMINATED BY '\n';
CREATE TABLE raw_sequence (line STRING)
ROW FORMAT DELIMITED FIELDS TERMINATED BY '\t' LINES TERMINATED BY '\n'
STORED AS SEQUENCEFILE;
LOAD DATA LOCAL INPATH '/tmp/weblogs/20090603-access.log.gz' INTO TABLE raw;
SET hive.exec.compress.output=true;
SET io.seqfile.compression.type=BLOCK; -- NONE/RECORD/BLOCK (see below)
INSERT OVERWRITE TABLE raw_sequence SELECT LINE FROM raw;
INSERT OVERWRITE TABLE raw_sequence SELECT * FROM raw; -- The previous line did not work for me, but this does.
The
value for io.seqfile.compression.type determines how the compression is
performed. If you set it to RECORD you will get as many output files as
the number of map/reduce jobs. If you set it to BLOCK, you will get as
many output files as there were input files. There is a tradeoff
involved here -- large number of output files => more parellel map
jobs => lower compression ratio.
分享到:
相关推荐
1.1 SequenceFile概述 1.2 SequenceFile压缩 1.4 SequenceFile读取文件 1.5 SequenceFile总结
22、MapReduce使用Gzip压缩、Snappy压缩和Lzo压缩算法写文件和读取相应的文件 网址:https://blog.csdn.net/chenwewi520feng/article/details/130456088 本文的前提是hadoop环境正常。 本文最好和MapReduce操作常见...
业务需要hive读取SequenceFile文件,所以把TextFile类型转SequenceFile,再导入hive
Hadoop中将SequenceFile转换成MapFile的主要方法:给Sequencefile重建索引的程序
小文件合并Sequencefile word.jar
sequencefile&mapfile相关测试代码
21、MapReduce读写SequenceFile、MapFile、ORCFile和ParquetFile文件 网址:https://blog.csdn.net/chenwewi520feng/article/details/130455817 本文介绍使用MapReduce读写文件,包括:读写SequenceFile、MapFile、...
利用Hadoop的sequencefile处理小文件的小程序
中文文档转成sequencefile文件格式,便于在hadoop下使用操作,java代码
SequenceFile文件格式
2)使用SequenceFile对以上文件进行封装,生成一个独立文件,压缩格式任意; 3)实现以下的三种方式的查询: 3.1)给出文件名,可以从序列文件整体读取文件并存储到指定的位置; 3.2)给出某个整数的key,可以读取...
spark-SequenceFile及MapFile讲解
序列文件示例 使用序列文件的示例集合 设置: 克隆项目 cd /tmp && git clone https://github.com/sakserv/sequencefile-examples.git ... hadoop jar target/sequencefile-examples-0.0.1-SNAPSHOT.jar
SequenceFile学习的Java Demo代码 里面包括合并小文件,读取SequenceFile文件,写SequenceFile文件
升级glib解决Hadoop WARN util.NativeCodeLoader: ... 和 SequenceFile doesn't work with GzipCodec without native-hadoop code 问题, 具体请参见博文:https://blog.csdn.net/l1028386804/article/details/88420473
Hadoop在处理海量小图像数据时,存在输入分片过多以及海量小图像存储问题。针对这些问题,不同于采用HIPI、SequenceFile等方法,提出了一个新型图像并行处理模型。利用Hadoop适合处理纯文本数据的特性,本模型使用...
压缩和输入切分 在MapReduce中使用压缩 序列化 Writable接口 Writable类 实现定制的Writable类型 序列化框架 Avro 依据文件的数据结构 写入SequenceFile MapFile 第5章 MapReduce应用开发 配置API 合并多个源文件 可...
Apache Hive 的 InputFormat,在查询 SequenceFiles 时将返回 (Text) 键和 (Text) 值。 我需要在不拆分内容的情况下完整解析大量文本文件。 HDFS 在处理大型连续文件时提供最佳吞吐量,因此我使用 Apache Mahout 将...
压缩和输入切分 在MapReduce中使用压缩 序列化 Writable接口 Writable类 实现定制的Writable类型 序列化框架 Avro 依据文件的数据结构 写入SequenceFile MapFile 第5章 MapReduce应用开发 ...
排序-CBIR-on-hadoop 将图像转换为 Hadoop SequenceFile 格式,适用于基于内容的图像检索系统。