http://blog.cloudera.com/blog/2009/11/hadoop-at-twitter-part-1-splittable-lzo-compression/
(namely :使hadoop支持Splittable压缩lzo)
- Very basic question about Hadoop and compressed input files
- Hadoop gzip input file using only one mapper
- Why can't hadoop split up a large text file and then compress the splits using gzip?
-what is the flow of upload a gzip file to dfs?
split to hdfs chunk block by block,so the first block is full,but last one maybe not.and no changes involved in these blockes in terms of bytes:
hadoop@host-08:~$ hadoop fsck /user/hadoop/mr-test-data.zj.tar.gz -blocks -locations -files FSCK started by hadoop from /192.168.12.108 for path /user/hadoop/mr-test-data.zj.tar.gz at Mon Oct 26 17:22:24 CST 2015 /user/hadoop/mr-test-data.zj.tar.gz 173826303 bytes, 2 block(s): OK 0. blk_-6142856910439989465_2680086 len=134217728 repl=3 [192.168.12.148:50010, 192.168.12.110:50010, 192.168.12.132:50010] 1. blk_-9182536886628119965_2680086 len=39608575 repl=3 [192.168.12.110:50010, 192.168.12.134:50010, 192.168.12.140:50010]
compared with raw file:
ls -rw-r--r-- 1 hadoop hadoopgrp 173826303 Apr 23 2014 mr-test-data.zj.tar.gz
-what is the order of writing a gzip file to dfs?split-> compress or compress -> split
TODO,see hbase's src
conclusion:
-a new record does -not- always mean that one slice text per line,though,may be one key/value pair etc.
-hadoop block level meaning is unrelated to 'splittable'
-lzo formatted file is splitable only if it generated by 'lzo indexed file' which is part of it.
this is similar to the 'hbase's hfile' format
gzip is not splittable ,so only one map to process it:
job_201411101612_0397 NORMAL hadoop word count 100.00% 1 0
but for hbase's hfile with snappy compression there are more than one mappers:
hadoop@host-08:/usr/local/hadoop/hadoop-1.0.3$ hbase hfile -s -f /hbase/archive/f63235f4a6d84c84722f82ffd8122206/fml/b7e2701a60764f9a940912743b55d4e0 15/10/26 17:51:56 INFO util.ChecksumType: Checksum can use java.util.zip.CRC32 15/10/26 17:51:56 INFO hfile.CacheConfig: Allocating LruBlockCache with maximum size 3.7g 15/10/26 17:51:57 WARN snappy.LoadSnappy: **Snappy** native library is available
job_201411101612_0396 NORMAL hadoop word count 100.00% 51 51
so ,u can think as it's a common text file as the hfile(with snappy compression) only compress the streaming data to it for key/value data bytes,instead of generating a real snappy file *.snappy.
相关推荐
<value>org.apache.hadoop.io.compress.GzipCodec,org.apache.hadoop.io.compress.DefaultCodec,com.hadoop.compression.lzo.LzoCodec,org.apache.hadoop.io.compress.BZip2Codec <name>io.compression.codec....
1.3 cp hadoop-gpl-compression-0.1.0/hadoop-gpl-compression-0.1.0.jar /usr/local/hadoop-1.0.2/lib/ 2.安装 lzo apt-get install gcc apt-get install lzop 3.在本地测试 可以 执行 压缩及解压缩命令 下载 ...
ant打包hadoop-eclipse-plugin
<value>org.apache.hadoop.io.compress.GzipCodec,org.apache.hadoop.io.compress.DefaultCodec,com.hadoop.compression.lzo.LzoCodec,org.apache.hadoop.io.compress.BZip2Codec <name>io.compression.codec....
它修复了hadoop-gpl-compression中的一些错误-尤其是,它允许解压缩器读取小的或不可压缩的lzo文件,并且还修复了压缩器在压缩小的或不可压缩的块时遵循lzo标准。 它还修复了许多在lzo编写器在中途被杀死时可能发生...
编译环境:centos 6.4 64bit、maven 3.3.9、jdk...目的:编译给hadoop2.4.1(64)用的; 解决:hive报错:Cannot create an instance of InputFormat class org.apache.hadoop ....... as specified in mapredwork!
4 covers the fundamentals of I/O in Hadoop: data integrity, compression, serialization, and file-based data structures. The next four chapters cover MapReduce in depth. Chapter 5 goes through the ...
Chapter 4 covers the fundamentals of I/O in Hadoop: data integrity, compression, serialization, and file-based data structures. The next four chapters cover MapReduce in depth. Chapter 5 goes through ...
#Hadoop Crypto Compressor 这是一个用于 ... <name>io.compression.codecs <value>org.apache.hadoop.io.compress.DefaultCodec,org.apache.hadoop.io.compress.GzipCodec, org.apache.hadoop.io.compress.BZip2Co
A Survey on Compression Algorithms in Hadoop
先说一下环境,flume 1.9.0,hadoop 3.2.1,兼容没有问题,官方文档没什么好说的,足够详细,每一项后面都附带有例子,好评。...org/apache/hadoop/io/SequenceFile$CompressionType 缺少hadoop-common-X.jar
Become familiar with Hadoop’s data and I/O building blocks for compression, data integrity, serialization, and persistence Discover common pitfalls and advanced features for writing real-world ...
* Become familiar with Hadoop's data and I/O building blocks for compression, data integrity, serialization, and persistence * Discover common pitfalls and advanced features for writing real-world ...
MapReduce is the distribution system that the Hadoop MapReduce engine uses to distribute work around a cluster by working parallel on smaller data sets. It is useful in a wide range of applications, ...
Become familiar with Hadoop’s data and I/O building blocks for compression, data integrity, serialization, and persistence Discover common pitfalls and advanced features for writing real-world ...
hadoop权威指南第三版(英文版)。 Foreword . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . xiii Preface . . . . . . ....
* Become familiar with Hadoop's data and IO building blocks for compression, data integrity, serialization, and persistence * Learn the common pitfalls and advanced features for writing real-world ...
Use Hadoop’s data and I/O building blocks for compression, data integrity, serialization (including Avro), and persistence Discover common pitfalls and advanced features for writing real-world ...
A Self-aware Data Compression System on FPGA in Hadoop