这是对 gzip格式的读取设置:
conf.set("mapreduce.output.fileoutputformat.compress.codec", "org.apache.hadoop.io.compress.GzipCodec");
如果源文件就是 backend_userlog_2017092200_192.168.201.4.1506010201501.4968.log.gz
这种的, 那么 即使 不设置上面读取的编码集, hadoop也会自动读取:
因为源代码会自动设置:
从配置文件里,拿不到编码相关的配置,就会默认把GzipCodec,DefaultCodec加进去。
/** * Find the codecs specified in the config value io.compression.codecs * and register them. Defaults to gzip and zip. */ public CompressionCodecFactory(Configuration conf) { codecs = new TreeMap<String, CompressionCodec>(); List<Class<? extends CompressionCodec>> codecClasses = getCodecClasses(conf); if (codecClasses == null) { addCodec(new GzipCodec()); addCodec(new DefaultCodec()); } else { Iterator<Class<? extends CompressionCodec>> itr = codecClasses.iterator(); while (itr.hasNext()) { CompressionCodec codec = ReflectionUtils.newInstance(itr.next(), conf); addCodec(codec); } } }
而针对 .gz格式的hdfs文件, 如果过滤查看文件内容的话,可以直接通过命令:
hadoop fs -text /collect_data/teach/20180825/*.gz | grep "1800066" | grep "41783251"
而如果通过
hadoop fs -cat 的方式,会出现乱码
其他的待补充
相关推荐
Hadoop 3.x(MapReduce)----【Hadoop 序列化】---- 代码 Hadoop 3.x(MapReduce)----【Hadoop 序列化】---- 代码 Hadoop 3.x(MapReduce)----【Hadoop 序列化】---- 代码 Hadoop 3.x(MapReduce)----【Hadoop ...
赠送jar包:hadoop-mapreduce-client-jobclient-2.6.5.jar; 赠送原API文档:hadoop-mapreduce-client-jobclient-2.6.5-javadoc.jar; 赠送源代码:hadoop-mapreduce-client-jobclient-2.6.5-sources.jar; 赠送...
赠送jar包:hadoop-mapreduce-client-common-2.6.5.jar; 赠送原API文档:hadoop-mapreduce-client-common-2.6.5-javadoc.jar; 赠送源代码:hadoop-mapreduce-client-common-2.6.5-sources.jar; 赠送Maven依赖信息...
必须注意对于不同的hadoop版本,` HADDOP_INSTALL_PATH/share/hadoop/common/lib`下的jar包版本都不同,需要一个个调整 - `hadoop2x-eclipse-plugin-master/ivy/library.properties` - `hadoop2x-eclipse-plugin-...
hadoop-annotations-3.1.1.jar hadoop-common-3.1.1.jar hadoop-mapreduce-client-core-3.1.1.jar hadoop-yarn-api-3.1.1.jar hadoop-auth-3.1.1.jar hadoop-hdfs-3.1.1.jar hadoop-mapreduce-client-hs-3.1.1.jar ...
赠送jar包:hadoop-yarn-common-2.6.5.jar 赠送原API文档:hadoop-yarn-common-2.6.5-javadoc.jar 赠送源代码:hadoop-yarn-common-2.6.5-sources.jar 包含翻译后的API文档:hadoop-yarn-common-2.6.5-javadoc-...
hadoop-eclipse2.7.1、hadoop-eclipse2.7.2、hadoop-eclipse2.7.3
# 解压命令 tar -zxvf flink-shaded-hadoop-2-uber-3.0.0-cdh6.2.0-7.0.jar.tar.gz # 介绍 用于CDH部署 Flink所依赖的jar包
赠送jar包:hadoop-yarn-client-2.6.5.jar; 赠送原API文档:hadoop-yarn-client-2.6.5-javadoc.jar; 赠送源代码:hadoop-yarn-client-2.6.5-sources.jar; 赠送Maven依赖信息文件:hadoop-yarn-client-2.6.5.pom;...
基于Hadoop的在线文件管理系统-开题报告.pdf基于Hadoop的在线文件管理系统-开题报告.pdf基于Hadoop的在线文件管理系统-开题报告.pdf基于Hadoop的在线文件管理系统-开题报告.pdf基于Hadoop的在线文件管理系统-开题...
kettle 9.1 连接hadoop clusters (CDH 6.2) 驱动
赠送jar包:hadoop-mapreduce-client-core-2.5.1.jar; 赠送原API文档:hadoop-mapreduce-client-core-2.5.1-javadoc.jar; 赠送源代码:hadoop-mapreduce-client-core-2.5.1-sources.jar; 赠送Maven依赖信息文件:...
flink-shaded-hadoop-2-uber-2.7.5-10.0.jar
hadoop-common-2.2.0-bin-master(包含windows端开发Hadoop和Spark需要的winutils.exe),Windows下IDEA开发Hadoop和Spark程序会报错,原因是因为如果本机操作系统是windows,在程序中使用了hadoop相关的东西,比如写入...
赠送jar包:hadoop-yarn-server-resourcemanager-2.6.0.jar; 赠送原API文档:hadoop-yarn-server-resourcemanager-2.6.0-javadoc.jar; 赠送源代码:hadoop-yarn-server-resourcemanager-2.6.0-sources.jar; 赠送...
hadoop-eclipse-plugin-1.2.1hadoop-eclipse-plugin-1.2.1hadoop-eclipse-plugin-1.2.1hadoop-eclipse-plugin-1.2.1
Hadoop Eclipse是Hadoop开发环境的插件,用户在创建Hadoop程序时,Eclipse插件会自动导入Hadoop编程接口的jar文件,这样用户就可以在Eclipse插件的图形界面中进行编码、调试和运行Hadop程序,也能通过Eclipse插件...
hadoop-eclipse2.5.2、hadoop-eclipse2.6.0、hadoop-eclipse2.6.5
hadoop-eclipse-plugin-2.7.3和2.7.7的jar包 hadoop-eclipse-plugin-2.7.3和2.7.7的jar包 hadoop-eclipse-plugin-2.7.3和2.7.7的jar包 hadoop-eclipse-plugin-2.7.3和2.7.7的jar包
hadoop-eclipse-plugin-2.7.2.jar,hadoop远程调试eclipse插件。