- 浏览: 32093 次
- 性别:
- 来自: 西安
文章分类
最新评论
1.Avro基本数据类型
类型 描述 模式示例
null The absence of a value "null"
boolean A binary value "boolean"
int 32位带符号整数 "int"
long 64位带符号整数 "long"
float 32位单精度浮点数 "float"
double 64位双精度浮点数 "double"
bytes byte数组(8位无字符字节序列) "bytes"
string Unicode字符串 "string"
【Avro基本数据类型还可以使用更冗长的形式使用type属性来指定如{"type":"null"}】
2.Avro复杂数据类型
数据类型 类型描述 模式示例
array An ordered collection of objects. {
All objects in a particular "type": "array",
array must have the same schema. "items": "long"
}
map An unordered collection of key-value pairs. {
Keys must be strings and values may be any type, "type": "map",
although within a particular map, "values": "string"
all values must have the same schema. }
record A collection of named fields of any type. {
"type": "record",
"name": "WeatherRecord",
"doc": "A weather reading.",
"fields": [
{"name": "year", "type": "int"},
{"name": "temperature", "type": "int"},
{"name": "stationId", "type": "string"}
]
}
enum A set of named values. {
"type": "enum",
"name": "Cutlery",
"doc": "An eating utensil.",
"symbols": ["KNIFE", "FORK", "SPOON"]
}
fixed
A fixed number of 8-bit unsigned bytes.
{
"type": "fixed",
"name": "Md5Hash",
"size": 16
}
union A union of schemas. A union is represented by a JSON [
array, where each element in the array is a schema. "null",
Data represented by a union must match "string",
one of the schemas in the union. {"type": "map", "values": "string"}
]
通过上图所示,通过程序可以将本地的小文件进行打包,组装成一个大文件在HDFS中进行保存,本地的小文件成为Avro的记录。具体的程序如下面的代码所示:
//对Avro数据文件的写入
public class AVRO_WRITE {
public static final String FIELD_CONTENTS = "contents";
public static final String FIELD_FILENAME = "filename";
public static final String SCHEMA_JSON = "{\"type\": \"record\",\"name\": \"SmallFilesTest\", "
+ "\"fields\": ["
+ "{\"name\":\""
+ FIELD_FILENAME
+ "\",\"type\":\"string\"},"
+ "{\"name\":\""
+ FIELD_CONTENTS
+ "\", \"type\":\"bytes\"}]}";
public static final Schema SCHEMA = new Schema.Parser().parse(SCHEMA_JSON);
public static void writeToAvro(File srcPath, OutputStream outputStream) throws IOException {
DataFileWriter<Object> writer = new DataFileWriter<Object>(new GenericDatumWriter<Object>()).setSyncInterval(100);
writer.setCodec(CodecFactory.snappyCodec());
writer.create(SCHEMA, outputStream);
for (Object obj : FileUtils.listFiles(srcPath, null, false)){
File file = (File) obj;
String filename = file.getAbsolutePath();
byte content[] = FileUtils.readFileToByteArray(file);
GenericRecord record = new GenericData.Record(SCHEMA);
record.put(FIELD_FILENAME, filename);
record.put(FIELD_CONTENTS, ByteBuffer.wrap(content));
writer.append(record);
System.out.println(file.getAbsolutePath() + ":"+ DigestUtils.md5Hex(content));
}
IOUtils.cleanup(null, writer);
IOUtils.cleanup(null, outputStream);
}
public static void main(String args[]) throws Exception {
Configuration config = new Configuration();
FileSystem hdfs = FileSystem.get(config);
File sourceDir = new File(args[0]);
Path destFile = new Path(args[1]);
OutputStream os = hdfs.create(destFile);
writeToAvro(sourceDir, os);
}
}
//对Avro数据文件的读取
public class AVRO_READ{
private static final String FIELD_FILENAME = "filename";
private static final String FIELD_CONTENTS = "contents";
public static void readFromAvro(InputStream is) throws IOException {
DataFileStream<Object> reader = new DataFileStream<Object>(is,new GenericDatumReader<Object>());
for (Object o : reader) {
GenericRecord r = (GenericRecord) o;
System.out.println(r.get(FIELD_FILENAME)+ ":"+DigestUtils.md5Hex(((ByteBuffer)r.get(FIELD_CONTENTS)).array()));
}
IOUtils.cleanup(null, is);
IOUtils.cleanup(null, reader);
}
public static void main(String... args) throws Exception {
Configuration config = new Configuration();
FileSystem hdfs = FileSystem.get(config);
Path destFile = new Path(args[0]);
InputStream is = hdfs.open(destFile);
readFromAvro(is);
}
}
类型 描述 模式示例
null The absence of a value "null"
boolean A binary value "boolean"
int 32位带符号整数 "int"
long 64位带符号整数 "long"
float 32位单精度浮点数 "float"
double 64位双精度浮点数 "double"
bytes byte数组(8位无字符字节序列) "bytes"
string Unicode字符串 "string"
【Avro基本数据类型还可以使用更冗长的形式使用type属性来指定如{"type":"null"}】
2.Avro复杂数据类型
数据类型 类型描述 模式示例
array An ordered collection of objects. {
All objects in a particular "type": "array",
array must have the same schema. "items": "long"
}
map An unordered collection of key-value pairs. {
Keys must be strings and values may be any type, "type": "map",
although within a particular map, "values": "string"
all values must have the same schema. }
record A collection of named fields of any type. {
"type": "record",
"name": "WeatherRecord",
"doc": "A weather reading.",
"fields": [
{"name": "year", "type": "int"},
{"name": "temperature", "type": "int"},
{"name": "stationId", "type": "string"}
]
}
enum A set of named values. {
"type": "enum",
"name": "Cutlery",
"doc": "An eating utensil.",
"symbols": ["KNIFE", "FORK", "SPOON"]
}
fixed
A fixed number of 8-bit unsigned bytes.
{
"type": "fixed",
"name": "Md5Hash",
"size": 16
}
union A union of schemas. A union is represented by a JSON [
array, where each element in the array is a schema. "null",
Data represented by a union must match "string",
one of the schemas in the union. {"type": "map", "values": "string"}
]
通过上图所示,通过程序可以将本地的小文件进行打包,组装成一个大文件在HDFS中进行保存,本地的小文件成为Avro的记录。具体的程序如下面的代码所示:
//对Avro数据文件的写入
public class AVRO_WRITE {
public static final String FIELD_CONTENTS = "contents";
public static final String FIELD_FILENAME = "filename";
public static final String SCHEMA_JSON = "{\"type\": \"record\",\"name\": \"SmallFilesTest\", "
+ "\"fields\": ["
+ "{\"name\":\""
+ FIELD_FILENAME
+ "\",\"type\":\"string\"},"
+ "{\"name\":\""
+ FIELD_CONTENTS
+ "\", \"type\":\"bytes\"}]}";
public static final Schema SCHEMA = new Schema.Parser().parse(SCHEMA_JSON);
public static void writeToAvro(File srcPath, OutputStream outputStream) throws IOException {
DataFileWriter<Object> writer = new DataFileWriter<Object>(new GenericDatumWriter<Object>()).setSyncInterval(100);
writer.setCodec(CodecFactory.snappyCodec());
writer.create(SCHEMA, outputStream);
for (Object obj : FileUtils.listFiles(srcPath, null, false)){
File file = (File) obj;
String filename = file.getAbsolutePath();
byte content[] = FileUtils.readFileToByteArray(file);
GenericRecord record = new GenericData.Record(SCHEMA);
record.put(FIELD_FILENAME, filename);
record.put(FIELD_CONTENTS, ByteBuffer.wrap(content));
writer.append(record);
System.out.println(file.getAbsolutePath() + ":"+ DigestUtils.md5Hex(content));
}
IOUtils.cleanup(null, writer);
IOUtils.cleanup(null, outputStream);
}
public static void main(String args[]) throws Exception {
Configuration config = new Configuration();
FileSystem hdfs = FileSystem.get(config);
File sourceDir = new File(args[0]);
Path destFile = new Path(args[1]);
OutputStream os = hdfs.create(destFile);
writeToAvro(sourceDir, os);
}
}
//对Avro数据文件的读取
public class AVRO_READ{
private static final String FIELD_FILENAME = "filename";
private static final String FIELD_CONTENTS = "contents";
public static void readFromAvro(InputStream is) throws IOException {
DataFileStream<Object> reader = new DataFileStream<Object>(is,new GenericDatumReader<Object>());
for (Object o : reader) {
GenericRecord r = (GenericRecord) o;
System.out.println(r.get(FIELD_FILENAME)+ ":"+DigestUtils.md5Hex(((ByteBuffer)r.get(FIELD_CONTENTS)).array()));
}
IOUtils.cleanup(null, is);
IOUtils.cleanup(null, reader);
}
public static void main(String... args) throws Exception {
Configuration config = new Configuration();
FileSystem hdfs = FileSystem.get(config);
Path destFile = new Path(args[0]);
InputStream is = hdfs.open(destFile);
readFromAvro(is);
}
}
发表评论
-
32位hadoop编译实现与64位操作系统兼容
2015-12-08 20:47 1176没有安装过集群的朋友,可能没有发现,hadoop版本没有64位 ... -
hive建的表丢了?其实它一直在
2015-12-03 18:49 2296问题来了: 1.hive使用derby作为元数据库找达到所创建 ... -
R语言与hadoop之间的千万柔情
2015-11-06 21:02 812Hadoop的家族如此之强大 ... -
MapReduce 从作业、任务(task)、管理员角度调优
2015-10-14 00:53 973【摘自hyj博主】 Hadoop为 ... -
hadoop作业的优化常用手段
2015-10-13 23:38 780在mapreduce应用机制全部完成后,常面临一个常见问题“作 ... -
hadoop_AVRO数据序列化系统_简介
2015-10-09 22:47 738声明()内容为个人理解,[]内容为注解 (1)Avro是一个 ...
相关推荐
hadoop_spark_数据算法hadoop_spark_数据算法hadoop_spark_数据算法hadoop_spark_数据算法
windows 64位下hadoop2.7.3搭建环境所需的hadoop.dll及winutil.exe等
Hadoop_in_Action.pdf
pc机连接hadoop集群必须的文件,把它放到本地,然后配置到环境变量中,才能在本地操作集群。
Hadoop_进行分布式并行编程.doc Hadoop_进行分布式并行编程.doc
Hadoop 是一个能够对大量数据进行分布式处理的软件框架。但是 Hadoop 是以一种可靠、高效、可伸缩的方式进行处理的。Hadoop 是可靠的,因为它假设计算元素和存储会失败,因此它维护多个工作数据副本,确保能够针对...
细细品味Hadoop_第16期_ZooKeeper简介及安装_V1.2细细品味Hadoop_第16期_ZooKeeper简介及安装_V1.2
hadoop_介绍
细细品味Hadoop_Hadoop集群(第11期副刊)_HBase之旅.pdf
Hadoop_Hadoop集群Hadoop_Hadoop集群Hadoop_Hadoop集群Hadoop_Hadoop集群Hadoop_Hadoop集群Hadoop_Hadoop集群Hadoop_Hadoop集群Hadoop_Hadoop集群Hadoop_Hadoop集群
hadoop_dll_winutil_2.7 hadoop 依赖库 hadoop 依赖库
针对数据分析介绍分布式计算涉及的大量概念、工具和技术,纵览Hadoop生态系统。
hadoop_dll_winutil_2.7.1
01_Hadoop_开篇_课程整体介绍.mp4 03_Hadoop_概论_大数据的特点.mp4 04_Hadoop_概论_大数据的应用场景.mp4 06_Hadoop_概论_未来工作内容.mp4 07_Hadoop_入门_课程介绍.mp4 11_Hadoop_入门_Hadoop优势.mp4 13_Hadoop_...
hadoop_tutorial hadoop入门经典 Hadoop 是一个能够对大量数据进行分布式处理的软件框架。Hadoop 是可靠的,因为它假设计算元素和存储会失败,因此它维护多个工作数据副本,确保能够针对失败的节点重新分布处理。...
Hadoop_Spark相关面试问题总结 - Hadoop知识库.pdf Hadoop_Spark相关面试问题总结 - Hadoop知识库.pdf Hadoop_Spark相关面试问题总结 - Hadoop知识库.pdf
Hadoop_MapReduce教程,Hadoop_MapReduce教程
当我们想要在Windows系统中模拟Hadoop集群,可以使用该插件 1.在没有中文路径下进行解压 ... 本地路径/hadoop_dll2.6.0_64bit/bin添加到环境变量,HADOOP_HOME=本地路径/hadoop_dll2.6.0_64bit/bin
Hadoop_Hadoop集群(第1期)_CentOS安装配置 Hadoop_Hadoop集群(第2期)_机器信息分布表 Hadoop_Hadoop集群(第4期)_SecureCRT使用 Hadoop_Hadoop集群(第5期)_Hadoop安装配置 Hadoop_Hadoop集群(第5期副刊)_...
Hadoop_in_Action