hive中分区表，桶的使用 -

zhangbaoming815

浏览: 147766 次
性别:
来自: 北京

最近访客更多访客>>

ssssd1000

f641385712

qishinihenhao

simshine

博主相关

博客

微博

相册

留言

关于我

文章分类

社区版块

存档分类

hive中分区表，桶的使用

博客分类：

hadoop

hive分区表 hive桶 hive的使用 hive

hive中分区表的使用：

1. 创建一个分区表，以 ds 为分区列：

create table invites (id int, name string) partitioned by (ds string) row format delimited fields terminated by '\t' stored as textfile;

2. 将数据添加到时间为 2012-10-12 这个分区中：

load data local inpath '/home/hadoop/Desktop/data.txt' overwrite into table invites partition (ds='2012-10-12');

3. 将数据添加到时间为 2012-10-20 这个分区中：

load data local inpath '/home/hadoop/Desktop/data.txt' overwrite into table invites partition (ds='2012-10-20');

4. 从一个分区中查询数据：

select * from invites where ds ='2012-10-12';

5. 往一个分区表的某一个分区中添加数据：

insert overwrite table invites partition (ds='2012-10-12') select id,max(name) from test group by id;

可以查看分区的具体情况，使用命令：

hadoop fs -ls /home/hadoop.hive/warehouse/invites

如果想在 eslipse 下面看效果，也是需要开启 hadoop 的， start-all.sh 。

hive 中桶的使用：

1. 创建带桶的 table ：

create table bucketed_user(id int,name string) clustered by (id) sorted by(name) into 4 buckets row format delimited fields terminated by '\t' stored as textfile;

2. 强制多个 reduce 进行输出：

set hive.enforce.bucketing=true;

3. 往表中插入数据：

insert overwrite table bucketed_user select * from test;

4. 查看表的结构，会发现当前表下有四个文件：

dfs -ls /home/hadoop/hive/warehouse/bucketed_user;

5. 读取数据，看没一个文件的数据：

dfs -cat /home/hadoop/hive/warehouse/bucketed_user/000000_0;

桶使用 hash 来实现，所以每个文件拥有的数据的个数都有可能不相等。

6. 对桶中的数据进行采样：

select * from bucketed_user tablesample(bucket 1 out of 4 on name);

桶的个数从 1 开始计数，前面的查询从 4 个桶中的第一个桶获取数据。其实就是四分之一。

7. 查询一半返回的桶数：

select * from bucketed_user tablesample(bucket 1 out of 2 on name);

分享到：

在eclipse下运行Map-Reduce程序 | 使用db4o的SODA进行数据库的查询

2012-07-12 20:14
浏览 4055
评论(0)
分类:编程语言
查看更多

发表评论

您还没有登录,请您登录后再发表评论

最近访客更多访客>>

博主相关

文章分类

社区版块

存档分类

最新评论

hive中分区表，桶的使用

评论

发表评论

相关推荐

最近访客 更多访客>>

博主相关

文章分类

社区版块

存档分类

最新评论

hive中分区表，桶的使用

评论

发表评论

相关推荐

hadoop源码解析copyFromLocal

hadoop中LineReader的readLine方法解析

hadoop新版本多文件输出

hadoop实现自定义的数据类型

使用MapReduce往Hbase插入数据

hbase整合hive

hive处理特殊分割符的日志

jdbc连接hive

在集群上运行hadoop程序

pig的一些基本函数的应用

pig中python的使用

pig的UDF函数的使用

在eclipse下运行Map-Reduce程序

最近访客更多访客>>