Statistics in Hive （hive的统计信息搜集）翻译 -

tobyqiu

浏览: 40871 次
性别:
来自: 上海

最近访客更多访客>>

lvtt

caodaoxi

zcw3895653

nalnait

博主相关

博客

微博

相册

留言

关于我

文章分类

社区版块

存档分类

Statistics in Hive （hive的统计信息搜集）翻译

博客分类：

hadoop sqoop hive

hive的统计信息搜集

动机
范围
执行
用法

变量配置
全新的表
现有的表

例子

Motivation 动机

统计信息，例如一个表有多少行，多少个分区，列的直方图等重要的信息。统计信息的关键作用是查询优化。把统计信息作为输入，通过成本优化函数，可以方便的比较不同的查询方案，并且从中进行选择。统计数据有时可以直接满足用户的查询目的。比如他们只是查询一些基础数据，而不需要通过运行执行计划.举几个例子，得到用户的年龄分布，人们使用的top10的apps，多少个不同的session.

Scope 范围

支持统计的第一个里程碑是支持表和分区层面的统计数据。对于新建表或者是已经存在的表，表和分区统计数据现在存储在Hive的元数据中对。目前支持的分区的如下统计：

1.多少行

2.多少个文件

3.大小（字节数）

对于表来说，统计信息支持新加的分区的统计。

列级别的top K值也可搜集基于分区级别统计。参见top k Statistics。

Implementation 执行

统计信息的搜集大概分2种，新表和现有表

对于新创建的表，创建一个新表的就是一个MapReduce job。在创建的过程中，每个mapper在文件拷贝的操作中搜集行数，然后放进数据库（可能是mysql）。在MapReduce作业结束时，把统计数据汇总并存储在MetaStore。一个类似的过程发生在已经存在的表，当新建一个map-only的job，当每个mapper在扫描表的过程中，搜集行的统计信息，然后同样的过程。

有一点需要明确，这里需要的用来存储临时统计信息的数据。现在有2种实现方式，一个是用mysql，另一个是hbase。这里有个接口IStatsPublisher和IStatsAggregator。开发人员可以实现支持任何其他的存储。接口列表如下

package org.apache.hadoop.hive.ql.stats;
 
import org.apache.hadoop.conf.Configuration;
 
/**
 * An interface for any possible implementation for publishing statics.
 */
 
public interface IStatsPublisher {
 
  /**
 * This method does the necessary initializations according to the implementation requirements.
   */
  public boolean init(Configuration hconf);
 
  /**
 * This method publishes a given statistic into a disk storage, possibly HBase or MySQL.
   *
 * rowID : a string identification the statistics to be published then gathered, possibly the table name + the partition specs.
   *
 * key : a string noting the key to be published. Ex: "numRows".
   *
 * value : an integer noting the value of the published key.
 * */
  public boolean publishStat(String rowID, String key, String value);
 
  /**
 * This method executes the necessary termination procedures, possibly closing all database connections.
   */
  public boolean terminate();
 
}

package org.apache.hadoop.hive.ql.stats;
 
import org.apache.hadoop.conf.Configuration;
 
/**
 * An interface for any possible implementation for gathering statistics.
 */
 
public interface IStatsAggregator {
 
  /**
 * This method does the necessary initializations according to the implementation requirements.
   */
  public boolean init(Configuration hconf);
 
  /**
 * This method aggregates a given statistic from a disk storage.
 * After aggregation, this method does cleaning by removing all records from the disk storage that have the same given rowID.
   *
 * rowID : a string identification the statistic to be gathered, possibly the table name + the partition specs.
   *
 * key : a string noting the key to be gathered. Ex: "numRows".
   *
 * */
  public String aggregateStats(String rowID, String key);
 
  /**
 * This method executes the necessary termination procedures, possibly closing all database connections.
   */
  public boolean terminate();
 
}

Usage用法

Configuration Variables参数配置

详见统计参数配置列表，如何使用参数。

Newly Created Tables新表

对于新建表/分区（通过INSERT OVERWRITE ），统计信息默认情况下会自动计算。如果用户把 hive.stats.autogather设置成false，那么统计信息就不会被自动计算，然后存储进hive 元数据。

set hive.stats.autogather=false;

用户还可以指定临时统计存储的变量 hive.stats.dbclass，例如，要设置hbase（默认是 {{jdbc:derby}}作为临时的统计信息存储）就使用，

set hive.stats.dbclass=hbase;

如果是通过jdbc来实现临时存储（ex. Derby or MySQL），可以通过设置hive.stats.dbconnectionstring指定适当的连接字符串来实现。同时还可以通过hive.stats.jdbcdriver来指定jdbc驱动

set hive.stats.dbclass=jdbc:derby;
set hive.stats.dbconnectionstring="jdbc:derby:;databaseName=TempStatsStore;create=true";
set hive.stats.jdbcdriver="org.apache.derby.jdbc.EmbeddedDriver";

查询可能无法正确的搜集统计信息。如果出现这种情况，这里还有一个设置。hive.stats.reliable。默认是false

Existing Tables现有表

对于现有的表和/或分区，用户可以发出ANALYZE命令来收集统计信息，并将其写入到元数据存储。语法该命令的描述如下：

ANALYZE TABLE tablename [PARTITION(partcol1[=val1], partcol2[=val2], ...)] COMPUTE STATISTICS [noscan];

当用户发出的命令，他可能会或可能不会指定分区。如果用户没有指定任何分区，就会收集统计所有分区的统计信息（如果有的话）。如果指定某个分区，只会收集那些分区的统计信息。当搜集所有分区时，分区字段会被罗列。

当指定可选参数NOSCAN，该命令将不会扫描文件，以便它更快。它得到的不是所有统计数据，只是收集了以下统计数据：

文件数
物理大小（字节）

Examples例子

假设table1 有4个分区

Partition1: (ds='2008-04-08', hr=11)
Partition2: (ds='2008-04-08', hr=12)
Partition3: (ds='2008-04-09', hr=11)
Partition4: (ds='2008-04-09', hr=12)

用户打了以下的命令

ANALYZE TABLE Table1 PARTITION(ds='2008-04-09', hr=11) COMPUTE STATISTICS;

那么只会统计分区3的数据(ds='2008-04-09', hr=11)

如果打了以下的命令

ANALYZE TABLE Table1 PARTITION(ds='2008-04-09', hr) COMPUTE STATISTICS;

那么只统计了分区3和分区4的数据

如果打了下面的命令

ANALYZE TABLE Table1 PARTITION(ds, hr) COMPUTE STATISTICS;

那么会统计4个分区的数据

对于非分区表可以使用以下命令

ANALYZE TABLE Table1 COMPUTE STATISTICS;

如果是个分区表，你就需要像上面写的那样明确分区字段，否则予以分析器就会抛出错误。

用户可以使用DESCRIBE 命令来查看已经搜集完毕的统计信息。统计信息被存放在一个参数array中，假设用户打算查看全表的统计信息，需要以下命令

DESCRIBE EXTENDED TABLE1;

然后会有以下的输出

... , parameters:{numPartitions=4, numFiles=16, numRows=2000, totalSize=16384, ...}, ....

如果使用以下命令

DESCRIBE EXTENDED TABLE1 PARTITION(ds='2008-04-09', hr=11);

会有以下输出

... , parameters:{numFiles=4, numRows=500, totalSize=4096, ...}, ....

如果用户使用以下命令

ANALYZE TABLE Table1 PARTITION(ds='2008-04-09', hr) COMPUTE STATISTICS noscan;

就只会统计分区3和分区4中有多少个文件，以及物理大小（单位byte）

分享到：

Enhanced Aggregation, Cube, Grouping and ... | ORC File 翻译

2014-05-13 21:49
浏览 5467
评论(0)
分类:数据库
查看更多

发表评论

您还没有登录,请您登录后再发表评论

最近访客更多访客>>

博主相关

文章分类

社区版块

存档分类

最新评论

Statistics in Hive （hive的统计信息搜集）翻译

Motivation 动机

Scope 范围

Implementation 执行

Usage用法

Configuration Variables参数配置

Newly Created Tables新表

Existing Tables现有表

Examples例子

评论

发表评论

相关推荐

最近访客 更多访客>>

博主相关

文章分类

社区版块

存档分类

最新评论

Statistics in Hive （hive的统计信息搜集） 翻译

Motivation 动机

Scope 范围

Implementation 执行

Usage用法

Configuration Variables参数配置

Newly Created Tables新表

Existing Tables现有表

Examples例子

评论

发表评论

相关推荐

Hive 压缩比较

Enhanced Aggregation, Cube, Grouping and Rollup 优化聚合函数

ORC File 翻译

Hive Join 优化 翻译

Hive 的join

sqoop 1.4.4 使用3

sqoop 1.4.4 使用2

sqoop 1.4.4 使用1

Sqoop 1.99.3 with hadoop-2.3.0 使用 3

Sqoop 1.99.3 with hadoop-2.3.0 使用 2

Sqoop 1.99.3 with hadoop-2.3.0 使用1

HIVE JDBC

CYGWIN SSH domain login

Hive 配置

hadoop WordCount 运行

hadoop 环境

最近访客更多访客>>

Statistics in Hive （hive的统计信息搜集）翻译

Hive Join 优化翻译