`
joerong666
  • 浏览: 410371 次
  • 性别: Icon_minigender_1
  • 来自: 广州
社区版块
存档分类
最新评论

Impala介绍博客相关问答

阅读更多

原博客文章地址:

http://blog.cloudera.com/blog/2012/10/cloudera-impala-real-time-queries-in-apache-hadoop-for-real/

  • SONAL / OCTOBER 25, 2012 / 11:44 AM

Very excited to see Impala. The Dremel paper outlines efficient columnar storage for nested data. How does Impala achieve its speeds if data is not to be loaded in to the system?

Thanks
Sonal

Dremel论文描述了使用列储存来有效地储存嵌套数据。如果数据没有被加载到系统中,Impala的实现是如何保证其速度的?

To address Sonal’s question:

The performance advantage you will see with Impala will always depend on the storage format of the data, among other things. Impala tries hard to be fast on ascii-encoded data (text files and sequencefile), but of course the parsing overhead will always show up as a performance penalty compared to something like ColumnIO or Trevni. Impala will also support Trevni in the GA release, as mentioned in the blog post.

Regarding data loading: we are working on background conversion into Trevni, in a way that enables a logical table to be backed by a mix of data formats. New data would show up in, say, sequencefile format and eventually get converted into the more efficient Trevni columnar format, but all of the data would be queryable at all times, regardless of format.

Marcel

Impala的性能优势始终依赖于数据的储存格式。Impala致力于能够对ASCII编码的数据进行快速处理,但是同ColumnIO或Trevni相比,解析开销肯定会对性能造成影响。Impala在正式版本中将支持Trevni。

考虑数据加载:我们在后台将数据转换到Trevni,这种方式可以允许一张逻辑表以混合格式进行备份。新数据是顺序文件格式,最终被转换为更有效的Trevni列格式,但是所有数据在任何时刻都是可查询的,和格式无关。

  • ALEX B / NOVEMBER 22, 2012 / 8:25 AM

Can you please comment how Impala compares to Hadapt in terms of architecture ? As far as I understand in case of Hadapt ( and I could be wrong of course ) some transformation of the data to Postgre SQL is needed . That does not seems to be the case with Impala( at least in the current implementation) ?

Thanks,
Alex

Impala和Hadapt在结构上进行比较?Hadapt中,需要进行某些数据到PG的转换。Impala看起来不需要这样做。

Regarding Alex’s question:

That’s correct, Impala does read data directly from HDFS and HBase. Impala also relies on Apache Hive’s metastore for the mapping of files into tables, which means you can re-use your schema definitions if you’re already querying Hadoop through Hive.

Hadapt runs a PostgreSql instance on each data node, and appears to require some form of data movement (and duplication of data storage) between Postgres and HDFS, but for the specifics of that architecture I would recommend consulting the Hadapt website.

Marcel

Impala直接从HDFS和HBase上读取数据,同时Impala依赖Hive的元存储来将文件映射到表,这意味着你如果已经通过Hive对Hadoop上的数据进行查询,那么你可以重用你的模式定义。

Hadapt在每个数据节点上运行一个PG实例,而且似乎需要在PG和HDFS直接进行某些形式的数据移动(和数据复制),但对于相关架构的细节建议到Hadapt网站上进行咨询。

Great stuff! We have tried it and impala shows about 2x speedup vs. hive on our simple query on test dataset.

Could Marcel explain more about the main reasons that make impala faster?
1. about columnar storage: it seems that hive can also benifit from columnar storage compared with text file.
2. about distributed scalable aggregation algorithms: is there some details and examples about the algorithms?
3. about join: if dataset can not fit into memory, how impala keep faster if impala use disk.
4. about main memory as a cache for table data: is it a cache in impala for recently accessed data?

Thanks!
Kang

我们已经试用过Impala,在测试的数据集中,使用简单查询,Impala的速度比Hive提升了2倍。

请Marcel解释Impala速度快的主要原因:

  1. 关于列储存:相对于文本文件,Hive也可以通过使用列储存获益。
  2. 关于分布式可扩展聚集算法:有算法的细节和例子吗?
  3. 关于join:如果数据集无法全部读入内存,Impala如何在使用磁盘的时候保持速度。
  4. 关于用作表数据缓存的主内存:缓存Impala最近访问的数据?

Regarding Kang’s questions:

1. Yes, the Trevni columnar storage format will be an open and general purpose storage format that will be available for any of the Hadoop processing frameworks, including Hive, MapReduce, and Pig.

However, we expect to see greater performance gains from Trevni in Impala compared to what you’d see in Hive. The reason is that in a disk-based system, Impala is often I/O-bound, and a columnar format will reduce the total I/O volume, often by a substantial amount. Hive is often cpu-bound and will therefore benefit much less from a reduction in I/O volume.

2. At the moment, Impala does a simple 2-stage aggregation: pre-aggregation is done by all executing backends, followed by a single, central merge aggregation step in the coordinator. In an upcoming release Impala will also support repartitioning aggregation, where the result of the pre-aggregation step is hash-partitioned across all executing backends, so that the total merge aggregation work is also distributed.

3. Impala currently has the limitation that the right-hand side table of a join needs to fit into the memory of every executing backend. In the GA release, this will be relaxed, so that the right-hand side table will only have to fit into the *aggregate* memory of all executing backends. Disk-based join algorithms won’t be available until after the GA release.

4. Impala does not maintain its own cache; instead, it relies on the OS buffer cache in order to keep frequently-accessed data in memory.

Marcel

  1. Trevni列储存格式将是一个开放和通用的储存格式,对所有Hadoop处理框架都可用,包括Hive、MapReduce和Pig。

但是,相对Hive,我们希望通过Trevni在Impala上获得更多的性能提升。原因是在一个基于磁盘的系统中,Impala经常受到I/O的限制,而列格式可以减少总I/O量,而且经常可以减少很多。Hive经常受到CPU的限制因此在I/O量减少方面获益较少。

  1. 目前,Impala进行一个简单的2阶段聚集算法:预聚集在所有执行后端完成,之后在协调器进行一个单一的、中心合并聚集步骤。在即将发布的版本中,Impala还将支持再分配聚集,预聚集步骤的结果将通过hash分区到所有执行后端,所以合并聚集工作也是分布式的。
  2. Impala目前限制右连接表需要加载到每个执行后端的内存中处理。在正式版本中,限制将放宽,右连接表只需要能加载到所有执行后端的总内存中即可。基于磁盘的join算法在正式版本之前都不可用。
  3. Impala没有维持其自有的缓存,取而代之的是使用OS buffer进行缓存以保证频繁访问的数据保留在内存中。
分享到:
评论

相关推荐

Global site tag (gtag.js) - Google Analytics