`

《Pro Oracle SQL》Chapter 3 -- 3.3.3 Hash Joins

阅读更多

Hash Joins    哈希连接       (page 116)
     Hash joins, like sort-merge joins, first reads the two tables to be joined independently and applies the
criteria in the WHERE clause.  Based on table and index statistics, the table that is determined to return the fewest rows will be hashed in its entirety into memory.  This hash table includes all the row data for that table and is loaded into hash buckets based on a randomizing function that converts the join key to a hash value.  As long as there is enough memory available, this hash table will reside in memory.  However, if there is not enough memory available, the hash table may be written to temp disk space. 
    哈希连接,像排序-合并连接,首先分别读取待连接的两表且应用WHERE子句中的标准(筛选有效值)。基于表和索引统计,把确定返回最少行集的表整个哈希化后存入内存。 这个哈希表包括表中的所有行数据,根据随机函数转换连接键成哈希值再载入hash桶。 只要有足够可用的内存,哈希表将驻留在内存中。然而, 如果没有足够的有效的内存,哈希表可能就存在临时磁盘空间上了。
    The next step is for the other larger table to be read and the hash function is applied to the join key
column.  That hash value is then used to probe the smaller in memory hash table for the matching hash
bucket where the row data for the first table resides.  Each bucket has a list (represented by a bitmap) of the rows in that bucket.  That list is checked for matches with the probing row.  If a match is made, the row is returned; otherwise it is discarded.  The larger table is read only once and each row is checked for a match.  This is different from the nested loops join where the inner table is read multiple times.  So really in this case, the larger table is the driving table as it is read only once and the smaller hashed table is probed many times.  Unlike a nested loops join plan, however, the tables are listed in the plan output with the smaller hashed table first and the larger probe table second.
    下一步是对另一个较大的表进行读取,哈希函数应用于连接列。接下来用哈希值探查内存中的较小的哈希过的表以匹配哈希桶,而第一张表的行数据驻留在其中。每个桶有一个该桶所包含的行集的列表(由位图表示)。通过匹配探测行检查列表。如果匹配,行返回,否则抛弃。大表只读一次且一次匹配每行都要检查。这不同于嵌套循环连接,其内部表读取多次。这种情况很实际,较大的表是驱动表因为它只读一次而较小的哈希过的表则探测多次。 不像嵌套循环,计划输出所列出的是较小 的哈希过的表放在第一而较大表放在第二。
    Let’s use the same query used earlier and break it down into how the hash join would be processed.   
     让我们用之前相同的查询,分解它,看看哈希连接是如何进行的。
select empno, ename, dname, loc
from emp, dept
where emp.deptno = dept.deptno
 
    This query would be processed as if it we rewritten like the following pseudocode:
    该查询的处理过程如下重写的伪代码:

determine the smaller row set, or in the case of an outer join,  确定较小的行集,或者若是外连接,使用外连接表
  use the outer joined table

select dname, loc, deptno from dept
hash the deptno column and build a hash table                      对dept表的deptno列哈希,且建立哈希表(因为dept数据少,是较小行集)
select empno, ename, deptno from emp
hash the deptno column and probe the hash table                  对emp表的deptno列哈希,再探测哈希表
if match made, check bitmap to confirm row match                如果匹配,检查位图确认行匹配
if no match made, discard the row                                        如果不匹配,抛弃行
 
Listing 3-20 shows the plan for this query.
Listing 3-20. Hash Join 
----------------------------------------------------------------
| Id  | Operation                      | Name | Rows  | Bytes | Cost (%CPU)|
----------------------------------------------------------------
|   0 | SELECT STATEMENT   |           |    14    |   462 |     7  (15)    |
|*  1 |  HASH JOIN                 |           |    14    |   462 |     7  (15)    |
|   2 |   TABLE ACCESS FULL| DEPT  |     4     |    80 |     3   (0)     |
|   3 |   TABLE ACCESS FULL| EMP   |    14    |   182 |     3   (0)     |
----------------------------------------------------------------
Predicate Information (identified by operation id):
---------------------------------------------------
 
   1 - access("EMP"."DEPTNO"="DEPT"."DEPTNO")
 
    In the hash join plan, the smaller hash table is listed first and the probe table is listed second. 
Keep in mind that the decision as to which table is smallest depends not just on the number of rows but
the size of those rows as well, since the entire row must be stored in the hash table.

    在哈希连接计划中,较小的哈希表列在第一而探测表列在第二。记住判断哪张表是最小的不仅依赖于行数还有行集的大小,因为整个行必须存储于哈希表中。
    Hash joins are considered more preferable when the row sources are larger and the result set is
larger as well.  Also, if one of the tables in the join is determined to always return the same row source,
a hash join would be preferable since it would only access that table once.  If a nested loops join was
chosen in that case, the row source would be accessed over and over again, requiring more work than a
single independent access.  Finally, if the smaller table can fit in memory, a hash join may be favored. 
    当行集较大而结果集也较大时哈希连接被认为是较完美的。如果要连接的一张表总是返回相同的行源,哈希连接将是完美的,因为只需要访问那张表一次。如果这种情况下选择的是嵌套循环连接,行源将一遍又一遍的访问,相比一次独立的访问要做更多的工作。最后,如果较小的表能在内存中放得下,哈希连接将是受惠的。
    Blocks are accessed for hash joins similar to how they are accessed for a sort-merge join.  The
blocks needed to build the hash table will be read and then the rest of the work will be done against the
hashed data stored in memory (from temp disk space if there isn’t enough memory).  So, when you do a
comparison of logical reads for a hash join to a sort-merge join, the block accesses will be
approximately identical.  But the logical reads as compared to a nested loops join will be less since the
blocks are read once and either placed into memory (for the hash table) where they are then accessed
or only read once (for the probe table).  
    哈希连接的块访问方式相似于排序-合并连接的。构建哈希表所需的块读入后对哈希数据的其余的工作将在内存(如果没有足够的内存就在临时磁盘空间)中完成。 因此,当你比较哈希连接和排序-合并连接的逻辑读,块访问大致相同。但是逻辑读相比嵌套循环连接要少很多,因为块读了一次且要么放在内存中(对哈希表而言),用于之后访问,要么只读一次(对探测表而言)。
    Hash joins are only possible if the join is an equi-join.   As mentioned previously, a sort-merge
join can be used to handle joins specified with an inequality condition.  The reason why hash joins
can’t be chosen unless the join is an equi-join is that the matches are made on hashed values and it
doesn’t make sense
to consider hashed values in a range.
  Listing 3-21 demonstrates how a computed
hash value doesn’t necessarily correspond to the key value being hashed (in terms of its numeric
value, in this case).
    哈希连接只可能是等值连接。 正如之前提到的,排序-合并连接能用于处理特定的不等值条件的连接。为什么哈希连接不能选中,除非连接是等值连接的原因是:匹配是通过哈希值进行的而认为哈希值是在一个范围内的是不合逻辑 的。 列表3-21演示了如何计算的哈希值并不一定对应被哈希的键值(根据它的数值,在本例中)。
Listing 3-21. Hash Values
SQL> select distinct deptno,
  2        ora_hash(deptno,1000) hv
  3    from scott.emp
  4   order by deptno;
 
         DEPTNO              HV
--------------- ---------------
             10             547
             20             486
             30             613
SQL>
SQL> select deptno
  2    from
  3  (
  4  select distinct deptno,
  5        ora_hash(deptno,1000) hv
  6    from scott.emp
  7   order by deptno
  8  )
  9   where hv between 100 and 500;
         DEPTNO
---------------
             20
SQL>
SQL> select distinct deptno,
  2        ora_hash(deptno,1000,50) hv
  3    from scott.emp
  4   order by deptno;
 
         DEPTNO              HV
--------------- ---------------
             10             839
             20             850
             30             290
SQL>
SQL> select deptno
  2    from
  3  (
  4  select distinct deptno,
  5        ora_hash(deptno,1000,50) hv
  6    from scott.emp
  7   order by deptno
  8  )
  9   where hv between 100 and 500;
 
         DEPTNO
---------------
             30
     I used the  ora_hash function to demonstrate how a hash value might be generated.  The  ora_hash
function takes up to three parameters: an input value of any base type, the maximum hash bucket value
(the minimum value is zero), and a seed value (also defaults to zero). 
So, for example, ora_hash(10,1000)  will return an integer value between zero and 1000.
    我使用ora_hash函数演示哈希可能是如何产生的。ora_hash函数接收三个参数:一个任意基础类型的输入值,最大哈希桶值(最小值是0),和一个种子值(默认值也是零)。 例如,ora_hash(10,1000)将返回一个0到1000之间的整数。
    In the two examples, I use the default seed in the first and a seed value of 50 for the second.  Notice
how the hash values for each deptno are quite different in each query.  So when I try to query a range of
hash values for each, I get a different result.  However, in both cases, if I was simply querying a range
of the column values, I could easily formulate what I wanted and be assured of always getting the right
answer.  This example is a bit forced, but I wanted to give you a visual on hash value comparisons so
you could better understand why they don’t work with inequality joins.
    在两个例子中,第一个我使用默认的的种子值,而第二个例子的种子值是50。注意每个deptno的哈希值是如何在每次查询中都不同的。这样当我每次试图查询哈希值的范围时,我获得不同的结果。然而,在上述两例中,如果我只是查询一个范围的列值,我本该能轻易的明确表达 我想要的,而且保证总是获得正确的答案。本例有点牵强, 但是我想给你形象的哈希值比较,这样你就能更好的理解为什么他们不能用于不等值连接了。

 

注:ora_hash()参考

http://www.stanford.edu/dept/itss/docs/oracle/10g/server.101/b10759/functions097.htm

1
1
分享到:
评论

相关推荐

Global site tag (gtag.js) - Google Analytics