Open Addressing

whitesock

浏览: 478708 次
性别:
来自: 大连

最近访客更多访客>>

zdyujia

rakejin

xiao1291147

mangyulin

博主相关

博客

微博

相册

留言

关于我

文章分类

社区版块

存档分类

博客分类：

open addressing

1 Overview

Open addressing和Chaining是两种不同的解决hash冲突的策略。当多个不同的key被映射到相同的slot时，chaining方式采用链表保存所有的value。而Open addressing则尝试在该slot的邻近位置查找，直到找到对应的value或者空闲的slot，这个过程被称作probing。常见的probing策略有Linear probing，Quadratic probing和Double hashing。

2 Chaining

2.1 Chaining in java.util.HashMap

在分析open addressing策略之前，首先简单介绍一下大多数的Java 核心集合类采用的chaining策略，以便比较。 java.util.HashMap有一个Entry[]类型的成员变量table，其每个非null元素都是一个单向链表的表头。

put()：如果存在hash冲突，那么对应的table元素的链表长度会大于1。
get()：需要对链表进行遍历，在遍历的过程中不仅要判断key的hash值是否相等，还要判断key是否相等或者equals。
对于put()和get()操作，如果其参数key为null，那么HashMap会有些特别的优化。

Chaining策略的主要缺点是需要通过Entry保存key，value以及指向链表下个节点的引用（Map.Entry就有四个成员变量），这意味着更多的内存使用（尤其是当key，value本身使用的内存很小时，额外使用的内存所占的比例就显得比较大）。此外链表对CPU的高速缓存不太友好。

3 Open Addressing

3.1 Probing

3.1.1 Linear probing

两次查找位置的间隔为一固定值，即每次查找时在原slot位置的基础上增加一个固定值（通常为1），例如：P = (P + 1) mod SLOT_LENGTH。其最大的优点在于计算速度快，另外对CPU高速缓存更友好。其缺点也非常明显：

假设key1，key2，key3的hash code都相同并且key1被映射到slot(p)，那么在计算key2的映射位置时需要查找slot(p), slot(p+1)，计算key3的映射位置时需要查找slot(p), slot(p+1)，slot(p+2)。也就是说对于导致hash冲突的所有key，在probing过程中会重复查找以前已经查找过的位置，这种现象被称为clustering。

3.1.2 Quadratic probing

两次查找位置的间隔线性增长，例如P(i) = (P + c1*i + c2*i*i) mod SLOT_LENGTH，其中c1和c2为常量且c2不为0（如果为0，那么降级为Linear probing）。 Quadratic probing的各方面性能介于Linear probing和Double hashing之间。

3.1.3 Double hashing
两次查找位置的间隔为一固定值，但是该值通过另外一个hash算法生成，例如P = (P + INCREMENT(key)) mod SLOT_LENGTH，其中INCREMENT即另外一个hash算法。以下是个简单的例子：

      H(key) = key mod 10
      INCREMENT(key) = 1 + (key mod 7)
      P(15): H(15) = 5;
      P(35): H(35) = 5, 与P(15)冲突，因此需要进行probe，位置是 (5 + INCREMENT(35)) mod 10 = 6
      P(25): H(25) = 5, 与P(15)冲突，因此需要进行probe，位置是 (5 + INCREMENT(25)) mod 10 = 0

P(75): H(75) = 5, 与P(15)冲突，因此需要进行probe，位置是 (5 + INCREMENT(75)) mod 10 = 1

从以上例子可以看出，跟Linear probing相比，减少了重复查找的次数。

3.2 Load Factor

基于open addressing的哈希表的性能对其load factor属性值非常敏感。如果该值超过0.7 (Trove maps/sets的默认load factor是0.5)，那么性能会下降的非常明显。由于hash冲突导致的probing次数跟(loadFactor) / (1 - loadFactor)成正比。当loadFactor为1时，如果哈希表中的空闲slot非常少，那么可能会导致probing的次数非常大。

3.3 Open addressing in gnu.trove.THashMap

GNU Trove (http://trove4j.sourceforge.net/) 是一个Java 集合类库。在某些场景下，Trove集合类库提供了更好的性能，而且内存使用更少。以下是Trove中跟open addressing相关的几个特性：

Trove maps/sets没有使用chaining解决hash冲突，而是使用了open addressing。
跟chaining相比，open addressing对hash算法的要求更高。通过TObjectHashingStrategy 接口， Trove支持定制hash算法（例如不希望使用String或者数组的默认hash算法）。
Trove提供的maps/sets的capaicity属性一定是质数，这有助于减少hash冲突。
跟java.util.HashSet不同，Trove sets没有使用maps，因此不需要额外分配value的引用。

跟java.util.HashMap相比，gnu.trove.THashMap没有Entry[] table之类的成员变量，而是分别通过Object[] _set，V[] _values直接保存key和value。在逻辑上，Object[] _set中的每个元素都有三种状态：

FREE：该slot目前空闲；
REMOVED：该slot之前被使用过，但是目前数据已被移除；
OCCUPIED：该slot目前被使用中；

这三种状态的迁移过程如下：

在构造或者resize时，所有_set元素都会赋值为FREE（FREE状态）；
向THashMap中put某个key/value对时，_set中对应的元素会被赋值为put()方法的参数key（OCCUPIED状态）；
从THashMap中以key进行remove的时，_set中对应的元素会被赋值为REMOVED（注意：不是赋值为FREE）；

以下是关于状态迁移的简单例子（：= 的含义是赋值， H(key) = key mod 11）：

put(7, value): _set[7] ：= 7;
put(9, value): _set[9] ：= 9;
put(18, value): 由于_set[7]处于OCCUPIED状态，导致hash冲突；假设第一次probe计算得出9，由于_set[9]处于OCCUPIED状态，仍然hash冲突；假设再次probe计算得到1, 由于_set[1]处于FREE状态，所以_set[1] ：= 18;
get(18): _set[7]处于OCCUPIED状态并且与18不等；假设第一次probe计算得出9，_set[9]的值与18不等；假设再次probe计算得到1, 由于_set[1] 的值等于18，所以返回对应value;
remove(9): _set[9] ：= REMOVED;
get(18): _set[7]的值与18不等；假设第一次probe计算得出9，由于_set[9]状态为REMOVED，需要再次probe；假设再次probe计算得到1, 由于_set[1] 的值等于18，所以返回对应value;
put(9, value): _set[9] ：= 9;

以下是与get()方法相关的代码片段：

public V get(Object key) {
    int index = index((K) key);
    return index < 0 ? null : _values[index];
}

protected int index(T obj) {
    final TObjectHashingStrategy<T> hashing_strategy = _hashingStrategy;

    final Object[] set = _set;
    final int length = set.length;
    final int hash = hashing_strategy.computeHashCode(obj) & 0x7fffffff;
    int index = hash % length;
    Object cur = set[index];

    if ( cur == FREE ) return -1;

    // NOTE: here it has to be REMOVED or FULL (some user-given value)
    if ( cur == REMOVED || ! hashing_strategy.equals((T) cur, obj)) {
        // see Knuth, p. 529
        final int probe = 1 + (hash % (length - 2));

        do {
            index -= probe;
            if (index < 0) {
                index += length;
            }
            cur = set[index];
        } while (cur != FREE
             && (cur == REMOVED || ! _hashingStrategy.equals((T) cur, obj)));
    }

    return cur == FREE ? -1 : index;
}

从以上代码可以看出get()方法的流程如下，根据key的hash值找到对应的set元素，判断是否存在hash冲突。

如果不存在hash冲突，那么该set元素的可能状态如下：

FREE：意味着THashMap中不存在该key；
OCCUPIED，并且该元素的值等于get()方法的参数key：意味着THashMap中存在该key；
非以上两种情况：意味着存在hash冲突，需要进行probe，直到找到状态为以上两种状态的set元素；

以下是与put()方法相关的代码片段：

public V put(K key, V value) {
    int index = insertionIndex(key);
    return doPut(key, value, index);
}

private V doPut(K key, V value, int index) {
    V previous = null;
    Object oldKey;
    boolean isNewMapping = true;
    if (index < 0) {
        index = -index -1;
        previous = _values[index];
        isNewMapping = false;
    }
    oldKey = _set[index];
    _set[index] = key;
    _values[index] = value;
    if (isNewMapping) {
        postInsertHook(oldKey == FREE);
    }

    return previous;
}

protected int insertionIndex(T obj) {
    final TObjectHashingStrategy<T> hashing_strategy = _hashingStrategy;

    final Object[] set = _set;
    final int length = set.length;
    final int hash = hashing_strategy.computeHashCode(obj) & 0x7fffffff;
    int index = hash % length;
    Object cur = set[index];

    if (cur == FREE) {
        return index;       // empty, all done
    } else if (cur != REMOVED && hashing_strategy.equals((T) cur, obj)) {
        return -index -1;   // already stored
    } else {                // already FULL or REMOVED, must probe
        // compute the double hash
        final int probe = 1 + (hash % (length - 2));

        // if the slot we landed on is FULL (but not removed), probe
        // until we find an empty slot, a REMOVED slot, or an element
        // equal to the one we are trying to insert.
        // finding an empty slot means that the value is not present
        // and that we should use that slot as the insertion point;
        // finding a REMOVED slot means that we need to keep searching,
        // however we want to remember the offset of that REMOVED slot
        // so we can reuse it in case a "new" insertion (i.e. not an update)
        // is possible.
        // finding a matching value means that we've found that our desired
        // key is already in the table
        if (cur != REMOVED) {
            // starting at the natural offset, probe until we find an
            // offset that isn't full.
            do {
                index -= probe;
                if (index < 0) {
                    index += length;
                }
                cur = set[index];
            } while (cur != FREE
                     && cur != REMOVED
                     && ! hashing_strategy.equals((T) cur, obj));
        }

        // if the index we found was removed: continue probing until we
        // locate a free location or an element which equal()s the
        // one we have.
        if (cur == REMOVED) {
            int firstRemoved = index;
            while (cur != FREE
                   && (cur == REMOVED || ! hashing_strategy.equals((T) cur, obj))) {
                index -= probe;
                if (index < 0) {
                    index += length;
                }
                cur = set[index];
            }
            // NOTE: cur cannot == REMOVED in this block
            return (cur != FREE) ? -index -1 : firstRemoved;
        }
        // if it's full, the key is already stored
        // NOTE: cur cannot equal REMOVE here (would have retuned already (see above)
        return (cur != FREE) ? -index -1 : index;
    }
}

从以上代码可以看出，THashMap使用Double hashing。用来计算增量的hash算法是final int probe = 1 + (hash % (length - 2)); 如果insertionIndex()方法的返回值为正值，那么该值就是可用的slot位置；如果为负值，那么说明该key之前已经保存过，(-index-1)就是之前的slot位置。

put()方法的流程如下，根据key的hash值找到对应的set元素，判断是否存在hash冲突。

如果不存在hash冲突，那么该set元素的可能状态如下：

FREE：意味着可以在该位置插入；
OCCUPIED，并且该元素的值等于put()方法的参数key：意味着该位置之前已经插入过相同key的数据，本次put操作需要对已有值进行替换；
非以上两种情况：意味着存在hash冲突，需要进行probe；在probe的过程中不能轻易重用状态为REMOVED的set元素：如果在整个probe过程中没有发现与put()方法的参数key相等的set元素，那么才可以重用probe过程中遇到的第一个状态为REMOVED的set元素。

7
顶

0
踩

分享到：

Too much success can kill your business | JLine

2010-07-07 17:59
浏览 3396
评论(1)
分类:编程语言
查看更多

1 楼 lzg406 2010-10-18

唤回大学学的一些知识，ＬＺ强啊，佩服

发表评论

您还没有登录,请您登录后再发表评论

最近访客更多访客>>

博主相关

文章分类

社区版块

存档分类

最新评论

Open Addressing

评论

发表评论

相关推荐

最近访客 更多访客>>

博主相关

文章分类

社区版块

存档分类

最新评论

Open Addressing

评论

发表评论

相关推荐

Understanding the Hash Array Mapped Trie

A Hierarchical CLH Queue Lock

Inside AbstractQueuedSynchronizer (4)

Inside AbstractQueuedSynchronizer (3)

Inside AbstractQueuedSynchronizer (2)

Inside AbstractQueuedSynchronizer (1)

Code Optimization

Distributed Lock

What's New on Java 7 Phaser

Sequantial Lock in Java

Feature or issue?

Bloom Filter

Inside java.lang.Enum

JLine

ID Generator

inotify-java

Perf4J

Progress Estimator

jManage

JMX Remoting

最近访客更多访客>>