`
whitesock
  • 浏览: 478708 次
  • 性别: Icon_minigender_1
  • 来自: 大连
社区版块
存档分类
最新评论

Open Addressing

    博客分类:
  • SE
阅读更多

1 Overview

      Open addressing和Chaining是两种不同的解决hash冲突的策略。当多个不同的key被映射到相同的slot时,chaining方式采用链表保存所有的value。而Open addressing则尝试在该slot的邻近位置查找,直到找到对应的value或者空闲的slot, 这个过程被称作probing。常见的probing策略有Linear probing,Quadratic probing和Double hashing。

 

2 Chaining

2.1 Chaining in java.util.HashMap

      在分析open addressing策略之前,首先简单介绍一下大多数的Java 核心集合类采用的chaining策略,以便比较。 java.util.HashMap有一个Entry[]类型的成员变量table,其每个非null元素都是一个单向链表的表头。

  • put():如果存在hash冲突,那么对应的table元素的链表长度会大于1。
  • get():需要对链表进行遍历,在遍历的过程中不仅要判断key的hash值是否相等,还要判断key是否相等或者equals。
  • 对于put()和get()操作,如果其参数key为null,那么HashMap会有些特别的优化。

      Chaining策略的主要缺点是需要通过Entry保存key,value以及指向链表下个节点的引用(Map.Entry就有四个成员变量),这意味着更多的内存使用(尤其是当key,value本身使用的内存很小时,额外使用的内存所占的比例就显得比较大)。此外链表对CPU的高速缓存不太友好。

 

3 Open Addressing

3.1 Probing

3.1.1 Linear probing

      两次查找位置的间隔为一固定值,即每次查找时在原slot位置的基础上增加一个固定值(通常为1),例如:P = (P + 1) mod SLOT_LENGTH。其最大的优点在于计算速度快,另外对CPU高速缓存更友好。其缺点也非常明显:

      假设key1,key2,key3的hash code都相同并且key1被映射到slot(p),那么在计算key2的映射位置时需要查找slot(p), slot(p+1),计算key3的映射位置时需要查找slot(p), slot(p+1),slot(p+2)。也就是说对于导致hash冲突的所有key,在probing过程中会重复查找以前已经查找过的位置,这种现象被称为clustering。

 
3.1.2 Quadratic probing

      两次查找位置的间隔线性增长,例如P(i) = (P + c1*i + c2*i*i) mod SLOT_LENGTH,其中c1和c2为常量且c2不为0(如果为0,那么降级为Linear probing)。 Quadratic probing的各方面性能介于Linear probing和Double hashing之间。


3.1.3 Double hashing
       两次查找位置的间隔为一固定值,但是该值通过另外一个hash算法生成,例如P = (P + INCREMENT(key)) mod SLOT_LENGTH,其中INCREMENT即另外一个hash算法。以下是个简单的例子:

      H(key) = key mod 10
      INCREMENT(key) = 1 + (key mod 7)
      P(15): H(15) = 5;
      P(35): H(35) = 5, 与P(15)冲突,因此需要进行probe,位置是 (5 + INCREMENT(35)) mod 10 = 6
      P(25): H(25) = 5, 与P(15)冲突,因此需要进行probe,位置是 (5 + INCREMENT(25)) mod 10 = 0

      P(75): H(75) = 5, 与P(15)冲突,因此需要进行probe,位置是 (5 + INCREMENT(75)) mod 10 = 1

      从以上例子可以看出,跟Linear probing相比,减少了重复查找的次数。

 

3.2 Load Factor

      基于open addressing的哈希表的性能对其load factor属性值非常敏感。如果该值超过0.7 (Trove maps/sets的默认load factor是0.5),那么性能会下降的非常明显。由于hash冲突导致的probing次数跟(loadFactor) / (1 - loadFactor)成正比。当loadFactor为1时,如果哈希表中的空闲slot非常少,那么可能会导致probing的次数非常大。

 

3.3 Open addressing in gnu.trove.THashMap

      GNU Trove (http://trove4j.sourceforge.net/) 是一个Java 集合类库。在某些场景下,Trove集合类库提供了更好的性能,而且内存使用更少。以下是Trove中跟open addressing相关的几个特性:

  • Trove maps/sets没有使用chaining解决hash冲突,而是使用了open addressing。
  • 跟chaining相比,open addressing对hash算法的要求更高。通过TObjectHashingStrategy 接口, Trove支持定制hash算法(例如不希望使用String或者数组的默认hash算法)。
  • Trove提供的maps/sets的capaicity属性一定是质数,这有助于减少hash冲突。
  • 跟java.util.HashSet不同,Trove sets没有使用maps,因此不需要额外分配value的引用。

      跟java.util.HashMap相比,gnu.trove.THashMap没有Entry[] table之类的成员变量,而是分别通过Object[] _set,V[] _values直接保存key和value。在逻辑上,Object[] _set中的每个元素都有三种状态:

  • FREE:该slot目前空闲;
  • REMOVED:该slot之前被使用过,但是目前数据已被移除;
  • OCCUPIED:该slot目前被使用中;

      这三种状态的迁移过程如下:

  • 在构造或者resize时,所有_set元素都会赋值为FREE(FREE状态);
  • 向THashMap中put某个key/value对时,_set中对应的元素会被赋值为put()方法的参数key(OCCUPIED状态);
  • 从THashMap中以key进行remove的时,_set中对应的元素会被赋值为REMOVED(注意:不是赋值为FREE);

      以下是关于状态迁移的简单例子(:= 的含义是赋值, H(key) = key mod 11):

  • put(7, value):  _set[7] := 7;
  • put(9, value):  _set[9] := 9;
  • put(18, value):  由于_set[7]处于OCCUPIED状态,导致hash冲突;假设第一次probe计算得出9,由于_set[9]处于OCCUPIED状态,仍然hash冲突; 假设再次probe计算得到1, 由于_set[1]处于FREE状态,所以_set[1] := 18;
  • get(18): _set[7]处于OCCUPIED状态并且与18不等;假设第一次probe计算得出9,_set[9]的值与18不等;假设再次probe计算得到1, 由于_set[1] 的值等于18,所以返回对应value;
  • remove(9): _set[9] := REMOVED;
  • get(18): _set[7]的值与18不等;假设第一次probe计算得出9,由于_set[9]状态为REMOVED,需要再次probe;假设再次probe计算得到1, 由于_set[1] 的值等于18,所以返回对应value;
  • put(9, value):  _set[9] := 9;

      以下是与get()方法相关的代码片段:

public V get(Object key) {
    int index = index((K) key);
    return index < 0 ? null : _values[index];
}

protected int index(T obj) {
    final TObjectHashingStrategy<T> hashing_strategy = _hashingStrategy;

    final Object[] set = _set;
    final int length = set.length;
    final int hash = hashing_strategy.computeHashCode(obj) & 0x7fffffff;
    int index = hash % length;
    Object cur = set[index];

    if ( cur == FREE ) return -1;

    // NOTE: here it has to be REMOVED or FULL (some user-given value)
    if ( cur == REMOVED || ! hashing_strategy.equals((T) cur, obj)) {
        // see Knuth, p. 529
        final int probe = 1 + (hash % (length - 2));

        do {
            index -= probe;
            if (index < 0) {
                index += length;
            }
            cur = set[index];
        } while (cur != FREE
             && (cur == REMOVED || ! _hashingStrategy.equals((T) cur, obj)));
    }

    return cur == FREE ? -1 : index;
}

      从以上代码可以看出get()方法的流程如下, 根据key的hash值找到对应的set元素,判断是否存在hash冲突。

      如果不存在hash冲突,那么该set元素的可能状态如下:

  • FREE:意味着THashMap中不存在该key;
  • OCCUPIED,并且该元素的值等于get()方法的参数key:意味着THashMap中存在该key;
  • 非以上两种情况:意味着存在hash冲突,需要进行probe,直到找到状态为以上两种状态的set元素;

      以下是与put()方法相关的代码片段:

public V put(K key, V value) {
    int index = insertionIndex(key);
    return doPut(key, value, index);
}

private V doPut(K key, V value, int index) {
    V previous = null;
    Object oldKey;
    boolean isNewMapping = true;
    if (index < 0) {
        index = -index -1;
        previous = _values[index];
        isNewMapping = false;
    }
    oldKey = _set[index];
    _set[index] = key;
    _values[index] = value;
    if (isNewMapping) {
        postInsertHook(oldKey == FREE);
    }

    return previous;
}

protected int insertionIndex(T obj) {
    final TObjectHashingStrategy<T> hashing_strategy = _hashingStrategy;

    final Object[] set = _set;
    final int length = set.length;
    final int hash = hashing_strategy.computeHashCode(obj) & 0x7fffffff;
    int index = hash % length;
    Object cur = set[index];

    if (cur == FREE) {
        return index;       // empty, all done
    } else if (cur != REMOVED && hashing_strategy.equals((T) cur, obj)) {
        return -index -1;   // already stored
    } else {                // already FULL or REMOVED, must probe
        // compute the double hash
        final int probe = 1 + (hash % (length - 2));

        // if the slot we landed on is FULL (but not removed), probe
        // until we find an empty slot, a REMOVED slot, or an element
        // equal to the one we are trying to insert.
        // finding an empty slot means that the value is not present
        // and that we should use that slot as the insertion point;
        // finding a REMOVED slot means that we need to keep searching,
        // however we want to remember the offset of that REMOVED slot
        // so we can reuse it in case a "new" insertion (i.e. not an update)
        // is possible.
        // finding a matching value means that we've found that our desired
        // key is already in the table
        if (cur != REMOVED) {
            // starting at the natural offset, probe until we find an
            // offset that isn't full.
            do {
                index -= probe;
                if (index < 0) {
                    index += length;
                }
                cur = set[index];
            } while (cur != FREE
                     && cur != REMOVED
                     && ! hashing_strategy.equals((T) cur, obj));
        }

        // if the index we found was removed: continue probing until we
        // locate a free location or an element which equal()s the
        // one we have.
        if (cur == REMOVED) {
            int firstRemoved = index;
            while (cur != FREE
                   && (cur == REMOVED || ! hashing_strategy.equals((T) cur, obj))) {
                index -= probe;
                if (index < 0) {
                    index += length;
                }
                cur = set[index];
            }
            // NOTE: cur cannot == REMOVED in this block
            return (cur != FREE) ? -index -1 : firstRemoved;
        }
        // if it's full, the key is already stored
        // NOTE: cur cannot equal REMOVE here (would have retuned already (see above)
        return (cur != FREE) ? -index -1 : index;
    }
}

       从以上代码可以看出,THashMap使用Double hashing。用来计算增量的hash算法是final int probe = 1 + (hash % (length - 2));  如果insertionIndex()方法的返回值为正值,那么该值就是可用的slot位置;如果为负值,那么说明该key之前已经保存过,(-index-1)就是之前的slot位置。

 

      put()方法的流程如下, 根据key的hash值找到对应的set元素,判断是否存在hash冲突。

      如果不存在hash冲突,那么该set元素的可能状态如下:

  • FREE:意味着可以在该位置插入;
  • OCCUPIED,并且该元素的值等于put()方法的参数key:意味着该位置之前已经插入过相同key的数据,本次put操作需要对已有值进行替换;
  • 非以上两种情况:意味着存在hash冲突,需要进行probe;在probe的过程中不能轻易重用状态为REMOVED的set元素:如果在整个probe过程中没有发现与put()方法的参数key相等的set元素,那么才可以重用probe过程中遇到的第一个状态为REMOVED的set元素。
7
0
分享到:
评论
1 楼 lzg406 2010-10-18  
唤回大学学的一些知识,LZ强啊,佩服

相关推荐

    数据结构Advanced-Data-Structures

    Open addressing 476 Lazy deletion 479 Linear probing 479 Quadratic probing 480 Double hashing 484 Cuckoo hashing 486 Coalesced hashing 491 Perfect hash function 494 Universal hashing 496 Linear ...

    Introduction to Algorithms, 3rd edtion

    11.4 Open addressing 269 11.5 Perfect hashing 277 12 Binary Search Trees 286 12.1 What is a binary search tree? 286 12.2 Querying a binary search tree 289 12.3 Insertion and deletion 294 12.4 Randomly...

    算法导论_英文第三版

    11.4 Open addressing 269 11.5 Perfect hashing 277 12 Binary Search Trees 286 12.1 What is a binary search tree? 286 12.2 Querying a binary search tree 289 12.3 Insertion and deletion 294 ? 12.4 ...

    算法导论 第三版 英文原版 高清文字版

    11.4 Open addressing 269 11.5 Perfect hashing 277 Contents vii 12 Binary Search Trees 286 12.1 What is a binary search tree? 286 12.2 Querying a binary search tree 289 12.3 Insertion and deletion 294 ...

    算法导论英文版

    ll.4 Open addressing 237 ll.5 Perfect hashing 24S l2 Binary Search Trees 253 l2.l What is a binary search tree? 2S3 l2.2 Querying a binary search tree 2S6 l2.3 Insertion and deletion 261 l2.4 ...

    算法导论-introduction to algrithms 3rd edition

    11.4 Open addressing 269 ? 11.5 Perfect hashing 277 Contents vii 12 Binary Search Trees 286 12.1 What is a binary search tree? 286 12.2 Querying a binary search tree 289 12.3 Insertion and ...

    算法导论--Introduction.to.Algorithms

    11.4 Open addressing 269 11.5 Perfect hashing 277 12 Binary Search Trees 286 12.1 What is a binary search tree? 286 12.2 Querying a binary search tree 289 12.3 Insertion and deletion 294 12.4 Randomly...

    HashTable的java实现

    实现了链表法(Chaining)和开放地址寻址(Open Addressing)中的Hash表实现,开放地址寻址采用双重散列解决冲突

    11.3冲突处理方法1

    11.3 冲突处理方法开放定址法(Open Addressing)一旦产生了冲突(该地址已有其它元素),就按某 种规则去寻找另一空地址处理冲突的方法常用处理冲突

    open_addressing_hash_table:在 C++ 上打开寻址哈希表

    在 C++ 上打开寻址哈希表 希望在这里看到功能齐全的 C++ 容器在 C++ 上实现快速和紧凑的开放寻址哈希表。 概念证明(参见 benchmark.cpp)表明,这种映射可以比std::unordered_map快 3 倍。 API 应该很像std::...

    浅谈哈希表存储效率一般不超过50%的原因

    哈希表大多使用open addressing来解决collision,此时search的时间复杂度计算公式为: 1/( 1 – n/m ) 其中,n与m分别表示存储的记录数与哈希表的长度,即装填因子( load factor ) 故,若哈希表半满,即 n/m &gt;= 1/2...

    Open-Channel SSD Spec 2.0

    This specification defines the Physical Page Addressing Command Set extension to the NVMe specification. It enables Solid State Drives (SSDs) to expose their internal parallelism and allows the host ...

    Open Host Controller Interface Specification

    Open Host Controller Interface Specification for USB 4.3.1.3.6.3 System Errors.......................................................................................25 4.3.1.3.7 Special ...

    The Open Group Base Specifications Issue 6 IEEE Std 1003.1, 2004 Edition

    The 2004 edition incorporates Technical Corrigendum Number 1 and Technical Corrigendum 2 addressing problems discovered since the approval of the 2001 edition. These are mainly due to resolving ...

    google-maps-coordinates-tile-bounds-projection

    Addressing tiles: same tile bounds with different indexes There are three main systems of tile adressing: Google XYZ, Microsoft QuadTree and from the open-source world comming TMS (Tile Map Service).

    Packt.Hands-on.Blockchain.Development.with.Hyperledger

    Hyperledger is an open source project to create private blockchain applications for different domains including finance, banking, supply chain, IoT and much more. This book will be an easy reference ...

    CD and DVD Forensics(syngress安全图书)

    *This is the first book addressing using the CD/DVD Inspector product in a hands-on manner with a complete step-by-step guide for examining evidence discs * See how to open CD's and DVD'd and extract ...

    Deep Learning with Theano.pdf

    addressing supervised and unsupervised learning, generative models, reinforcement learning in the fields of image recognition, natural language processing, or game strategy. The book also discusses...

Global site tag (gtag.js) - Google Analytics