“Sketching” data structures store a summary of a data set in situations where the whole data would be prohibitively costly to store (at least in a fast-access place like the memory as opposed to the hard disk). Variants of trees, hash tables, etc. are not sketching structures, they just facilitate access to the data, but they still store the data itself. However, the concept of hashing is closely related to most sketching ideas as we will see.
The main feature of sketching data structures is that they can answer certain questions about the data extremely efficiently, at the price of the occasional error. The best part is that the probability of an error can be quantified and the programmer can trade off the expected error rate with the amount of resources (storage, time) afforded. At the limit of this trade-off (when no error is allowed) these sketching structures collapse into traditional data structures.
Sketching data structures are somewhat counter-intuitive, but they can be useful in many real applications. I look at two such structures mostly for my own benefit: As I try to understand them, I write down my notes. Perhaps someone else will find them useful. Links to further information can be found in the end. Leave comments if you know of other sketching data structures that you found useful or if you have some favorite elegant and unusual data structure.
The Count-Min (CM) sketch is less known than the Bloom filter, but it is somewhat similar (especially to the counting variants of the Bloom filter). The problem here is to store a numerical value associated with each element, say the number of occurrences of the element in a stream (for example when counting accesses from different IP addresses to a server). Surprisingly, this can be done using less space than the number of elements, with the trade-off that the result can be slightly off sometimes, but mostly on the small values. Again, the parameters of the data structure can be chosen such as to obtain a desired accuracy.
CM works as follows: we have k different hash functions and k different tables which are indexed by the outputs of these functions (note that the Bloom filter can be implemented in this way as well). The fields in the tables are now integer values. Initially we have all fields set to 0 (all unseen elements have count 0). When we increase the count of an element, we increment all the corresponding k fields in the different tables (given by the hash values of the element). If a decrease operation is allowed (which makes things more difficult), we similarly subtract a value from all k elements.
To obtain the count of an element, we take the minimum of the k fields that correspond to that element (as given by the hashes). This makes intuitive sense. Out of the k values, probably some have been incremented on other elements also (if there were collisions on the hash values). However, if not all k fields have been returned by the hash functions on other elements, the minimum will give the correct value. See illustration for an example on counting hits from IP addresses:
In this example the scenario could be that we want to notice if an IP address is responsible for a lot of traffic (to further investigate if there is a problem or some kind of attack). The CM structure allows us to do this without storing a record for each address. When we increment the fields corresponding to an address, simultaneously we check if the minimum is above some threshold and we do some costly operation if it is (which might be a false alert). On the other hand, the real count can never be larger than the reported number, so if the minimum is a small number, we don’t have to do anything (this holds for the presented simple variant that does not allow decreases). As the example shows, CM sketch is most useful for detecting “heavy hitters” in a stream.
It is interesting to note that if we take the CM data structure and make the counters such that they saturate at 1, we obtain the Bloom filter.
Refrences
http://en.wikipedia.org/wiki/Count%E2%80%93min_sketch
http://research.neustar.biz/2011/09/08/our-approach/
http://lkozma.net/blog/sketching-data-structures/
http://research.neustar.biz/tag/count-min-sketch/
相关推荐
使用多核处理器并行Min Sketch算法的研究,张玉,,在高速的网络监控中,不断增长的流量要求一个一个高性能的计算频繁项的解决方案。当前商业多核心处理器的越来越多的内核为并行开
YACMS 又一个 Count-Min 草图(countminsketch 和 count_min_sketch 已经被采用了。) 执照 新 BSD,请参阅许可证。
Count-Min 草图 Count-Min 草图的 Java 实现,用于在指定错误范围内查找数据流中的事件频率。 更多信息 参考: [1] [2] [3] [4] 问题 欢迎修复错误或改进! 请 fork 项目并在 github 上发送 pull request。...
count-min-sketch:C语言中的最小计数草图实现
需要资源请查询<Sketch算法所用数据>,基于网络流处理的Sketch算法的C语言实现,已经过测试证实准确可用
艺术家配对评估器 - Count-Min Sketch 关于 文本文件“Artist_lists_small.txt”包含来自的 1000 位用户最喜欢的音乐艺术家。 每行是一个最多 50 位艺术家的列表,格式如下: Radiohead,Pulp,Morrissey,Delays,...
资源分类:Python库 所属语言:Python 资源全名:count_min_sketch-1.0.0.tar.gz 资源来源:官方 安装方法:https://lanzao.blog.csdn.net/article/details/101784059
CountMinSketch **待办事项:添加说明**
草图最小计数
开源项目-seiflotfy-cmts.zip,Count-Min-Tree Sketch: Count-min sharing bits instead of bytes.
概率数据结构带有 Redis 存储的 Java 概率数据结构。 目标是使概率数据结构易于使用,依靠基础实现并... Muthukrishnan, S.,“使用 Count-Min Sketch 逼近数据”,软件,IEEE,第 29 卷,第 1 期,第 64,69 页,一月至
使用count-min sketch和2-universal散列函数实现了在高速流量下实时的分组处理和流量统计;使用Bloom filter在控制平面恢复流量的原始5元组信息,解决了sketch数据结构的不可逆问题。使用CERNET骨干网流量数据对原型...
该算法将数据流分段并利用改进的FP-growth算法挖掘分段中的频繁项集,然后利用Count-Min Sketch进行项集计数。算法解决了压缩统计和计算快速高效的问题。通过实验分析,FP-SegCount算法是有效的。
开源项目-seiflotfy-pmc.zip,PMC: to Count-Min is as HyperLogLog to Bloomfilter (practically a better Coun-Min Sketch implementation in Go)
Clojure中的草绘算法 安装 sketchy可从一个Maven构件 。 总览 该库包含各种基于草图/散列的算法,可用于构建大型数据集的紧凑摘要。 所有草图均使用香草Clojure数据结构组成。 这意味着不变性和易于序列化... [min-has
这包括稳定的Bloom过滤器,可伸缩的Bloom过滤器, Counting Bloom过滤器, Inverse Bloom过滤器, Cuckoo过滤器,传统Bloom过滤器的几种变体, HyperLogLog , Count-Min Sketch和MinHash 。 经典布隆过滤器通常...
再聊Count-MinCount-Sketchhttps://blog.csdn.net/u012332103/article/details/79702495
使用 `Count-Min Sketch` 更精确地估算点查的代价 支持分析更复杂的条件,尽可能充分的使用索引 支持通过 `STRAIGHT_JOIN` 语法手动指定 Join 顺序 `GROUP BY`子句为空时使用 Stream Aggregation 算子,提升性能 ...
实施论文时,由于当前解决方案不足,我们不得不为Count-Min Sketchs和Count-Sketches创建一个新的库。 C ++库很难使用,而Python包则往往很慢。 因此,我们编写了一个Python包,其中包含与C ++库的Python绑定,以...