您的位置:首页 > 大数据 > 人工智能

Blockchain的鱼和熊掌系列(7)Bloom Filter(续SPV)

2017-02-11 11:23 375 查看
Bloom Filter, a space-efficient randomized data structure, is mainly designed for many portable devices with limited storage space. One disadvantage of Bloom Filter is that it is hard to eliminate the probability of errors though it can be made sufficiently small and significantly useful enough for practical applications. It is worth noting that the time of computation, the size of look-up table as well as the probability of error are three fundamental metrics that trade off the performance of Bloom Filter.

Q1: Bloom Filter 是什么?及其参数控制范围?



首先,不妨假设待映射的集合S中有n个元素,目标look-up table的表容量为m个二进制位,哈希函数的个数为k。接着,我们将k个哈希函数依次哈希S集合中的每一个元素,哈希值于look-up table中不断累记。然后,给定一个新的元素n4,让look-up table判定n4是否是集合S中的元素?——只需将k个哈希函数依次哈希n4,验证n4的k个哈希值是否均正确地映射到look-up table中,所谓正确就是哈希的结果不能有指向0的二进制位。如果满足,说明n4在指定集合,否则不在。这就是Bloom Filter的基本原理。

如果look-up table很小,集合很大,哈希值冲突就容易出现:即Bloom Filter错误地告诉我们说元素n4存在集合S中。怎么避免或者说尽量避免Bloom Filter犯错?参数控制显得相当重要!集合S通过k个哈希函数全部映射完成之后,look-up table中某个二进制位仍然为0的概率为:

由此,哈希函数值冲突的概率相应地为:


显然,哈希函数值冲突的概率要控制在较小的,可以忽略的范围内。保证look-up table中0的个数占有相当比重(一旦look-up table中全为1就完全哈希冲突了,表就失去意义了)。

Github

Bloom Filter in C/C++: https://github.com/ArashPartow/bloom

欢迎关注“Aha实验室”微信公众号

参考

[1] Bloom B H. Space/time trade-offs in hash coding with allowable errors[J]. Communications of the ACM, 1970, 13(7): 422-426.

[2] Mitzenmacher M. Compressed bloom filters[J]. IEEE/ACM transactions on networking, 2002, 10(5): 604-612.
内容来自用户分享和网络整理,不保证内容的准确性,如有侵权内容,可联系管理员处理 点击这里给我发消息
标签: