BLOM - Bloom Filter
From DCBase Wiki
|Inclusion date||1.0.3 2010-06-26|
|Discussion Page Available|
This page contains a discussion page.
Bloom filters allow the hub to filter certain searches using bitmap that represents the hashes of the files in the users share. BLOM is an extension that allows hub software to create a map (bloom filter) of the shared files on the hub, but with minimal effort, e.g. the hub doesn't keep a list of files, but a filter that never produces false negatives but only possible false positives. This can potentially save bandwidth and effort on the client side. When the user updates the share, the client must send an INF containing the flag SF. The hub may at any time request that the client sends an updated version of its bloom filter by sending a GET command to the client. The client will then respond using SND and send the bloom filter binary data.
|b||Number of bits used for file hashes|
|n||Number of files in the user's share|
|m||Size of the bloom filter in bits|
|k||Number of sub-hashes constructed from the file hash|
|h||Number of bits to use for each sub-hash|
|p||Propability of a false positive|
The hub chooses k, h and m.
|k * h < = b|
|h < = 64|
|2^h^ > m|
|m mod 64 == 0|
|p == (1 - (1 - 1 / m)^(k * n)^)^k^||False positives|
|p == 0||False negatives|
Signal BLOM in SUP.
For the SND type, adds H as message type.
For the GET type, adds I as message type.
For the GET type, adds "blom" as type.
For the GET type, "/" shall be used as namespace.
For the GET type, 0 (zero) shall be used as start position.
For the GET type, m / 8 shall be used as byte amount.
Updates GET with the following flags;
The client constructs the bloom filter by creating a bit array of m bits set to 0 (zero). For each file it then sets to "1" k positions constructed from the file hash. Seeing the file hash as a stream of bits (starting from the lowest bit of the first byte, ending at the highest bit of the last byte), the client should use h bits starting at the first bit of the first byte to create an integer and apply modulo m to get the position in the bit array, then redo the process k times advancing the starting position by h each time.
Once the hub has received the bloom filter bit array, for each search command it processes, if it contains a hash search term, it can skip broadcasting the search to a particular client if at least one of the k bits in that clients bit array is "0", calculating positions as the client does when setting bits to "1". The hub has to evaluate the filter for each client that it has a bloom filter for, for each search.
p = (1 - (1 - 1 / m)^(k * n)^)^k^, thus p becomes smaller as m grows and larger as n grows. Larger m means more bits to transfer but also fewer false positives. The optimum value for k given m and n is (m / n) * ln 2. The largest k supported by a hash of a certain size is b / h, so if the hub wants the smallest p possible, it should choose the smallest possible h which gives the largest k, and then calculate m = k * n/ln 2, checking that the resulting m < 2^h^. 2^h^ should much be larger than m (at least 3-4 times), because of how the modulo operator works. Also, with m grows the required bandwidth to transfer the bloom filter, so the hub may wish to cap m. In that case, it should still choose k according to m / n * ln 2, but send an h as big as possible to alleviate biasing problems.
For TTH roots, b is 192 and a reasonable value for h is 24, giving a maximum k = 8 which means that m = 8 * n / ln 2 ≈ 11.5 * n. The required bandwidth then becomes 11.5 * n / 8 bytes, so approximately 1.44 bytes per file in the users share. For 20000 files, m should then be 230016 (taking into account the modulo 64 requirement), giving a p = 0.004, in other words ~99.6% of all searches that don't match a users share would not be sent, saving valuable upload bandwidth for the hub. The client calculates i valid positions, if x is an array of bytes containing the hash of the file, on a little-endian machine, by doing pos = x[0+i*h/8] | (x[1+i*h/8] << 8) | (x[2+i*h/8] << 16) for i = [0;k). This is of course a special case where h % 8 = 0, the actual algorithm has to take into consideration values for h where the integer crosses byte boundaries.
For test vectors, see wiki talk page.
- Implementation in C++ and Object Pascal
- Implementation in Erlang
- Implementation in Haskell
- Implementation in Lisp
- Implementation in Perl
- Implementation in Python
- Implementation in Ruby
- Implementation in Python