Skip to content

Bug for bit sizes being a power of two (and maybe other values) #19

@qjerome

Description

@qjerome

Implementing the following test wouldn't make the test passing anymore.

--- FAIL: TestBugFalsePositives (1.24s)
    bloom_test.go:330: False positive probability is too high at  20.18335038131724 % vs  0.602097140729787 %
FAIL
exit status 1
FAIL	github.com/DCSO/bloom	1.763s
func TestBugFalsePositives(t *testing.T) {
	// this capacity + p would produce a power of 2 bit size
	capacity := uint64(109397)
	p := float64(0.01)
	fillingFactor := 0.9
	N := uint64(float64(capacity) * fillingFactor)
	filter, _ := GenerateExampleFilter(capacity, p, N)
	pAcceptable := math.Pow(1-math.Exp(-float64(filter.k)*float64(N)/float64(filter.m)), float64(filter.k))
	fingerprint := make([]uint64, filter.k)
	cnt := 0.0
	matches := 0.0
	for {
		cnt++
		value := GenerateTestValue(100)
		filter.Fingerprint(value, fingerprint)
		if filter.CheckFingerprint(fingerprint) {
			matches++
		}
		if cnt > float64(capacity)*10 {
			break
		}
	}
	//this might still fail sometimes...
	//we allow for a probability that is two times higher than the normally acceptable probability
	if matches/cnt > pAcceptable*2 {
		t.Error("False positive probability is too high at ", matches/cnt*100, "% vs ", pAcceptable*100, "%")
	}
}

After analysis, it is possible there is a flaw in the fingerprint generation.
Here is my theory (not verified mathematically). Your algorithm generates fingerprints with:

	for i := uint64(0); i < s.k; i++ {
		hn = (hn * g) % m
		fingerprint[i] = uint64(hn % s.m)

However the value of m used (i.e. 0xffffffffffffffc5) is so big that what the code does is equivalent for any x = (hn * g); x < m (which is very likely as we are working with uint64) to the following

	for i := uint64(0); i < s.k; i++ {
		hn = hn * g
		fingerprint[i] = uint64(hn % s.m)

So under those conditions hn is always a multiple of h0 (fnv hash) multiplied by g power i, very likely creating lots of collisions for very specific cases, such as this one.

NB: I did not verify if other bit sizes create the same behavior

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions