Bloom Filter UDF for GSQL

danmccreary · May 13, 2021, 12:22pm

I am working on a scale-out graph rules engine project that requires us to check if an item is NOT on one of many long lists of items (hundreds of thousands of items on thousands of lists). Other NoSQL systems use a probabilistic hashing algorithm called a Bloom filter to do this. Has anyone created a Bloom Filter User Defined Function (UDF) for GSQL yet?

If anyone else is interested in graph rule engine design, please let me know.

As a background, this article is very well written

Rules for Knowledge Graphs Rules

Richard_Henderson · May 13, 2021, 4:22pm

Doesn’t look like it, but should be very doable as a UDF.

There is some internal conversation!

My first thoughts:
Architecturally, we’d need a ListOfStuff vertex to contain the Bloom state. That would need to be an encoded representation (base64 string or whatever).

Then two queries, keeping them pure:

addValueToBloom(thing, state) => newState
checkValueNotInBloom(thing, state) => isNotContained

This would elaborate if we wanted to parameterise the Bloom filter width and hash functions, but might be enough for now.

Would that be enough for you?

danmccreary · May 13, 2021, 5:02pm

Thank you for your quick response.

Keeping the functions simple is a great idea. I like just having two queries, one for addValue and one for checkValue. One other option is to pass a collection Set attribute to an addValue since this is really the datatype we are working with. I wonder if that would be faster than calling addValue N times?

Richard_Henderson · May 13, 2021, 5:58pm

Yes it should be a little quicker, particularly the problematic bit in this design would be the serialization and deserialization of the filter state on every call.

The filter would be some small multiple of the bloom filter size which would depend on the number of elements etc. Parameterising those would require a createFilter function.

Bloom calculators are here:
https://hur.st/bloomfilter/?n=4000&p=1.0E-7&m=&k=

If the ruleset is static, i.e. we generate the filter once, and then test against it a lot, then more succinct structures can be used e.g. cuckoo or xor filters.

Then there wouldn’t be an addValue operation, just a createFilter operation taking all of your rules as a parameter and returning the serialized filter state.

Richard_Henderson · May 13, 2021, 6:00pm

It’s worth doing the calculations to see just how much room would be needed to contain the bloom filters.

danmccreary · May 20, 2021, 5:55pm

I have also found out that Xilinx has a FPGA implementation of Bloom filters:

https://xilinx.github.io/Vitis_Libraries/database/2019.2/guide/bloom_filter/bloom_filter.html