Hi, Racket Discourse.
Among other things, I am playing with a bloom-filter to try and validate UUIDs used in a database for a very numerous set of records.
At first, I simply queried the DB whenever I needed to validate these UUIDs as being unique, but after the records started nudging the 5,000,000 mark, this method is not great. It's not slow, but it eats memory very quickly.
So, having recently read about how SQLite
uses bloom-filters to speed-up its queries, I thought it might a good place to start.
Forgetting, for the moment, questions about overall efficiency, I am curious how one would set the bits in a bloom-filter with a width on the order of 2^(31) + 1
bits.[0]
At the moment I am simply inclusive-or-ing the bits to be set, but one might imagine that shuffling around these massive numbers in memory cannot be ideal.
Absit any technical errors, the code currently looks like:
#lang racket/base
(require
racket/match racket/stream racket/set racket/list uuid)
(struct uuid-filter [size bits count])
(define << arithmetic-shift)
(define M (1 . << . 28))
(define B (1 . << . (+ M 1)))
(define K 8)
(define (new-uuid-filter) (uuid-filter M B 0))
(define (uuid-filter-error ufilter)
(match-define (uuid-filter size _ count) ufilter)
(expt (- 1 (exp (* (- K) (/ count size)))) K))
(define (uuid-string->bytes uuid-string)
; should probably remove the version and variant from the uuid
(define uuid-norm (regexp-replace* #px"[-]" uuid-string ""))
(string->bytes/utf-8 uuid-norm))
(define (uuid-string->byte-stream uuid-string)
(define uuid-bytes (uuid-string->bytes uuid-string))
(define byte-count (quotient (bytes-length uuid-bytes) K))
(define byte-input (open-input-bytes uuid-bytes))
(let group-bytes ([groups empty-stream])
(cond
[(eof-object? (peek-byte byte-input)) groups]
[else
(define group (read-bytes byte-count byte-input))
(group-bytes (stream-cons group groups))])))
(define (uuid-string->bit-index-stream size uuid-string)
(define byte-stream (uuid-string->byte-stream uuid-string))
(for/stream ([group (in-stream byte-stream)])
(remainder (integer-bytes->integer group #;signed? #false) size)))
(define (bitwise-bits-set? bits indexes)
(for/and ([index (in-stream indexes)])
(bitwise-bit-set? bits index)))
(define (uuid-filter-match? ufilter uuid-string)
(match-define (uuid-filter size bits count) ufilter)
(define bit-indexes (uuid-string->bit-index-stream size uuid-string))
(bitwise-bits-set? bits bit-indexes))
(define (create-mask indexes)
(define sorted-indexes (sort (set->list (for/set ([index (in-stream indexes)]) index)) <))
(for/fold ([mask 0]
[head 1]
[prev 0]
#:result mask)
([index (in-list sorted-indexes)])
(define next (head . << . (- index prev)))
(values (bitwise-ior mask next) next index)))
(define (uuid-filter-add ufilter uuid-strings)
(match-define (uuid-filter size bits count) ufilter)
(define-values (mask extra)
(let-values ([(indexes count)
(for/fold ([indexes empty-stream]
[inserts 0])
([uuid-string (in-stream uuid-strings)])
(define bit-indexes (uuid-string->bit-index-stream size uuid-string))
(values
(stream-append bit-indexes indexes) (+ inserts 1)))])
(values
(create-mask indexes) count)))
(uuid-filter
size (bitwise-ior bits mask) (+ count extra)))
(define U (new-uuid-filter))
(uuid-filter-error U)
(define V
(time (uuid-filter-add U (for/stream ([_ (in-range 1E4)]) (uuid-string)))))
(uuid-filter-error V)
I have naively tried to at least keep the numbers in order, so to minimize "jumping around" in the bits to be set; but having to sort all the prospective entries doesn't really help, especially because the method of using streams
to lazily load the entries falls apart once they are turned into a set
.
Now, if there were some way to set the bits more "cheaply", I probably wouldn't have to worry about that as much, or perhaps even batch the sorted entries to allow for shorter "bursts" of ordered bits to be set. Bucketing the bits of the bloom-filter into 128 times 2^(24)
numbers--or some similar scheme--may also help.
The entries in the DB grow roughly by 100,000 entries per day, so the idea would be to initally have a slow-load of the current entries and then just serialize and run the filter each day on these fewer entries.
Is this basically as much as one can do in this regard, without being clever, or is there still some room left for improvement before looking at smarter ways of batching the entries, and so forth?
I don't have a great intuition for what's happening to the code, so my assumptions might be way-off, here.
[0] Haha, I gave my colleague such a surprise showing him that available memory is the only bottleneck to such large integers in Racket!
Edit: a bit cleaner, and doesn't use a port, but this is neither here nor there:
(define (uuid-string->bit-index-stream size uuid-string)
(define uuid-bytes (uuid-string->bytes uuid-string))
(define byte-count (quotient (bytes-length uuid-bytes) K))
(for/stream ([group (in-slice byte-count (in-bytes uuid-bytes))])
(for/fold ([fuse 0]
#:result (remainder fuse size))
([byte (in-list group)])
(+ (fuse . << . 8) byte))))
Edit: the answer by @gneuner is the "right" one for business, but thanks to @soegaard and @dhamidi for their interesting links.