Setting bits in really wide numbers

Hi, Racket Discourse.

Among other things, I am playing with a bloom-filter to try and validate UUIDs used in a database for a very numerous set of records.

At first, I simply queried the DB whenever I needed to validate these UUIDs as being unique, but after the records started nudging the 5,000,000 mark, this method is not great. It's not slow, but it eats memory very quickly.

So, having recently read about how SQLite uses bloom-filters to speed-up its queries, I thought it might a good place to start.

Forgetting, for the moment, questions about overall efficiency, I am curious how one would set the bits in a bloom-filter with a width on the order of 2^(31) + 1 bits.[0]

At the moment I am simply inclusive-or-ing the bits to be set, but one might imagine that shuffling around these massive numbers in memory cannot be ideal.

Absit any technical errors, the code currently looks like:

#lang racket/base

(require
  racket/match racket/stream racket/set racket/list uuid)

(struct uuid-filter [size bits count])

(define << arithmetic-shift)

(define M (1 . << . 28))
(define B (1 . << . (+ M 1)))
(define K 8)

(define (new-uuid-filter) (uuid-filter M B 0))

(define (uuid-filter-error ufilter)
  (match-define (uuid-filter size _ count) ufilter)
  (expt (- 1 (exp (* (- K) (/ count size)))) K))

(define (uuid-string->bytes uuid-string)
  ; should probably remove the version and variant from the uuid
  (define uuid-norm (regexp-replace* #px"[-]" uuid-string ""))
  (string->bytes/utf-8 uuid-norm))

(define (uuid-string->byte-stream uuid-string)
  (define uuid-bytes (uuid-string->bytes uuid-string))
  (define byte-count (quotient (bytes-length uuid-bytes) K))
  (define byte-input (open-input-bytes uuid-bytes))
  (let group-bytes ([groups empty-stream])
    (cond
      [(eof-object? (peek-byte byte-input)) groups]
      [else
       (define group (read-bytes byte-count byte-input))
       (group-bytes (stream-cons group groups))])))

(define (uuid-string->bit-index-stream size uuid-string)
  (define byte-stream (uuid-string->byte-stream uuid-string))
  (for/stream ([group (in-stream byte-stream)])
    (remainder (integer-bytes->integer group #;signed? #false) size)))

(define (bitwise-bits-set? bits indexes)
  (for/and ([index (in-stream indexes)])
    (bitwise-bit-set? bits index)))

(define (uuid-filter-match? ufilter uuid-string)
  (match-define (uuid-filter size bits count) ufilter)
  (define bit-indexes (uuid-string->bit-index-stream size uuid-string))
  (bitwise-bits-set? bits bit-indexes))

(define (create-mask indexes)
  (define sorted-indexes (sort (set->list (for/set ([index (in-stream indexes)]) index)) <))
  (for/fold ([mask 0]
             [head 1]
             [prev 0]
             #:result mask)
            
            ([index (in-list sorted-indexes)])
    (define next (head . << . (- index prev)))
    (values (bitwise-ior mask next) next index)))

(define (uuid-filter-add ufilter uuid-strings)
  (match-define (uuid-filter size bits count) ufilter)

  (define-values (mask extra)
    (let-values ([(indexes count)
                  (for/fold ([indexes empty-stream]
                             [inserts 0])
                            ([uuid-string (in-stream uuid-strings)])
                    (define bit-indexes (uuid-string->bit-index-stream size uuid-string))
                    (values
                     (stream-append bit-indexes indexes) (+ inserts 1)))])
      (values
       (create-mask indexes) count)))
  
  (uuid-filter
   size (bitwise-ior bits mask) (+ count extra)))

(define U (new-uuid-filter))
(uuid-filter-error U)

(define V
  (time (uuid-filter-add U (for/stream ([_ (in-range 1E4)]) (uuid-string)))))
(uuid-filter-error V)

I have naively tried to at least keep the numbers in order, so to minimize "jumping around" in the bits to be set; but having to sort all the prospective entries doesn't really help, especially because the method of using streams to lazily load the entries falls apart once they are turned into a set.

Now, if there were some way to set the bits more "cheaply", I probably wouldn't have to worry about that as much, or perhaps even batch the sorted entries to allow for shorter "bursts" of ordered bits to be set. Bucketing the bits of the bloom-filter into 128 times 2^(24) numbers--or some similar scheme--may also help.

The entries in the DB grow roughly by 100,000 entries per day, so the idea would be to initally have a slow-load of the current entries and then just serialize and run the filter each day on these fewer entries.

Is this basically as much as one can do in this regard, without being clever, or is there still some room left for improvement before looking at smarter ways of batching the entries, and so forth?

I don't have a great intuition for what's happening to the code, so my assumptions might be way-off, here.


[0] Haha, I gave my colleague such a surprise showing him that available memory is the only bottleneck to such large integers in Racket!


Edit: a bit cleaner, and doesn't use a port, but this is neither here nor there:

(define (uuid-string->bit-index-stream size uuid-string)
  (define uuid-bytes (uuid-string->bytes uuid-string))
  (define byte-count (quotient (bytes-length uuid-bytes) K))
  (for/stream ([group (in-slice byte-count (in-bytes uuid-bytes))])
    (for/fold ([fuse 0]
               #:result (remainder fuse size))
              ([byte (in-list group)])
      (+ (fuse . << . 8) byte))))

Edit: the answer by @gneuner is the "right" one for business, but thanks to @soegaard and @dhamidi for their interesting links.

I don't have a direct solution to your problem, but someone went through the trouble of making all UUIDv4s enumerable and searchable, so maybe you can find inspiration here: Writing down (and searching through) every UUID · eieio.games

1 Like

Bucketing the bits of the bloom-filter into 128 times 2^(24) numbers--or some similar scheme--may also help.

If you want to experiment with this type of representation, you can give bit-vector a spin.

9 Bit Vectors

Or perhaps bv:

Bitvectors

1 Like

You are better off querying the DB ... likely nothing you could write will be able to match it for performance. But with the DB growing so quickly, you are - or very soon will be - beyond the capabilities of SQLite.

SQLite is a library: all its work is done in memory allocated by your program. It can handle pretty big databases, but it really struggles with big databases that change rapidly.

I think it's time to move to a more capable client/server DBMS such as Postgresql or MySQL. The DBMS then will be a separate process using its own memory. Until your individual queries get quite complicated you are not likely to be worrying about how much "work" space the DBMS requires to process them.

1 Like

Hi, @gneuner

Thank you for the advice. I suspect that this will be the route to go, onward. I love messing around, although the "just use postgres" position has been getting closer in the mirror for a while now.

Not that the digression to bloom-filters wasn't a worthwhile investigation, but the show must go on.