Hi, Racket Discourse.
I am messing around with embeddings, and I need some help getting a custom type to work.
Using the pg_vector extension to Postgres (running version 17), I can send an embedding vector to the database just fine, like so:
(db:query-exec
pgc "INSERT INTO items (embedding) VALUES ($1::float4[])"
embedding) ;; embedding is a list of real?
But when I try and do something like this:
(db:query-list
pgc "SELECT 1 - (embedding <=> $1::vector) AS cosine_similarity FROM items"
embedding)
I run into problems.
query-list: cannot convert given value to SQL type
given: '(-0.0043678335 -0.026342869 -0.031660147 ...)
type: bytea
expected: bytes?
dialect: PostgreSQL
If you try and swap the ::vector for ::float4[], it complains that the types are wrong:
query-list: operator does not exist: vector <=> real[]
SQLSTATE: 42883
I can do this to make it work, but it seems like a "hack":
(db:query-list
pgc "SELECT 1 - (embedding <=> CAST($1::float4[] AS vector)) AS cosine_similarity FROM items"
embedding)
;=> '(0.5611949909970255 0.5611949909970255 0.7075591289715957)
Now, I've defined the custom type like so:
(define vector-typeid
(vector->values
(db:query-row pgc "SELECT oid FROM pg_type WHERE typname = $1" "vector")))
(send pgc add-custom-types
(list (pg:pg-custom-type vector-typeid 'vector)))
but I am assuming I should use the #:send and #:recv keywords correctly. I tried naively serializing the embedding to bytes, but it seems like the pg_vector extension expects a particular format for that.
Is there an "easy" way to get something like this to work, or do I just have to figure out the encoding/decoding and then it'll make sense?
Eventually I just looked at another implementation of the binary serialization to get the idea:
(define (encode-embedding e)
(apply
bytes-append
(list*
;; header (from the python `>HH`, meaning two big-endian, unsigned shorts)
(integer->integer-bytes (length e) 2 #false #true)
(integer->integer-bytes 0 2 #false #true)
;; content (I just guessed they would be big-endian, too)
(for/list ([real (in-list e)])
(real->floating-point-bytes real 4 #true)))))
(define read-four
(lambda (in) (read-bytes 4 in)))
(define (decode-embedding d)
(define in (open-input-bytes d))
(define dimms (read-bytes 4 in))
;; adds some noise, need to tune this
(for/list ([bytes (in-producer read-four eof in)])
(floating-point-bytes->real bytes #true)))
...
(send pgc add-custom-types
(list (pg:pg-custom-type vector-typeid 'vector
#:send encode-embedding
#:recv decode-embedding)))
Et voilĂ :
(db:query-list
pgc "SELECT 1 - (embedding <=> $1::vector) AS cosine_similarity FROM items"
embedding)
;=> '(0.5611949909970255 0.5611949909970255 0.7075591289715957)